|
Message-ID: <CAKGDhHXMh78LocrhnJ2omEUq2quVeKAuNSmCstB2mhvZOMdjrQ@mail.gmail.com> Date: Sat, 9 May 2015 22:25:06 +0200 From: Agnieszka Bielec <bielecagnieszka8@...il.com> To: john-dev@...ts.openwall.com Subject: Re: [GSoC] John the Ripper support for PHC finalists 2015-05-05 19:05 GMT+02:00 Solar Designer <solar@...nwall.com>: > Agnieszka, > I notice that you use 2x interleaving factor with SSE2, but 4x with > AVX2. Why this specific choice? without any reason > I suggest that you try 2x interleaving with AVX2. There's little point > in going for 4x without having seen speedup with 2x first. with 2x interleaving for avx2 is lower decrease of the speed for cost=6:6,6:6 and without the difference for 2 or 4 > Then, you're unnecessarily relying on compiler optimizations too much. > You've turned random_number, index_local, and index_global from > variables into arrays, and you're hoping(?) that the compiler will > allocate groups of 2 or 4 registers for them anyway. Well, it might or > it might not. Please use simple variables, like the original code did. > So random_number0, random_number1, etc. You're using explicit indices > anyway, so this won't complicate your code. changed > Please omit or rework index_global_t. There's no point in precomputing > values like "index_global_t[3]=index_global[3]+3;" when you're then only > using them like "S[index_global_t[3]]". If S[] elements were no larger > than 8 bytes each, then the CPU's addressing modes would enable e.g. > S+i*8+24 to be calculated during effective address calculation in the > load instruction at no extra cost. This doesn't work for elements > larger than 8 (and yours are __m256i), so it makes sense to precompute > them multiplied by sizeof(__m256i), and then access the data via a macro > that would do the proper typecasts to use byte offsets. Not only for > index_global_t, but also for i0 and index_local*, so that the > multiplication by sizeof(__m256i) (shift left by 5) would be performed > less frequently, and then +32, +64, +96, etc. would be added to it. I omitted index_global_t > On Sat, May 02, 2015 at 06:14:05AM +0200, Agnieszka Bielec wrote: >> I made interleaving for no-SIMD, SSE2 and AVX2 version, the speed for >> costs 2,2 and 0,0 is slightly better but for costs 6,6 and 8,8 is >> worse, so I'm not sure if I did everything correctly. > > Given POMELO's use of memory, interleaving might in fact be of little > help, as the more memory you use at once, the slower the memory accesses > become as you're getting further out of cache. I think this is why > you're not seeing a speedup with only your initial implementation, not > optimized yet. You might or might not see more of a speedup when you > implement optimizations such as what I suggested above. > I suggest that you review the generated assembly code without and with > interleaving. See if extra instructions get generated (such as spilling > registers to memory and loading them back). interleaved function (SSE2) contains more lea intructions none@...e ~/Desktop $ cat sse.asm | grep lea | wc -l 189 none@...e ~/Desktop $ cat sseold.asm | grep lea | wc -l 59 none@...e ~/Desktop $ cat sse.asm | grep rbp | wc -l 141 none@...e ~/Desktop $ cat sseold.asm | grep rbp | wc -l 27 none@...e ~/Desktop $ cat sseold.asm | grep movdqu | wc -l 126 none@...e ~/Desktop $ cat sse.asm | grep movdqu | wc -l 264 It looks like additional set of instruction isn't interleaved with the original one > Also, find those left shifts that are used to calculate byte offsets > from indices. See if any can be avoided or moved to outer loops. > Perhaps some of these optimizations can also be made to non-interleaved > code (and even submitted back to the author of POMELO). In my opinion they can't be moved to the better place > >> Maybe it's because we have bigger gaps between chunks of data in memory > > No, I think the memory layout is fine. When different cache lines are > accessed, it does not matter how large or small the gap between their > currently cached addresses is. but what when the came cache lines are accessed when we have less memory usage and other when we use more memory. I mean the H and jumping to random numbers and L2 cache > > However, I suggest that you align the memory allocations to be on cache > line boundary. Right now, you align them to 32 bytes as AVX2 requires, > but our cache lines are 64 bytes. Crossing a cache line boundary > unnecessarily has performance cost and it thrashes other valuable data > out of cache (it thrashes two cache lines instead of just one). > > > Oh, and in the SSE2/AVX code you're not aligning the memory allocation > of S at all, so you only get the current malloc()'s guaranteed 16-byte > alignment. This might or might not happen to also be 64-byte aligned. > You should explicitly make it at least 64-byte aligned. this is also done now > As magnum correctly suggested, this should be automatic. Also, it > should be reported as "AVX" when #ifdef __AVX__, because in that case > the compiler generates AVX instructions for the same SSE2 intrinsics. It's automatic. I tested avx2 and sse2 on well where avx2 is supported, so I changed only one function call for the test. I could also not to make export for gcc but changing a function call was faster for me. ___ I noticed that SSE2 is slightly faster sse2 usual interleaving: a@...l:~/hmm/run$ ./john --format=pomelo --test --cost=2:2,2:2 Will run 8 OpenMP threads Benchmarking: POMELO, Generic pomelo [SSE2]... (8xOMP) DONE Speed for cost 1 (N) of 2, cost 2 (r) of 2 Many salts: 104192 c/s real, 13024 c/s virtual Only one salt: 104448 c/s real, 13056 c/s virtual a@...l:~/hmm/run$ ./john --format=pomelo --test --cost=4:4,4:4 Will run 8 OpenMP threads Benchmarking: POMELO, Generic pomelo [SSE2]... (8xOMP) DONE Speed for cost 1 (N) of 4, cost 2 (r) of 4 Many salts: 6525 c/s real, 816 c/s virtual Only one salt: 6525 c/s real, 817 c/s virtual a@...l:~/hmm/run$ ./john --format=pomelo --test --cost=6:6,6:6 Will run 8 OpenMP threads Benchmarking: POMELO, Generic pomelo [SSE2]... (8xOMP) DONE Speed for cost 1 (N) of 6, cost 2 (r) of 6 Many salts: 322 c/s real, 43.4 c/s virtual Only one salt: 320 c/s real, 42.7 c/s virtual sse2 interleaving after modyfications: a@...l:~/hmm/run$ ./john --format=pomelo --test --cost=2:2,2:2 Will run 8 OpenMP threads Benchmarking: POMELO, Generic pomelo [SSE2]... (8xOMP) DONE Speed for cost 1 (N) of 2, cost 2 (r) of 2 Many salts: 109056 c/s real, 13649 c/s virtual Only one salt: 109056 c/s real, 13632 c/s virtual a@...l:~/hmm/run$ ./john --format=pomelo --test --cost=4:4,4:4 Will run 8 OpenMP threads Benchmarking: POMELO, Generic pomelo [SSE2]... (8xOMP) DONE Speed for cost 1 (N) of 4, cost 2 (r) of 4 Many salts: 6776 c/s real, 844 c/s virtual Only one salt: 6710 c/s real, 843 c/s virtual a@...l:~/hmm/run$ ./john --format=pomelo --test --cost=6:6,6:6 Will run 8 OpenMP threads Benchmarking: POMELO, Generic pomelo [SSE2]... (8xOMP) DONE Speed for cost 1 (N) of 6, cost 2 (r) of 6 Many salts: 341 c/s real, 45.2 c/s virtual Only one salt: 341 c/s real, 45.3 c/s virtual the reduction in speed is the same for bigger costs except avx2 when I modified interleaving into 2x, now the reduction is similar to SSE2 and no-SIMD versions thanks
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.