|
Message-ID: <20150530025524.GA14513@openwall.com> Date: Sat, 30 May 2015 05:55:25 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: Interleaving of intrinsics On Fri, May 29, 2015 at 10:52:41PM +0200, magnum wrote: > Here's a GitHub issue where we discuss interleaving and present some > benchmarks: > https://github.com/magnumripper/JohnTheRipper/issues/1217 Thanks. Most of this (including all of the benchmarking discussion) should be on john-dev (and some of it was), not (only) on GitHub. magnum, what exactly did you benchmark for these tables? - https://github.com/magnumripper/JohnTheRipper/issues/1217#issuecomment-106695709 I fail to make sense of them, except that they seem to show interleaving not helping. It's unclear what hash type this is - raw or something else. Are the 8x, 14x, and 6x the numbers of OpenMP threads? If so, why 14x? Was max_keys_per_crypt kept constant or was it increasing along with increased interleaving (if so, we might have been exceeding L1 data cache size for higher interleaving factors)? Lei - https://github.com/magnumripper/JohnTheRipper/issues/1217#issuecomment-106839338 These are reasonable results for pbkdf2-hmac-sha512, but there's something "wrong" for pbkdf2-hmac-sha256. It is suspicious whenever there's a performance regression for going from x1 to x2 interleaving, but then a performance improvement at higher interleaving factors. This suggest there's some overhead incurred with interleaving, and it is probably avoidable. Lei wrote: > BTW, OMP_SCALE = 1 and OMP_SCALE = 4 has nearly the same performance on MIC. This is no surprise for a slow hash like this. OMP_SCALE is a kludge we use for fast and semi-fast hashes. We should just keep it at 1, or remove it altogether, for slow hashes. Lei wrote: > Somehow I couldn't get useful info from a forked run of pbkdf2-hmac-sha256/512. I used the same settings as I benchmarked raw-md4(5), but only got output like: > Will run 4 OpenMP threads per process (240 total across 60 processes) > Node numbers 1-60 of 60 (fork) > Session stopped (max run-time reached) > > I tried increasing --max-run , but to no avail. Something wrong here? I suspect max_keys_per_crypt might be too high. You'd want to find out what it was by examining the log file. It shouldn't actually be high for a slow hash like this, but maybe it is (and needs to be lowered). Also, you never mentioned your full command line. There might be something specific to the cracking mode you invoked and its settings. Anyway, let's be using OpenMP alone, without --fork, for very slow hashes like this on Xeon Phi. --fork helps for faster hashes and/or when there's other load on the same device (not expected on Xeon Phi). > I think you will have some educated thoughts about this; Here's part of > our current SHA-1: > > > #define SHA1_PARA_DO(x) for((x)=0;(x)<SIMD_PARA_SHA1;(x)++) > > #define SHA1_ROUND2(a,b,c,d,e,F,t) \ > SHA1_PARA_DO(i) tmp3[i] = tmpR[i*16+(t&0xF)]; \ > SHA1_EXPAND2(t+16) \ > F(b,c,d) \ > SHA1_PARA_DO(i) e[i] = vadd_epi32( e[i], tmp[i] ); \ > SHA1_PARA_DO(i) tmp[i] = vroti_epi32(a[i], 5); \ > SHA1_PARA_DO(i) e[i] = vadd_epi32( e[i], tmp[i] ); \ > SHA1_PARA_DO(i) e[i] = vadd_epi32( e[i], cst ); \ > SHA1_PARA_DO(i) e[i] = vadd_epi32( e[i], tmp3[i] ); \ > SHA1_PARA_DO(i) b[i] = vroti_epi32(b[i], 30); > > > And here's a similar part of SHA256: > > #define SHA256_STEP0(a,b,c,d,e,f,g,h,x,K) \ > { \ > SHA256_PARA_DO(i) \ > { \ > w = _w[i].w; \ > tmp1[i] = vadd_epi32(h[i], S1(e[i])); \ > tmp1[i] = vadd_epi32(tmp1[i], Ch(e[i],f[i],g[i])); \ > tmp1[i] = vadd_epi32(tmp1[i], vset1_epi32(K)); \ > tmp1[i] = vadd_epi32(tmp1[i], w[x]); \ > tmp2[i] = vadd_epi32(S0(a[i]),Maj(a[i],b[i],c[i])); \ > d[i] = vadd_epi32(tmp1[i], d[i]); \ > h[i] = vadd_epi32(tmp1[i], tmp2[i]); \ > } \ > } > > This file is -O3 (from a pragma) so I guess both cases will be unrolled > but there is obviously a big difference after just unrolling. Assuming a > perfect optimizer it wouldn't matter but assuming a non-perfect one, is > the former better? I'm guessing SHA-1 was written that way for a reason? I wasn't involved in writing either. I think JimF was first to introduce the *_PARA_DO stuff to JtR. To me, both of these depend on compiler optimization too much. When I interleave stuff, I use non-array temporary variables and explicit unrolling via macros - see BF_std.c, MD5_std.c, and php_mt_seed. I have little experience with how well or not the approach with arrays and loops works across different compilers/versions - I only benchmarked these formats in JtR just like others in here did. I understand that use of arrays and loops makes it simpler to have shared macros yet be able to adjust the interleave factors easily. Yet maybe we should explore alternatives to it, such as via heavier use of cpp features (merge a base variable name with a suffix to form other non-array variable names). As to interleaving of individual operations vs. interleaving of SHA-* steps, I think either is reasonable (assuming full loop unrolling), since the number of operations in a SHA-* step is small enough that typically they're all within a compiler's instruction scheduling range and within an out-of-order CPU's reorder buffer. (There will indeed be differences in generated code and in its performance, but those will be "random" - it's just whatever happens to match a given compiler version and CPU better.) [ Things become different when we try to interleave bigger things, such as e.g. entire crypto primitives (so one group of multiple rounds follows another group of multiple rounds, with each of them possibly exceeding the scheduling ranges and/or reorder buffers). ] In a recent experiment with another program (a revision of Argon2 in PHC, where I tried to mix unrelated SIMD and scalar processing), I got very slightly better results by adding "-fsched2-use-superblocks -fsched-stalled-insns=0" to gcc flags. (That's on bull with its gcc.) Please feel free to try these. magnum wrote: > One difference is that the former is almost guaranteed to be fully > unrolled why the latter might not. I am not so sure. I think these might be at similar risk of not being unrolled. If we continue to use this arrays+loops approach, maybe we should compile the corresponding source files with forced unrolling of these loops somehow. "gcc --param" has max-unrolled-insns, max-average-unrolled-insns, and max-unroll-times, but it might not be a good idea to specify these as the available parameters change between gcc versions (may result in failed builds with a future version). Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.