john-dev - Re: Interleaving of intrinsics

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150530025524.GA14513@openwall.com>
Date: Sat, 30 May 2015 05:55:25 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Interleaving of intrinsics

On Fri, May 29, 2015 at 10:52:41PM +0200, magnum wrote:
> Here's a GitHub issue where we discuss interleaving and present some 
> benchmarks:
> https://github.com/magnumripper/JohnTheRipper/issues/1217

Thanks.  Most of this (including all of the benchmarking discussion)
should be on john-dev (and some of it was), not (only) on GitHub.

magnum, what exactly did you benchmark for these tables? -

https://github.com/magnumripper/JohnTheRipper/issues/1217#issuecomment-106695709

I fail to make sense of them, except that they seem to show interleaving
not helping.  It's unclear what hash type this is - raw or something
else.  Are the 8x, 14x, and 6x the numbers of OpenMP threads?  If so,
why 14x?  Was max_keys_per_crypt kept constant or was it increasing along
with increased interleaving (if so, we might have been exceeding L1 data
cache size for higher interleaving factors)?

Lei -

https://github.com/magnumripper/JohnTheRipper/issues/1217#issuecomment-106839338

These are reasonable results for pbkdf2-hmac-sha512, but there's
something "wrong" for pbkdf2-hmac-sha256.  It is suspicious whenever
there's a performance regression for going from x1 to x2 interleaving,
but then a performance improvement at higher interleaving factors.  This
suggest there's some overhead incurred with interleaving, and it is
probably avoidable.

Lei wrote:
> BTW,  OMP_SCALE = 1  and  OMP_SCALE = 4  has nearly the same performance on MIC.

This is no surprise for a slow hash like this.  OMP_SCALE is a kludge we
use for fast and semi-fast hashes.  We should just keep it at 1, or
remove it altogether, for slow hashes.

Lei wrote:
> Somehow I couldn't get useful info from a forked run of pbkdf2-hmac-sha256/512. I used the same settings as I benchmarked raw-md4(5), but only got output like:
> Will run 4 OpenMP threads per process (240 total across 60 processes)
> Node numbers 1-60 of 60 (fork)
> Session stopped (max run-time reached)
> 
> I tried increasing  --max-run , but to no avail. Something wrong here?

I suspect max_keys_per_crypt might be too high.  You'd want to find out
what it was by examining the log file.  It shouldn't actually be high
for a slow hash like this, but maybe it is (and needs to be lowered).

Also, you never mentioned your full command line.  There might be
something specific to the cracking mode you invoked and its settings.

Anyway, let's be using OpenMP alone, without --fork, for very slow
hashes like this on Xeon Phi.  --fork helps for faster hashes and/or
when there's other load on the same device (not expected on Xeon Phi).

> I think you will have some educated thoughts about this; Here's part of 
> our current SHA-1:
> 
> 
> #define SHA1_PARA_DO(x)	for((x)=0;(x)<SIMD_PARA_SHA1;(x)++)
> 
> #define SHA1_ROUND2(a,b,c,d,e,F,t)                      \
>     SHA1_PARA_DO(i) tmp3[i] = tmpR[i*16+(t&0xF)];       \
>     SHA1_EXPAND2(t+16)                                  \
>     F(b,c,d)                                            \
>     SHA1_PARA_DO(i) e[i] = vadd_epi32( e[i], tmp[i] );  \
>     SHA1_PARA_DO(i) tmp[i] = vroti_epi32(a[i], 5);      \
>     SHA1_PARA_DO(i) e[i] = vadd_epi32( e[i], tmp[i] );  \
>     SHA1_PARA_DO(i) e[i] = vadd_epi32( e[i], cst );     \
>     SHA1_PARA_DO(i) e[i] = vadd_epi32( e[i], tmp3[i] ); \
>     SHA1_PARA_DO(i) b[i] = vroti_epi32(b[i], 30);
> 
> 
> And here's a similar part of SHA256:
> 
> #define SHA256_STEP0(a,b,c,d,e,f,g,h,x,K)                    \
> {                                                            \
>     SHA256_PARA_DO(i)                                        \
>     {                                                        \
>         w = _w[i].w;                                         \
>         tmp1[i] = vadd_epi32(h[i],    S1(e[i]));             \
>         tmp1[i] = vadd_epi32(tmp1[i], Ch(e[i],f[i],g[i]));   \
>         tmp1[i] = vadd_epi32(tmp1[i], vset1_epi32(K));       \
>         tmp1[i] = vadd_epi32(tmp1[i], w[x]);                 \
>         tmp2[i] = vadd_epi32(S0(a[i]),Maj(a[i],b[i],c[i]));  \
>         d[i]    = vadd_epi32(tmp1[i], d[i]);                 \
>         h[i]    = vadd_epi32(tmp1[i], tmp2[i]);              \
>     }                                                        \
> }
> 
> This file is -O3 (from a pragma) so I guess both cases will be unrolled 
> but there is obviously a big difference after just unrolling. Assuming a 
> perfect optimizer it wouldn't matter but assuming a non-perfect one, is 
> the former better? I'm guessing SHA-1 was written that way for a reason?

I wasn't involved in writing either.  I think JimF was first to
introduce the *_PARA_DO stuff to JtR.

To me, both of these depend on compiler optimization too much.  When I
interleave stuff, I use non-array temporary variables and explicit
unrolling via macros - see BF_std.c, MD5_std.c, and php_mt_seed.  I have
little experience with how well or not the approach with arrays and
loops works across different compilers/versions - I only benchmarked
these formats in JtR just like others in here did.

I understand that use of arrays and loops makes it simpler to have
shared macros yet be able to adjust the interleave factors easily.  Yet
maybe we should explore alternatives to it, such as via heavier use of
cpp features (merge a base variable name with a suffix to form other
non-array variable names).

As to interleaving of individual operations vs. interleaving of SHA-*
steps, I think either is reasonable (assuming full loop unrolling),
since the number of operations in a SHA-* step is small enough that
typically they're all within a compiler's instruction scheduling range
and within an out-of-order CPU's reorder buffer.  (There will indeed be
differences in generated code and in its performance, but those will be
"random" - it's just whatever happens to match a given compiler version
and CPU better.)  [ Things become different when we try to interleave
bigger things, such as e.g. entire crypto primitives (so one group of
multiple rounds follows another group of multiple rounds, with each of
them possibly exceeding the scheduling ranges and/or reorder buffers). ]

In a recent experiment with another program (a revision of Argon2 in
PHC, where I tried to mix unrelated SIMD and scalar processing), I got
very slightly better results by adding "-fsched2-use-superblocks
-fsched-stalled-insns=0" to gcc flags.  (That's on bull with its gcc.)
Please feel free to try these.

magnum wrote:
> One difference is that the former is almost guaranteed to be fully
> unrolled why the latter might not.

I am not so sure.  I think these might be at similar risk of not being
unrolled.  If we continue to use this arrays+loops approach, maybe we
should compile the corresponding source files with forced unrolling of
these loops somehow.  "gcc --param" has max-unrolled-insns,
max-average-unrolled-insns, and max-unroll-times, but it might not be a
good idea to specify these as the available parameters change between
gcc versions (may result in failed builds with a future version).

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.