Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150529043212.GA20585@openwall.com>
Date: Fri, 29 May 2015 07:32:12 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Cc: Alain Espinosa <alainesp@...ta.cu>
Subject: Re: bitslice SHA-256

On Fri, May 29, 2015 at 06:35:29AM +0300, Solar Designer wrote:
> "So we can expect that bitslice SHA256 will be (79-62)/62 = 27% slower
> than normal SHA256"
> 
> This is based on instruction count.  And in a private e-mail to me Alain
> reported actual speeds, where the difference is much bigger.  I guess it
> may be bigger because we're exceeding L1 code cache size.  I recently
> suggested how to deal with that: keep the instruction stream size per
> cycle at no more than 16 bytes, so that it gets fetched from L2 cache
> fast enough to execute at full speed.

In my experiments, this works great for unrolled code sizes of up to
roughly one half of L2 cache size, or a little bit more.  The threshold
appears to vary slightly between program invocations, typically staying
in the 130 KB to 140 KB range.  That's testing with just 1 thread on
i7-4770K, which has 256 KB L2 cache per core.  I suspect L2 cache might
be physically-indexed, and perhaps this threshold is where probability
becomes high that we'd have duplicate indices in excess of the cache's
8-way associativity.  Maybe we could use the full 256 KB without speed
loss with smarter page allocation (page coloring in the kernel, or a
custom allocator in a kernel module just for this special use).

> This may be 3 5-byte AVX2
> instructions, each one with a different 1-byte offset against one of 8
> general-purpose registers, thereby giving us a window of 64 "virtual
> registers" that we can shift by occasional ADDs/SUBs to the GPRs.  But
> this won't remove the 27% slowdown estimated from instruction counts.
> Unless we find a way to reduce the instruction count, bitslicing SHA-256
> on this architecture is not worthwhile.

I think it might actually be possible to reduce this 27% increase in
instruction count.  Specifically:

Alain, you appear to count each ADD as 5 instructions when bitsliced.
You do it for each ADD even in expressions like:

"W[0] += R1(W[14]) + W[9] + R0(W[1]);"

This is 3 ADDs in a row.  We don't have to treat them as arbitrary 3
ADDs.  Instead, we can treat them as one 4-input ADD.  I expect that
once we've optimized it, the total instruction count for 4-input ADD
will be lower than 15.

Similarly, in:

"H += R_E(E) + (G ^ (E & (F ^ G))) + 0xD807AA98 + W[0];"

it should be less than 5 instructions per 2-input ADD, because they get
merged into a single 5-input ADD, and moreover since one of the inputs
is a constant we can save instructions on handling of carry bits where
the constant has a 0 bit.  On top of this, if we didn't already have a
bitselect exposed in the original expression (like we do here), some of
the logical operations from our implementations of the ADDs might be
merged with nearby logical operations from the original expression
producing more complicated yet single-instruction logical operations
such as bitselect or AVX-512/Maxwell ternary logic operations.

I briefly experimented with merged ADDs in this md5slice.c revision:

http://www.openwall.com/lists/john-dev/2015/03/11/10

add32c() is a 3-input ADD where one of the inputs is a constant (the
loop is assumed to be unrolled, leaving only one of the two if/else
paths for each iteration).  FF() has everything merged manually (and
again its loops are assumed to be unrolled).  This can be taken further,
applying it to GG, HH, II as well.  Even the final ADDs to state can
probably be reasonably merged with the last step.

Finally, much of the estimated 27% increase in instruction count comes
from extra loads.  But extra loads are not necessarily extra
instructions.  On x86, we have load-op instructions (where one of the
inputs is in memory), and the SIMD ones typically execute in a single
cycle (when the data is in L1 cache) - just as fast as instructions with
both inputs in registers.  This works as long as at most 2 out of 3
instructions on a given clock cycle are load-ops, since we only have 2
read ports from L1 data cache.  I think 2 out of 3 is good enough for
our needs here.

So to me this experiment does not yet rule out the possibility of
obtaining faster SHA-256 through bitslicing on Haswell.  We just need to
be far more thorough and careful.  And C+intrinsics won't be good enough
when we target L2 cache and need to optimize code size and unroll at the
same time.

Alain, I think you might be the person who could pull this off, given
the additional advice and encouragement above. :-)

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.