|
Message-ID: <20150529043212.GA20585@openwall.com> Date: Fri, 29 May 2015 07:32:12 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Cc: Alain Espinosa <alainesp@...ta.cu> Subject: Re: bitslice SHA-256 On Fri, May 29, 2015 at 06:35:29AM +0300, Solar Designer wrote: > "So we can expect that bitslice SHA256 will be (79-62)/62 = 27% slower > than normal SHA256" > > This is based on instruction count. And in a private e-mail to me Alain > reported actual speeds, where the difference is much bigger. I guess it > may be bigger because we're exceeding L1 code cache size. I recently > suggested how to deal with that: keep the instruction stream size per > cycle at no more than 16 bytes, so that it gets fetched from L2 cache > fast enough to execute at full speed. In my experiments, this works great for unrolled code sizes of up to roughly one half of L2 cache size, or a little bit more. The threshold appears to vary slightly between program invocations, typically staying in the 130 KB to 140 KB range. That's testing with just 1 thread on i7-4770K, which has 256 KB L2 cache per core. I suspect L2 cache might be physically-indexed, and perhaps this threshold is where probability becomes high that we'd have duplicate indices in excess of the cache's 8-way associativity. Maybe we could use the full 256 KB without speed loss with smarter page allocation (page coloring in the kernel, or a custom allocator in a kernel module just for this special use). > This may be 3 5-byte AVX2 > instructions, each one with a different 1-byte offset against one of 8 > general-purpose registers, thereby giving us a window of 64 "virtual > registers" that we can shift by occasional ADDs/SUBs to the GPRs. But > this won't remove the 27% slowdown estimated from instruction counts. > Unless we find a way to reduce the instruction count, bitslicing SHA-256 > on this architecture is not worthwhile. I think it might actually be possible to reduce this 27% increase in instruction count. Specifically: Alain, you appear to count each ADD as 5 instructions when bitsliced. You do it for each ADD even in expressions like: "W[0] += R1(W[14]) + W[9] + R0(W[1]);" This is 3 ADDs in a row. We don't have to treat them as arbitrary 3 ADDs. Instead, we can treat them as one 4-input ADD. I expect that once we've optimized it, the total instruction count for 4-input ADD will be lower than 15. Similarly, in: "H += R_E(E) + (G ^ (E & (F ^ G))) + 0xD807AA98 + W[0];" it should be less than 5 instructions per 2-input ADD, because they get merged into a single 5-input ADD, and moreover since one of the inputs is a constant we can save instructions on handling of carry bits where the constant has a 0 bit. On top of this, if we didn't already have a bitselect exposed in the original expression (like we do here), some of the logical operations from our implementations of the ADDs might be merged with nearby logical operations from the original expression producing more complicated yet single-instruction logical operations such as bitselect or AVX-512/Maxwell ternary logic operations. I briefly experimented with merged ADDs in this md5slice.c revision: http://www.openwall.com/lists/john-dev/2015/03/11/10 add32c() is a 3-input ADD where one of the inputs is a constant (the loop is assumed to be unrolled, leaving only one of the two if/else paths for each iteration). FF() has everything merged manually (and again its loops are assumed to be unrolled). This can be taken further, applying it to GG, HH, II as well. Even the final ADDs to state can probably be reasonably merged with the last step. Finally, much of the estimated 27% increase in instruction count comes from extra loads. But extra loads are not necessarily extra instructions. On x86, we have load-op instructions (where one of the inputs is in memory), and the SIMD ones typically execute in a single cycle (when the data is in L1 cache) - just as fast as instructions with both inputs in registers. This works as long as at most 2 out of 3 instructions on a given clock cycle are load-ops, since we only have 2 read ports from L1 data cache. I think 2 out of 3 is good enough for our needs here. So to me this experiment does not yet rule out the possibility of obtaining faster SHA-256 through bitslicing on Haswell. We just need to be far more thorough and careful. And C+intrinsics won't be good enough when we target L2 cache and need to optimize code size and unroll at the same time. Alain, I think you might be the person who could pull this off, given the additional advice and encouragement above. :-) Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.