|
Message-ID: <20150602110124.GA20487@openwall.com> Date: Tue, 2 Jun 2015 14:01:25 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: Interleaving of intrinsics magnum - On Mon, Jun 01, 2015 at 01:37:14PM +0200, magnum wrote: > Or perhaps as soon as we use interleaving, things like tmp[SIMD_PARA] > end up being stack arrays? That should hurt a lot. This is quite possible. In general, one of the things limiting the interleaving factor is register pressure - and the compiler might in fact do a worse job at register allocation when we use arrays. > Actually, here's a bug we have: Using the wide loops as in SHA2, we > don't need to use "tmp[i]" at all - we do fine with just "tmp". Huh? Doesn't this defeat interleaving, replacing it with sequential processing, because our source code sort of hints to the compiler to reuse the same register across instances? Or are we hoping that the compiler or the CPU will recognize that we're reusing the variable, and actually allocate a new register or a new rename register, respectively? The compiler might and a CPU capable of register renaming at all probably will, but didn't we intend to reduce rather then increase our reliance on luck? I just took a look at commit cde0fb470f35ef6dc5949d3b11137dd27ca2672b, and it does look as problematic as I had thought from reading your message. :-( > I tried this but there was very little difference (but to the better). This suggests one of two things, or maybe an in-between: 1. Interleaving in that code never really worked, so breaking it further does not hurt further. -OR- 2. The compiler and/or the CPU are so good that interleaving still works even despite of us trying to kill it so hard. We could also try making the temporary scope of those variables explicit, by defining them inside of e.g. SHA512_PARA_DO(i) { ... } block, etc. Then the compiler, after having unrolled this loop, might have a better idea that it can substitute different registers for the different loop iterations. Or it might not. IIRC, I did try experimenting with the temporary scope approach when I first introduced BF_X2, and it didn't work as well as keeping separate variables at the outer scope. > I tried changing MD4/5 and SHA1 to use fewer, wider loops similar to > SHA2 and consequently use single temps instead of arrays. There was > about 4% boost for MD4/MD5 but SHA1 got slightly worse. Why? Luck. Would it be reasonable for us to try my usual approach, with separate variables at the outer scope (inside the hashing function, but not inside the individual steps)? And if those are in fact separate variables rather than array elements, this implies manual or cpp level loop unrolling. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.