|
Message-ID: <20150314040000.GA9163@openwall.com> Date: Sat, 14 Mar 2015 07:00:00 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: MD4/MD5 round 3 common XOR (was: bitslice MD*/SHA*, AVX2) On Fri, Mar 13, 2015 at 09:01:25AM +0100, magnum wrote: > On 2015-03-11 23:07, Solar Designer wrote: > > On Wed, Mar 11, 2015 at 10:45:19PM +0100, magnum wrote: > >> On 2015-03-11 22:21, Solar Designer wrote: > >>> In my testing, this might not be beneficial on 2-operand archs such as > >>> plain x86, but it should be on 3-operand archs such as AVX. So we > >>> should update the code in sse-intrinsics.c, and benchmark. And we should > >>> update the plain C code anyway, such as for non-x86 archs (which are > >>> mostly 3-operand RISC). > >>> > >>> magnum, Jim? > >> > >> Yeah... unless we have some GSoC candidate wanting to show his/her > >> teeth? That would be a good start! > > > > OK, I don't mind keeping this on hold until GSoC student application > > period ends. Would you track it, so it doesn't get forgotten in case no > > GSoC candidate takes care of it? > > Out of curiosity I did some experiments with sse-intrinsics.c and I only > see regression when trying to implement this. Does that make sense? I > also tried with no interleaving, still a regression. Could this somehow > break some other optimization made by the compiler? In the MD4 case I > didn't even have to add a new temp variable, it already has tmp2 free to > use at that place. > > It doesn't get much slower, but always definitely slower. As long as you're building for AVX or better and have enough registers, that's unexpected. Yes, if you're observing this anyway, it might be breaking other optimizations. Like I wrote, on 2-operand archs, including SSE2, this might hurt performance as it might be replacing XOR's with MOV's and eating up extra registers. Also, on register-starved archs, like on x86 in 32-bit mode, the extra register pressure might be hurting performance. But on x86-64 with AVX or better, it should be beneficial. A month or so ago, atom said on Twitter that it was always beneficial for him, though. Oh, maybe that's because he wasn't testing on anything older than Bulldozer or Ivy Bridge, and these have free MOVs via register renaming? Maybe your reuse of "tmp2", or the compiler's reuse of a register of its choosing, causes extra anti-dependencies? http://en.wikipedia.org/wiki/Data_dependency#Anti-dependency I don't know how good or not specific CPUs' register renaming is at dealing with these. You could try introducing a new variable for this, to possibly make it more likely that the compiler would allocate a new register if it can. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.