|
Message-ID: <20120709063029.GA2069@openwall.com>
Date: Mon, 9 Jul 2012 10:30:29 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Cc: Tavis Ormandy <taviso@...xchg8b.com>
Subject: Re: Rotate and bitselect investigation
magnum, Sayantan, Tavis -
On Mon, Jul 09, 2012 at 10:15:54AM +0530, Sayantan Datta wrote:
> On Mon, Jul 9, 2012 at 6:27 AM, Solar Designer <solar@...nwall.com> wrote:
> > On Mon, Jul 09, 2012 at 01:24:14AM +0530, Sayantan Datta wrote:
> > > I was able to squeeze in another bitselect optimization in the kernel
> > > resulting in 2% performance jump.
> >
> > Where did you put that other bitselect()?
>
> F(x,y,z) ((x & y) | (z & (x | y)))==F(x,y,z) (bitselect(x, y, z) ^
> bitselect(x, (uint)0, y))
Wow. I wonder if this trick for SHA-1 was known at all. Not to us, it
seems. The second bitselect() is essentially an and-not, so the speed
might be better if it's written as such (if there's an and-not
instruction). Also, I guess this change should hurt on NVIDIA (does
it?), so you'll need to wrap it in some #ifdef.
Anyway, I've just tried it on CPU (XOP). Patch attached. Here are the
speeds (best of several invocations in each case):
Before:
Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 XOP intrinsics 4x]... DONE
Raw: 28925K c/s real, 28925K c/s virtual
After:
Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 XOP intrinsics 4x]... DONE
Raw: 28435K c/s real, 28435K c/s virtual
Somehow a slowdown here, even though one instruction is saved. I guess
we incur some data dependency stall, which apparently could be avoided
with greater parallelism (interleaved instructions for two sets of SHA-1
computations - that is, for 8 of them at once).
With the sse-intrinsics.c code, there's actually a speedup from an
equivalent change:
Before:
Benchmarking: Raw SHA-1 [128/128 XOP intrinsics 8x]... DONE
Raw: 22221K c/s real, 22221K c/s virtual
Benchmarking: M$ Cache Hash 2 (DCC2) PBKDF2-HMAC-SHA-1 [128/128 XOP intrinsics 8x]... (8xOMP) DONE
Raw: 4515 c/s real, 562 c/s virtual
After:
Benchmarking: Raw SHA-1 [128/128 XOP intrinsics 8x]... DONE
Raw: 23629K c/s real, 23629K c/s virtual
Benchmarking: M$ Cache Hash 2 (DCC2) PBKDF2-HMAC-SHA-1 [128/128 XOP intrinsics 8x]... (8xOMP) DONE
Raw: 4698 c/s real, 586 c/s virtual
I guess we should commit at least the change to sse-intrinsics.c. As to
the change to rawSHA1_ng_fmt.c, maybe we should commit it in
commented-out form - e.g., replace the "#ifdef __XOP__" with "#if 0" for
now, just to have this alternative code recorded.
Finally, I think more OpenCL kernels may benefit from this - essentially
all where we have SHA-1.
Thanks,
Alexander
View attachment "john-sha1-r3-bitselect.diff" of type "text/plain" (2716 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.