john-dev - Re: Rotate and bitselect investigation

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20120709063029.GA2069@openwall.com>
Date: Mon, 9 Jul 2012 10:30:29 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Cc: Tavis Ormandy <taviso@...xchg8b.com>
Subject: Re: Rotate and bitselect investigation

magnum, Sayantan, Tavis -

On Mon, Jul 09, 2012 at 10:15:54AM +0530, Sayantan Datta wrote:
> On Mon, Jul 9, 2012 at 6:27 AM, Solar Designer <solar@...nwall.com> wrote:
> > On Mon, Jul 09, 2012 at 01:24:14AM +0530, Sayantan Datta wrote:
> > > I was able to squeeze in another bitselect optimization in the kernel
> > > resulting in 2% performance jump.
> >
> > Where did you put that other bitselect()?
> 
> F(x,y,z) ((x & y) | (z & (x | y)))==F(x,y,z) (bitselect(x, y, z) ^
> bitselect(x, (uint)0, y))

Wow.  I wonder if this trick for SHA-1 was known at all.  Not to us, it
seems.  The second bitselect() is essentially an and-not, so the speed
might be better if it's written as such (if there's an and-not
instruction).  Also, I guess this change should hurt on NVIDIA (does
it?), so you'll need to wrap it in some #ifdef.

Anyway, I've just tried it on CPU (XOP).  Patch attached.  Here are the
speeds (best of several invocations in each case):

Before:

Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 XOP intrinsics 4x]... DONE
Raw:    28925K c/s real, 28925K c/s virtual

After:

Benchmarking: Raw SHA-1 (pwlen <= 15) [128/128 XOP intrinsics 4x]... DONE
Raw:    28435K c/s real, 28435K c/s virtual

Somehow a slowdown here, even though one instruction is saved.  I guess
we incur some data dependency stall, which apparently could be avoided
with greater parallelism (interleaved instructions for two sets of SHA-1
computations - that is, for 8 of them at once).

With the sse-intrinsics.c code, there's actually a speedup from an
equivalent change:

Before:

Benchmarking: Raw SHA-1 [128/128 XOP intrinsics 8x]... DONE
Raw:    22221K c/s real, 22221K c/s virtual

Benchmarking: M$ Cache Hash 2 (DCC2) PBKDF2-HMAC-SHA-1 [128/128 XOP intrinsics 8x]... (8xOMP) DONE
Raw:    4515 c/s real, 562 c/s virtual

After:

Benchmarking: Raw SHA-1 [128/128 XOP intrinsics 8x]... DONE
Raw:    23629K c/s real, 23629K c/s virtual

Benchmarking: M$ Cache Hash 2 (DCC2) PBKDF2-HMAC-SHA-1 [128/128 XOP intrinsics 8x]... (8xOMP) DONE
Raw:    4698 c/s real, 586 c/s virtual

I guess we should commit at least the change to sse-intrinsics.c.  As to
the change to rawSHA1_ng_fmt.c, maybe we should commit it in
commented-out form - e.g., replace the "#ifdef __XOP__" with "#if 0" for
now, just to have this alternative code recorded.

Finally, I think more OpenCL kernels may benefit from this - essentially
all where we have SHA-1.

Thanks,

Alexander

View attachment "john-sha1-r3-bitselect.diff" of type "text/plain" (2716 bytes)

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.