|
Message-ID: <20150908101714.GA12952@openwall.com>
Date: Tue, 8 Sep 2015 13:17:14 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: md5crypt mmxput*()
On Sat, Sep 05, 2015 at 05:51:51AM +0300, Solar Designer wrote:
> magnum, Jim, Simon, Lei -
>
> Some speedup for md5crypt on CPU might be possible through vectorizing
> the mmxput*() functions, or through use of SHLD/SHRD instructions
> (available since 386) or other archs' equivalents (I think ARM has this
> too) in mmxput3() when not vectorized (somehow gcc does not do it for
> us). These functions are similar to buf_update() in cryptmd5_kernel.cl,
> where I've added uses of amd_bitalign() and NVIDIA's funnel shifter
> recently (analogous to SHLD/SHRD), and which obviously is processed on
> the SIMD units on GPUs (can do it on CPUs as well, although no SHLD/SHRD
> then, unless a given CPU architecture has them in SIMD form as well -
> need to look into that).
I looked into making mmxput3() use SHLD/SHRD, and found this comment:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55583#c6
"We can convince the current compiler to generate shrd by constructing
((((unsigned long long)a)<<32) | b) >> n"
I tried doing this, but since I'm on x86_64 I actually got a 64-bit
shift instead. Good news: it's also faster. Before:
Benchmarking: md5crypt, crypt(3) $1$ [MD5 128/128 XOP 4x2]... (8xOMP) DONE
Raw: 228864 c/s real, 28608 c/s virtual
After:
Benchmarking: md5crypt, crypt(3) $1$ [MD5 128/128 XOP 4x2]... (8xOMP) DONE
Raw: 231424 c/s real, 28928 c/s virtual
Patched attached. I only tested this on bull so far. I put an #if in
there enabling the 64-bit approach on any 64-bit arch as well as on
32-bit x86 (expecting that SHRD would be generated, as per the comment
above) - but this needs to be tested on different machines and with
different versions of gcc (I wouldn't be surprised if there's a
regression with some old version).
I think further speedup is possible by using a switch statement to make
the shift counts into constants (we have an if anyway, we'll just
replace it with a switch) like cryptmd5_kernel.cl has. And indeed by
vectorizing this. But for a trivial patch, the above speedup isn't bad.
In fact, with a switch there might not be a speedup from the 64-bit
shift anymore (except on 32-bit x86, where it should enable SHRD).
I think the speedup comes from one of the shift counts becoming a
constant now (the constant 32), but with a switch all of them would be
constants anyway. So maybe that #if would need to be revised then.
If we're not going to vectorize this soon, then maybe the next steps are
to try switch and then to tune the #if based on testing on different
CPUs and compiler versions.
... or just test the attached patch a bit more and commit it.
Maybe we should enable the 64-bit approach for ARM as well (IIRC, it has
a similar instruction to 386's SHRD).
Alexander
View attachment "john-md5crypt-bitalign.diff" of type "text/plain" (1804 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.