|
Message-ID: <20150311205509.GA19294@openwall.com>
Date: Wed, 11 Mar 2015 23:55:09 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: bitslice MD*/SHA*, AVX2
Hi,
So I was thinking whether to drop trying bitslice SHA-2 from the "John
the Ripper SIMD support enhancements" / "Better SIMD implementations of
SHA-2 hashes and higher-level schemes" GSoC sub-task description, since
this portion is too(?) unlikely to result in better performance on
current hardware.
I dug up the latest revision of my md5slice.c experiment (attached to
this message) and tried it out on our i7-4770K. Here's the speed with
128-bit AVX, building with "gcc version 4.6.3 (Ubuntu/Linaro
4.6.3-1ubuntu5)":
solar@...l:~/md5slice$ gcc md5slice.c -o md5slice -Wall -s -O3 -fomit-frame-pointer -funroll-loops -DVECTOR -march=native
solar@...l:~/md5slice$ time ./md5slice
vector size = 128 bits
c09c4c1f 21876746 18aed2 70b452f0
real 0m0.630s
user 0m0.628s
sys 0m0.000s
That's for 10 million MD5s, giving us a speed of 10M/0.63 = ~15.9M MD5/s.
To try AVX2, I changed one line:
typedef element vector __attribute__ ((vector_size (16)));
to:
typedef element vector __attribute__ ((vector_size (32)));
and built with:
solar@...l:~/md5slice$ export PATH=~gcc/gcc-4.9-20130707/bin:$PATH
solar@...l:~/md5slice$ gcc md5slice.c -o md5slice -Wall -s -O3 -fomit-frame-pointer -funroll-loops -DVECTOR -march=native
This gave "warning: always_inline function might not be inlinable" about
FF(), I(), H(), F(), add32r(), add32c(), add32() - but then it built
fine. The speed is:
solar@...l:~/md5slice$ time ./md5slice
vector size = 256 bits
c09c4c1f 21876746 18aed2 70b452f0
real 0m0.354s
user 0m0.348s
sys 0m0.004s
which corresponds to 10M/0.35 = ~28.5 MD5/s.
For comparison, a non-OpenMP build of john-1.8.0-jumbo-1 on this machine
achieves:
solar@...l:~/j/john-1.8.0-jumbo-1/run$ ./john -te -form=raw-md5
Benchmarking: Raw-MD5 [MD5 128/128 AVX 12x]... DONE
Raw: 31923K c/s real, 31923K c/s virtual
So the speed is finally comparable. That's AVX vs. AVX2, so a
straightforward implementation of MD5 with AVX2 will likely run faster
yet, but the results so far are not conclusively anti-bitslice.
And the ease of upgrading from 128-bit AVX to 256-bit AVX2 with a
one-line change (a two-character change, even) is appealing. BTW, I did
review the generated assembly code to confirm that it was 128-bit AVX
and 256-bit AVX2 in these two tests as expected.
There's definitely some further room for optimization in md5slice.c -
for example, it does not yet take advantage of the common subexpressions
in the invocations of HH() (we can save 8*32 XORs). FF() has everything
merged into it manually and specialized, whereas GG(), HH(), and II()
use other functions. Also, explicit (such as via cpp or with a script)
unrolling and inlining might (or might not) work better than gcc's (the
various *ptr++'s will be replaced with explicit array accesses at fixed
indices, which I think gcc does for us anyway).
Overall, based on this test, I think I'll leave the task listed. And we
might even need to give it a try for MD4, MD5, and SHA-1 if AVX2 somehow
does not speedup straightforward implementations of them a lot (although
from this test I expect that it will, and the bitslice implementations
will remain much slower than straightforward).
Alexander
View attachment "md5slice.c" of type "text/x-c" (10797 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.