john-dev - XOP for MD5/MD4/SHA-1

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120318012753.GA19597@openwall.com>
Date: Sun, 18 Mar 2012 05:27:53 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: XOP for MD5/MD4/SHA-1

On Thu, Mar 15, 2012 at 12:34:16AM +0200, Milen Rangelov wrote:
> Actually (I know that's offtopic) this CPU demonstrates some weird
> behavior. With my SSE2 code, a 4-core Phenom II @3.2GHz is almost as fast
> as the 6-core FX-6100 @3.3 GHz. At first that seemed strange, then I
> implemented the XOP codepaths for MD5/MD4/SHA1 and then things look better
> now, the _mm_roti_epi32/_mm_cmov_si128 optimizations lead to ~ 40%
> improvement as compared to the SSE2 code (still worse than what I expected
> though). Same for SHA1 and MD4.

I've just tried the above for MD5 in JtR.

Originally, a linux-x86-64i build gave me:

Benchmarking: FreeBSD MD5 [SSE2i 12x]... (8xOMP) DONE
Raw:    159531 c/s real, 20002 c/s virtual

when I rebuilt as linux-x86-64, I got:

Benchmarking: FreeBSD MD5 [SSE2i 12x]... (8xOMP) DONE
Raw:    141168 c/s real, 17614 c/s virtual

So indeed Intel compiler's code had an advantage over gcc's even for
this AMD CPU.  This is "gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu3)".

Then I rebuilt as linux-x86-64-xop and got:

Benchmarking: FreeBSD MD5 [SSE2i 12x]... (8xOMP) DONE
Raw:    142191 c/s real, 17741 c/s virtual

With "objdump -d sse-intrinsics.o | less", I confirmed that I got
3-operand AVX instructions there - so apparently this only gave a 1%
speedup here.  (Maybe greater speedup would be possible with some
tuning, such as of MD5_SSE_PARA.)

Then I started editing sse-intrinsics.c to actually use XOP.  I made
MD5_F() and MD5_G() use _mm_cmov_si128().  This gave me:

Benchmarking: FreeBSD MD5 [SSE2i 12x]... (8xOMP) DONE
Raw:    153300 c/s real, 19143 c/s virtual

Finally, I edited MD5_STEP() to use _mm_roti_epi32(), and I got:

Benchmarking: FreeBSD MD5 [SSE2i 12x]... (8xOMP) DONE
Raw:    175200 c/s real, 21924 c/s virtual

Now that's some success.  At least we don't need an *i-xop target (but
an *i-avx target would be useful for Intel CPUs).

"make testpara" with "-mxop" gave me:

model name      : AMD FX(tm)-8120 Eight-Core Processor
gcc "gcc" version: gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu3)
Best -m64 paras:
MD4:    para_3 60640K c/s       para_2 49710K c/s       para_1 27280K c/s
MD5:    para_2 44150K c/s       para_1 24314K c/s
MD5C:   para_2 201K c/s para_3 175K c/s para_1 128K c/s
SHA1:   para_2 4x16 24855K c/s  para_2 4x80 21509K c/s  para_1 4x16 16777K c/s para_1 4x80 12807K c/s

Note that I haven't modified MD4 and SHA-1 to actually use XOP yet (so
this should be AVX rather than XOP here), and that for raw MD5 para_2
was a lot better than para_3 (but the latter is better for MD5-crypt).

It did not test 4, so I've just tried it manually:

Benchmarking: FreeBSD MD5 [SSE2i 16x]... (8xOMP) DONE
Raw:    172126 c/s real, 21604 c/s virtual

(slight slowdown compared to 3).

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.