Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120318012753.GA19597@openwall.com>
Date: Sun, 18 Mar 2012 05:27:53 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: XOP for MD5/MD4/SHA-1

On Thu, Mar 15, 2012 at 12:34:16AM +0200, Milen Rangelov wrote:
> Actually (I know that's offtopic) this CPU demonstrates some weird
> behavior. With my SSE2 code, a 4-core Phenom II @3.2GHz is almost as fast
> as the 6-core FX-6100 @3.3 GHz. At first that seemed strange, then I
> implemented the XOP codepaths for MD5/MD4/SHA1 and then things look better
> now, the _mm_roti_epi32/_mm_cmov_si128 optimizations lead to ~ 40%
> improvement as compared to the SSE2 code (still worse than what I expected
> though). Same for SHA1 and MD4.

I've just tried the above for MD5 in JtR.

Originally, a linux-x86-64i build gave me:

Benchmarking: FreeBSD MD5 [SSE2i 12x]... (8xOMP) DONE
Raw:    159531 c/s real, 20002 c/s virtual

when I rebuilt as linux-x86-64, I got:

Benchmarking: FreeBSD MD5 [SSE2i 12x]... (8xOMP) DONE
Raw:    141168 c/s real, 17614 c/s virtual

So indeed Intel compiler's code had an advantage over gcc's even for
this AMD CPU.  This is "gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu3)".

Then I rebuilt as linux-x86-64-xop and got:

Benchmarking: FreeBSD MD5 [SSE2i 12x]... (8xOMP) DONE
Raw:    142191 c/s real, 17741 c/s virtual

With "objdump -d sse-intrinsics.o | less", I confirmed that I got
3-operand AVX instructions there - so apparently this only gave a 1%
speedup here.  (Maybe greater speedup would be possible with some
tuning, such as of MD5_SSE_PARA.)

Then I started editing sse-intrinsics.c to actually use XOP.  I made
MD5_F() and MD5_G() use _mm_cmov_si128().  This gave me:

Benchmarking: FreeBSD MD5 [SSE2i 12x]... (8xOMP) DONE
Raw:    153300 c/s real, 19143 c/s virtual

Finally, I edited MD5_STEP() to use _mm_roti_epi32(), and I got:

Benchmarking: FreeBSD MD5 [SSE2i 12x]... (8xOMP) DONE
Raw:    175200 c/s real, 21924 c/s virtual

Now that's some success.  At least we don't need an *i-xop target (but
an *i-avx target would be useful for Intel CPUs).

"make testpara" with "-mxop" gave me:

model name      : AMD FX(tm)-8120 Eight-Core Processor
gcc "gcc" version: gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu3)
Best -m64 paras:
MD4:    para_3 60640K c/s       para_2 49710K c/s       para_1 27280K c/s
MD5:    para_2 44150K c/s       para_1 24314K c/s
MD5C:   para_2 201K c/s para_3 175K c/s para_1 128K c/s
SHA1:   para_2 4x16 24855K c/s  para_2 4x80 21509K c/s  para_1 4x16 16777K c/s para_1 4x80 12807K c/s

Note that I haven't modified MD4 and SHA-1 to actually use XOP yet (so
this should be AVX rather than XOP here), and that for raw MD5 para_2
was a lot better than para_3 (but the latter is better for MD5-crypt).

It did not test 4, so I've just tried it manually:

Benchmarking: FreeBSD MD5 [SSE2i 16x]... (8xOMP) DONE
Raw:    172126 c/s real, 21604 c/s virtual

(slight slowdown compared to 3).

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.