|
Message-ID: <20120318012753.GA19597@openwall.com> Date: Sun, 18 Mar 2012 05:27:53 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: XOP for MD5/MD4/SHA-1 On Thu, Mar 15, 2012 at 12:34:16AM +0200, Milen Rangelov wrote: > Actually (I know that's offtopic) this CPU demonstrates some weird > behavior. With my SSE2 code, a 4-core Phenom II @3.2GHz is almost as fast > as the 6-core FX-6100 @3.3 GHz. At first that seemed strange, then I > implemented the XOP codepaths for MD5/MD4/SHA1 and then things look better > now, the _mm_roti_epi32/_mm_cmov_si128 optimizations lead to ~ 40% > improvement as compared to the SSE2 code (still worse than what I expected > though). Same for SHA1 and MD4. I've just tried the above for MD5 in JtR. Originally, a linux-x86-64i build gave me: Benchmarking: FreeBSD MD5 [SSE2i 12x]... (8xOMP) DONE Raw: 159531 c/s real, 20002 c/s virtual when I rebuilt as linux-x86-64, I got: Benchmarking: FreeBSD MD5 [SSE2i 12x]... (8xOMP) DONE Raw: 141168 c/s real, 17614 c/s virtual So indeed Intel compiler's code had an advantage over gcc's even for this AMD CPU. This is "gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu3)". Then I rebuilt as linux-x86-64-xop and got: Benchmarking: FreeBSD MD5 [SSE2i 12x]... (8xOMP) DONE Raw: 142191 c/s real, 17741 c/s virtual With "objdump -d sse-intrinsics.o | less", I confirmed that I got 3-operand AVX instructions there - so apparently this only gave a 1% speedup here. (Maybe greater speedup would be possible with some tuning, such as of MD5_SSE_PARA.) Then I started editing sse-intrinsics.c to actually use XOP. I made MD5_F() and MD5_G() use _mm_cmov_si128(). This gave me: Benchmarking: FreeBSD MD5 [SSE2i 12x]... (8xOMP) DONE Raw: 153300 c/s real, 19143 c/s virtual Finally, I edited MD5_STEP() to use _mm_roti_epi32(), and I got: Benchmarking: FreeBSD MD5 [SSE2i 12x]... (8xOMP) DONE Raw: 175200 c/s real, 21924 c/s virtual Now that's some success. At least we don't need an *i-xop target (but an *i-avx target would be useful for Intel CPUs). "make testpara" with "-mxop" gave me: model name : AMD FX(tm)-8120 Eight-Core Processor gcc "gcc" version: gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu3) Best -m64 paras: MD4: para_3 60640K c/s para_2 49710K c/s para_1 27280K c/s MD5: para_2 44150K c/s para_1 24314K c/s MD5C: para_2 201K c/s para_3 175K c/s para_1 128K c/s SHA1: para_2 4x16 24855K c/s para_2 4x80 21509K c/s para_1 4x16 16777K c/s para_1 4x80 12807K c/s Note that I haven't modified MD4 and SHA-1 to actually use XOP yet (so this should be AVX rather than XOP here), and that for raw MD5 para_2 was a lot better than para_3 (but the latter is better for MD5-crypt). It did not test 4, so I've just tried it manually: Benchmarking: FreeBSD MD5 [SSE2i 16x]... (8xOMP) DONE Raw: 172126 c/s real, 21604 c/s virtual (slight slowdown compared to 3). Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.