|
Message-ID: <20100517233248.GB9735@openwall.com> Date: Tue, 18 May 2010 03:32:48 +0400 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: C compiler generated SSE2 code (was: clang benchmarks) On Mon, May 17, 2010 at 11:55:18AM +0200, bartavelle@...quise.net wrote: > Now with my MD5 implementation (uses SSE intrinsics): Can you upload it to the wiki, please? http://openwall.info/wiki/john/patches > gcc 15696 > icc 32364 > clang 19644 It's a bit weird that gcc performs so poorly here. I am getting near-perfect SSE2 code for bitslice DES (an unreleased revision of the source code) with gcc 4.5.0. With properly tweaked compiler options (primarily to control function inlining), it slightly outperforms the hand-crafted SSE2 assembly code currently in JtR, in fact. Also, I found gcc's "statement expressions" extension very handy for mixing SSE2/MMX/native instructions (for virtual vector sizes of 192 and 256 bits) in expressions and function calls: http://gcc.gnu.org/onlinedocs/gcc-4.5.0/gcc/Statement-Exprs.html Surprisingly, this extension is supported by Sun Studio as well (I did not check other compilers). It let me define things such as: #define x(p, q) ({ vtype t; vxor(t, *(vtype *)&e[p] ed, *(vtype *)&k[q] kd); t; }) #define y(p, q) ({ vtype t; vxor(t, *(vtype *)&b[p] bd, *(vtype *)&k[q] kd); t; }) #define z(r) ((vtype *)&b[r] bd) and call the DES S-box functions (usually to be inlined) as: s1(x(0, 0), x(1, 1), x(2, 2), x(3, 3), x(4, 4), x(5, 5), z(40), z(48), z(54), z(62)); vtype, vxor(), etc. could be defined as: #elif DES_BS_VECTOR == 4 typedef struct { __m128i f; __m64 g; long h; } vtype; #define vxor(dst, a, b) \ (dst).f = _mm_xor_si128((a).f, (b).f); \ (dst).g = _mm_xor_si64((a).g, (b).g); \ (dst).h = (a).h ^ (b).h; That's for 256-bit vectors. A minor difficulty with 192-bit vectors was that they needed to be 256-bit aligned (for the SSE2 portion to be 128-bit aligned), which required changes to other source files - yet I got around this difficulty and tried those out as well (both kinds). Overall, this did not provide a speedup (on most CPUs the code became slower per-bit, although this could be different on future CPUs), but I was pleased with the low cost (my time) of this experiment. The assembly code generated by gcc looked reasonable (a nice mix of instructions, no obviously unneeded moves). This approach could be of more benefit for other hash types, where there's insufficient parallelism otherwise (perhaps the DES S-boxes had sufficient parallelism to almost fully exploit SSE2, which is why mixing in 64-bit instructions would slow things down per-bit most of the time). > gcc version 4.3.2 (Debian 4.3.2-1.1) You could want to try 4.5.0 (build it from source). http://openwall.info/wiki/internal/gcc-local-build On the other hand, with properly tuned source code with SSE2 intrinsics, even going from 4.5.0 to 3.4.5 (yes, this old!) resulted in only a 10% slowdown at bitslice DES for me. So perhaps there's something to tweak in your source code to make it gcc-friendly. One curious feature of gcc (not found in Sun Studio at least) is that it is able to generate SSE2 instructions for 128-bit bitwise ops even if the source code does not use intrinsics explicitly (that is, if it uses the usual C operators for the bitwise ops, but the operands are of 128-bit vector types). gcc 4.5.0 generates almost(?) the same code (near-perfect) whether you use intrinsics or not. On the other hand, without explicit use of intrinsics, gcc 3.4.5 generates awful code. Sun Studio refuses to compile such source code (but works fine when the source code does use intrinsics). Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.