|
Message-ID: <20150903151529.GA12605@openwall.com> Date: Thu, 3 Sep 2015 18:15:29 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: MD5 I() (was: SHA-1 H()) On Wed, Sep 02, 2015 at 06:52:25PM +0300, Solar Designer wrote: > Attached to this message is a program I used to search for possible > optimized expressions like this. [...] I was hoping > it might find a 2 operation expression for MD5's I(), but no luck. I've enhanced the program, and had better luck today: $ ./search3 y n n n 2>&1 | cut -d' ' -f7- | sort -u 165 sel(ff, 55, 0f) = f5; 33 ^ f5 = c6; sel(ff, 55, 0f) = f5; f5 ^ 33 = c6; #define I(x, y, z) (bitselect(0xffffffffU, (x), (z)) ^ (y)) To remind, the original was: #define I(x, y, z) ((y) ^ ((x) | ~(z))) I think it's the first time MD5 I() has been shown to be implementable with only 2 operations on architectures without OR-NOT (and without XNOR, and of course also without "ternary" logic instructions such as Maxwell's or AVX-512's, which would turn this into 1 operation with no effort). Now that I think of it, the expression is actually very simple and I should have been able to arrive at it without a program. bitselect() with the all-ones constant is directly usable to implement OR-NOT. :-) > It doesn't yet test two bitselect()'s per expression, though - this is > worth adding and trying again (many possibilities to test there). It does now, and it also tests constants as you can see. New version attached. And here's a table produced with it: $ ./search3.sh SEL XNOR ORN ANDN COUNT MD5_I yes yes yes yes 190 yes yes yes yes no 190 yes yes yes no yes 190 yes yes yes no no 178 yes yes no yes yes 177 yes yes no yes no 177 yes yes no no yes 177 yes yes no no no 165 yes no yes yes yes 144 yes no yes yes no 114 yes no yes no yes 114 yes no yes no no 72 no no no yes yes 131 yes no no yes no 95 yes no no no yes 95 no no no no no 59 no This shows the number of different truth tables possible for 3 inputs with at most 2 operations on different architectures. (It is assumed that AND, OR, NOT and constants are always available, so only possible instruction set extensions are listed in the table.) The theoretical maximum (actually achieved with Maxwell's or AVX-512's "ternary" logic instructions) is 256. The last column shows whether MD5's I() is implementable with 2 operations or not. Similar tables can be generated for other expressions, such as those found in other MD4/MD5, SHA-1, and SHA-2 basic functions - but for those we readily knew seemingly optimal expressions using bitselect(), so I focused on MD5's I() for now. Unfortunately, the program became rather long. I'm sure it can be greatly simplified - and perhaps it should in fact be rewritten and tested against this version in order for us to have greater confidence it actually finds all possible truth tables rather than only a (very large) subset. As to machine code improvements from this change to I(), here's GCN assembly from our md5crypt kernel. Before the change: v_not_b32 v5, v7 // 00003DA8: 7E0A6F07 v_or_b32 v5, v1, v5 // 00003DAC: 380A0B01 v_xor_b32 v5, v2, v5 // 00003DB0: 3A0A0B02 v_add_i32 v4, vcc, v13, v4 // 00003DB4: 4A08090D v_add_i32 v4, vcc, v5, v4 // 00003DB8: 4A080905 v_add_i32 v4, vcc, 0xbd3af235, v4 // 00003DBC: 4A0808FF BD3AF235 v_alignbit_b32 v4, v4, v4, 22 // 00003DC4: D29C0004 025A0904 v_add_i32 v4, vcc, v1, v4 // 00003DCC: 4A080901 After the change (surprisingly, register allocation and offsets look unchanged): v_bfi_b32 v5, v7, v1, -1 // 00003DA8: D2940005 03060307 v_xor_b32 v5, v2, v5 // 00003DB0: 3A0A0B02 v_add_i32 v4, vcc, v13, v4 // 00003DB4: 4A08090D v_add_i32 v4, vcc, v5, v4 // 00003DB8: 4A080905 v_add_i32 v4, vcc, 0xbd3af235, v4 // 00003DBC: 4A0808FF BD3AF235 v_alignbit_b32 v4, v4, v4, 22 // 00003DC4: D29C0004 025A0904 v_add_i32 v4, vcc, v1, v4 // 00003DCC: 4A080901 That's 7 instructions instead of 8. However, there's no code size reduction in bytes, since v_bfi_b32 occupies 8 bytes (and it does so even when no immediate value operand is used). The effect of this will need to be tested across multiple OpenCL kernels and pieces of C+intrinsics code across multiple architectures. Besides GPUs, also on AMD CPUs with XOP. Alexander Download attachment "search3.sh" of type "application/x-sh" (455 bytes) View attachment "search3.c" of type "text/x-c" (6132 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.