|
Message-ID: <20150903200659.GA16390@openwall.com> Date: Thu, 3 Sep 2015 23:06:59 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: MD5 I() (was: SHA-1 H()) On Thu, Sep 03, 2015 at 06:15:29PM +0300, Solar Designer wrote: > I've enhanced the program, and had better luck today: > > $ ./search3 y n n n 2>&1 | cut -d' ' -f7- | sort -u > 165 > sel(ff, 55, 0f) = f5; 33 ^ f5 = c6; > sel(ff, 55, 0f) = f5; f5 ^ 33 = c6; > > #define I(x, y, z) (bitselect(0xffffffffU, (x), (z)) ^ (y)) > > To remind, the original was: > > #define I(x, y, z) ((y) ^ ((x) | ~(z))) > > I think it's the first time MD5 I() has been shown to be implementable > with only 2 operations on architectures without OR-NOT atom confirmed that this wasn't in oclHashcat yet, but I guess now it will be: <solardiz> MD5_I(x, y, z) = y ^ (x | ~z) = y ^ bitselect(-1, x, z); Was this known? @hashcat Do you use it? 2 ops without OR-NOT http://www.openwall.com/lists/john-dev/2015/09/03/6 <@hashcat> @solardiz It's new to me. Surprisingly it does not have any effect on cracking speed on AMD GPU even if the instruction count is reduced <@hashcat> @solardiz It can be used for RIPEMD's J() as well <@hashcat> @solardiz OK, found the reason! oclHashcat reverses the MD5 I() instruction for single hashes and therefore it almost never reached I() <@hashcat> @solardiz For situations where we can not reverse I() there's an increase. For example with phpass 2932k->3038k @ 290x <@solardiz> @hashcat Thanks for confirming this. I'm getting 2233k->2310k for JtR's phpass-opencl on Tahiti 1050 MHz. > $ ./search3.sh > SEL XNOR ORN ANDN COUNT MD5_I > yes yes yes yes 190 yes > yes yes yes no 190 yes > yes yes no yes 190 yes > yes yes no no 178 yes > yes no yes yes 177 yes > yes no yes no 177 yes > yes no no yes 177 yes > yes no no no 165 yes > no yes yes yes 144 yes > no yes yes no 114 yes > no yes no yes 114 yes > no yes no no 72 no > no no yes yes 131 yes > no no yes no 95 yes > no no no yes 95 no > no no no no 59 no Note that MD5 I() is also implementable in 2 ops on archs with XNOR and ANDN, but no ORN and no SEL. Do these exist? (Usually if XNOR is available, then ORN is also available.) #define I(x, y, z) ((~(x) & (z)) ^ ~(y)) > This shows the number of different truth tables possible for 3 inputs > with at most 2 operations on different architectures. (It is assumed > that AND, OR, NOT and constants are always available, so only possible > instruction set extensions are listed in the table.) The theoretical > maximum (actually achieved with Maxwell's or AVX-512's "ternary" logic > instructions) is 256. Of course, I meant AND, OR, XOR, NOT there. (Forgot to list XOR.) > As to machine code improvements from this change to I(), here's GCN > assembly from our md5crypt kernel. Before the change: > > v_not_b32 v5, v7 // 00003DA8: 7E0A6F07 > v_or_b32 v5, v1, v5 // 00003DAC: 380A0B01 > v_xor_b32 v5, v2, v5 // 00003DB0: 3A0A0B02 > v_add_i32 v4, vcc, v13, v4 // 00003DB4: 4A08090D > v_add_i32 v4, vcc, v5, v4 // 00003DB8: 4A080905 > v_add_i32 v4, vcc, 0xbd3af235, v4 // 00003DBC: 4A0808FF BD3AF235 > v_alignbit_b32 v4, v4, v4, 22 // 00003DC4: D29C0004 025A0904 > v_add_i32 v4, vcc, v1, v4 // 00003DCC: 4A080901 > > After the change (surprisingly, register allocation and offsets look > unchanged): > > v_bfi_b32 v5, v7, v1, -1 // 00003DA8: D2940005 03060307 > v_xor_b32 v5, v2, v5 // 00003DB0: 3A0A0B02 > v_add_i32 v4, vcc, v13, v4 // 00003DB4: 4A08090D > v_add_i32 v4, vcc, v5, v4 // 00003DB8: 4A080905 > v_add_i32 v4, vcc, 0xbd3af235, v4 // 00003DBC: 4A0808FF BD3AF235 > v_alignbit_b32 v4, v4, v4, 22 // 00003DC4: D29C0004 025A0904 > v_add_i32 v4, vcc, v1, v4 // 00003DCC: 4A080901 > > That's 7 instructions instead of 8. However, there's no code size > reduction in bytes, since v_bfi_b32 occupies 8 bytes (and it does so > even when no immediate value operand is used). > > The effect of this will need to be tested across multiple OpenCL kernels > and pieces of C+intrinsics code across multiple architectures. Besides > GPUs, also on AMD CPUs with XOP. For md5crypt-opencl, we're getting some slowdown - but that kernel is weird (as discussed elsewhere, it currently uses global memory in its inner loop). We need to optimize it some more, then re-test this change. For phpass-opencl, there's decent speedup as mentioned above. I didn't try any others yet. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.