|
Message-ID: <20150823072153.GA15333@openwall.com> Date: Sun, 23 Aug 2015 10:21:54 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: PHC: Argon2 on GPU On Sun, Aug 23, 2015 at 08:53:05AM +0300, Solar Designer wrote: > There might also be room for improvement of Argon2 performance on GPUs > through special handling of BLAKE2b's 64-bit operations. See: > > http://hashcat.net/forum/archive/index.php?thread-3422.html > > "All the 64-bit based algorithms like SHA512, Keccak etc dropped in > performance with each new driver a little bit. So it was hard to notice. > GPUs instructions operate still on 32-bit only, so the 64-bit mode is > emulated. But the way how it is emulated was somehow broken. I was > able to pinpoint the problem where the biggest drop came from and I > managed to workaround it. For NVidia it took me a little PTX hack, for > AMD luckily there was no binary hack required." > > Unfortunately, atom doesn't go into further detail there (but we could > try asking him). I guess the approach amounts to explicitly building > 64-bit addition out of 32-bit additions. Maybe having it split like > that right away (rather than only in the PTX or IL to ISA translation) > is somehow friendlier to current compilers. In PTX, we appear to be getting add.s64 now. I guess it'd be more optimal to get add.cc.u32 followed by addc.u32: http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#extended-precision-integer-arithmetic-instructions In source code, it should be something like: dst_lo = src1_lo + src2_lo; dst_hi = src1_hi + src2_hi + (dst_lo < src1_lo); (where in our case these would likely be 32-bit elements of uint4), but it'll probably take effort to reach the desired PTX code. For bit rotates, we appear to be getting things like this: { .reg .b64 %lhs; .reg .b64 %rhs; shl.b64 %lhs, %rd12449, 1; shr.b64 %rhs, %rd12449, 63; add.u64 %rd12450, %lhs, %rhs; } This probably translates to at least 6 native instructions. There ought to be more efficient ways, such as involving bfe or/and bfi instructions: http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#integer-arithmetic-instructions-bfe For 32-bit rotates, there also was the mad trick: http://www.openwall.com/lists/john-dev/2012/03/22/7 For starters, we should try OpenCL's rotate() and see if it translates to decent PTX code these days. Right now, opencl_blake2.h: rotr64() does not yet use rotate(), while opencl_blake2-round-no-msg.h does. We should switch both to use the same approach, at least to make reviewing the generated PTX code easier. Also, right now opencl_blake2-round-no-msg.h uses rotate() with negative shift counts. We should change this to use the corresponding positive shift counts instead. opencl_blake2.h uses the "__constant uchar blake2b_sigma" array to simplify the source code. Unfortunately, this actually gets into the compiled code: ld.const.u8 %r13, [blake2b_sigma+1]; mul.wide.u32 %rd84, %r13, 8; add.s64 %rd85, %rd2021, %rd84; ld.local.u64 %rd86, [%rd85]; We should optimize this in the source using cpp macros, or alternatively those non-performance-critical uses of BLAKE2 may be kept on the host. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.