|
Message-ID: <20150829064848.GA30978@openwall.com> Date: Sat, 29 Aug 2015 09:48:49 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: PHC: Argon2 on GPU Agnieszka, On Sat, Aug 29, 2015 at 08:29:53AM +0300, Solar Designer wrote: > You could identify that loop's code size (including any functions it > calls, if not inlined), and/or try to reduce it (e.g., cut down on the > unrolling and inlining overall, or do it selectively). > > In fact, even if the most performance critical loop fits in cache, or if > we make it fit eventually, the size of the full kernel also matters. > > For comparison, the size of our md5crypt kernel is under 8k PTX > instructions total, and even at that size inlining of md5_digest() or > partially unrolling the main 1000 iterations loop isn't always optimal. > In my recent experiments, I ended up not inlining md5_digest(), but > unrolling the loop 2x on AMD and 4x on NVIDIA. Greater unrolling slowed > things down on our HD 7990's GPUs, so large kernel size might be a > reason why your Argon2 kernels perform worse on the AMD GPUs. Per this recent discussion, not inlining of functions isn't supported in AMD OpenCL currently: https://community.amd.com/thread/170309 So I am puzzled why I appeared to have any performance difference from including or omitting the "inline" keyword on md5_digest(). I'll need to re-test this, preferably reviewing the generated code. When targeting NVIDIA, I am indeed getting the exact same PTX code regardless of whether I include the inline keyword or not. "realhet", who commented in that thread, wrote a GCN ISA assembler, so he would know. It's one of the tools we have listed at: http://openwall.info/wiki/john/development/GPU-low-level And it seems I was wrong about the 8k PTX instructions - that might have been for another kernel or something. Our md5crypt kernel is at around 4k PTX instructions currently. However, function calls in OpenCL do seem to be supported on NVIDIA, as seen from reviewing the PTX code for your Argon2 kernels. You don't have your functions explicitly marked "inline", but most are inlined anyway - yet a few are not: $ fgrep .func kernel.out .func Initialize .func blake2b_update( .func blake2b_final( .func blake2b( .func Initialize( $ fgrep -A1 call.uni kernel.out | head -8 call.uni blake2b_update, -- call.uni blake2b_update, -- call.uni blake2b_update, You could want to look into ways to make more of the infrequent function calls to actually be calls rather than inlining. Ideally, there would be a keyword to prevent inlining, but I am not aware of one. Maybe there's a compiler switch, and then explicit "inline" would start to matter. Please look into this. As to loop unrolling, there's "#pragma unroll N", and when you specify N=1 so "#pragma unroll 1" I think it prevents unrolling. As an experiment, I tried adding "#pragma unroll 1" before all loops in argon2d_kernel.cl, and the PTX instruction count reduced - but not a lot. With uses of BLAKE2_ROUND_NO_MSG_V macros also put into loops: #pragma unroll 1 for (i = 0; i < 64; i += 8) { BLAKE2_ROUND_NO_MSG_V(state[i], state[i+1], state[i+2], state[i+3], state[i+4], state[i+5], state[i+6], state[i+7]); } #pragma unroll 1 for (i = 0; i < 8; i++) { BLAKE2_ROUND_NO_MSG_V(state[i], state[i+8], state[i+16], state[i+24], state[i+32], state[i+40], state[i+48], state[i+56]); } I got the PTX instruction count down from ~100k to ~80k. No speedup, though. (But not much slowdown either.) We need to figure out why it doesn't get lower. ~80k is still a lot. Are there many inlined functions and unrolled loops in the .h files? Maybe some pre- and/or post-processing should be kept on host to make the kernel simpler and smaller. This is bad in terms of Amdahl's law, but it might help us figure things out initially. BTW, it would be helpful to have some Perl scripts or such to analyze the PTX code. Even counting the instructions is a bit tricky since many of the lines are not instructions. "sort -u ... | wc -l" gives an estimate (and this is what I have been using) due to a new virtual register number being allocated each time (so even if the same instruction is used multiple times, it appears as different - and that's as we want it for counting). Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.