|
Message-ID: <20120405124401.GB24480@openwall.com> Date: Thu, 5 Apr 2012 16:44:01 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: fast hashes on GPU myrice - On Thu, Apr 05, 2012 at 04:19:49PM +0800, myrice wrote: > Yes, I using a much larger max_keys_per_crypt. ( 128*KEYS_PER_CRYPT, > THREADS: 512, BLOCK: 32, sm_13) > On GTX580, Before: > Benchmarking: Mac OS X 10.7+ salted SHA-512 [CUDA]... DONE > Many salts: 70254K c/s real, 70964K c/s virtual > Only one salt: 27262K c/s real, 27262K c/s virtual > > After: > Benchmarking: Mac OS X 10.7+ salted SHA-512 [CUDA]... DONE > Many salts: 72673K c/s real, 72673K c/s virtual > Only one salt: 27439K c/s real, 27698K c/s virtual The only change was an increase of max_keys_per_crypt by a factor of 128, correct? > Enlarge either MAX_KEYS_PER_CRYPT or BLOCK size will reduce the > performance. Are you saying that increasing these values _further_ (beyond 128x) reduces performance? If so, the reason is probably that the keys no longer fit into the same cache level on the CPU. This is also why the salts trick discussed before does have potential for a little bit of speedup (with a lower max_keys_per_crypt then, to fit in a smaller and faster CPU cache) compared to an increased max_keys_per_crypt. Similarly, the reduced work set size on GPU (N passwords + M salts instead of N*M passwords for comparable total processing time per call) may allow for the use of a faster memory type on GPU. So the salts trick is worth revisiting once you feel you're done with easier optimizations. > I think it is a lot of data transfer that takes time. Of course transferring more data takes more time, but this is irrelevant to the slowdown with larger (too large) max_keys_per_crypt because the relative data transfer sizes per password hashed remain the same. I've provided my guess as to the actual cause of the slowdown above. What are you planning to do next, for XSHA512 or/and for other aspects of the fast hashes project? Here are some possible sub-tasks: 1. Work with magnum to get your code written so far into magnum-jumbo. (I strongly recommend that you take care of this first.) 2. Try to optimize XSHA512 further by optimizing the SHA-512 stuff in it (e.g., get rid of the init/update/final functions). 3. Try to optimize XSHA512 further by storing/caching the salts in GPU as discussed. 4. Try to optimize XSHA512 further by storing/caching the loaded hashes in GPU, so that you can merge cmp_all()'s on-GPU code into crypt_all()'s and avoid the call to GPU in cmp_all(). With very large max_keys_per_crypt this won't matter (the calls are infrequent anyway), but this change may result in the optimal max_keys_per_crypt value becoming a little bit lower (you'll need to re-tune it), which may improve CPU cache usage (resulting in a little bit of speedup). 5. Try to support longer passwords without a significant slowdown when the actual passwords being tested happen to be short. Right now, we have to decrease PLAINTEXT_LENGTH to achieve decent speed for the single-salt case, but this is a nasty limitation for actual use of such a build of John (another build needs to be used when testing of longer passwords is desired). There are ways to keep the data transfers to GPU smaller most of the time while allowing for longer passwords in the rare cases when those are present. 6. For correctness, implement computation of bytes 9-64 of the hash value in cmp_exact(). You'll keep BINARY_SIZE at 8, but in the rare cases when the 8 bytes match and cmp_exact() is called, you re-decode the ASCII string, compute the full 512-bit hash value, and compare these two 512-bit values. You may perform this computation on CPU (just use OpenSSL calls). 7. Proceed to implement XSHA (similar to XSHA512, but based on SHA-1). Since these hashes are a lot faster to compute on GPU, they will highlight the various bottlenecks much better (and will make the need for dealing with those more pressing). (By the way, XSHA should be added to your proposed GSoC timeline.) 8. Implement XSHA512 (and then XSHA) in OpenCL as well. Try it on AMD GPUs. This may also highlight the bottlenecks better (if run on a card like 7970, which should be about 3x faster than GTX 580 at SHA-1). (Yes, it is desirable that you get a fast AMD card as well.) 9. Since XSHA512 and XSHA use only a single instance of SHA-512 or SHA-1, it should be possible to reverse the last few rounds of the computation (perform the reverse in binary() on CPU when loading the hashes). Then you have fewer rounds to compute at runtime (on GPU). (BTW, this optimization will apply to CPU-only implementations as well, so we'll need it there as well, but we'll have to move away from OpenSSL's to our own code for SHA-512 and SHA-1 first. If you show how many rounds may be reversed and how in your GPU code, someone else on our team may do it for CPU, or vice versa.) Of course, there will be many more sub-tasks (for the summer if we proceed to work under a summer program), but the above are some that I think you can work on now. How does this sound to you? Thanks, Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.