|
Message-ID: <20120518133908.GA22735@openwall.com> Date: Fri, 18 May 2012 17:39:08 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: About Performance of Store Loaded Hashes/Salts on GPU myrice - On Fri, May 18, 2012 at 08:52:24PM +0800, myrice wrote: > The performance gains from storing loaded hashes/salts on GPU is > mainly from reducing hashes/salts transfer from CPU to GPU I think. Not only that. Consider the case of having multiple hashes per salt, or having a saltless hash type (and multiple hashes). Without that optimization, you either talk to the GPU from every cmp_all() call, which you need to make per-hash (with the current salt, if applicable) or you have to transfer computed (partial) hashes back to the CPU such that you can use get_hash*() and do the (indirect) comparisons on CPU. When you have the loaded hashes stored on the GPU card, you may include the comparisons inside crypt_all(), right after computation of the hashes. (And you may use your own bitmaps and hash table lookups there if the number of loaded hashes for a given salt is large enough to warrant that.) Then cmp_all() becomes almost a no-op: you don't need to talk to the GPU from it, you just return one int value that you obtained from the GPU in crypt_all(). So with one loaded hash per salt, you save one call to the GPU. With 10 hashes per salt, you save 10 calls to the GPU (you only do one call in crypt_all(), but cmp_all() becomes dummy). (A planned enhancement to the formats interface will allow skipping the dummy calls to cmp_all() in this case.) > Here is q quick test I have done. > Current xsha512-opencl/xsha512-cuda have cmp_all() on GPU. The > hashes/salts are transferred to GPU when we invoke > cmp_all()/crypt_all() respectively. So I commented out the hashes and > salts copy code in cmp_all() and crypt_all(). Here is the result: > Before: > [11:36:43 myrice] run $ ./john -te=1 -fo=xsha512-cuda > Benchmarking: Mac OS X 10.7+ salted SHA-512 [CUDA]... DONE > Many salts: 65278K c/s real, 65278K c/s virtual > Only one salt: 28973K c/s real, 28973K c/s virtual > > After: > [11:36:43 myrice] run $ ./john -te=1 -fo=xsha512-cuda > Benchmarking: Mac OS X 10.7+ salted SHA-512 [CUDA]... DONE > Many salts: 65925K c/s real, 65925K c/s virtual > Only one salt: 29491K c/s real, 29230K c/s virtual > > It seems these copys do not hurt performance a lot. Any ideas about this? Yes. I did not expect them to make a lot of a difference in the above case. You should see a slight additional improvement if you completely eliminate interaction with the GPU in cmp_all(), though. And this improvement would be greater with multiple loaded hashes per salt (not visible on --test, but visible on an actual cracking run with a proper password file to expose this) and with saltless hashes. Also it'd be greater with faster hashes (recall that SHA-512 on GPU is only semi-fast), although you'd need to also deal with the password generation bottleneck in order for the effect to become significant. Thanks, Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.