john-dev - Re: About Performance of Store Loaded Hashes/Salts on GPU

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20120518133908.GA22735@openwall.com>
Date: Fri, 18 May 2012 17:39:08 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: About Performance of Store Loaded Hashes/Salts on GPU

myrice -

On Fri, May 18, 2012 at 08:52:24PM +0800, myrice wrote:
> The performance gains from storing loaded hashes/salts on GPU is
> mainly from reducing hashes/salts transfer from CPU to GPU I think.

Not only that.  Consider the case of having multiple hashes per salt, or
having a saltless hash type (and multiple hashes).  Without that
optimization, you either talk to the GPU from every cmp_all() call,
which you need to make per-hash (with the current salt, if applicable)
or you have to transfer computed (partial) hashes back to the CPU such
that you can use get_hash*() and do the (indirect) comparisons on CPU.

When you have the loaded hashes stored on the GPU card, you may include
the comparisons inside crypt_all(), right after computation of the
hashes.  (And you may use your own bitmaps and hash table lookups there
if the number of loaded hashes for a given salt is large enough to
warrant that.)  Then cmp_all() becomes almost a no-op: you don't need to
talk to the GPU from it, you just return one int value that you obtained
from the GPU in crypt_all().

So with one loaded hash per salt, you save one call to the GPU.  With 10
hashes per salt, you save 10 calls to the GPU (you only do one call in
crypt_all(), but cmp_all() becomes dummy).

(A planned enhancement to the formats interface will allow skipping the
dummy calls to cmp_all() in this case.)

> Here is q quick test I have done.
> Current xsha512-opencl/xsha512-cuda have cmp_all() on GPU. The
> hashes/salts are transferred to GPU when we invoke
> cmp_all()/crypt_all() respectively. So I commented out the hashes and
> salts copy code in cmp_all() and crypt_all(). Here is the result:
> Before:
> [11:36:43 myrice] run $ ./john -te=1 -fo=xsha512-cuda
> Benchmarking: Mac OS X 10.7+ salted SHA-512 [CUDA]... DONE
> Many salts:     65278K c/s real, 65278K c/s virtual
> Only one salt:  28973K c/s real, 28973K c/s virtual
> 
> After:
> [11:36:43 myrice] run $ ./john -te=1 -fo=xsha512-cuda
> Benchmarking: Mac OS X 10.7+ salted SHA-512 [CUDA]... DONE
> Many salts:     65925K c/s real, 65925K c/s virtual
> Only one salt:  29491K c/s real, 29230K c/s virtual
> 
> It seems these copys do not hurt performance a lot. Any ideas about this?

Yes.  I did not expect them to make a lot of a difference in the above
case.  You should see a slight additional improvement if you completely
eliminate interaction with the GPU in cmp_all(), though.  And this
improvement would be greater with multiple loaded hashes per salt (not
visible on --test, but visible on an actual cracking run with a proper
password file to expose this) and with saltless hashes.  Also it'd be
greater with faster hashes (recall that SHA-512 on GPU is only
semi-fast), although you'd need to also deal with the password
generation bottleneck in order for the effect to become significant.

Thanks,

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.