|
Message-ID: <20120623212611.GB1276@openwall.com> Date: Sun, 24 Jun 2012 01:26:11 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: async key transfers to GPU (was: Weekly report 1) myrice, Samuele - On Tue, Apr 17, 2012 at 08:34:45PM +0400, Solar Designer wrote: > On Tue, Apr 17, 2012 at 03:10:07PM +0800, myrice wrote: > > 4. Tried async cpy on GPU but no performance gains, still keep tuning. > > IIRC, what you tried was not supposed to result in any speedup because > your GPU code was invoked and required the data to be already available > right after you started the async copy - so you had it waiting for data > right at that point anyway. > > Lukas' code was different: IIRC, he split the buffered candidate > passwords in three smaller chunks, where two of the three may in fact be > transferred to the GPU asynchronously while the previous chunk is being > processed. You may implement that too, and I suggest that you make the > number of chunks to use configurable and try values larger than 3 (e.g., > 10 might be reasonable - letting you hide the latency for 9 out of 10 > transfers while hopefully not exceeding the size of a CPU data cache). While thinking of formats interface enhancements to make this more efficient, I realized that full efficiency may already be achieved by splitting the set of keys in only two chunks and starting transfer of the first chunk when set_key() is called for index == max_keys_per_crypt / 2 - 1. Do it right from that set_key() call. Then crypt_all() will start by initiating transfer of the second chunk and hashing of the first chunk (which may be already in GPU by that time), and then proceed to hash the second chunk (its transfer to GPU may complete while the first chunk is being hashed). (You'll need to handle the special case when fewer than max_keys_per_crypt or even fewer than max_keys_per_crypt / 2 keys are tried per crypt_all() call. Not optimize for this case, but just make sure it works properly as well, without real async transfers then. This is not difficult.) I will likely make this more explicit in the formats interface, which is needed to support trickier things such as overlapping CPU/GPU computation for WPA-PSK and MSCash2, but for now the hack suggested above should just work to fully hide the latency of transfers to GPU as long as we're able to generate and transfer the candidate passwords fast enough at all. Can you please try it out? Thanks, Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.