|
Message-ID: <20141031125556.GB7088@openwall.com> Date: Fri, 31 Oct 2014 15:55:56 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: descrypt speed (was: "Failed copy data to gpu" when using fork with descrypt-opencl) On Fri, Oct 31, 2014 at 03:31:12AM +0100, magnum wrote: > On 2014-10-30 16:49, Royce Williams wrote: > >>Using -fork=4 on a quadcore+HT and GTX980 I got over 82 Mc/s. > > > >On my 8-core AMD and GTX970, using fork=2 gets me 52 Mc/s, which is > >much better than no fork (~35 Mc/s). fork=3 settles in around 54 > >Mc/s. Forking more than 3 doesn't materially increase the c/s rate. > > Solar, Sayantan, all, > > Why is this? This is bordering candidate generation bottleneck but > that's not quite the problem, is it? So what is the bottleneck? Could we > do something to make it faster without forking or *is* it just candidate > generation? Might be bandwidth - I just brought this up in another message. Another idea is that we may introduce explicit buffering in global memory, and async processing. > Also, as far as I understand just from googling, Atom has yet to > implement bitslicing. Yet his descrypt exceeds 100M c/s on a single > Tahiti (according to > https://twitter.com/hashcat/status/160488271267364864). How is that > possible? Should we not beat him silly with our bitslicing version? No, since oclHashcat generates candidate passwords on GPU and we do it on host. Also, as discussed before, bitslice DES works great on AMD GCN GPUs for single DES cracking (and for LM hash cracking), but not so great for descrypt. I think this has to do with L1 instruction cache hit rate: for single DES, most fetches from L1i are reused by multiple wavefronts executing code near the same addresses, whereas once iterations are added they get too much out of sync after ~10 iterations or so (per Sayantan's benchmarks when I was asking him to try hacked descrypt with lower iteration counts). For descrypt, it's 25 iterations. Perhaps a workaround is possible: e.g. compute partial hashes for 5 iterations, and do it 5 times - or is there some barrier forcing wavefronts sync that we can put into the existing OpenCL code easily? Anyway, we need candidate password generation on GPU before we can see much of an effect from such changes. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.