|
Message-ID: <20130107175644.GA20530@openwall.com> Date: Mon, 7 Jan 2013 21:56:44 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: des-opencl Hi Sayantan - On Mon, Jan 07, 2013 at 02:51:04PM +0530, Sayantan Datta wrote: > On Mon, Jan 7, 2013 at 9:01 AM, Solar Designer <solar@...nwall.com> wrote: > > > Another curious detail about mysterymath's implementation is that it > > obviously exceeds instruction cache, yet is reasonably fast on GCN. > > We should try that too as it avoids pointer indirection and reduces > > register pressure due to hard-coding key bit indices (no key schedule > > array needed anymore). Lukas reported very poor performance on 5850, > > though. > > Yes I tried that too, but speed was far worse due to hardcoding. Although I > tried this by hard-coding only k=0 and k=96 , there was a huge performance > drop, probably due iCache overrun. To fully hard-code the kernel we need to > do that for k=0 to 720 with a gap of 48 which should be around 16 times > more code(instructions) than we currently have in the kernel. There's some confusion here. What exactly did you try? Fully unroll the DES rounds loop (all 16 rounds are sequential pieces of code, with only the 25-iterations loop around them), but then only substituted some fixed key bit indices, continuing to use the key schedule array for the rest (even though you no longer had to)? I am not sure we understand each other correctly here... Why 0 to 720? I think it's 0 to 767. Why gap? Why 16 times more code? We currently have 2 DES rounds unrolled (looping over them 8 times). Fully unrolled would be at most 8 times larger, and in practice less than that because the reduced pointer indirection also reduces code size - so maybe ~6x bigger code overall. Yes, it wouldn't fit in the instruction cache on current devices, but maybe that's acceptable when the number of waves is large enough (so each portion of code fetched into the cache is made use of multiple times by the different waves). > > The speed seen during actual cracking is a lot lower - about 1M c/s, > > not 44M c/s as --test would suggest (this is for many salts). Why is > > that? How to avoid this problem? > > If you check the speed just after it starts, it should be quite low. It > should increase rapidly after the compilation of all kernels with different > salt is finished. Oh, you're right. Somehow I thought this version of code didn't have that. We need a version of code that uses neither multiple kernels nor runtime binary patching - just with the E[] array as we had before. Yes, it'd be slower, but I think we should keep that as well anyway - to be used when the approach with specializing kernels for a salt value somehow fails or when the startup time needs to be reduced. I'm not sure how to keep both (or all three?) approaches in the same source tree best, though. 3 formats? Or a format with compile-time fallbacks (e.g., use binary patching when the target GPU type is one of those where we've tested this and it works, and fallback to E[] for other devices?) Perhaps we'll make a "final" determination on that at a later time, but for now we simply need to have these available. Meanwhile, more comments were posted to the reddit thread - including a comment by atom. He managed to get 6 billion keys/s with a revision of the OP's code - looks like we need to upgrade Catalyst and experiment with this some more, too. Thanks, Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.