|
Message-ID: <86b7b0e6d283a9d8bbd0926b42a20f29@smtp.hushmail.com> Date: Thu, 26 Apr 2012 23:29:43 +0200 From: magnum <john.magnum@...hmail.com> To: john-dev@...ts.openwall.com Subject: Re: New RAR OpenCL kernel On 04/26/2012 12:13 AM, Milen Rangelov wrote: >> >> The funny thing is I got the same 4400 c/s anyway. It >>> got better in theory (less suggestions from the tool) but in practice it >>> stayed the same. For GTX580 I'm using 8192 now since higher figures >>> don't make any difference. >>> >>> > It's not that funny (you probably have the same problem as me). I have that > problem with my progress indicator eheh. The kernel is executing so slow, I > don't know how do you manage it in JTR but I guess it's something similar. > I have a thread that wakes up each 3 seconds and displays the speed based > on candidates tried in that time interval. A kernel invocation takes > usually 1-2 seconds and it tries say a ndrange of 128*128 candidates. Well > speed usually goes around the same 128*128*(1 or 2) value. You can't even > measure current speed in a *nice* way. That's why I am thinking of > introducing an "average speed" for my program, it would be much more > realistic for cases like the rar one :) I got rid of an obstacle now: I was using four RawPsw buffers, one for each alignment requirement and no bit flogging inside the inner loop. But that's an awful lot of registers, especially after I bumped max length to 32 (which was good for other reasons). So I dropped the extra buffers and just use one aligned buffer and all the bitshift magics in the inner loop. Now it continues to scale, but it quickly get extreme durations. But with sane durations (like 3 secs @16K GWS) I now have ~4500 c/s on GTX570/580 as well as HD 7970. It seems the latter runs fine with my present code while all older AMD's just hate me (first impressions last, lol). Still, I should be able to get 25% more out of both nvidia and ATI. I get +10% if I use a fixed-length kernel (without any optimisations except the compiler's) so I'm now trying to figure out how to make the host code "sort" the lengths in some effective way. I got a vague idea: What if I launch all applicable kernels at once (the host code may decide that we need kernels for eg. length 4, 5 and 6) after the host code sets up an array with pointers for each length accordingly. I think this would be fairly cheap in this context. But maybe I should leave the GPU for now and concentrate on the AES/unpack/crc part instead: I have curiously noticed that cRARk just do not get ANY performance drop from -p mode, even for large files. This must mean he doesn't unpack much of it. I have 2-3 ideas that I will investigate. magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.