john-dev - Re: New RAR OpenCL kernel

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <86b7b0e6d283a9d8bbd0926b42a20f29@smtp.hushmail.com>
Date: Thu, 26 Apr 2012 23:29:43 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: New RAR OpenCL kernel

On 04/26/2012 12:13 AM, Milen Rangelov wrote:
>>
>>  The funny thing is I got the same 4400 c/s anyway. It
>>> got better in theory (less suggestions from the tool) but in practice it
>>> stayed the same. For GTX580 I'm using 8192 now since higher figures
>>> don't make any difference.
>>>
>>>
> It's not that funny (you probably have the same problem as me). I have that
> problem with my progress indicator eheh. The kernel is executing so slow, I
> don't know how do you manage it in JTR but I guess it's something similar.
> I have a thread that wakes up each 3 seconds and displays the speed based
> on candidates tried in that time interval. A kernel invocation takes
> usually 1-2 seconds and it tries say a ndrange of 128*128 candidates. Well
> speed usually goes around the same 128*128*(1 or 2) value. You can't even
> measure current speed in a *nice* way. That's why I am thinking of
> introducing an "average speed" for my program, it would be much more
> realistic for cases like the rar one :)

I got rid of an obstacle now: I was using four RawPsw buffers, one for
each alignment requirement and no bit flogging inside the inner loop.
But that's an awful lot of registers, especially after I bumped max
length to 32 (which was good for other reasons). So I dropped the extra
buffers and just use one aligned buffer and all the bitshift magics in
the inner loop. Now it continues to scale, but it quickly get extreme
durations.

But with sane durations (like 3 secs @16K GWS) I now have ~4500 c/s on
GTX570/580 as well as HD 7970. It seems the latter runs fine with my
present code while all older AMD's just hate me (first impressions last,
lol). Still, I should be able to get 25% more out of both nvidia and
ATI. I get +10% if I use a fixed-length kernel (without any
optimisations except the compiler's) so I'm now trying to figure out how
to make the host code "sort" the lengths in some effective way. I got a
vague idea: What if I launch all applicable kernels at once (the host
code may decide that we need kernels for eg. length 4, 5 and 6) after
the host code sets up an array with pointers for each length
accordingly. I think this would be fairly cheap in this context.

But maybe I should leave the GPU for now and concentrate on the
AES/unpack/crc part instead: I have curiously noticed that cRARk just do
not get ANY performance drop from -p mode, even for large files. This
must mean he doesn't unpack much of it. I have 2-3 ideas that I will
investigate.

magnum

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.