Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CABh=JRExD8yUVt4tvhmcfTBPArZ2mv_YP2YofGJ10tA5-cg0ow@mail.gmail.com>
Date: Thu, 26 Apr 2012 00:11:32 +0300
From: Milen Rangelov <gat3way@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: New RAR OpenCL kernel

Hello (and sorry for hijacking the thread).

The values for LWS (worksize) and KPC (ndrange) seem quite too unoptimal to
me. Worksize of 256 is just not healthy for such a kernel, for a number of
reasons. I would stick to 64 (or 32 for nvidia), even hardcoded, no need to
customize that at all...but well that's just my opinion :)

NDRange of just 256 is just bad though. It's not enough to keep the CUs
busy enough to 'hide' memory access latencies. It needs to be at least
several thousands. You are underusing the GPU that way.

OTOH well yes, I know higher NDRange with RAR kernel could be disastrous,
could even lead to ASIC hangs and so on.

IMO the RAR kernel is just about (a very fragile) balance between several
factors.

Also I just noticed the ALUBusy and ALUPacking expectations. I guess they
are too optimistic, especially the ALUBusy one, it would never reach
anywhere near 100%, in fact I think even 50-60% would be an excellent
achievement...yeah, that bad. Well unless you think of some clever way to
reduce branching and/or  loops and some clever way to reduce GPR usage
without resorting to shifting variables to __local memory. Writing a good
performing RAR kernel is indeed a very complex task (not trying to
overestimate that at all, it took me weeks of coding and profiling and I am
still not happy with the results). The only thing being close somehow is
the sha512-crypt kernel for AMD, still RAR is still more complex.

On Wed, Apr 25, 2012 at 11:30 PM, magnum <john.magnum@...hmail.com> wrote:

> On 04/25/2012 10:26 PM, magnum wrote:
> > On 04/25/2012 02:38 PM, SAYANTAN DATTA wrote:
> >> I tested your rar format on my 4890.Here's the result:
> >>
> >> Local worksize (LWS) 256, Global worksize (KPC) 256
> >> Benchmarking: RAR3 (6 characters) [OpenCL]... DONE
> >> Raw:    64.2 c/s real, 64.2 c/s virtual
> >>
> >>  Is it okay to have KPC 256? Seems a bit low..
> >
> > I forgot that last question. No, I do not think 256 is too low. I can
> > get max speed on GTX580 using any LWS >= 64.
>
> Sorry I misread this. LWS 256 is OK but KPC at the same can't be good. I
> wonder how it ended up like that.
>
> Try explicitly saying KPC=0 for a benchmark output. Maybe also try
> setting a lower LWS (and KPC=0) and see what happens.
>
> magnum
>
>

Content of type "text/html" skipped

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.