john-dev - Re: New RAR OpenCL kernel

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <97bd87792664c6c7ecee88f955192b3a@smtp.hushmail.com>
Date: Wed, 25 Apr 2012 21:15:35 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: New RAR OpenCL kernel

On 04/25/2012 06:05 AM, Claudio André wrote:
> 
>> That is odd. I saw similar things when experimenting with Loveland and
>> Cedar cards. Even using really low figures like 16, I never got rid of
>> register spill. I must have hit a per-thread max rather than a total max.
> 
> If you look at the profile output, you gonna see 18 ScratchRegs used in
> the previous version. This should be 0.
> I saw this happen in my code once, but i did not understood why. Looking
> at what i did, seems the compiler made some "optimization". The
> "optimization" could be better than my original code, but i'm not sure
> about it. To me it was a compiler decision when i have a not very good
> code (to GPU arch).

Here's a similar issue: Today I noticed that when I enable shared memory
on nvidia (supposedly for decreasing GPR pressure by a whopping 40
registers) the final GPR use *increase* by 2 instead, doh!

>> I think we should come up with a couple of -Ddefines that are
>> automagically added by common-opencl at (JIT-)build time, depending on
>> device. I think we could use these or more:
>>
>> -DAMD or -DNVIDIA for starters.
>> And perhaps -DGCN, -DFERMI, I'm not sure. I know Milen use -DOLD_ATI for
>> 4xxx (btw I just re-read everything he ever wrote to this list and it
>> was well worth the time)
> 
> I'm not sure how is this "defines" going to be set, but, they are going
> to be useful. AMD has some code that you can get the GPU family, so we
> can get it and use/adapt. See page 2-7 in [1]

Great, I'll have a look. Today I realised that for the simplest cases
(just AMD vs nvidia) I added this in my kernel:

  #ifdef cl_nv_pragma_unroll
+ #define NVIDIA
  #pragma OPENCL EXTENSION cl_nv_pragma_unroll : enable
  #endif

...simple as that :) then further down I just #ifdef NVIDIA ... #else
... #endif for the architecture-specific things.

> I tried a lot my own find_best_KPC (and it follows Samuele ideas). It is
> not deterministic. If i am using a clear, unused GPU (no X on it), i
> will try to improve it (if possible), but in a live environment, the
> results were ok.
> 
> I haven't seen bad results from find LWS and KPC routines, i've seen
> suboptimal. With suboptimal numbers in mind, i did some experiments and
> selected what was the best to the test i did. We could recommend
> something like this.
> For LWS, we can always start on  work group size multiple. Have you
> tried using this constrain?

Yes, I start on CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, double for
each pass, and end at CL_KERNEL_WORK_GROUP_SIZE (as opposed to
CL_DEVICE_MAX_WORK_GROUP_SIZE which is useless). The problem is that for
the LWS enumeration to be correct, the global work size used in that
loop must suite: If I use a too low value, a powerful card will not show
any difference in speed between LWS 32 or 1024. But if I use a too high
value, a weak card will take minutes to go through the loop! This is why
I factored in number of SP, but it is not really the perfect solution.

The KPC test is easier. It is correct for a given LWS, no problem there
except how to know when to stop. Also, to be really sure you pick the
absolute best, this loop should start at LWS, decrease with LWS and end
when it's going downhill. But to be faster, it's better to start at LWS
and just double. This is more of a design decision.

> I'll email the developer. Something is wrong.
> 
> OpenCL GPU device #0 not found, CPU is used

As an alternative benchmark I tried oclhashcat today with Cedar, 9600GT
and GTX580, for raw SHA-1. The Cedar and 9600GT was almost equal and
they were about 1/10 of the GTX580. Using my RAR, the 9600GT is also
1/10 of GTX580 but the Cedar is less than 1/100. I think my code is
actually pretty decent for big and small nvidia's (for a n00b's first
project) but it's not near good enough for AMD.

magnum
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.