|
Message-ID: <4F97780E.9000705@gmail.com>
Date: Wed, 25 Apr 2012 01:05:34 -0300
From: Claudio André <claudioandre.br@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: New RAR OpenCL kernel
> That is odd. I saw similar things when experimenting with Loveland and
> Cedar cards. Even using really low figures like 16, I never got rid of
> register spill. I must have hit a per-thread max rather than a total max.
If you look at the profile output, you gonna see 18 ScratchRegs used in
the previous version. This should be 0.
I saw this happen in my code once, but i did not understood why. Looking
at what i did, seems the compiler made some "optimization". The
"optimization" could be better than my original code, but i'm not sure
about it. To me it was a compiler decision when i have a not very good
code (to GPU arch).
> Great! I did not really dare to hope for that. BTW there are some AMD
> alternatives commented out and/or defined out in the kernel, that in the
> future should be enabled when applicable, things like this:
>
> #if 0 // slower on GTX580
> #define F(x,y,z) bitselect(z, y, x)
> #else
> #define F(x,y,z) (z ^ (x& (y ^ z)))
> #endif
I started "to get" ideas from the bitcoin code. I got some 3% speedup.
And (at least now) it is based only on these bitselect, bit_align, etc..
> I think we should come up with a couple of -Ddefines that are
> automagically added by common-opencl at (JIT-)build time, depending on
> device. I think we could use these or more:
>
> -DAMD or -DNVIDIA for starters.
> And perhaps -DGCN, -DFERMI, I'm not sure. I know Milen use -DOLD_ATI for
> 4xxx (btw I just re-read everything he ever wrote to this list and it
> was well worth the time)
I'm not sure how is this "defines" going to be set, but, they are going
to be useful. AMD has some code that you can get the GPU family, so we
can get it and use/adapt. See page 2-7 in [1]
> and maybe even things like
>
> -DLOC_MEM_SIZE=xxxx
>
> But we should not go overboard with this, just the minimum stuff needed
> for decent adoption to GPU. I'm not sure exactly how to pick what
> defines to send.
I never face an occasion i want to send memory info. Nothing to say.
> Another related thing is I'd like to send defines like PLAINTEXT_LENGTH,
> ROUNDS and LMEM_PER_THREAD from the host code to the kernel when
> building. Maybe another argument to opencl_init()? Or maybe I should
> just start using a rar.h file that's included by both host and kernel code.
This is the best reason i use an .h file. It is useful (while my code
is not finished)).
The problem about using a lot of opencl_init stuff is that i don't trust
the compiler, if you start to stress it, i bet it is going to surprise us.
>
>
> If you don't give any LWS and KPC, will it pick decent figures
> automatically? No matter how I do it, find_workgroup_size is suboptimal
> on some cards. The current code works fine on 9600GT and GTX580 but
> tends to pick a low LWS for GTX680 because I use number of SP's as a
> parameter but there's no way to tell how many cores each SP have.
I tried a lot my own find_best_KPC (and it follows Samuele ideas). It is
not deterministic. If i am using a clear, unused GPU (no X on it), i
will try to improve it (if possible), but in a live environment, the
results were ok.
I haven't seen bad results from find LWS and KPC routines, i've seen
suboptimal. With suboptimal numbers in mind, i did some experiments and
selected what was the best to the test i did. We could recommend
something like this.
For LWS, we can always start on work group size multiple. Have you
tried using this constrain?
>
> Create a -hp mode test archive:
>
> $ rar a -hppassword test.rar README
>
> Then benchmark just like this, using the OpenCL version of cRARk:
>
> $ ./crark-hp -b test.rar
>
> It takes half a minute. I'm fairly sure it will benchmark 6 characters.
> If it doesn't, add -l6 -g6
I'll email the developer. Something is wrong.
claudio@...udioandre-desktop:~/bin/crark$ ./crark-hp -b test.rar -l6 -g6
cRARk-HP (Encrypted Headers) OpenCL enabled v. 3.4e Freeware
Copyright 1995-2001, 2006-11 by P. Semjanov, http://www.crark.net
(c) PSW-soft Password Cracking Library PCL v. 2.0d by P. Semjanov
OpenCL GPU device #0 not found, CPU is used
Claudio
[1]
http://developer.amd.com/sdks/amdappsdk/assets/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf
Content of type "text/html" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.