|
|
Message-ID: <CAKGDhHUNezhzh3=-b0CF-xvEYkRUwO6Do2kpCmwwev3dU3d_wQ@mail.gmail.com>
Date: Thu, 7 May 2015 15:30:43 +0200
From: Agnieszka Bielec <bielecagnieszka8@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: [GSoC] John the Ripper support for PHC finalists
2015-05-05 20:00 GMT+02:00 Solar Designer <solar@...nwall.com>:
> On Mon, May 04, 2015 at 01:18:46AM +0200, Agnieszka Bielec wrote:
>> 2015-04-27 3:50 GMT+02:00 Solar Designer <solar@...nwall.com>:
>>
>> > BTW, bumping into total GPU global memory size may be realistic with
>> > these memory-hard hashes. Our TITAN's 6 GB was the performance
>> > limiting factor in some of the benchmarks here:
>> > http://www.openwall.com/lists/crypt-dev/2014/03/13/1
>>
>> I use only 128MB
>
> What happens if you increase GWS further? Does performance drop? What
> if you manually increase GWS even further? It might happen that the
> auto-tuning finds a local minimum, whereas a higher GWS is optimal.
the speed drops significantly when I make gws x2 bigger
>> for the experimenst i removed almost everything from my code except
>> the biggest bottleneck:
>
> That's a good experiment.
>
>> v1=vload4(0,S+sMAP(index_local)); \
>> v= v+(v1<<1); \
>> v1=v1+(v<<2); \
>> vstore4(v1,0,S+sMAP(index_local)); \
>> \
>> random_number = S[sMAP(i3)]; \
>
> BTW, can you explain why sMAP is as it is?
>
> #define sMAP(X) ((X)*GID+gid4)
>
> where:
>
> gid = get_global_id(0);
> GID = get_global_size(0);
> gid4 = gid * 4;
>
sMAP, MAP and cMAP stands for coalescing (IIRC it's around 5 (maybe
more) times faster with it)
> Also, I notice there are some if/else in G and H macros. Are they
> removed during loop unrolling, or do they translate to exec masks in the
> generated code?
I cached values from memory into variables and I must check if
i0==index_global and i0==index_local, it's faster with this. In F all
workitems execute the same if-else branch but not in H. I didn't
disassemble the code yet. I doubt
>> and the gws number with the memory usage were the same, I can nothing
>> to do with this bottleneck
>>
>> but If I remove everything from the code, GWS also doesn't differ
>
> "Everything"?
if I change my function into pomelo_crypt_kernel(args...) { nothing }
but sorry, this was a false positive, If i set manually gws in this
case everything looks normal
> Yes, caching the index_local portion of S[] in local memory (or
> you can also try private memory on NVIDIA) makes sense to me. A drawback
> is that for all writes to S[], you'd have to check if the index is low
> enough that the write needs to go to this cached copy (as well as
> possibly to the global copy, to avoid having to perform a similar check
> on global_index reads, or you can use the cache there as well - it's
> unclear which will run faster).
>
>> we could cache this segment but today graphic cards
>> rarely has 64KB of local memory, on super we have 48kB and 32KB
>> it is even worse because this is 64KB for the work-group
>>
>> we don't know how much the __private memory we have , we can only see
>> if the kernel compilation failed or not, but I'm not sure of this
>
> I think you should generally prefer to use local rather than private
> memory for this.
I think so too
> AMD GCN (dev=0 and dev=1 in super) has 64 KB of local memory per CU.
> See http://developer.amd.com/wordpress/media/2013/06/2620_final.pdf
> slide 10.
I checked local memory size using this code
clGetDeviceInfo(devices[gpu_id],CL_DEVICE_LOCAL_MEM_SIZE,sizeof(cl_ulong),&local_memory_size,NULL);
printf("mamy %llu\n",(unsigned long long) local_memory_size);
and I was getting 48 and 32 KB
thanks
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.