|
Message-ID: <CAKGDhHUNezhzh3=-b0CF-xvEYkRUwO6Do2kpCmwwev3dU3d_wQ@mail.gmail.com> Date: Thu, 7 May 2015 15:30:43 +0200 From: Agnieszka Bielec <bielecagnieszka8@...il.com> To: john-dev@...ts.openwall.com Subject: Re: [GSoC] John the Ripper support for PHC finalists 2015-05-05 20:00 GMT+02:00 Solar Designer <solar@...nwall.com>: > On Mon, May 04, 2015 at 01:18:46AM +0200, Agnieszka Bielec wrote: >> 2015-04-27 3:50 GMT+02:00 Solar Designer <solar@...nwall.com>: >> >> > BTW, bumping into total GPU global memory size may be realistic with >> > these memory-hard hashes. Our TITAN's 6 GB was the performance >> > limiting factor in some of the benchmarks here: >> > http://www.openwall.com/lists/crypt-dev/2014/03/13/1 >> >> I use only 128MB > > What happens if you increase GWS further? Does performance drop? What > if you manually increase GWS even further? It might happen that the > auto-tuning finds a local minimum, whereas a higher GWS is optimal. the speed drops significantly when I make gws x2 bigger >> for the experimenst i removed almost everything from my code except >> the biggest bottleneck: > > That's a good experiment. > >> v1=vload4(0,S+sMAP(index_local)); \ >> v= v+(v1<<1); \ >> v1=v1+(v<<2); \ >> vstore4(v1,0,S+sMAP(index_local)); \ >> \ >> random_number = S[sMAP(i3)]; \ > > BTW, can you explain why sMAP is as it is? > > #define sMAP(X) ((X)*GID+gid4) > > where: > > gid = get_global_id(0); > GID = get_global_size(0); > gid4 = gid * 4; > sMAP, MAP and cMAP stands for coalescing (IIRC it's around 5 (maybe more) times faster with it) > Also, I notice there are some if/else in G and H macros. Are they > removed during loop unrolling, or do they translate to exec masks in the > generated code? I cached values from memory into variables and I must check if i0==index_global and i0==index_local, it's faster with this. In F all workitems execute the same if-else branch but not in H. I didn't disassemble the code yet. I doubt >> and the gws number with the memory usage were the same, I can nothing >> to do with this bottleneck >> >> but If I remove everything from the code, GWS also doesn't differ > > "Everything"? if I change my function into pomelo_crypt_kernel(args...) { nothing } but sorry, this was a false positive, If i set manually gws in this case everything looks normal > Yes, caching the index_local portion of S[] in local memory (or > you can also try private memory on NVIDIA) makes sense to me. A drawback > is that for all writes to S[], you'd have to check if the index is low > enough that the write needs to go to this cached copy (as well as > possibly to the global copy, to avoid having to perform a similar check > on global_index reads, or you can use the cache there as well - it's > unclear which will run faster). > >> we could cache this segment but today graphic cards >> rarely has 64KB of local memory, on super we have 48kB and 32KB >> it is even worse because this is 64KB for the work-group >> >> we don't know how much the __private memory we have , we can only see >> if the kernel compilation failed or not, but I'm not sure of this > > I think you should generally prefer to use local rather than private > memory for this. I think so too > AMD GCN (dev=0 and dev=1 in super) has 64 KB of local memory per CU. > See http://developer.amd.com/wordpress/media/2013/06/2620_final.pdf > slide 10. I checked local memory size using this code clGetDeviceInfo(devices[gpu_id],CL_DEVICE_LOCAL_MEM_SIZE,sizeof(cl_ulong),&local_memory_size,NULL); printf("mamy %llu\n",(unsigned long long) local_memory_size); and I was getting 48 and 32 KB thanks
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.