Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 7 May 2015 17:09:47 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: [GSoC] John the Ripper support for PHC finalists

Agnieszka,

On Thu, May 07, 2015 at 03:30:43PM +0200, Agnieszka Bielec wrote:
> 2015-05-05 20:00 GMT+02:00 Solar Designer <solar@...nwall.com>:
> > On Mon, May 04, 2015 at 01:18:46AM +0200, Agnieszka Bielec wrote:
> >> 2015-04-27 3:50 GMT+02:00 Solar Designer <solar@...nwall.com>:
> >>
> >> > BTW, bumping into total GPU global memory size may be realistic with
> >> > these memory-hard hashes.  Our TITAN's 6 GB was the performance
> >> > limiting factor in some of the benchmarks here:
> >> > http://www.openwall.com/lists/crypt-dev/2014/03/13/1
> >>
> >> I use only 128MB
> >
> > What happens if you increase GWS further?  Does performance drop?  What
> > if you manually increase GWS even further?  It might happen that the
> > auto-tuning finds a local minimum, whereas a higher GWS is optimal.
> 
> the speed drops significantly when I make gws x2 bigger

Can you try making it bigger yet anyway?  This probably won't help, but
it may be worth trying.

> > BTW, can you explain why sMAP is as it is?
> >
> > #define sMAP(X) ((X)*GID+gid4)
> >
> > where:
> >
> >         gid = get_global_id(0);
> >         GID = get_global_size(0);
> >         gid4 = gid * 4;
> >
> 
> sMAP, MAP and cMAP stands for coalescing (IIRC it's around 5 (maybe
> more) times faster with it)

Ideally, you'd literally explain these definitions in source code comments.

> > Also, I notice there are some if/else in G and H macros.  Are they
> > removed during loop unrolling, or do they translate to exec masks in the
> > generated code?
> 
> I cached values from memory into variables and I must check if
> i0==index_global and i0==index_local, it's faster with this. In F all
> workitems execute the same if-else branch but not in H. I didn't
> disassemble the code yet. I doubt

I don't understand.

What exactly have you cached?

Do you expect the "i0==index_local" and "i0==index_global" conditions to
often be true, or are these rare special cases?  I'd expect the latter,
but I don't see the purpose.

> >> and the gws number with the memory usage were the same, I can nothing
> >> to do with this bottleneck
> >>
> >> but If I remove everything from the code, GWS also doesn't differ
> >
> > "Everything"?
> 
> if I change my function into pomelo_crypt_kernel(args...) { nothing  }
> but sorry, this was a false positive, If i set manually gws in this
> case everything looks normal

Does this suggest that GWS auto-tuning does not work correctly?

> > AMD GCN (dev=0 and dev=1 in super) has 64 KB of local memory per CU.
> > See http://developer.amd.com/wordpress/media/2013/06/2620_final.pdf
> > slide 10.
> 
> I checked local memory size using this code
> 
> clGetDeviceInfo(devices[gpu_id],CL_DEVICE_LOCAL_MEM_SIZE,sizeof(cl_ulong),&local_memory_size,NULL);
>     printf("mamy %llu\n",(unsigned long long) local_memory_size);
> 
> and I was getting 48 and 32 KB

Which devices do these correspond to?

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.