john-dev - Re: [GSoC] John the Ripper support for PHC finalists

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150507140947.GB11074@openwall.com>
Date: Thu, 7 May 2015 17:09:47 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: [GSoC] John the Ripper support for PHC finalists

Agnieszka,

On Thu, May 07, 2015 at 03:30:43PM +0200, Agnieszka Bielec wrote:
> 2015-05-05 20:00 GMT+02:00 Solar Designer <solar@...nwall.com>:
> > On Mon, May 04, 2015 at 01:18:46AM +0200, Agnieszka Bielec wrote:
> >> 2015-04-27 3:50 GMT+02:00 Solar Designer <solar@...nwall.com>:
> >>
> >> > BTW, bumping into total GPU global memory size may be realistic with
> >> > these memory-hard hashes.  Our TITAN's 6 GB was the performance
> >> > limiting factor in some of the benchmarks here:
> >> > http://www.openwall.com/lists/crypt-dev/2014/03/13/1
> >>
> >> I use only 128MB
> >
> > What happens if you increase GWS further?  Does performance drop?  What
> > if you manually increase GWS even further?  It might happen that the
> > auto-tuning finds a local minimum, whereas a higher GWS is optimal.
> 
> the speed drops significantly when I make gws x2 bigger

Can you try making it bigger yet anyway?  This probably won't help, but
it may be worth trying.

> > BTW, can you explain why sMAP is as it is?
> >
> > #define sMAP(X) ((X)*GID+gid4)
> >
> > where:
> >
> >         gid = get_global_id(0);
> >         GID = get_global_size(0);
> >         gid4 = gid * 4;
> >
> 
> sMAP, MAP and cMAP stands for coalescing (IIRC it's around 5 (maybe
> more) times faster with it)

Ideally, you'd literally explain these definitions in source code comments.

> > Also, I notice there are some if/else in G and H macros.  Are they
> > removed during loop unrolling, or do they translate to exec masks in the
> > generated code?
> 
> I cached values from memory into variables and I must check if
> i0==index_global and i0==index_local, it's faster with this. In F all
> workitems execute the same if-else branch but not in H. I didn't
> disassemble the code yet. I doubt

I don't understand.

What exactly have you cached?

Do you expect the "i0==index_local" and "i0==index_global" conditions to
often be true, or are these rare special cases?  I'd expect the latter,
but I don't see the purpose.

> >> and the gws number with the memory usage were the same, I can nothing
> >> to do with this bottleneck
> >>
> >> but If I remove everything from the code, GWS also doesn't differ
> >
> > "Everything"?
> 
> if I change my function into pomelo_crypt_kernel(args...) { nothing  }
> but sorry, this was a false positive, If i set manually gws in this
> case everything looks normal

Does this suggest that GWS auto-tuning does not work correctly?

> > AMD GCN (dev=0 and dev=1 in super) has 64 KB of local memory per CU.
> > See http://developer.amd.com/wordpress/media/2013/06/2620_final.pdf
> > slide 10.
> 
> I checked local memory size using this code
> 
> clGetDeviceInfo(devices[gpu_id],CL_DEVICE_LOCAL_MEM_SIZE,sizeof(cl_ulong),&local_memory_size,NULL);
>     printf("mamy %llu\n",(unsigned long long) local_memory_size);
> 
> and I was getting 48 and 32 KB

Which devices do these correspond to?

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.