john-dev - Re: PHC: Argon2 on GPU

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150820033010.GA24909@openwall.com>
Date: Thu, 20 Aug 2015 06:30:10 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Argon2 on GPU

On Thu, Aug 20, 2015 at 04:53:55AM +0300, Solar Designer wrote:
> On Wed, Aug 19, 2015 at 07:41:02PM +0200, Agnieszka Bielec wrote:
> > ptxas info    : Function properties for FillSegment
> > ptxas         .     0 bytes stack frame, 17400 bytes spill stores,
> > 19352 bytes spill loads
> > ptxas info    : Function properties for GenerateAddresses
> > ptxas         .     0 bytes stack frame, 7780 bytes spill stores,
> > 11648 bytes spill loads
> 
> The spills in FillSegment and GenerateAddresses are pretty bad.  Where
> do they come from, and why so much?  In FillSegment you use 1 KB per
> work-item for addresses[], in GenerateAddresses you use 2 KB for two
> blocks.  GenerateAddresses is called from FillSegment, so adds its
> private memory needs on top of FillSegment's.

There's also 1 KB ref_block[] in ComputeBlock and in ComputeBlock_pgg.

On super's -dev=5, I was getting:

ptxas info    : Function properties for FillSegment
ptxas         .     8216 bytes stack frame, 9708 bytes spill stores, 7776 bytes spill loads
ptxas info    : Function properties for GenerateAddresses
ptxas         .     6104 bytes stack frame, 4056 bytes spill stores, 4124 bytes spill loads

I've optimized this to:

ptxas info    : Function properties for FillSegment
ptxas         .     4408 bytes stack frame, 5984 bytes spill stores, 4020 bytes 
spill loads
ptxas info    : Function properties for GenerateAddresses
ptxas         .     1304 bytes stack frame, 388 bytes spill stores, 400 bytes spill loads

with the attached patch.  As it is, it provides no speedup for me (in
fact, there's very slight slowdown), but it should illustrate to you
what to optimize.  I expect that once you convert those uint operations
to work on ulong2 all the time, you'll see slight speedup.  (The changes
in performance seen from these code changes are relatively minor because
GenerateAddresses corresponds to a relatively small part of the total
running time.  There is a significant reduction in global memory usage,
though, as seen via nvidia-smi.)

In fact, those typecasts between ulong2 and uint pointers are probably
disallowed, as they violate strict aliasing rules.  Also, your code
heavily depends on the architecture being little-endian (just like
Argon2's original code did, which is a known bug).  You should try to
avoid that as you proceed to optimize your OpenCL kernels.  You'll find
that avoiding endianness dependencies goes along with avoiding strict
aliasing violations and achieving better speed as well (since the kernel
would use its full allocated SIMD width all the time, rather than only
part of the time).

BTW, out_tmp[] in Initialize() appears to be twice larger than it needs
to be:

	ulong2 out_tmp[BLOCK_SIZE/8];

ulong2 is 16 bytes, but you divide by 8.  Or is this on purpose?  Why?

Alexander

View attachment "john-argon2i-opencl-opt1.diff" of type "text/plain" (3036 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.