Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150819044057.GA16484@openwall.com>
Date: Wed, 19 Aug 2015 07:40:57 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Argon2 on GPU

Agnieszka,

On Wed, Aug 19, 2015 at 07:10:42AM +0300, Solar Designer wrote:
> I think the modulo division operations are causing a lot of latency:
> 
> [solar@...er opencl]$ fgrep % argon2*.cl
> argon2d_kernel.cl:              reference_block_offset = (phi % r);
> argon2i_kernel.cl:              uint reference_block_index = addresses[0] % r;
> argon2i_kernel.cl:              uint reference_block_index = addresses[i] % r;

You might also achieve speedup by moving these operations up in code, to
be performed as soon as their input data is available.  Maybe the
compiler already does it for you, or maybe not.

In 2d, you could compute an equivalent of "phi % r" inside ComputeBlock,
after having invoked 9 out of 16 BLAKE2 rounds.  (The remaining 7 only
affect state[] elements other than the one used for phi.)

In 2i, you could compute "addresses[...] % r" on writes to addresses[]
(thus, store the block indices instead of the original addresses).

"r" might not yet have the correct value at that point, but you should
be able to correctly predict what it would have been at the "%" time.

However, I doubt this will fully do the latency hiding trick on GPUs.
Parallel processing capabilities to perform the modulo operations
concurrently are simply not available at that level in the OpenCL
programming model:

For 2d, there's a chance you'd save the latency of one modulo operation
(and that's all you need) if the GPU has an instruction like this in
hardware (and microcode?) and it doesn't block further processing (until
there's a data dependency on the result).

For 2i, there's no way those 256 modulo operations would be run
concurrently from one work-item.  And besides, to run them concurrently
you'd need to provide storage for the results (you already have that
addresses[] array) and then it's no better than copying this data e.g.
from a larger array (holding all precomputed indices) in global memory.

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.