Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150823084825.GA16692@openwall.com>
Date: Sun, 23 Aug 2015 11:48:25 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Argon2 on GPU

On Wed, Aug 19, 2015 at 07:39:09PM +0300, Solar Designer wrote:
> Agnieszka,
> 
> As it has just been mentioned on the PHC list, you need to try
> exploiting the parallelism inside ComputeBlock.  There are two groups of
> 8 BLAKE2 rounds.  In each of the groups, the 8 rounds may be computed in
> parallel.  When your kernel is working on ulong2, I think it won't fully
> exploit this parallelism, except that the parallelism may allow for
> better pipelining within those ulong2 lanes (not stalling further
> instructions since their input data is separate and thus is readily
> available).
> 
> I think you may try working on ulong16 or ulong8 instead.  I expect
> ulong8 to match the current GPU hardware best, but OTOH ulong16 makes
> more parallelism apparent to the OpenCL compiler and allocates it to one
> work-item.  So please try both and see which works best.
> 
> With this, you'd launch groups of 8 or 4 BLAKE2 rounds on those wider
> vectors, and then between the two groups of 8 in ComputeBlock you'd need
> to shuffle vector elements (moving them between two vectors of ulong8 if
> you use that type) instead of shuffling state[] elements like you do now
> (and like the original Argon2 code did).
> 
> The expectation is that a single kernel invocation will then make use of
> more SIMD width (2x512- or 512-bit instead of the current 128-bit), yet
> only the same amount of local and private memory as it does now.  So
> you'd pack as many of these kernels per GPU as you do now, but they will
> run faster (up to 8x faster) since they'd process 8 or 4 BLAKE2 rounds
> in parallel rather than sequentially.

I was totally wrong and naive in hoping that use of ulong2 (or wider)
would somehow give us a corresponding portion of the GPU hardware SIMD
vectors.  There are simply no such instructions.  We're instead given
32-bit elements in different registers.

I think use of vectorized kernels like that works like I had expected
when targeting CPUs with SIMD, but not when targeting GPUs.

So our only hope to exploit Argon2's ComputeBlock parallelism on GPUs is
through playing by the SIMT rules.

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.