john-dev - Re: PHC: Argon2 on GPU

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150820203403.GA29081@openwall.com>
Date: Thu, 20 Aug 2015 23:34:03 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Argon2 on GPU

On Thu, Aug 20, 2015 at 08:04:20PM +0200, Agnieszka Bielec wrote:
> 2015-08-19 18:39 GMT+02:00 Solar Designer <solar@...nwall.com>:
> > I think you may try working on ulong16 or ulong8 instead.  I expect
> > ulong8 to match the current GPU hardware best, but OTOH ulong16 makes
> > more parallelism apparent to the OpenCL compiler and allocates it to one
> > work-item.  So please try both and see which works best.
> 
> I created something using ulong8, it's almost not noticeable better
> speed in my laptop but worse on super both cards, no idea if this is
> what you wanted ( I think that not ), you can take a look on branch
> vector8

This is a step towards what I meant, but you're not quite there yet.

You need to convert more of the processing to ulong8 (or ulong16).  For
example, you still have "ulong2 ref_block[64];" in ComputeBlock_pgg(),
but it should become an array of ulong 8 too.  And so on.  Yes, this
means that you either have to convert the callers to using this wider
vector type as well, or you have to convert between the vectors
somewhere (which likely results in performance loss).  You should also
use the wider vector type for the global memory references and in the
kernel parameter list.

The only shuffling of ulong2's inside/between ulong8's should be between
the two groups of 8 BLAKE2 rounds.  Right now, you also have conversion
from ulong2 to ulong8 before the first group of 8 BLAKE2 rounds - it
should go away when you optimize this code further as I suggested above.

Also, the shuffling can probably be optimized.  Right now, you keep the
full block in state[] and you also have 8 ulong8's storing half a block
at a time.  You may instead have 16 ulong8's storing the entire block.
Yes, the shuffling might require some temporary storage, but you don't
necessarily have to write the entire block to a temporary array of
ulong2's - perhaps there's a more efficient way for the specific kind of
shuffling that is being done.

Also, we're optimizing this blindfolded, and that's wrong.  We should be
reviewing the generated code.  You may patch common-opencl.c:
opencl_build_kernel_opt() to invoke opencl_build() like this:

	opencl_build(sequential_id, opts, 1, "kernel.out");

instead of the current:

	opencl_build(sequential_id, opts, 0, NULL);

Then when targeting NVIDIA cards it dumps PTX assembly to the filename
specified there.  It looks something like this, just much larger:

http://arrayfire.com/demystifying-ptx-code/

You could start by experimenting with a much simpler than Argon2 yet in
some ways similar kernel: implement some trivial operation like XOR on
different vector widths and see whether/how this changes the assembly.
Then make it slightly less trivial (just enough to prevent the compiler
from optimizing things out) and add uses of private or local memory,
and see if you can make it run faster by using wider vectors per the
same private or local memory usage.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.