john-dev - Re: PHC: Argon2 on GPU

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150823072153.GA15333@openwall.com>
Date: Sun, 23 Aug 2015 10:21:54 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Argon2 on GPU

On Sun, Aug 23, 2015 at 08:53:05AM +0300, Solar Designer wrote:
> There might also be room for improvement of Argon2 performance on GPUs
> through special handling of BLAKE2b's 64-bit operations.  See:
> 
> http://hashcat.net/forum/archive/index.php?thread-3422.html
> 
> "All the 64-bit based algorithms like SHA512, Keccak etc dropped in
> performance with each new driver a little bit.  So it was hard to notice.
> GPUs instructions operate still on 32-bit only, so the 64-bit mode is
> emulated.  But the way how it is emulated was somehow broken.  I was
> able to pinpoint the problem where the biggest drop came from and I
> managed to workaround it.  For NVidia it took me a little PTX hack, for
> AMD luckily there was no binary hack required."
> 
> Unfortunately, atom doesn't go into further detail there (but we could
> try asking him).  I guess the approach amounts to explicitly building
> 64-bit addition out of 32-bit additions.  Maybe having it split like
> that right away (rather than only in the PTX or IL to ISA translation)
> is somehow friendlier to current compilers.

In PTX, we appear to be getting add.s64 now.  I guess it'd be more
optimal to get add.cc.u32 followed by addc.u32:

http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#extended-precision-integer-arithmetic-instructions

In source code, it should be something like:

dst_lo = src1_lo + src2_lo;
dst_hi = src1_hi + src2_hi + (dst_lo < src1_lo);

(where in our case these would likely be 32-bit elements of uint4), but
it'll probably take effort to reach the desired PTX code.

For bit rotates, we appear to be getting things like this:

        {
        .reg .b64 %lhs;
        .reg .b64 %rhs;
        shl.b64         %lhs, %rd12449, 1;
        shr.b64         %rhs, %rd12449, 63;
        add.u64         %rd12450, %lhs, %rhs;
        }

This probably translates to at least 6 native instructions.  There ought
to be more efficient ways, such as involving bfe or/and bfi instructions:

http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#integer-arithmetic-instructions-bfe

For 32-bit rotates, there also was the mad trick:

http://www.openwall.com/lists/john-dev/2012/03/22/7

For starters, we should try OpenCL's rotate() and see if it translates
to decent PTX code these days.  Right now, opencl_blake2.h: rotr64()
does not yet use rotate(), while opencl_blake2-round-no-msg.h does.  We
should switch both to use the same approach, at least to make reviewing
the generated PTX code easier.

Also, right now opencl_blake2-round-no-msg.h uses rotate() with negative
shift counts.  We should change this to use the corresponding positive
shift counts instead.

opencl_blake2.h uses the "__constant uchar blake2b_sigma" array to
simplify the source code.  Unfortunately, this actually gets into the
compiled code:

        ld.const.u8     %r13, [blake2b_sigma+1];
        mul.wide.u32    %rd84, %r13, 8;
        add.s64         %rd85, %rd2021, %rd84;
        ld.local.u64    %rd86, [%rd85];

We should optimize this in the source using cpp macros, or alternatively
those non-performance-critical uses of BLAKE2 may be kept on the host.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.