Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150823072153.GA15333@openwall.com>
Date: Sun, 23 Aug 2015 10:21:54 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Argon2 on GPU

On Sun, Aug 23, 2015 at 08:53:05AM +0300, Solar Designer wrote:
> There might also be room for improvement of Argon2 performance on GPUs
> through special handling of BLAKE2b's 64-bit operations.  See:
> 
> http://hashcat.net/forum/archive/index.php?thread-3422.html
> 
> "All the 64-bit based algorithms like SHA512, Keccak etc dropped in
> performance with each new driver a little bit.  So it was hard to notice.
> GPUs instructions operate still on 32-bit only, so the 64-bit mode is
> emulated.  But the way how it is emulated was somehow broken.  I was
> able to pinpoint the problem where the biggest drop came from and I
> managed to workaround it.  For NVidia it took me a little PTX hack, for
> AMD luckily there was no binary hack required."
> 
> Unfortunately, atom doesn't go into further detail there (but we could
> try asking him).  I guess the approach amounts to explicitly building
> 64-bit addition out of 32-bit additions.  Maybe having it split like
> that right away (rather than only in the PTX or IL to ISA translation)
> is somehow friendlier to current compilers.

In PTX, we appear to be getting add.s64 now.  I guess it'd be more
optimal to get add.cc.u32 followed by addc.u32:

http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#extended-precision-integer-arithmetic-instructions

In source code, it should be something like:

dst_lo = src1_lo + src2_lo;
dst_hi = src1_hi + src2_hi + (dst_lo < src1_lo);

(where in our case these would likely be 32-bit elements of uint4), but
it'll probably take effort to reach the desired PTX code.

For bit rotates, we appear to be getting things like this:

        {
        .reg .b64 %lhs;
        .reg .b64 %rhs;
        shl.b64         %lhs, %rd12449, 1;
        shr.b64         %rhs, %rd12449, 63;
        add.u64         %rd12450, %lhs, %rhs;
        }

This probably translates to at least 6 native instructions.  There ought
to be more efficient ways, such as involving bfe or/and bfi instructions:

http://docs.nvidia.com/cuda/parallel-thread-execution/index.html#integer-arithmetic-instructions-bfe

For 32-bit rotates, there also was the mad trick:

http://www.openwall.com/lists/john-dev/2012/03/22/7

For starters, we should try OpenCL's rotate() and see if it translates
to decent PTX code these days.  Right now, opencl_blake2.h: rotr64()
does not yet use rotate(), while opencl_blake2-round-no-msg.h does.  We
should switch both to use the same approach, at least to make reviewing
the generated PTX code easier.

Also, right now opencl_blake2-round-no-msg.h uses rotate() with negative
shift counts.  We should change this to use the corresponding positive
shift counts instead.

opencl_blake2.h uses the "__constant uchar blake2b_sigma" array to
simplify the source code.  Unfortunately, this actually gets into the
compiled code:

        ld.const.u8     %r13, [blake2b_sigma+1];
        mul.wide.u32    %rd84, %r13, 8;
        add.s64         %rd85, %rd2021, %rd84;
        ld.local.u64    %rd86, [%rd85];

We should optimize this in the source using cpp macros, or alternatively
those non-performance-critical uses of BLAKE2 may be kept on the host.

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.