john-dev - Re: PHC: Argon2 on GPU

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150816215059.GA27142@openwall.com>
Date: Mon, 17 Aug 2015 00:51:00 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Argon2 on GPU

On Sun, Aug 16, 2015 at 10:27:27PM +0200, Agnieszka Bielec wrote:
> 2015-08-16 16:09 GMT+02:00 Solar Designer <solar@...nwall.com>:
> > bi+i is used to index an array if 16-byte elements, so it needs to be
> > multiplied by 16 each time (unless the compiler manages to optimize
> > this, perhaps much like you had done manually in the first version).
> 
> if something is not supported why I have on my laptop the opposite of
> this slowdown on AMD?

It is possible that index scaling by 16 is not supported on AMD GCN, but
is supported on NVIDIA Maxwell (although I doubt it) - you'd need to
check the corresponding ISA manuals and/or the generated GPU ISA code.
It is also possible that one compiler happens to handle this better than
the other, optimizing out the need to scale the index.  Finally, it is
possible that extra instructions for the scaling by 16 are generated for
either GPU, but on one of them they end up actually helping e.g. through
avoiding a stall elsewhere.  (It does sometimes happen that even a NOP
introduced into code speeds it up.  In fact, some compilers generate
code with occasional NOPs in it in some cases - I've recently seen that
in code that icc generates for MIC.  Usually this is done to have a next
instruction more likely issued onto a specific execution unit, which
may in turn benefit yet another sequence of instructions through which
execution units are busy vs. available at the time that sequence starts.)

> none@...e ~/Desktop/r/run $ ./john --test --format=argon2d-opencl
> Benchmarking: argon2d-opencl [Blake2 OpenCL]...
> memory per hash : 1.46 MB
> Device 0: GeForce GTX 960M
> using different password for benchmarking
> DONE
> Speed for cost 1 (t) of 1, cost 2 (m) of 1500, cost 3 (l) of 1
> Many salts:     4114 c/s real, 4114 c/s virtual
> Only one salt:  4114 c/s real, 4151 c/s virtual

BTW, these are impressively good speeds for your small GPU.  We need to
get a Titan X, and it'll outperform a CPU significantly.

What speeds are you getting on well's CPU for Argon2d at these settings?
With memory (de)allocation out of the loop, like we had for the Lyra2
and yescrypt benchmarks.

Also, please set m=1536, so we'd have exactly 1.5 MiB.

Thanks,

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.