|
Message-ID: <20150806140204.GD18936@openwall.com> Date: Thu, 6 Aug 2015 17:02:04 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: PHC: Argon2 on CPU Agnieszka, On Sun, Aug 02, 2015 at 10:46:00PM +0200, Agnieszka Bielec wrote: > hi, I have argon2i/d on CPU and GPU although I have not optimizations on GPU yet > argon2i/d is in both versions: REF and OPT-SSE > turned out that OPT-SSE after I removed SSE is faster than REF and my In other words, you produced a SIMD-less version based on the SIMD version's source code? If so, you should keep these two faster versions (one for use on SIMD-capable CPUs, the other on SIMD-less CPUs and on archs for which we don't yet have SIMD intrinsics) in a JtR format (the CPU one), choosing the faster one based on whether a suitable SIMD instruction set is enabled for a given John build or not. In fact, this is what we should do for all other formats as well (and what we already do for many in core and jumbo trees, starting e.g. with descrypt, which was the very first format JtR ever supported). > GPU version bases on this You're basing your Argon2 OpenCL code on SIMD-less CPU code? Why is that? Wouldn't a vectorized OpenCL kernel likely run faster? > results: > > OPT-SSE > none@...e ~/Desktop/rr/run $ ./john --test --format=argon2i > Will run 8 OpenMP threads > Benchmarking: argon2i [AVX]... (8xOMP) > memory per hash : 100.00 kB > using different password for benchmarking > DONE > Speed for cost 1 (t) of 3, cost 2 (m) of 100 > Raw: 31232 c/s real, 3908 c/s virtual > > none@...e ~/Desktop/rr/run $ ./john --test --format=argon2d > Will run 8 OpenMP threads > Benchmarking: argon2d [AVX]... (8xOMP) > memory per hash : 100.00 kB > using different password for benchmarking > DONE > Speed for cost 1 (t) of 3, cost 2 (m) of 100 > Raw: 35328 c/s real, 4483 c/s virtual I understand that you set these to the same parameters for a straightforward comparison, but FWIW the minimum recommended t for Argon2i is in fact 3, but for Argon2d it is 1. (This is for TMTO resilience reasons.) Maybe we should be benchmarking them at t=3 and t=1, respectively, going forward. > REF > none@...e ~/Desktop/rr/run $ ./john --test --format=argon2i > Will run 8 OpenMP threads > Benchmarking: argon2i [AVX]... (8xOMP) > memory per hash : 100.00 kB > using different password for benchmarking > DONE > Speed for cost 1 (t) of 3, cost 2 (m) of 100 > Raw: 9216 c/s real, 1160 c/s virtual > > none@...e ~/Desktop/rr/run $ ./john --test --format=argon2d > Will run 8 OpenMP threads > Benchmarking: argon2d [AVX]... (8xOMP) > memory per hash : 100.00 kB > using different password for benchmarking > DONE > Speed for cost 1 (t) of 3, cost 2 (m) of 100 > Raw: 10624 c/s real, 1336 c/s virtual > > OPT > > none@...e ~/Desktop/rr/run $ ./john --test --format=argon2i > Will run 8 OpenMP threads > Benchmarking: argon2i [AVX]... (8xOMP) > memory per hash : 100.00 kB > using different password for benchmarking > DONE > Speed for cost 1 (t) of 3, cost 2 (m) of 100 > Raw: 24064 c/s real, 3019 c/s virtual > > none@...e ~/Desktop/rr/run $ ./john --test --format=argon2d > Will run 8 OpenMP threads > Benchmarking: argon2d [AVX]... (8xOMP) > memory per hash : 100.00 kB > using different password for benchmarking > DONE > Speed for cost 1 (t) of 3, cost 2 (m) of 100 > Raw: 27008 c/s real, 3418 c/s virtual Nice speeds for presumably SIMD-less code, but please note that both of your benchmarks above (REF and OPT) say AVX. Are they lying? > but I was testing these no-sse versions by modyfiyng my code, don't > know if I can just turn-off simd (?), so I can't be sure of these > results although I know that structure of REF is different than > OPT-SSE one(maybe more) function was called a different number of time I'm sorry, but I find your wording above confusing. So let me try to ask a clarifying question: Are you reviewing the generated assembly code? It's trivial to see if the code is using SIMD or not. And while we're at it: How are you obtaining the assembly code for review? Do you replace gcc's "-c" option with "-S"? Or do you use "objdump -d" on the .o file? > GPU > > none@...e ~/Desktop/rr/run $ ./john --test --format=argon2i-opencl > Benchmarking: argon2i-opencl [Blake2 OpenCL]... > memory per hash : 100.00 kB > Device 0: GeForce GTX 960M > using different password for benchmarking > DONE > Speed for cost 1 (t) of 3, cost 2 (m) of 100, cost 3 (l) of 1 > Many salts: 11070 c/s real, 11145 c/s virtual > Only one salt: 11299 c/s real, 11299 c/s virtual > > none@...e ~/Desktop/rr/run $ ./john --test --format=argon2d-opencl > Benchmarking: argon2d-opencl [Blake2 OpenCL]... > memory per hash : 100.00 kB > Device 0: GeForce GTX 960M > using different password for benchmarking > DONE > Speed for cost 1 (t) of 3, cost 2 (m) of 100, cost 3 (l) of 1 > Many salts: 13884 c/s real, 13884 c/s virtual > Only one salt: 13884 c/s real, 13768 c/s virtual I wonder if the faster GPU's such as in "super" will outperform a CPU at this. They should, as Argon2 isn't particularly GPU-resistant (except for needing 1 KB of preferably local or private memory for the state). > this is version of argon2 from github > https://github.com/khovratovich/Argon2 and I have some questions and > comments > > here in version argon2i > https://github.com/khovratovich/Argon2/blob/master/Argon2i/opt-sse/argon2i-opt-sse.cpp > is text "Argon2d optimized implementation" > > and "SSE3" but it's SSSE3 in blake2 > https://github.com/khovratovich/Argon2/blob/master/Argon2i/opt-sse/blake2b.cpp SSE3 is generally useless for crypto (adds nothing that we'd use and that wasn't already in SSE2). So any mention of SSE3 in this context is almost certainly a typo of SSSE3. SSSE3 is useful in that it adds the PSHUFB instruction, available via the _mm_shuffle_epi8() intrinsic. BLAKE2, during its design, had its rotate counts deliberately adjusted such that this instruction would be usable to implement them. (This change between BLAKE and BLAKE2 is similar to an equivalent change between Salsa20 and ChaCha.) In blake2b-round.h currently in jumbo, we actually see uses of SSSE3: #ifndef __XOP__ #ifdef __SSSE3__ #define _mm_roti_epi64(x, c) \ (-(c) == 32) ? _mm_shuffle_epi32((x), _MM_SHUFFLE(2,3,0,1)) \ : (-(c) == 24) ? _mm_shuffle_epi8((x), r24) \ : (-(c) == 16) ? _mm_shuffle_epi8((x), r16) \ : (-(c) == 63) ? _mm_xor_si128(_mm_srli_epi64((x), -(c)), _mm_add_epi64((x), (x))) \ : _mm_xor_si128(_mm_srli_epi64((x), -(c)), _mm_slli_epi64((x), 64-(-(c)))) This comes straight from BLAKE2 designers' code. Argon2's bundled BLAKE2 code is essentially the same: https://github.com/khovratovich/Argon2/blob/master/Argon2i/opt-sse/blake2b-round.h So yes, it uses XOP when available, and when not then it uses SSSE3 when available. (XOP is superior to SSSE3.) > even SSE4_1 but I don't know if only one instruction > blake2b-round.h:#define LOADU(p) _mm_loadu_si128( (__m128i *)(p) ) > can make that it's SSE4_1 version (?) As magnum pointed out, this instruction doesn't require SSE4.1 at all. My guess is that they use the unaligned load instructions only when SSE4.1 is enabled because actual CPUs with SSE4.1 or better tend to offer those instructions "for free" (no performance overhead), whereas many older ones don't (have performance overhead). So it could be a "just in case" thing - if S->buf somehow isn't naturally aligned, but the build is for SSE4.1 or better, then it will work anyway. We could ask Samuel Neves to clarify this. > files .cpp are with header and I added that I modified these files but > files .h are without header and I don't know what to do with these, > even part of blake2 is without header > https://github.com/khovratovich/Argon2/blob/master/Argon2i/ref/blake-round.h As magnum said, we should have the PHC formats use a shared BLAKE2 implementation already in our tree, unless/until there are reasons to include custom BLAKE2 code with the formats. ... and that reason could be the 1.2 revision of Argon2 using BlaMka, a modification of BLAKE2 round. If a source file is lacking a comment on who wrote it, you may add a comment like that describing where you got the file, under what license, and what you changed. Preferably do that with a separate commit (first commit the unmodified file). Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.