john-dev - Re: PHC: Argon2 on GPU

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKGDhHUh+9Wo4zUDP1uG5FaDrbSZF_SvfbnvA+RCYO05hqF_Vg@mail.gmail.com>
Date: Sun, 16 Aug 2015 14:01:38 +0200
From: Agnieszka Bielec <bielecagnieszka8@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Argon2 on GPU

2015-08-16 0:21 GMT+02:00 Agnieszka Bielec <bielecagnieszka8@...il.com>:
> I added to crypt_all() time measurement and here are results:
>
> [a@...er run]$ ./john --test --format=argon2i-opencl --v=4
> Benchmarking: argon2i-opencl [Blake2 OpenCL]...
> memory per hash : 1.46 MB
> Device 0: Tahiti [AMD Radeon HD 7900 Series]
> Options used: -I ./kernels -cl-mad-enable -D__GPU__ -DDEVICE_INFO=138
> -DDEV_VER_MAJOR=1800 -DDEV_VER_MINOR=5 -D_OPENCL_COMPILER
> -DBINARY_SIZE=256 -DSALT_SIZE=64 -DPLAINTEXT_LENGTH=32
> Calculating best global worksize (GWS); max. 1s single kernel invocation.
> crypt all start, count=256, gws=256, lws=64
> crypt all end, time: 0.702250
> gws:       256         385 c/s         385 rounds/s 664.384ms per crypt_all()!
> crypt all start, count=512, gws=512, lws=64
> crypt all end, time: 0.738910
> gws:       512         719 c/s         719 rounds/s 711.666ms per crypt_all()+
> crypt all start, count=1024, gws=1024, lws=64
> crypt all end, time: 0.819439
> gws:      1024        1306 c/s        1306 rounds/s 783.545ms per crypt_all()+
> Local worksize (LWS) 64, global worksize (GWS) 1024
> crypt all start, count=1, gws=64, lws=64
> crypt all end, time: 0.982416
> crypt all start, count=2, gws=64, lws=64
> crypt all end, time: 0.642484
> crypt all start, count=3, gws=64, lws=64
> crypt all end, time: 0.675356
> crypt all start, count=4, gws=64, lws=64
> crypt all end, time: 0.677136
> crypt all start, count=5, gws=64, lws=64
> crypt all end, time: 0.057678
> crypt all start, count=7, gws=64, lws=64
> crypt all end, time: 0.057936
> crypt all start, count=10, gws=64, lws=64
> crypt all end, time: 0.042161
> crypt all start, count=14, gws=64, lws=64
> crypt all end, time: 0.054247
> crypt all start, count=1024, gws=1024, lws=64
> crypt all end, time: 2.615536
> using different password for benchmarking
> crypt all start, count=1024, gws=1024, lws=64
> crypt all end, time: 2.635043
> qqqqqqqqqqqqqqqqqqqqqqqqq
> real_time 263
> crypt all start, count=1024, gws=1024, lws=64
> crypt all end, time: 2.645786
> qqqqqqqqqqqqqqqqqqqqqqqqq
> real_time 265
> DONE
> Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
> ten int 1024
> clock : 263
> aaa Many salts: 389 c/s real, 102400 c/s virtual
> zzzzz Only one salt:    386 c/s real, 102400 c/s virtual
>
> [a@...er run]$ GWS=1024 ./john --test --format=argon2i-opencl --v=4
> Benchmarking: argon2i-opencl [Blake2 OpenCL]...
> memory per hash : 1.46 MB
> Device 0: Tahiti [AMD Radeon HD 7900 Series]
> Local worksize (LWS) 64, global worksize (GWS) 1024
> crypt all start, count=1, gws=64, lws=64
> crypt all end, time: 0.653867
> crypt all start, count=2, gws=64, lws=64
> crypt all end, time: 0.578068
> crypt all start, count=3, gws=64, lws=64
> crypt all end, time: 0.618967
> crypt all start, count=4, gws=64, lws=64
> crypt all end, time: 0.621076
> crypt all start, count=5, gws=64, lws=64
> crypt all end, time: 0.053851
> crypt all start, count=7, gws=64, lws=64
> crypt all end, time: 0.054477
> crypt all start, count=10, gws=64, lws=64
> crypt all end, time: 0.041921
> crypt all start, count=14, gws=64, lws=64
> crypt all end, time: 0.052137
> crypt all start, count=1024, gws=1024, lws=64
> crypt all end, time: 0.788093
> using different password for benchmarking
> crypt all start, count=1024, gws=1024, lws=64
> crypt all end, time: 0.788118
> crypt all start, count=1024, gws=1024, lws=64
> crypt all end, time: 0.789293
> qqqqqqqqqqqqqqqqqqqqqqqqq
> real_time 158
> crypt all start, count=1024, gws=1024, lws=64
> crypt all end, time: 0.788320
> crypt all start, count=1024, gws=1024, lws=64
> crypt all end, time: 0.787732
> qqqqqqqqqqqqqqqqqqqqqqqqq
> real_time 158
> DONE
> Speed for cost 1 (t) of 3, cost 2 (m) of 1500, cost 3 (l) of 1
> ten int 2048
> clock : 158
> aaa Many salts: 1296 c/s real, 204800 c/s virtual
> zzzzz Only one salt:    1296 c/s real, 102400 c/s virtual
>
>
> don't know how is this possible, this bug occurs only on super AMD
> (--dev=5 on super works after I cut plaintext length)
> also the same problem in cracking run - works faster when GWS=1024 is
> set, works slow when GWS is not set

now I was digging in argon2d ( I discovored that this bug occurs after
commit 9e96f452350c0f2cae32b38e4a4cd1f83d51a367)
and before this commit was code:

bi = prev_block_offset = ((prev_slice * lanes + pos.lane + 1) *
segment_length - 1) * BLOCK_SIZE;
for (i = 0; i < 64; i++)
{
       prev_block[i] = *(__global ulong2 *) (&memory[bi]);
       bi += 16;
}

slowdown on AMD occurs when I changed this code to:

bi = prev_block_offset = ((prev_slice * lanes + pos.lane + 1) *
segment_length - 1) * BLOCK_SIZE / 16;
for (i = 0; i < 64; i++)
{
        prev_block[i] = ((__global ulong2*)memory)[bi+i];
}

see anyone some logic here or is this just a bug on AMD?
I didn't gained speed anywhere on similar changes to this so I can
just revert back these changes
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.