john-dev - Re: PHC: Argon2 on GPU

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKGDhHXLF-6JUzsZWAKFAt3yBAK_6yW=OtOsc=Z+VCnvZWYYww@mail.gmail.com>
Date: Sun, 30 Aug 2015 01:31:42 +0200
From: Agnieszka Bielec <bielecagnieszka8@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Argon2 on GPU

2015-08-29 6:47 GMT+02:00 Solar Designer <solar@...nwall.com>:
> I've just briefly tried running your 2i and 2d kernels from your main
> branch (not the vector8 stuff) on Titan X - and the results are
> disappointing.  Performance is similar to what we saw on the old Titan,
> whereas the expectation was it'd be a multiple of what we saw on your
> 960M.

I made some tests for titan X, and updated speeds for TITAN from your
previous mails, speed of argon2i is better but argon2d is slightly
worse

argon2i
CPU on well - 2480
GeForce GTX 960M - 1861
AMD Tahiti - 1288
GeForce GTX TITAN - 4292
GeForce GTX TITAN X - 6113
memory: 1.5 MB

argon2d
CPU on well - 7808
GeForce GTX 960M - 4227
AMD Tahiti - 2742
GeForce GTX TITAN - 6215
GeForce GTX TITAN X - 6525
memory: 1.5 MB


[a@...er run]$ LWS=128 GWS=4096 ./john --test --format=argon2i-opencl
--dev=4 --v=4
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 4: GeForce GTX TITAN X
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__
-DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21
-D_OPENCL_COMPILER -DBINARY_SIZE=256 -DSALT_SIZE=64
-DPLAINTEXT_LENGTH=125
Local worksize (LWS) 128, global worksize (GWS) 4096
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts:     6068 c/s real, 6068 c/s virtual
Only one salt:  6113 c/s real, 6068 c/s virtual


[a@...er run]$ LWS=32 GWS=512 ./john --test --format=argon2d-opencl --dev=4
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 4: GeForce GTX TITAN X
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts:     6525 c/s real, 6525 c/s virtual
Only one salt:  6525 c/s real, 6525 c/s virtual

but there is again difference between results when GWS is set and when
is not set (I tested several times):

[a@...er run]$ ./john --test --format=argon2d-opencl --dev=4 --v=4
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 4: GeForce GTX TITAN X
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__
-DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21
-D_OPENCL_COMPILER -DBINARY_SIZE=256 -DSALT_SIZE=64
-DPLAINTEXT_LENGTH=125
Calculating best global worksize (GWS); max. 1s single kernel invocation.
gws:       256        2559 c/s        2559 rounds/s 100.027ms per crypt_all()!
gws:       512        3131 c/s        3131 rounds/s 163.523ms per crypt_all()+
gws:      1024        4574 c/s        4574 rounds/s 223.859ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts:     4740 c/s real, 4740 c/s virtual
Only one salt:  4740 c/s real, 4740 c/s virtual

[a@...er run]$ GWS=1024 ./john --test --format=argon2d-opencl --dev=4 --v=4
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 4: GeForce GTX TITAN X
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__
-DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21
-D_OPENCL_COMPILER -DBINARY_SIZE=256 -DSALT_SIZE=64
-DPLAINTEXT_LENGTH=125
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts:     4697 c/s real, 4697 c/s virtual
Only one salt:  4697 c/s real, 4654 c/s virtual

>Can you please experiment with this too, and try to use LWS and
> GWS settings directly scaled from those that you find performing good on
> your 960M (perhaps it means same LWS, but ~4.8x larger GWS)?  In your
> most recent full set of benchmark results, you didn't include the
> auto-tuning output (no -v=4), so I don't know what LWS and GWS you were
> using in the 960M benchmarks.

I was putting LWS and before ./john
_______
TITAN X

[a@...er run]$ GWS=1024 ./john --test --format=argon2i-opencl --dev=4
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 4: GeForce GTX TITAN X
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts:     2625 c/s real, 2603 c/s virtual
Only one salt:  2648 c/s real, 2648 c/s virtual

[a@...er run]$ GWS=512 ./john --test --format=argon2d-opencl --dev=4
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 4: GeForce GTX TITAN X
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts:     3479 c/s real, 3479 c/s virtual
Only one salt:  3513 c/s real, 3513 c/s virtual

[a@...er run]$ LWS=32 GWS=512 ./john --test --format=argon2d-opencl --dev=4
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 4: GeForce GTX TITAN X
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts:     6525 c/s real, 6525 c/s virtual
Only one salt:  6525 c/s real, 6525 c/s virtual

__________________________
980m

none@...e ~/Desktop/r/run $ GWS=1024 ./john --test --format=argon2i-opencl
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts:     1878 c/s real, 1861 c/s virtual
Only one salt:  1861 c/s real, 1861 c/s virtual

none@...e ~/Desktop/r/run $ GWS=512 ./john --test --format=argon2d-opencl
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts:     3976 c/s real, 3938 c/s virtual
Only one salt:  3976 c/s real, 4015 c/s virtual

none@...e ~/Desktop/r/run $ LWS=32 GWS=512 ./john --test --format=argon2d-opencl
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts:     4266 c/s real, 4227 c/s virtual
Only one salt:  4227 c/s real, 4266 c/s virtual


> Like I said, my initial results are not good, and I did try a few LWS
> and GWS combinations (up to using nearly the full 12 GB memory even).
> So I don't expect you would succeed either, but I'd like us to have a
> direct comparison of 960M vs. Titan X anyway, so that we can try to
> figure out what the bottleneck in scaling Argon2 between these two GPUs
> is.  And the next task might be to deal with the register spilling.

there is a tool nvidia visual profiler but unfortunatelly doesn't work
on my laptop and on super.  nvprof is a the same tool but in a command
line

[a@...er run]$ nvprof ./john --test --format=argon2i-opencl --dev=4 --v=4
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
==25405== NVPROF is profiling process 25405, command: ./john --test
--format=argon2i-opencl --dev=4 --v=4
Device 4: GeForce GTX TITAN X
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__
-DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21
-D_OPENCL_COMPILER -DBINARY_SIZE=256 -DSALT_SIZE=64
-DPLAINTEXT_LENGTH=125
Calculating best global worksize (GWS); max. 1s single kernel invocation.
gws:       256        1053 c/s        1053 rounds/s 242.977ms per crypt_all()!
gws:       512        2292 c/s        2292 rounds/s 223.306ms per crypt_all()!
gws:      1024        2643 c/s        2643 rounds/s 387.368ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts:     2648 c/s real, 2648 c/s virtual
Only one salt:  2625 c/s real, 2625 c/s virtual

==25405== Profiling application: ./john --test --format=argon2i-opencl
--dev=4 --v=4
==25405== Profiling result:
No kernels were profiled.

==25405== API calls:
No API activities were profiled.

none@...e ~/Desktop/morecopy/run $ nvprof ./john --test --format=argon2d-opencl
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
==6809== NVPROF is profiling process 6809, command: ./john --test
--format=argon2d-opencl
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts:     3976 c/s real, 3938 c/s virtual
Only one salt:  3976 c/s real, 4015 c/s virtual

==6809== Profiling application: ./john --test --format=argon2d-opencl
==6809== Warning: make sure cudaDeviceReset() is called before
application exit to flush profile data.
======== Error: CUDA profiling error.


> If things just don't fit into private memory, then we might prefer to
> explicitly move some into local or/and global than leave this up to the
> compiler and keep guessing what's going on.  For a start, we need to
> achieve the same performance as we do now, but without spills and with
> explicit use of other memory types.  And after that point, we could
> proceed to optimize our use of the different memory types.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.