Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKGDhHXLF-6JUzsZWAKFAt3yBAK_6yW=OtOsc=Z+VCnvZWYYww@mail.gmail.com>
Date: Sun, 30 Aug 2015 01:31:42 +0200
From: Agnieszka Bielec <bielecagnieszka8@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Argon2 on GPU

2015-08-29 6:47 GMT+02:00 Solar Designer <solar@...nwall.com>:
> I've just briefly tried running your 2i and 2d kernels from your main
> branch (not the vector8 stuff) on Titan X - and the results are
> disappointing.  Performance is similar to what we saw on the old Titan,
> whereas the expectation was it'd be a multiple of what we saw on your
> 960M.

I made some tests for titan X, and updated speeds for TITAN from your
previous mails, speed of argon2i is better but argon2d is slightly
worse

argon2i
CPU on well - 2480
GeForce GTX 960M - 1861
AMD Tahiti - 1288
GeForce GTX TITAN - 4292
GeForce GTX TITAN X - 6113
memory: 1.5 MB

argon2d
CPU on well - 7808
GeForce GTX 960M - 4227
AMD Tahiti - 2742
GeForce GTX TITAN - 6215
GeForce GTX TITAN X - 6525
memory: 1.5 MB


[a@...er run]$ LWS=128 GWS=4096 ./john --test --format=argon2i-opencl
--dev=4 --v=4
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 4: GeForce GTX TITAN X
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__
-DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21
-D_OPENCL_COMPILER -DBINARY_SIZE=256 -DSALT_SIZE=64
-DPLAINTEXT_LENGTH=125
Local worksize (LWS) 128, global worksize (GWS) 4096
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts:     6068 c/s real, 6068 c/s virtual
Only one salt:  6113 c/s real, 6068 c/s virtual


[a@...er run]$ LWS=32 GWS=512 ./john --test --format=argon2d-opencl --dev=4
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 4: GeForce GTX TITAN X
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts:     6525 c/s real, 6525 c/s virtual
Only one salt:  6525 c/s real, 6525 c/s virtual

but there is again difference between results when GWS is set and when
is not set (I tested several times):

[a@...er run]$ ./john --test --format=argon2d-opencl --dev=4 --v=4
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 4: GeForce GTX TITAN X
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__
-DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21
-D_OPENCL_COMPILER -DBINARY_SIZE=256 -DSALT_SIZE=64
-DPLAINTEXT_LENGTH=125
Calculating best global worksize (GWS); max. 1s single kernel invocation.
gws:       256        2559 c/s        2559 rounds/s 100.027ms per crypt_all()!
gws:       512        3131 c/s        3131 rounds/s 163.523ms per crypt_all()+
gws:      1024        4574 c/s        4574 rounds/s 223.859ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts:     4740 c/s real, 4740 c/s virtual
Only one salt:  4740 c/s real, 4740 c/s virtual

[a@...er run]$ GWS=1024 ./john --test --format=argon2d-opencl --dev=4 --v=4
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 4: GeForce GTX TITAN X
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__
-DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21
-D_OPENCL_COMPILER -DBINARY_SIZE=256 -DSALT_SIZE=64
-DPLAINTEXT_LENGTH=125
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts:     4697 c/s real, 4697 c/s virtual
Only one salt:  4697 c/s real, 4654 c/s virtual

>Can you please experiment with this too, and try to use LWS and
> GWS settings directly scaled from those that you find performing good on
> your 960M (perhaps it means same LWS, but ~4.8x larger GWS)?  In your
> most recent full set of benchmark results, you didn't include the
> auto-tuning output (no -v=4), so I don't know what LWS and GWS you were
> using in the 960M benchmarks.

I was putting LWS and before ./john
_______
TITAN X

[a@...er run]$ GWS=1024 ./john --test --format=argon2i-opencl --dev=4
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 4: GeForce GTX TITAN X
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts:     2625 c/s real, 2603 c/s virtual
Only one salt:  2648 c/s real, 2648 c/s virtual

[a@...er run]$ GWS=512 ./john --test --format=argon2d-opencl --dev=4
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 4: GeForce GTX TITAN X
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts:     3479 c/s real, 3479 c/s virtual
Only one salt:  3513 c/s real, 3513 c/s virtual

[a@...er run]$ LWS=32 GWS=512 ./john --test --format=argon2d-opencl --dev=4
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 4: GeForce GTX TITAN X
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts:     6525 c/s real, 6525 c/s virtual
Only one salt:  6525 c/s real, 6525 c/s virtual

__________________________
980m

none@...e ~/Desktop/r/run $ GWS=1024 ./john --test --format=argon2i-opencl
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts:     1878 c/s real, 1861 c/s virtual
Only one salt:  1861 c/s real, 1861 c/s virtual

none@...e ~/Desktop/r/run $ GWS=512 ./john --test --format=argon2d-opencl
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts:     3976 c/s real, 3938 c/s virtual
Only one salt:  3976 c/s real, 4015 c/s virtual

none@...e ~/Desktop/r/run $ LWS=32 GWS=512 ./john --test --format=argon2d-opencl
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts:     4266 c/s real, 4227 c/s virtual
Only one salt:  4227 c/s real, 4266 c/s virtual


> Like I said, my initial results are not good, and I did try a few LWS
> and GWS combinations (up to using nearly the full 12 GB memory even).
> So I don't expect you would succeed either, but I'd like us to have a
> direct comparison of 960M vs. Titan X anyway, so that we can try to
> figure out what the bottleneck in scaling Argon2 between these two GPUs
> is.  And the next task might be to deal with the register spilling.

there is a tool nvidia visual profiler but unfortunatelly doesn't work
on my laptop and on super.  nvprof is a the same tool but in a command
line

[a@...er run]$ nvprof ./john --test --format=argon2i-opencl --dev=4 --v=4
Benchmarking: argon2i-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
==25405== NVPROF is profiling process 25405, command: ./john --test
--format=argon2i-opencl --dev=4 --v=4
Device 4: GeForce GTX TITAN X
Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__
-DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21
-D_OPENCL_COMPILER -DBINARY_SIZE=256 -DSALT_SIZE=64
-DPLAINTEXT_LENGTH=125
Calculating best global worksize (GWS); max. 1s single kernel invocation.
gws:       256        1053 c/s        1053 rounds/s 242.977ms per crypt_all()!
gws:       512        2292 c/s        2292 rounds/s 223.306ms per crypt_all()!
gws:      1024        2643 c/s        2643 rounds/s 387.368ms per crypt_all()+
Local worksize (LWS) 64, global worksize (GWS) 1024
using different password for benchmarking
DONE
Speed for cost 1 (t) of 3, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts:     2648 c/s real, 2648 c/s virtual
Only one salt:  2625 c/s real, 2625 c/s virtual

==25405== Profiling application: ./john --test --format=argon2i-opencl
--dev=4 --v=4
==25405== Profiling result:
No kernels were profiled.

==25405== API calls:
No API activities were profiled.

none@...e ~/Desktop/morecopy/run $ nvprof ./john --test --format=argon2d-opencl
Benchmarking: argon2d-opencl [Blake2 OpenCL]...
memory per hash : 1.50 MB
==6809== NVPROF is profiling process 6809, command: ./john --test
--format=argon2d-opencl
Device 0: GeForce GTX 960M
using different password for benchmarking
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1
Many salts:     3976 c/s real, 3938 c/s virtual
Only one salt:  3976 c/s real, 4015 c/s virtual

==6809== Profiling application: ./john --test --format=argon2d-opencl
==6809== Warning: make sure cudaDeviceReset() is called before
application exit to flush profile data.
======== Error: CUDA profiling error.


> If things just don't fit into private memory, then we might prefer to
> explicitly move some into local or/and global than leave this up to the
> compiler and keep guessing what's going on.  For a start, we need to
> achieve the same performance as we do now, but without spills and with
> explicit use of other memory types.  And after that point, we could
> proceed to optimize our use of the different memory types.

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.