|
Message-ID: <CAKGDhHXLF-6JUzsZWAKFAt3yBAK_6yW=OtOsc=Z+VCnvZWYYww@mail.gmail.com> Date: Sun, 30 Aug 2015 01:31:42 +0200 From: Agnieszka Bielec <bielecagnieszka8@...il.com> To: john-dev@...ts.openwall.com Subject: Re: PHC: Argon2 on GPU 2015-08-29 6:47 GMT+02:00 Solar Designer <solar@...nwall.com>: > I've just briefly tried running your 2i and 2d kernels from your main > branch (not the vector8 stuff) on Titan X - and the results are > disappointing. Performance is similar to what we saw on the old Titan, > whereas the expectation was it'd be a multiple of what we saw on your > 960M. I made some tests for titan X, and updated speeds for TITAN from your previous mails, speed of argon2i is better but argon2d is slightly worse argon2i CPU on well - 2480 GeForce GTX 960M - 1861 AMD Tahiti - 1288 GeForce GTX TITAN - 4292 GeForce GTX TITAN X - 6113 memory: 1.5 MB argon2d CPU on well - 7808 GeForce GTX 960M - 4227 AMD Tahiti - 2742 GeForce GTX TITAN - 6215 GeForce GTX TITAN X - 6525 memory: 1.5 MB [a@...er run]$ LWS=128 GWS=4096 ./john --test --format=argon2i-opencl --dev=4 --v=4 Benchmarking: argon2i-opencl [Blake2 OpenCL]... memory per hash : 1.50 MB Device 4: GeForce GTX TITAN X Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__ -DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21 -D_OPENCL_COMPILER -DBINARY_SIZE=256 -DSALT_SIZE=64 -DPLAINTEXT_LENGTH=125 Local worksize (LWS) 128, global worksize (GWS) 4096 using different password for benchmarking DONE Speed for cost 1 (t) of 3, cost 2 (m) of 1536, cost 3 (l) of 1 Many salts: 6068 c/s real, 6068 c/s virtual Only one salt: 6113 c/s real, 6068 c/s virtual [a@...er run]$ LWS=32 GWS=512 ./john --test --format=argon2d-opencl --dev=4 Benchmarking: argon2d-opencl [Blake2 OpenCL]... memory per hash : 1.50 MB Device 4: GeForce GTX TITAN X using different password for benchmarking DONE Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1 Many salts: 6525 c/s real, 6525 c/s virtual Only one salt: 6525 c/s real, 6525 c/s virtual but there is again difference between results when GWS is set and when is not set (I tested several times): [a@...er run]$ ./john --test --format=argon2d-opencl --dev=4 --v=4 Benchmarking: argon2d-opencl [Blake2 OpenCL]... memory per hash : 1.50 MB Device 4: GeForce GTX TITAN X Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__ -DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21 -D_OPENCL_COMPILER -DBINARY_SIZE=256 -DSALT_SIZE=64 -DPLAINTEXT_LENGTH=125 Calculating best global worksize (GWS); max. 1s single kernel invocation. gws: 256 2559 c/s 2559 rounds/s 100.027ms per crypt_all()! gws: 512 3131 c/s 3131 rounds/s 163.523ms per crypt_all()+ gws: 1024 4574 c/s 4574 rounds/s 223.859ms per crypt_all()+ Local worksize (LWS) 64, global worksize (GWS) 1024 using different password for benchmarking DONE Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1 Many salts: 4740 c/s real, 4740 c/s virtual Only one salt: 4740 c/s real, 4740 c/s virtual [a@...er run]$ GWS=1024 ./john --test --format=argon2d-opencl --dev=4 --v=4 Benchmarking: argon2d-opencl [Blake2 OpenCL]... memory per hash : 1.50 MB Device 4: GeForce GTX TITAN X Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__ -DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21 -D_OPENCL_COMPILER -DBINARY_SIZE=256 -DSALT_SIZE=64 -DPLAINTEXT_LENGTH=125 Local worksize (LWS) 64, global worksize (GWS) 1024 using different password for benchmarking DONE Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1 Many salts: 4697 c/s real, 4697 c/s virtual Only one salt: 4697 c/s real, 4654 c/s virtual >Can you please experiment with this too, and try to use LWS and > GWS settings directly scaled from those that you find performing good on > your 960M (perhaps it means same LWS, but ~4.8x larger GWS)? In your > most recent full set of benchmark results, you didn't include the > auto-tuning output (no -v=4), so I don't know what LWS and GWS you were > using in the 960M benchmarks. I was putting LWS and before ./john _______ TITAN X [a@...er run]$ GWS=1024 ./john --test --format=argon2i-opencl --dev=4 Benchmarking: argon2i-opencl [Blake2 OpenCL]... memory per hash : 1.50 MB Device 4: GeForce GTX TITAN X using different password for benchmarking DONE Speed for cost 1 (t) of 3, cost 2 (m) of 1536, cost 3 (l) of 1 Many salts: 2625 c/s real, 2603 c/s virtual Only one salt: 2648 c/s real, 2648 c/s virtual [a@...er run]$ GWS=512 ./john --test --format=argon2d-opencl --dev=4 Benchmarking: argon2d-opencl [Blake2 OpenCL]... memory per hash : 1.50 MB Device 4: GeForce GTX TITAN X using different password for benchmarking DONE Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1 Many salts: 3479 c/s real, 3479 c/s virtual Only one salt: 3513 c/s real, 3513 c/s virtual [a@...er run]$ LWS=32 GWS=512 ./john --test --format=argon2d-opencl --dev=4 Benchmarking: argon2d-opencl [Blake2 OpenCL]... memory per hash : 1.50 MB Device 4: GeForce GTX TITAN X using different password for benchmarking DONE Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1 Many salts: 6525 c/s real, 6525 c/s virtual Only one salt: 6525 c/s real, 6525 c/s virtual __________________________ 980m none@...e ~/Desktop/r/run $ GWS=1024 ./john --test --format=argon2i-opencl Benchmarking: argon2i-opencl [Blake2 OpenCL]... memory per hash : 1.50 MB Device 0: GeForce GTX 960M using different password for benchmarking DONE Speed for cost 1 (t) of 3, cost 2 (m) of 1536, cost 3 (l) of 1 Many salts: 1878 c/s real, 1861 c/s virtual Only one salt: 1861 c/s real, 1861 c/s virtual none@...e ~/Desktop/r/run $ GWS=512 ./john --test --format=argon2d-opencl Benchmarking: argon2d-opencl [Blake2 OpenCL]... memory per hash : 1.50 MB Device 0: GeForce GTX 960M using different password for benchmarking DONE Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1 Many salts: 3976 c/s real, 3938 c/s virtual Only one salt: 3976 c/s real, 4015 c/s virtual none@...e ~/Desktop/r/run $ LWS=32 GWS=512 ./john --test --format=argon2d-opencl Benchmarking: argon2d-opencl [Blake2 OpenCL]... memory per hash : 1.50 MB Device 0: GeForce GTX 960M using different password for benchmarking DONE Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1 Many salts: 4266 c/s real, 4227 c/s virtual Only one salt: 4227 c/s real, 4266 c/s virtual > Like I said, my initial results are not good, and I did try a few LWS > and GWS combinations (up to using nearly the full 12 GB memory even). > So I don't expect you would succeed either, but I'd like us to have a > direct comparison of 960M vs. Titan X anyway, so that we can try to > figure out what the bottleneck in scaling Argon2 between these two GPUs > is. And the next task might be to deal with the register spilling. there is a tool nvidia visual profiler but unfortunatelly doesn't work on my laptop and on super. nvprof is a the same tool but in a command line [a@...er run]$ nvprof ./john --test --format=argon2i-opencl --dev=4 --v=4 Benchmarking: argon2i-opencl [Blake2 OpenCL]... memory per hash : 1.50 MB ==25405== NVPROF is profiling process 25405, command: ./john --test --format=argon2i-opencl --dev=4 --v=4 Device 4: GeForce GTX TITAN X Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__ -DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21 -D_OPENCL_COMPILER -DBINARY_SIZE=256 -DSALT_SIZE=64 -DPLAINTEXT_LENGTH=125 Calculating best global worksize (GWS); max. 1s single kernel invocation. gws: 256 1053 c/s 1053 rounds/s 242.977ms per crypt_all()! gws: 512 2292 c/s 2292 rounds/s 223.306ms per crypt_all()! gws: 1024 2643 c/s 2643 rounds/s 387.368ms per crypt_all()+ Local worksize (LWS) 64, global worksize (GWS) 1024 using different password for benchmarking DONE Speed for cost 1 (t) of 3, cost 2 (m) of 1536, cost 3 (l) of 1 Many salts: 2648 c/s real, 2648 c/s virtual Only one salt: 2625 c/s real, 2625 c/s virtual ==25405== Profiling application: ./john --test --format=argon2i-opencl --dev=4 --v=4 ==25405== Profiling result: No kernels were profiled. ==25405== API calls: No API activities were profiled. none@...e ~/Desktop/morecopy/run $ nvprof ./john --test --format=argon2d-opencl Benchmarking: argon2d-opencl [Blake2 OpenCL]... memory per hash : 1.50 MB ==6809== NVPROF is profiling process 6809, command: ./john --test --format=argon2d-opencl Device 0: GeForce GTX 960M using different password for benchmarking DONE Speed for cost 1 (t) of 1, cost 2 (m) of 1536, cost 3 (l) of 1 Many salts: 3976 c/s real, 3938 c/s virtual Only one salt: 3976 c/s real, 4015 c/s virtual ==6809== Profiling application: ./john --test --format=argon2d-opencl ==6809== Warning: make sure cudaDeviceReset() is called before application exit to flush profile data. ======== Error: CUDA profiling error. > If things just don't fit into private memory, then we might prefer to > explicitly move some into local or/and global than leave this up to the > compiler and keep guessing what's going on. For a start, we need to > achieve the same performance as we do now, but without spills and with > explicit use of other memory types. And after that point, we could > proceed to optimize our use of the different memory types.
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.