|
Message-ID: <20150704162208.GC23327@openwall.com> Date: Sat, 4 Jul 2015 19:22:08 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: PHC: Lyra2 on GPU On Sat, Jul 04, 2015 at 05:08:29PM +0200, Agnieszka Bielec wrote: > 2015-07-04 11:54 GMT+02:00 Solar Designer <solar@...nwall.com>: > > On Sat, Jul 04, 2015 at 02:04:26AM +0200, Agnieszka Bielec wrote: > >> I received results: > >> > >> [a@...er run]$ ./john --test --format=lyra2-opencl --dev=5 > >> Benchmarking: Lyra2-opencl, Lyra2 [Lyra2 Sponge OpenCL (inefficient, > >> development use only)]... Device 5: GeForce GTX TITAN > >> Local worksize (LWS) 64, global worksize (GWS) 2048 > >> DONE > >> Speed for cost 1 (t) of 8, cost 2 (m) of 8, cost 3 (c) of 256, cost 4 (p) of 2 > >> Raw: 6023 c/s real, 5965 c/s virtual > >> > >> [a@...er run]$ ./john --test --format=lyra2-opencl > >> Benchmarking: Lyra2-opencl, Lyra2 [Lyra2 Sponge OpenCL (inefficient, > >> development use only)]... Device 0: Tahiti [AMD Radeon HD 7900 Series] > >> Local worksize (LWS) 64, global worksize (GWS) 2048 > >> DONE > >> Speed for cost 1 (t) of 8, cost 2 (m) of 8, cost 3 (c) of 256, cost 4 (p) of 2 > >> Raw: 7447 c/s real, 51200 c/s virtual > >> > >> before optimizations speed was equal to 1k > > > > Cool. And these are much better than what you were getting with Lyra2 > > authors' CUDA code, right? > > yes, but they claimed that theirs implementation isn't optimal > > this is the best result I gained > > [a@...er run]$ ./john --test --format=lyra2-cuda > Benchmarking: Lyra2-cuda, Lyra2 [Lyra2 CUDA]... DONE > Speed for cost 1 (t) of 8, cost 2 (m) of 8 > Raw: 1914 c/s real, 1932 c/s virtual OK. And what's the best speed on CPU? > > Is the "copying small portions of global memory into local buffers" like > > prefetching? Or are those small portions more frequently accessed than > > the rest? In other words, why is this optimization effective for Lyra2? > > I'm copying data in several separate for loops.only sometimes one > element is accessed two times, mostly it's 1 time, but these portions > are small anyway, 12 ulong's for one random pointer to global memory > (4 is max) so I decided copy even if something is accessed only once, > and I tried to copy bigger portions at once but speed was worse, even > if something is accessed only once it's faster with copying on AMD GPU This sounds like prefetching, then. By the way, while your current choice of: > >> Speed for cost 1 (t) of 8, cost 2 (m) of 8, cost 3 (c) of 256, cost 4 (p) of 2 is fine for testing, I think for all of the PHC finalists we need to tune parameters to a level comparable with defensive use of bcrypt at cost 5, using this as our baseline. When used defensively and running an efficient implementation, bcrypt at cost 5 achieves about 541*8 = ~4330 c/s on i7-4770K: solar@...l:~/crypt_blowfish-1.2-notest$ ./crypt_test_threads 602.8 c/s real, 602.8 c/s virtual 0: 540.4 c/s real 1: 542.4 c/s real 2: 542.4 c/s real 3: 542.6 c/s real 4: 540.4 c/s real 5: 540.4 c/s real 6: 542.4 c/s real 7: 540.4 c/s real So you'd need to tune the PHC finalists to achieve the same defensive use performance for their most optimal implementations on "well", and these will be the settings you'd use for attacking them on GPU. You'd set t_cost to the lowest supported value, parallelism to 1 (no thread level parallelism within one instance), the rest of parameters as recommended by the PHC finalist designers, and tune m_cost to achieve the defensive speed above. As an extra test, you'd set t_cost higher and m_cost lower, still for the same defensive speed. But it's just an extra. The main test should use the lowest supported t_cost. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.