john-dev - Re: PHC: Lyra2 on GPU

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150704162208.GC23327@openwall.com>
Date: Sat, 4 Jul 2015 19:22:08 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Lyra2 on GPU

On Sat, Jul 04, 2015 at 05:08:29PM +0200, Agnieszka Bielec wrote:
> 2015-07-04 11:54 GMT+02:00 Solar Designer <solar@...nwall.com>:
> > On Sat, Jul 04, 2015 at 02:04:26AM +0200, Agnieszka Bielec wrote:
> >> I received results:
> >>
> >> [a@...er run]$ ./john --test --format=lyra2-opencl --dev=5
> >> Benchmarking: Lyra2-opencl, Lyra2 [Lyra2 Sponge OpenCL (inefficient,
> >> development use only)]... Device 5: GeForce GTX TITAN
> >> Local worksize (LWS) 64, global worksize (GWS) 2048
> >> DONE
> >> Speed for cost 1 (t) of 8, cost 2 (m) of 8, cost 3 (c) of 256, cost 4 (p) of 2
> >> Raw:    6023 c/s real, 5965 c/s virtual
> >>
> >> [a@...er run]$ ./john --test --format=lyra2-opencl
> >> Benchmarking: Lyra2-opencl, Lyra2 [Lyra2 Sponge OpenCL (inefficient,
> >> development use only)]... Device 0: Tahiti [AMD Radeon HD 7900 Series]
> >> Local worksize (LWS) 64, global worksize (GWS) 2048
> >> DONE
> >> Speed for cost 1 (t) of 8, cost 2 (m) of 8, cost 3 (c) of 256, cost 4 (p) of 2
> >> Raw:    7447 c/s real, 51200 c/s virtual
> >>
> >> before optimizations speed was equal to 1k
> >
> > Cool.  And these are much better than what you were getting with Lyra2
> > authors' CUDA code, right?
> 
> yes, but they claimed that theirs implementation isn't optimal
> 
> this is the best result I gained
> 
> [a@...er run]$ ./john --test --format=lyra2-cuda
> Benchmarking: Lyra2-cuda, Lyra2 [Lyra2 CUDA]... DONE
> Speed for cost 1 (t) of 8, cost 2 (m) of 8
> Raw:    1914 c/s real, 1932 c/s virtual

OK.  And what's the best speed on CPU?

> > Is the "copying small portions of global memory into local buffers" like
> > prefetching?  Or are those small portions more frequently accessed than
> > the rest?  In other words, why is this optimization effective for Lyra2?
> 
> I'm copying data in several separate for loops.only sometimes one
> element is accessed two times, mostly it's 1 time, but these portions
> are small anyway, 12 ulong's for one random pointer to global memory
> (4 is max) so I decided copy even if something is accessed only once,
> and I tried to copy bigger portions at once but speed was worse, even
> if something is accessed only once it's faster with copying on AMD GPU

This sounds like prefetching, then.

By the way, while your current choice of:

> >> Speed for cost 1 (t) of 8, cost 2 (m) of 8, cost 3 (c) of 256, cost 4 (p) of 2

is fine for testing, I think for all of the PHC finalists we need to
tune parameters to a level comparable with defensive use of bcrypt at
cost 5, using this as our baseline.

When used defensively and running an efficient implementation, bcrypt at
cost 5 achieves about 541*8 = ~4330 c/s on i7-4770K:

solar@...l:~/crypt_blowfish-1.2-notest$ ./crypt_test_threads 
602.8 c/s real, 602.8 c/s virtual
0: 540.4 c/s real
1: 542.4 c/s real
2: 542.4 c/s real
3: 542.6 c/s real
4: 540.4 c/s real
5: 540.4 c/s real
6: 542.4 c/s real
7: 540.4 c/s real

So you'd need to tune the PHC finalists to achieve the same defensive
use performance for their most optimal implementations on "well", and
these will be the settings you'd use for attacking them on GPU.  You'd
set t_cost to the lowest supported value, parallelism to 1 (no thread
level parallelism within one instance), the rest of parameters as
recommended by the PHC finalist designers, and tune m_cost to achieve
the defensive speed above.

As an extra test, you'd set t_cost higher and m_cost lower, still for
the same defensive speed.  But it's just an extra.  The main test should
use the lowest supported t_cost.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.