john-dev - Re: PHC: Lyra2 on GPU

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKGDhHVp2pE4=P+VnBuF068yxm=_Zk7SXe_h3wuVQhOKQw=2RA@mail.gmail.com>
Date: Mon, 13 Jul 2015 16:33:24 +0200
From: Agnieszka Bielec <bielecagnieszka8@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Lyra2 on GPU

2015-07-12 16:55 GMT+02:00 Solar Designer <solar@...nwall.com>:
> Agnieszka,
>
> On Mon, Jul 06, 2015 at 04:56:11PM +0200, Agnieszka Bielec wrote:
>> 2015-07-05 9:53 GMT+02:00 Solar Designer <solar@...nwall.com>:
>> > Please also try going in the opposite direction: keep more stuff in
>> > global memory, reduce use of local memory per instance to the point
>> > where you can use a lot higher GWS - like 20480 (10x higher than what's
>> > auto-tuned now) or even higher.  This may result in a speedup through
>> > hiding of global memory access latencies due to the greater concurrency.
>>
>> it's my first version, I'm including results for costs 16 16, 1 20 and
>> 1 28.
>
> Can you also try:
>
> t = 1
> m = 80
> c = 256
> p = 1
>
> This should be almost 2 MB.

these tests are for 960m this time

with lws=64

none@...e ~/Desktop/jajo/run $ ./john --test --format=lyra2-opencl
Benchmarking: Lyra2-opencl [Lyra2 OpenCL (inefficient, development use only)]...
Device 0: GeForce GTX 960M
memory per hash : 1.88 MB
Local worksize (LWS) 64, global worksize (GWS) 256
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 80, cost 3 (c) of 256, cost 4 (p) of 1
Raw:    406 c/s real, 403 c/s virtual

with lws=8 (because 8 was the best on CUDA)
none@...e ~/Desktop/jajo/run $ ./john --test --format=lyra2-cuda
Benchmarking: Lyra2-cuda [Lyra2 CUDA]... \
memory per hash : 1.88 MB
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 80, cost 3 (c) of 256, cost 4 (p) of 1
Raw:    363 c/s real, 360 c/s virtual

with lws=8
none@...e ~/Desktop/jajo/run $ LWS=8 ./john --test --format=lyra2-opencl
Benchmarking: Lyra2-opencl [Lyra2 OpenCL (inefficient, development use only)]...
Device 0: GeForce GTX 960M
memory per hash : 1.88 MB
Local worksize (LWS) 8, global worksize (GWS) 256
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 80, cost 3 (c) of 256, cost 4 (p) of 1
Raw:    506 c/s real, 506 c/s virtual

and I discovered now that the best number of lws also differ for
various costs but it isn't autotuned (for lowest costs the best is 4
but lws must be bigger than nThreads)

Opencl, my previous version:

none@...e ~/Desktop/work_lyra2_dziala/run $ ./john --test --format=lyra2-opencl
Benchmarking: Lyra2-opencl, Lyra2 [Lyra2 Sponge OpenCL (inefficient,
development use only)]... Device 0: GeForce GTX 960M
Local worksize (LWS) 64, global worksize (GWS) 256
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 80, cost 3 (c) of 256, cost 4 (p) of 1
Raw:    275 c/s real, 276 c/s virtual

none@...e ~/Desktop/work_lyra2_dziala/run $ LWS=8 ./john --test
--format=lyra2-opencl
Benchmarking: Lyra2-opencl, Lyra2 [Lyra2 Sponge OpenCL (inefficient,
development use only)]... Device 0: GeForce GTX 960M
Local worksize (LWS) 8, global worksize (GWS) 256
DONE
Speed for cost 1 (t) of 1, cost 2 (m) of 80, cost 3 (c) of 256, cost 4 (p) of 1
Raw:    360 c/s real, 358 c/s virtual


the speed of CUDA and my old version where I had everything in
__global memory is the same

>
>> benchmarking doesn't work good in my old version and I'm setting
>> GWS manually, note that I'm getting CL_INVALID_BUFFER_SIZE for
>> GWS=8192 and cost 16 16. it's 3GB.
>
> You're right, the card's total memory size should become the limiting
> factor for this approach.

now I know that the maximum size of allocation of one buffer is 1/4 of
total memory (I read it somewhere), I can make 4 buffer for various
numbers of global_id but the speed is decreasing at this size of gws
(IIRC)

>
>> I said that I'm using local memory but I wanted to say __private ,
>> sorry if caused confusion
>
> OK.  I guess you're putting the current row (24 KB) in there?  And when
> you were using global memory before, you had the current row fetched
> from and sent to global memory each time?

it's not 24KB. I wrote that there are very small chunks an when I
tried 2x, 3x, 5x bigger - speed decreased.
but I'm sceptic about so huge cache in local memory because we have
e.g. 32KB for all lws number of threads and speed will decrease after
only if I change lws from 64 to 1

>
>> [a@...er run]$ GWS=1024 ./john --test --format=lyra2-old-pencl
>> --cost=16:16,16:16
>> Benchmarking: Lyra2-old-pencl [Lyra2 OpenCL (inefficient, development
>> use only)]... Device 0: Tahiti [AMD Radeon HD 7900 Series]
>> memory per hash : 384.00 kB
>> Local worksize (LWS) 64, global worksize (GWS) 1024
>> DONE
>> Speed for cost 1 (t) of 16, cost 2 (m) of 16, cost 3 (c) of 256, cost 4 (p) of 2
>> Raw:    769 c/s real, 34133 c/s virtual
>>
>> GWS=8192 ./john --test --format=lyra2-old-pencl --cost=16:16,16:16
>> Benchmarking: Lyra2-old-pencl [Lyra2 OpenCL (inefficient, development
>> use only)]... Device 0: Tahiti [AMD Radeon HD 7900 Series]
>> memory per hash : 384.00 kB
>> OpenCL error (CL_INVALID_BUFFER_SIZE) in file
>> (opencl_lyra2_old_fmt_plug.c) at line (170) - (Error creating device
>> buffer)
>
> I guess you also tried slightly smaller values, like 7680?  So that
> you'd fit in 3 GB.

I didn't and I can't test it now because I have lags
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.