|
Message-ID: <CAKGDhHVp2pE4=P+VnBuF068yxm=_Zk7SXe_h3wuVQhOKQw=2RA@mail.gmail.com> Date: Mon, 13 Jul 2015 16:33:24 +0200 From: Agnieszka Bielec <bielecagnieszka8@...il.com> To: john-dev@...ts.openwall.com Subject: Re: PHC: Lyra2 on GPU 2015-07-12 16:55 GMT+02:00 Solar Designer <solar@...nwall.com>: > Agnieszka, > > On Mon, Jul 06, 2015 at 04:56:11PM +0200, Agnieszka Bielec wrote: >> 2015-07-05 9:53 GMT+02:00 Solar Designer <solar@...nwall.com>: >> > Please also try going in the opposite direction: keep more stuff in >> > global memory, reduce use of local memory per instance to the point >> > where you can use a lot higher GWS - like 20480 (10x higher than what's >> > auto-tuned now) or even higher. This may result in a speedup through >> > hiding of global memory access latencies due to the greater concurrency. >> >> it's my first version, I'm including results for costs 16 16, 1 20 and >> 1 28. > > Can you also try: > > t = 1 > m = 80 > c = 256 > p = 1 > > This should be almost 2 MB. these tests are for 960m this time with lws=64 none@...e ~/Desktop/jajo/run $ ./john --test --format=lyra2-opencl Benchmarking: Lyra2-opencl [Lyra2 OpenCL (inefficient, development use only)]... Device 0: GeForce GTX 960M memory per hash : 1.88 MB Local worksize (LWS) 64, global worksize (GWS) 256 DONE Speed for cost 1 (t) of 1, cost 2 (m) of 80, cost 3 (c) of 256, cost 4 (p) of 1 Raw: 406 c/s real, 403 c/s virtual with lws=8 (because 8 was the best on CUDA) none@...e ~/Desktop/jajo/run $ ./john --test --format=lyra2-cuda Benchmarking: Lyra2-cuda [Lyra2 CUDA]... \ memory per hash : 1.88 MB DONE Speed for cost 1 (t) of 1, cost 2 (m) of 80, cost 3 (c) of 256, cost 4 (p) of 1 Raw: 363 c/s real, 360 c/s virtual with lws=8 none@...e ~/Desktop/jajo/run $ LWS=8 ./john --test --format=lyra2-opencl Benchmarking: Lyra2-opencl [Lyra2 OpenCL (inefficient, development use only)]... Device 0: GeForce GTX 960M memory per hash : 1.88 MB Local worksize (LWS) 8, global worksize (GWS) 256 DONE Speed for cost 1 (t) of 1, cost 2 (m) of 80, cost 3 (c) of 256, cost 4 (p) of 1 Raw: 506 c/s real, 506 c/s virtual and I discovered now that the best number of lws also differ for various costs but it isn't autotuned (for lowest costs the best is 4 but lws must be bigger than nThreads) Opencl, my previous version: none@...e ~/Desktop/work_lyra2_dziala/run $ ./john --test --format=lyra2-opencl Benchmarking: Lyra2-opencl, Lyra2 [Lyra2 Sponge OpenCL (inefficient, development use only)]... Device 0: GeForce GTX 960M Local worksize (LWS) 64, global worksize (GWS) 256 DONE Speed for cost 1 (t) of 1, cost 2 (m) of 80, cost 3 (c) of 256, cost 4 (p) of 1 Raw: 275 c/s real, 276 c/s virtual none@...e ~/Desktop/work_lyra2_dziala/run $ LWS=8 ./john --test --format=lyra2-opencl Benchmarking: Lyra2-opencl, Lyra2 [Lyra2 Sponge OpenCL (inefficient, development use only)]... Device 0: GeForce GTX 960M Local worksize (LWS) 8, global worksize (GWS) 256 DONE Speed for cost 1 (t) of 1, cost 2 (m) of 80, cost 3 (c) of 256, cost 4 (p) of 1 Raw: 360 c/s real, 358 c/s virtual the speed of CUDA and my old version where I had everything in __global memory is the same > >> benchmarking doesn't work good in my old version and I'm setting >> GWS manually, note that I'm getting CL_INVALID_BUFFER_SIZE for >> GWS=8192 and cost 16 16. it's 3GB. > > You're right, the card's total memory size should become the limiting > factor for this approach. now I know that the maximum size of allocation of one buffer is 1/4 of total memory (I read it somewhere), I can make 4 buffer for various numbers of global_id but the speed is decreasing at this size of gws (IIRC) > >> I said that I'm using local memory but I wanted to say __private , >> sorry if caused confusion > > OK. I guess you're putting the current row (24 KB) in there? And when > you were using global memory before, you had the current row fetched > from and sent to global memory each time? it's not 24KB. I wrote that there are very small chunks an when I tried 2x, 3x, 5x bigger - speed decreased. but I'm sceptic about so huge cache in local memory because we have e.g. 32KB for all lws number of threads and speed will decrease after only if I change lws from 64 to 1 > >> [a@...er run]$ GWS=1024 ./john --test --format=lyra2-old-pencl >> --cost=16:16,16:16 >> Benchmarking: Lyra2-old-pencl [Lyra2 OpenCL (inefficient, development >> use only)]... Device 0: Tahiti [AMD Radeon HD 7900 Series] >> memory per hash : 384.00 kB >> Local worksize (LWS) 64, global worksize (GWS) 1024 >> DONE >> Speed for cost 1 (t) of 16, cost 2 (m) of 16, cost 3 (c) of 256, cost 4 (p) of 2 >> Raw: 769 c/s real, 34133 c/s virtual >> >> GWS=8192 ./john --test --format=lyra2-old-pencl --cost=16:16,16:16 >> Benchmarking: Lyra2-old-pencl [Lyra2 OpenCL (inefficient, development >> use only)]... Device 0: Tahiti [AMD Radeon HD 7900 Series] >> memory per hash : 384.00 kB >> OpenCL error (CL_INVALID_BUFFER_SIZE) in file >> (opencl_lyra2_old_fmt_plug.c) at line (170) - (Error creating device >> buffer) > > I guess you also tried slightly smaller values, like 7680? So that > you'd fit in 3 GB. I didn't and I can't test it now because I have lags
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.