|
Message-ID: <CAKGDhHUn4tOT_A_yPLnyVyDYBC0kNuhAL1tw9z1w4856CciLJA@mail.gmail.com> Date: Sat, 4 Jul 2015 17:08:29 +0200 From: Agnieszka Bielec <bielecagnieszka8@...il.com> To: john-dev@...ts.openwall.com Subject: Re: PHC: Lyra2 on GPU 2015-07-04 11:54 GMT+02:00 Solar Designer <solar@...nwall.com>: > On Sat, Jul 04, 2015 at 02:04:26AM +0200, Agnieszka Bielec wrote: >> I received results: >> >> [a@...er run]$ ./john --test --format=lyra2-opencl --dev=5 >> Benchmarking: Lyra2-opencl, Lyra2 [Lyra2 Sponge OpenCL (inefficient, >> development use only)]... Device 5: GeForce GTX TITAN >> Local worksize (LWS) 64, global worksize (GWS) 2048 >> DONE >> Speed for cost 1 (t) of 8, cost 2 (m) of 8, cost 3 (c) of 256, cost 4 (p) of 2 >> Raw: 6023 c/s real, 5965 c/s virtual >> >> [a@...er run]$ ./john --test --format=lyra2-opencl >> Benchmarking: Lyra2-opencl, Lyra2 [Lyra2 Sponge OpenCL (inefficient, >> development use only)]... Device 0: Tahiti [AMD Radeon HD 7900 Series] >> Local worksize (LWS) 64, global worksize (GWS) 2048 >> DONE >> Speed for cost 1 (t) of 8, cost 2 (m) of 8, cost 3 (c) of 256, cost 4 (p) of 2 >> Raw: 7447 c/s real, 51200 c/s virtual >> >> before optimizations speed was equal to 1k > > Cool. And these are much better than what you were getting with Lyra2 > authors' CUDA code, right? yes, but they claimed that theirs implementation isn't optimal this is the best result I gained [a@...er run]$ ./john --test --format=lyra2-cuda Benchmarking: Lyra2-cuda, Lyra2 [Lyra2 CUDA]... DONE Speed for cost 1 (t) of 8, cost 2 (m) of 8 Raw: 1914 c/s real, 1932 c/s virtual my first version in opencl had speed more than 1k but I don't remember exactly > > Are these higher speeds reproducible on actual cracking runs? Please test. what means 'reproducible on actual cracking runs' ? > >> my optimizations are based on transfer one table to local memory and >> copying small portions of global memory into local buffers, I didn't >> saw any sense i coalescing and I didn't tried it > > OK. > > Is the "copying small portions of global memory into local buffers" like > prefetching? Or are those small portions more frequently accessed than > the rest? In other words, why is this optimization effective for Lyra2? I'm copying data in several separate for loops.only sometimes one element is accessed two times, mostly it's 1 time, but these portions are small anyway, 12 ulong's for one random pointer to global memory (4 is max) so I decided copy even if something is accessed only once, and I tried to copy bigger portions at once but speed was worse, even if something is accessed only once it's faster with copying on AMD GPU
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.