|
Message-ID: <20150507195226.GA15044@openwall.com> Date: Thu, 7 May 2015 22:52:26 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: bcrypt-opencl local vs. private memory Sayantan, magnum - I just realized that our bcrypt-opencl's bf_kernel.cl is using local rather than private memory on all GPUs. While this is right for AMD, it might not be right for NVIDIA. Here's what I am getting with unchanged bf_kernel.cl on super's GTX TITAN: $ ./john -te -form=bcrypt-opencl -dev=5 Device 5: GeForce GTX TITAN Local worksize (LWS) 8, Global worksize (GWS) 128 Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... DONE Speed for cost 1 (iteration count) of 32 Raw: 487 c/s real, 487 c/s virtual $ GWS=1024 ./john -te -form=bcrypt-opencl -dev=5 Device 5: GeForce GTX TITAN Local worksize (LWS) 8, Global worksize (GWS) 1024 Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... DONE Speed for cost 1 (iteration count) of 32 Raw: 781 c/s real, 787 c/s virtual BTW, it's unclear why the auto-tuning didn't go higher than GWS=128. Here's the best speed we got for it before (IIRC, also with manual GWS): http://www.openwall.com/presentations/Passwords14-Energy-Efficient-Cracking/slide-45.html This says 813 c/s. Changing "#define MAYBE_LOCAL" in bf_kernel.cl from __local to __private, I got: $ ./john -te -form=bcrypt-opencl -dev=5 Device 5: GeForce GTX TITAN Local worksize (LWS) 8, Global worksize (GWS) 1024 Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... DONE Speed for cost 1 (iteration count) of 32 Raw: 860 c/s real, 853 c/s virtual GWS=1024 right away, and the speed is slightly better. BTW, 2048 fails: $ GWS=2048 ./john -te -form=bcrypt-opencl -dev=5 Device 5: GeForce GTX TITAN Local worksize (LWS) 8, Global worksize (GWS) 2048 Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... Segmentation fault I think we should test on other NVIDIA cards (such as on GTX 570 in bull) and maybe make this the default for NVIDIA. It may also make sense to place some of the S-boxes in local and some in private. Maybe this will result in a higher optimal GWS. Will you take this task from here, please? BTW, on AMD this results in huge slowdown. local: $ ./john -te -form=bcrypt-opencl -dev=1 Device 1: Tahiti [AMD Radeon HD 7900 Series] Local worksize (LWS) 4, Global worksize (GWS) 1024 Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... DONE Speed for cost 1 (iteration count) of 32 Raw: 4231 c/s real, 512000 c/s virtual private: $ ./john -te -form=bcrypt-opencl -dev=1 Device 1: Tahiti [AMD Radeon HD 7900 Series] Local worksize (LWS) 4, Global worksize (GWS) 512 Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... DONE Speed for cost 1 (iteration count) of 32 Raw: 775 c/s real, 102400 c/s virtual $ GWS=1024 ./john -te -form=bcrypt-opencl -dev=1 Device 1: Tahiti [AMD Radeon HD 7900 Series] Local worksize (LWS) 4, Global worksize (GWS) 1024 Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... DONE Speed for cost 1 (iteration count) of 32 Raw: 775 c/s real, 102400 c/s virtual Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.