|
Message-ID: <613f4f8dfad890b4bff5527da0769358@smtp.hushmail.com> Date: Mon, 22 Jun 2015 21:20:51 +0200 From: magnum <john.magnum@...hmail.com> To: john-dev@...ts.openwall.com Subject: Re: bcrypt-opencl local vs. private memory On 2015-06-22 05:49, Solar Designer wrote: > On Sun, Jun 21, 2015 at 01:30:52AM +0200, magnum wrote: >> On 2015-06-20 23:04, Solar Designer wrote: >>> magnum, can we possibly have >>> this local vs. private bit autodetected along with GWS and LWS? >> >> Well the bcrypt format could do so. That would be for Sayantan to >> implement. However, I just commited a workaround for now, simply using >> nvidia_sm_5x() instead of gpu_nvidia(). > > This is based on testing on your Maxwell card? What speeds are you > getting for local vs. private memory there? And what card is that? I was confused, I had the idea your Titan was somehow sm_5x despite not being Maxwell. But more on Maxwell below. >> BTW for my Kepler GPU, I see no difference between using local or private. > > Note that I initially pointed this out for a Kepler - the TITAN that we > have in super: > > http://www.openwall.com/lists/john-dev/2015/05/07/36 It seems I screwed up (again) when checking that. My little toy Kepler is indeed faster using private. Unfortunately the nvidia_sm* macros don't work on OSX (they depend on proprietary extensions to OpenCL which Apple doesn't include even for their nvidia drivers). > So maybe the check should be: > > #if nvidia_sm_3x(DEVICE_INFO) || nvidia_sm_5x(DEVICE_INFO) Actually only sm_3x. I tested this on a Titan X today and local is much better there: Using private: Device 0: GeForce GTX TITAN X Local worksize (LWS) 8, Global worksize (GWS) 2048 Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... DONE Speed for cost 1 (iteration count) of 32 Raw: 790 c/s real, 787 c/s virtual Using local: Device 0: GeForce GTX TITAN X Local worksize (LWS) 8, Global worksize (GWS) 4096 Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish OpenCL]... DONE Speed for cost 1 (iteration count) of 32 Raw: 5354 c/s real, 5319 c/s virtual BTW I tested oclHashcat too and it does 11570 c/s, we don't even do half of that :-/ Anyway, I have now committed a proper change (sm_3x gets private, all others get local). I may try to find a workaround for OSX detection some rainy day. For example, if CUDA is enabled we could fall back to CUDA queries for that. magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.