|
|
Message-ID: <613f4f8dfad890b4bff5527da0769358@smtp.hushmail.com>
Date: Mon, 22 Jun 2015 21:20:51 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: bcrypt-opencl local vs. private memory
On 2015-06-22 05:49, Solar Designer wrote:
> On Sun, Jun 21, 2015 at 01:30:52AM +0200, magnum wrote:
>> On 2015-06-20 23:04, Solar Designer wrote:
>>> magnum, can we possibly have
>>> this local vs. private bit autodetected along with GWS and LWS?
>>
>> Well the bcrypt format could do so. That would be for Sayantan to
>> implement. However, I just commited a workaround for now, simply using
>> nvidia_sm_5x() instead of gpu_nvidia().
>
> This is based on testing on your Maxwell card? What speeds are you
> getting for local vs. private memory there? And what card is that?
I was confused, I had the idea your Titan was somehow sm_5x despite not
being Maxwell. But more on Maxwell below.
>> BTW for my Kepler GPU, I see no difference between using local or private.
>
> Note that I initially pointed this out for a Kepler - the TITAN that we
> have in super:
>
> http://www.openwall.com/lists/john-dev/2015/05/07/36
It seems I screwed up (again) when checking that. My little toy Kepler
is indeed faster using private. Unfortunately the nvidia_sm* macros
don't work on OSX (they depend on proprietary extensions to OpenCL which
Apple doesn't include even for their nvidia drivers).
> So maybe the check should be:
>
> #if nvidia_sm_3x(DEVICE_INFO) || nvidia_sm_5x(DEVICE_INFO)
Actually only sm_3x. I tested this on a Titan X today and local is much
better there:
Using private:
Device 0: GeForce GTX TITAN X
Local worksize (LWS) 8, Global worksize (GWS) 2048
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish
OpenCL]... DONE
Speed for cost 1 (iteration count) of 32
Raw: 790 c/s real, 787 c/s virtual
Using local:
Device 0: GeForce GTX TITAN X
Local worksize (LWS) 8, Global worksize (GWS) 4096
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish
OpenCL]... DONE
Speed for cost 1 (iteration count) of 32
Raw: 5354 c/s real, 5319 c/s virtual
BTW I tested oclHashcat too and it does 11570 c/s, we don't even do half
of that :-/
Anyway, I have now committed a proper change (sm_3x gets private, all
others get local). I may try to find a workaround for OSX detection some
rainy day. For example, if CUDA is enabled we could fall back to CUDA
queries for that.
magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.