Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <613f4f8dfad890b4bff5527da0769358@smtp.hushmail.com>
Date: Mon, 22 Jun 2015 21:20:51 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: bcrypt-opencl local vs. private memory

On 2015-06-22 05:49, Solar Designer wrote:
> On Sun, Jun 21, 2015 at 01:30:52AM +0200, magnum wrote:
>> On 2015-06-20 23:04, Solar Designer wrote:
>>> magnum, can we possibly have
>>> this local vs. private bit autodetected along with GWS and LWS?
>>
>> Well the bcrypt format could do so. That would be for Sayantan to
>> implement. However, I just commited a workaround for now, simply using
>> nvidia_sm_5x() instead of gpu_nvidia().
>
> This is based on testing on your Maxwell card?  What speeds are you
> getting for local vs. private memory there?  And what card is that?

I was confused, I had the idea your Titan was somehow sm_5x despite not 
being Maxwell. But more on Maxwell below.

>> BTW for my Kepler GPU, I see no difference between using local or private.
>
> Note that I initially pointed this out for a Kepler - the TITAN that we
> have in super:
>
> http://www.openwall.com/lists/john-dev/2015/05/07/36

It seems I screwed up (again) when checking that. My little toy Kepler 
is indeed faster using private. Unfortunately the nvidia_sm* macros 
don't work on OSX (they depend on proprietary extensions to OpenCL which 
Apple doesn't include even for their nvidia drivers).

> So maybe the check should be:
>
> #if nvidia_sm_3x(DEVICE_INFO) || nvidia_sm_5x(DEVICE_INFO)

Actually only sm_3x. I tested this on a Titan X today and local is much 
better there:

Using private:
Device 0: GeForce GTX TITAN X
Local worksize (LWS) 8, Global worksize (GWS) 2048
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish 
OpenCL]... DONE
Speed for cost 1 (iteration count) of 32
Raw:    790 c/s real, 787 c/s virtual

Using local:
Device 0: GeForce GTX TITAN X
Local worksize (LWS) 8, Global worksize (GWS) 4096
Benchmarking: bcrypt-opencl ("$2a$05", 32 iterations) [Blowfish 
OpenCL]... DONE
Speed for cost 1 (iteration count) of 32
Raw:    5354 c/s real, 5319 c/s virtual

BTW I tested oclHashcat too and it does 11570 c/s, we don't even do half 
of that :-/

Anyway, I have now committed a proper change (sm_3x gets private, all 
others get local). I may try to find a workaround for OSX detection some 
rainy day. For example, if CUDA is enabled we could fall back to CUDA 
queries for that.

magnum


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.