Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <63961a75e0728b716717c3c0cf074f85@smtp.hushmail.com>
Date: Mon, 24 Aug 2015 09:08:31 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: interleaving on GPUs

On 2015-08-24 04:44, Solar Designer wrote:
> On Sun, Aug 23, 2015 at 11:19:08PM +0200, magnum wrote:
>> On 2015-08-23 23:10, magnum wrote:
>>> On 2015-08-23 23:05, magnum wrote:
>>>> On 2015-08-23 07:08, Solar Designer wrote:
>>>>> Note that they explicitly mention "processing several data items
>>>>> concurrently per thread".  So it appears that when targeting Kepler, up
>>>>> to 2x interleaving at OpenCL kernel source level could make sense.
>>>>
>>>> Shouldn't simply using vectorized code (eg. using uint2) result in just
>>>> the interleaving we want (on nvidia)?
>
> With my current understanding of the extent to which we're stuck with
> the pure SIMT model, yes, uint2 should be similar to 2x interleaving.
>
>>> I tried PBKDF2-HMAC MD4, MD5 and SHA-1 but they all lost some performance.
>>
>> The loss I saw might have been because my laptop Kepler is too slow so
>> auto-tune doesn't let it run optimally.
>
> How much did they lose on your laptop?

MD4 and MD5 got 6 and 11% regression. Looking at autotune -verbose:5 
shows that for a given work size, the vector code is faster but since 
the autotune currently cap the GWS for max 5 seconds of total crypt_all 
duration and/or for total memory use, the end result is lower GWS and 
lower speed.

>> Here's super's Titan:
>>
>> $ ../run/john -test -dev=5 -form:pbkdf2-hmac-md4-opencl
>> Device 5: GeForce GTX TITAN
>> Benchmarking: PBKDF2-HMAC-MD4-opencl [PBKDF2-MD4 OpenCL]... DONE
>> Speed for cost 1 (iterations) of 1000
>> Raw:	2933K c/s real, 2892K c/s virtual
>>
>> $ ../run/john -test -dev=5 -form:pbkdf2-hmac-md4-opencl -force-vec=2
>> Device 5: GeForce GTX TITAN
>> Benchmarking: PBKDF2-HMAC-MD4-opencl [PBKDF2-MD4 OpenCL 2x]... DONE
>> Speed for cost 1 (iterations) of 1000
>> Raw:	3302K c/s real, 3201K c/s virtual
>>
>> $ ../run/john -test -dev=5 -form:pbkdf2-hmac-md5-opencl
>> Device 5: GeForce GTX TITAN
>> Benchmarking: PBKDF2-HMAC-MD5-opencl [PBKDF2-MD5 OpenCL]... DONE
>> Speed for cost 1 (iterations) of 1000
>> Raw:	1906K c/s real, 1872K c/s virtual
>>
>> $ ../run/john -test -dev=5 -form:pbkdf2-hmac-md5-opencl -force-vec=2
>> Device 5: GeForce GTX TITAN
>> Benchmarking: PBKDF2-HMAC-MD5-opencl [PBKDF2-MD5 OpenCL 2x]... DONE
>> Speed for cost 1 (iterations) of 1000
>> Raw:	2199K c/s real, 2169K c/s virtual
>>
>> $ ../run/john -test -dev=5 -form:pbkdf2-hmac-sha1-opencl
>> Device 5: GeForce GTX TITAN
>> Benchmarking: PBKDF2-HMAC-SHA1-opencl [PBKDF2-SHA1 OpenCL]... DONE
>> Speed for cost 1 (iterations) of 1000
>> Raw:	864804 c/s real, 859488 c/s virtual
>>
>> $ ../run/john -test -dev=5 -form:pbkdf2-hmac-sha1-opencl -force-vec=2
>> Device 5: GeForce GTX TITAN
>> Benchmarking: PBKDF2-HMAC-SHA1-opencl [PBKDF2-SHA1 OpenCL 2x]... DONE
>> Speed for cost 1 (iterations) of 1000
>> Raw:	718202 c/s real, 703742 c/s virtual
>>
>> So there is indeed a speedup for MD4 and MD5 but not for SHA-1 in this case.
>
> Cool!  If the loss on your laptop for MD4 and MD5 is less than the gain
> on TITAN, then can we make this the default?

I'll have a look at that. Since we see a significant loss for SHA-1 it 
will be per format. PBKDF2-HMAC-MD4/5 are contrived ones. We should add 
vector support to more formats.

> -force-vec=2 doesn't appear to affect md5crypt-opencl.  Why not?  Does
> it require some per-format support?

Yes, and md5crypt is trickier at varying length plaintexts (although we 
could sort them).

magnum

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.