john-dev - Re: interleaving on GPUs

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <63961a75e0728b716717c3c0cf074f85@smtp.hushmail.com>
Date: Mon, 24 Aug 2015 09:08:31 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: interleaving on GPUs

On 2015-08-24 04:44, Solar Designer wrote:
> On Sun, Aug 23, 2015 at 11:19:08PM +0200, magnum wrote:
>> On 2015-08-23 23:10, magnum wrote:
>>> On 2015-08-23 23:05, magnum wrote:
>>>> On 2015-08-23 07:08, Solar Designer wrote:
>>>>> Note that they explicitly mention "processing several data items
>>>>> concurrently per thread".  So it appears that when targeting Kepler, up
>>>>> to 2x interleaving at OpenCL kernel source level could make sense.
>>>>
>>>> Shouldn't simply using vectorized code (eg. using uint2) result in just
>>>> the interleaving we want (on nvidia)?
>
> With my current understanding of the extent to which we're stuck with
> the pure SIMT model, yes, uint2 should be similar to 2x interleaving.
>
>>> I tried PBKDF2-HMAC MD4, MD5 and SHA-1 but they all lost some performance.
>>
>> The loss I saw might have been because my laptop Kepler is too slow so
>> auto-tune doesn't let it run optimally.
>
> How much did they lose on your laptop?

MD4 and MD5 got 6 and 11% regression. Looking at autotune -verbose:5 
shows that for a given work size, the vector code is faster but since 
the autotune currently cap the GWS for max 5 seconds of total crypt_all 
duration and/or for total memory use, the end result is lower GWS and 
lower speed.

>> Here's super's Titan:
>>
>> $ ../run/john -test -dev=5 -form:pbkdf2-hmac-md4-opencl
>> Device 5: GeForce GTX TITAN
>> Benchmarking: PBKDF2-HMAC-MD4-opencl [PBKDF2-MD4 OpenCL]... DONE
>> Speed for cost 1 (iterations) of 1000
>> Raw:	2933K c/s real, 2892K c/s virtual
>>
>> $ ../run/john -test -dev=5 -form:pbkdf2-hmac-md4-opencl -force-vec=2
>> Device 5: GeForce GTX TITAN
>> Benchmarking: PBKDF2-HMAC-MD4-opencl [PBKDF2-MD4 OpenCL 2x]... DONE
>> Speed for cost 1 (iterations) of 1000
>> Raw:	3302K c/s real, 3201K c/s virtual
>>
>> $ ../run/john -test -dev=5 -form:pbkdf2-hmac-md5-opencl
>> Device 5: GeForce GTX TITAN
>> Benchmarking: PBKDF2-HMAC-MD5-opencl [PBKDF2-MD5 OpenCL]... DONE
>> Speed for cost 1 (iterations) of 1000
>> Raw:	1906K c/s real, 1872K c/s virtual
>>
>> $ ../run/john -test -dev=5 -form:pbkdf2-hmac-md5-opencl -force-vec=2
>> Device 5: GeForce GTX TITAN
>> Benchmarking: PBKDF2-HMAC-MD5-opencl [PBKDF2-MD5 OpenCL 2x]... DONE
>> Speed for cost 1 (iterations) of 1000
>> Raw:	2199K c/s real, 2169K c/s virtual
>>
>> $ ../run/john -test -dev=5 -form:pbkdf2-hmac-sha1-opencl
>> Device 5: GeForce GTX TITAN
>> Benchmarking: PBKDF2-HMAC-SHA1-opencl [PBKDF2-SHA1 OpenCL]... DONE
>> Speed for cost 1 (iterations) of 1000
>> Raw:	864804 c/s real, 859488 c/s virtual
>>
>> $ ../run/john -test -dev=5 -form:pbkdf2-hmac-sha1-opencl -force-vec=2
>> Device 5: GeForce GTX TITAN
>> Benchmarking: PBKDF2-HMAC-SHA1-opencl [PBKDF2-SHA1 OpenCL 2x]... DONE
>> Speed for cost 1 (iterations) of 1000
>> Raw:	718202 c/s real, 703742 c/s virtual
>>
>> So there is indeed a speedup for MD4 and MD5 but not for SHA-1 in this case.
>
> Cool!  If the loss on your laptop for MD4 and MD5 is less than the gain
> on TITAN, then can we make this the default?

I'll have a look at that. Since we see a significant loss for SHA-1 it 
will be per format. PBKDF2-HMAC-MD4/5 are contrived ones. We should add 
vector support to more formats.

> -force-vec=2 doesn't appear to affect md5crypt-opencl.  Why not?  Does
> it require some per-format support?

Yes, and md5crypt is trickier at varying length plaintexts (although we 
could sort them).

magnum
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.