|
Message-ID: <63961a75e0728b716717c3c0cf074f85@smtp.hushmail.com> Date: Mon, 24 Aug 2015 09:08:31 +0200 From: magnum <john.magnum@...hmail.com> To: john-dev@...ts.openwall.com Subject: Re: interleaving on GPUs On 2015-08-24 04:44, Solar Designer wrote: > On Sun, Aug 23, 2015 at 11:19:08PM +0200, magnum wrote: >> On 2015-08-23 23:10, magnum wrote: >>> On 2015-08-23 23:05, magnum wrote: >>>> On 2015-08-23 07:08, Solar Designer wrote: >>>>> Note that they explicitly mention "processing several data items >>>>> concurrently per thread". So it appears that when targeting Kepler, up >>>>> to 2x interleaving at OpenCL kernel source level could make sense. >>>> >>>> Shouldn't simply using vectorized code (eg. using uint2) result in just >>>> the interleaving we want (on nvidia)? > > With my current understanding of the extent to which we're stuck with > the pure SIMT model, yes, uint2 should be similar to 2x interleaving. > >>> I tried PBKDF2-HMAC MD4, MD5 and SHA-1 but they all lost some performance. >> >> The loss I saw might have been because my laptop Kepler is too slow so >> auto-tune doesn't let it run optimally. > > How much did they lose on your laptop? MD4 and MD5 got 6 and 11% regression. Looking at autotune -verbose:5 shows that for a given work size, the vector code is faster but since the autotune currently cap the GWS for max 5 seconds of total crypt_all duration and/or for total memory use, the end result is lower GWS and lower speed. >> Here's super's Titan: >> >> $ ../run/john -test -dev=5 -form:pbkdf2-hmac-md4-opencl >> Device 5: GeForce GTX TITAN >> Benchmarking: PBKDF2-HMAC-MD4-opencl [PBKDF2-MD4 OpenCL]... DONE >> Speed for cost 1 (iterations) of 1000 >> Raw: 2933K c/s real, 2892K c/s virtual >> >> $ ../run/john -test -dev=5 -form:pbkdf2-hmac-md4-opencl -force-vec=2 >> Device 5: GeForce GTX TITAN >> Benchmarking: PBKDF2-HMAC-MD4-opencl [PBKDF2-MD4 OpenCL 2x]... DONE >> Speed for cost 1 (iterations) of 1000 >> Raw: 3302K c/s real, 3201K c/s virtual >> >> $ ../run/john -test -dev=5 -form:pbkdf2-hmac-md5-opencl >> Device 5: GeForce GTX TITAN >> Benchmarking: PBKDF2-HMAC-MD5-opencl [PBKDF2-MD5 OpenCL]... DONE >> Speed for cost 1 (iterations) of 1000 >> Raw: 1906K c/s real, 1872K c/s virtual >> >> $ ../run/john -test -dev=5 -form:pbkdf2-hmac-md5-opencl -force-vec=2 >> Device 5: GeForce GTX TITAN >> Benchmarking: PBKDF2-HMAC-MD5-opencl [PBKDF2-MD5 OpenCL 2x]... DONE >> Speed for cost 1 (iterations) of 1000 >> Raw: 2199K c/s real, 2169K c/s virtual >> >> $ ../run/john -test -dev=5 -form:pbkdf2-hmac-sha1-opencl >> Device 5: GeForce GTX TITAN >> Benchmarking: PBKDF2-HMAC-SHA1-opencl [PBKDF2-SHA1 OpenCL]... DONE >> Speed for cost 1 (iterations) of 1000 >> Raw: 864804 c/s real, 859488 c/s virtual >> >> $ ../run/john -test -dev=5 -form:pbkdf2-hmac-sha1-opencl -force-vec=2 >> Device 5: GeForce GTX TITAN >> Benchmarking: PBKDF2-HMAC-SHA1-opencl [PBKDF2-SHA1 OpenCL 2x]... DONE >> Speed for cost 1 (iterations) of 1000 >> Raw: 718202 c/s real, 703742 c/s virtual >> >> So there is indeed a speedup for MD4 and MD5 but not for SHA-1 in this case. > > Cool! If the loss on your laptop for MD4 and MD5 is less than the gain > on TITAN, then can we make this the default? I'll have a look at that. Since we see a significant loss for SHA-1 it will be per format. PBKDF2-HMAC-MD4/5 are contrived ones. We should add vector support to more formats. > -force-vec=2 doesn't appear to affect md5crypt-opencl. Why not? Does > it require some per-format support? Yes, and md5crypt is trickier at varying length plaintexts (although we could sort them). magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.