|
Message-ID: <228dbeb0672ae8088882707f4e18b564@smtp.hushmail.com> Date: Sun, 23 Aug 2015 23:19:08 +0200 From: magnum <john.magnum@...hmail.com> To: john-dev@...ts.openwall.com Subject: Re: interleaving on GPUs On 2015-08-23 23:10, magnum wrote: > On 2015-08-23 23:05, magnum wrote: >> On 2015-08-23 07:08, Solar Designer wrote: >>> I just read this about NVIDIA's Kepler (such as the old GTX TITAN that >>> we have in super): >>> >>> http://docs.nvidia.com/cuda/kepler-tuning-guide/index.html#device-utilization-and-occupancy >>> >>> >>> >>> "Also note that Kepler GPUs can utilize ILP in place of >>> thread/warp-level parallelism (TLP) more readily than Fermi GPUs can. >>> Furthermore, some degree of ILP in conjunction with TLP is required by >>> Kepler GPUs in order to approach peak single-precision performance, >>> since SMX's warp scheduler issues one or two independent instructions >>> from each of four warps per clock. ILP can be increased by means of, >>> for >>> example, processing several data items concurrently per thread or >>> unrolling loops in the device code, though note that either of these >>> approaches may also increase register pressure." >>> >>> Note that they explicitly mention "processing several data items >>> concurrently per thread". So it appears that when targeting Kepler, up >>> to 2x interleaving at OpenCL kernel source level could make sense. >> >> Shouldn't simply using vectorized code (eg. using uint2) result in just >> the interleaving we want (on nvidia)? I tested this with some of our >> formats that can optionally run vectorized but they don't seem to gain >> from --force-vector=2. > > BTW here's a list of such formats: > > $ git grep -l v_width *fmt*c > opencl_encfs_fmt_plug.c > opencl_krb5pa-sha1_fmt_plug.c > opencl_ntlmv2_fmt_plug.c > opencl_office2007_fmt_plug.c > opencl_office2010_fmt_plug.c > opencl_office2013_fmt_plug.c > opencl_pbkdf2_hmac_md4_fmt_plug.c > opencl_pbkdf2_hmac_md5_fmt_plug.c > opencl_pbkdf2_hmac_sha1_fmt_plug.c > opencl_rakp_fmt_plug.c > opencl_sha1crypt_fmt_plug.c > opencl_wpapsk_fmt_plug.c > > I tried PBKDF2-HMAC MD4, MD5 and SHA-1 but they all lost some performance. The loss I saw might have been because my laptop Kepler is too slow so auto-tune doesn't let it run optimally. Here's super's Titan: $ ../run/john -test -dev=5 -form:pbkdf2-hmac-md4-opencl Device 5: GeForce GTX TITAN Benchmarking: PBKDF2-HMAC-MD4-opencl [PBKDF2-MD4 OpenCL]... DONE Speed for cost 1 (iterations) of 1000 Raw: 2933K c/s real, 2892K c/s virtual $ ../run/john -test -dev=5 -form:pbkdf2-hmac-md4-opencl -force-vec=2 Device 5: GeForce GTX TITAN Benchmarking: PBKDF2-HMAC-MD4-opencl [PBKDF2-MD4 OpenCL 2x]... DONE Speed for cost 1 (iterations) of 1000 Raw: 3302K c/s real, 3201K c/s virtual $ ../run/john -test -dev=5 -form:pbkdf2-hmac-md5-opencl Device 5: GeForce GTX TITAN Benchmarking: PBKDF2-HMAC-MD5-opencl [PBKDF2-MD5 OpenCL]... DONE Speed for cost 1 (iterations) of 1000 Raw: 1906K c/s real, 1872K c/s virtual $ ../run/john -test -dev=5 -form:pbkdf2-hmac-md5-opencl -force-vec=2 Device 5: GeForce GTX TITAN Benchmarking: PBKDF2-HMAC-MD5-opencl [PBKDF2-MD5 OpenCL 2x]... DONE Speed for cost 1 (iterations) of 1000 Raw: 2199K c/s real, 2169K c/s virtual $ ../run/john -test -dev=5 -form:pbkdf2-hmac-sha1-opencl Device 5: GeForce GTX TITAN Benchmarking: PBKDF2-HMAC-SHA1-opencl [PBKDF2-SHA1 OpenCL]... DONE Speed for cost 1 (iterations) of 1000 Raw: 864804 c/s real, 859488 c/s virtual $ ../run/john -test -dev=5 -form:pbkdf2-hmac-sha1-opencl -force-vec=2 Device 5: GeForce GTX TITAN Benchmarking: PBKDF2-HMAC-SHA1-opencl [PBKDF2-SHA1 OpenCL 2x]... DONE Speed for cost 1 (iterations) of 1000 Raw: 718202 c/s real, 703742 c/s virtual So there is indeed a speedup for MD4 and MD5 but not for SHA-1 in this case. magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.