|
Message-ID: <20150825114227.GA31265@openwall.com>
Date: Tue, 25 Aug 2015 14:42:27 +0300
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: LWS and GWS auto-tuning
magnum, all -
On Tue, Aug 25, 2015 at 09:06:55AM +0300, Solar Designer wrote:
> We ought to do something about the auto-tuning. Here are some ideas:
>
> Maybe have a table of per card type likely optimal LWS (or multipliers
> for powers of 2).
Actually, this info can typically be queried, and we already had code to
do that - but it appeared mostly (or totally?) unused.
Specifically, there are opencl_find_best_workgroup() and
opencl_find_best_lws() functions in common-opencl.c. The attached patch
#if 0's opencl_find_best_workgroup() (perhaps we need to drop it
completely, and remove from common-opencl.h too), and revises and makes
use of opencl_find_best_lws().
The new logic is, when neither GWS nor LWS env vars are specified:
pre-tune GWS (with a lower than usual maximum), tune LWS, and finally
tune GWS with the tuned LWS and considering the queried number of
compute units. Obviously, this is far from perfect - we're trying to
find a maximum of a function of two variables, but are adjusting only
one at a time. Yet it appears to work much better than the current
approach of tuning GWS only.
When either LWS or GWS is specified, then only the other is auto-tuned
(once). When both are specified, nothing is auto-tuned.
For example, with md5crypt-opencl on GTX TITAN, where the previous
approach worked poorly:
[solar@...er run]$ ./john -test -form=md5crypt-opencl -dev=5
Device 5: GeForce GTX TITAN
Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... DONE
Raw: 1984K c/s real, 1984K c/s virtual
[solar@...er run]$ time ./john -test -form=md5crypt-opencl -dev=5 -v=4
Device 5: GeForce GTX TITAN
Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... Options used: -I ./kernels -cl-mad-enable -cl-nv-verbose -D__GPU__ -DDEVICE_INFO=131090 -DDEV_VER_MAJOR=352 -DDEV_VER_MINOR=21 -D_OPENCL_COMPILER -DPLAINTEXT_LENGTH=15
Calculating best global worksize (GWS); max. 250ms single kernel invocation.
gws: 1024 160637 c/s 160637000 rounds/s 6.374ms per crypt_all()!
gws: 2048 306444 c/s 306444000 rounds/s 6.683ms per crypt_all()+
gws: 4096 572829 c/s 572829000 rounds/s 7.150ms per crypt_all()+
gws: 8192 957582 c/s 957582000 rounds/s 8.554ms per crypt_all()+
gws: 16384 989299 c/s 989299000 rounds/s 16.561ms per crypt_all()+
gws: 32768 1225015 c/s 1225015000 rounds/s 26.749ms per crypt_all()+
gws: 65536 1402179 c/s 1402179000 rounds/s 46.738ms per crypt_all()+
Calculating best local worksize (LWS)
Testing GWS=65536 LWS=32 ... 190469952ns
Testing GWS=65536 LWS=64 ... 107994464ns
Testing GWS=65472 LWS=96 ... 93050272ns
Testing GWS=65536 LWS=128 ... 92955840ns
Testing GWS=65440 LWS=160 ... 94382368ns
Testing GWS=65472 LWS=192 ... 93250048ns
Testing GWS=65408 LWS=224 ... 95941952ns
Testing GWS=65536 LWS=256 ... 93266272ns
Testing GWS=65536 LWS=512 ... 93425312ns
Testing GWS=65536 LWS=1024 ... 106644352ns
Calculating best global worksize (GWS); max. 500ms single kernel invocation.
gws: 1344 247774 c/s 247774000 rounds/s 5.424ms per crypt_all()!
gws: 2688 465121 c/s 465121000 rounds/s 5.779ms per crypt_all()+
gws: 5376 811578 c/s 811578000 rounds/s 6.624ms per crypt_all()+
gws: 10752 1335447 c/s 1335447000 rounds/s 8.051ms per crypt_all()+
gws: 21504 1963838 c/s 1963838000 rounds/s 10.949ms per crypt_all()+
gws: 43008 1978725 c/s 1978725000 rounds/s 21.735ms per crypt_all()
gws: 86016 1985954 c/s 1985954000 rounds/s 43.312ms per crypt_all()+
gws: 172032 1993503 c/s 1993503000 rounds/s 86.296ms per crypt_all()
gws: 344064 1996328 c/s 1996328000 rounds/s 172.348ms per crypt_all()
gws: 688128 2002809 c/s 2002809000 rounds/s 343.581ms per crypt_all()
Local worksize (LWS) 96, global worksize (GWS) 86016
DONE
Raw: 1978K c/s real, 1978K c/s virtual
real 0m5.642s
user 0m3.445s
sys 0m2.111s
Some other formats show speedups as well. I didn't test all, though.
There might be regressions.
One known issue is that the LWS tuning probably needs a time limit, in
case the device supports a very high maximum LWS. This may be
implemented similarly to how GWS tuning's time limit is.
Also, this code needs a cleanup. My patch is a hack on top of other hacks.
Many formats provide their own idea of their desired LWS and GWS; maybe
we should drop most of this, as I suspect they are often less optimal
than the new auto-tuning. Even md5crypt-opencl benchmarked above has a
boilerplate get_default_workgroup() in it, and the new auto-tuning
actually respects this initially (for the initial GWS tuning). Maybe we
should instead start right with a device query to determine initial LWS
from that. Those get_default_workgroup() copied to multiple format
files look ridiculous.
Alexander
View attachment "john-opencl-auto2.diff" of type "text/plain" (9306 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.