|
Message-ID: <20150826224108.GA12389@openwall.com> Date: Thu, 27 Aug 2015 01:41:08 +0300 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: LWS and GWS auto-tuning On Wed, Aug 26, 2015 at 10:21:31PM +0200, magnum wrote: > On 2015-08-26 21:37, Solar Designer wrote: > >Unfortunately, LWS auto-tuning tries unreasonably high values (like > >8192) and sometimes fails totally (results in an error from OpenCL and > >program abort) for some formats when tested with one or the other OpenCL > >SDK on "well". Can you look into this, and perhaps commit a fix? > > That's odd, can you name a format? For example: $ ./john -test -form=phpass-opencl -dev=0 -v=4 Device 0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz Benchmarking: phpass-opencl ($P$9 lengths 0 to 15) [MD5 OpenCL]... Options used: -I ./kernels -cl-mad-enable -D__CPU__ -DDEVICE_INFO=33 -DDEV_VER_MAJOR=1 -DDEV_VER_MINOR=2 -D_OPENCL_COMPILER Build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel <phpass> was not vectorized Done. Calculating best global worksize (GWS); max. 100ms single kernel invocation. gws: 256 24569 c/s 24569 rounds/s 10.419ms per crypt_all()! gws: 512 24150 c/s 24150 rounds/s 21.200ms per crypt_all() gws: 1024 26315 c/s 26315 rounds/s 38.912ms per crypt_all()+ gws: 2048 26323 c/s 26323 rounds/s 77.800ms per crypt_all() Calculating best local worksize (LWS) Testing LWS=128 GWS=1024 ... 151.439ms+ Testing LWS=256 GWS=1024 ... 302.382ms Testing LWS=512 GWS=1024 ... 604.730ms Testing LWS=1024 GWS=1024 ... 1.209s Testing LWS=2048 GWS=2048 ...Segmentation fault and via the other SDK: $ ./john -test -form=phpass-opencl -dev=4 -v=4 Device 4: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz Benchmarking: phpass-opencl ($P$9 lengths 0 to 15) [MD5 OpenCL]... Options used: -I ./kernels -cl-mad-enable -D__CPU__ -DDEVICE_INFO=33 -DDEV_VER_MAJOR=1800 -DDEV_VER_MINOR=5 -D_OPENCL_COMPILER Calculating best global worksize (GWS); max. 100ms single kernel invocation. gws: 256 3982 c/s 3982 rounds/s 64.282ms per crypt_all()! Calculating best local worksize (LWS) Testing LWS=1 GWS=256 ... 12.615ms+ Testing LWS=2 GWS=256 ... 16.678ms Testing LWS=3 GWS=255 ... 22.180ms Testing LWS=4 GWS=256 ... 25.108ms Testing LWS=5 GWS=255 ... 31.315ms Testing LWS=6 GWS=252 ... 37.359ms Testing LWS=7 GWS=252 ... 43.621ms Testing LWS=8 GWS=256 ... 43.024ms Testing LWS=16 GWS=256 ... 85.121ms Testing LWS=32 GWS=256 ... 169.962ms Testing LWS=64 GWS=256 ... 339.913ms Testing LWS=128 GWS=256 ... 679.668ms Testing LWS=256 GWS=256 ... 1.359s Testing LWS=512 GWS=512 ... 2.718s Testing LWS=1024 GWS=1024 ... 3.091s Calculating best global worksize (GWS); max. 200ms single kernel invocation. *** glibc detected *** ./john: corrupted double-linked list: 0x0000000003783740 *** Also: solar@...l:~/j/bleeding-jumbo/run$ ./john -test -form=md5crypt-opencl -dev=0 -v=4 Device 0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... Options used: -I ./kernels -cl-mad-enable -D__CPU__ -DDEVICE_INFO=33 -DDEV_VER_MAJOR=1 -DDEV_VER_MINOR=2 -D_OPENCL_COMPILER -DPLAINTEXT_LENGTH=15 Build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel <cryptmd5> was successfully vectorized (8) Done. Calculating best global worksize (GWS); max. 250ms single kernel invocation. gws: 1024 176233 c/s 176233000 rounds/s 5.810ms per crypt_all()! gws: 2048 195544 c/s 195544000 rounds/s 10.473ms per crypt_all()+ gws: 4096 215922 c/s 215922000 rounds/s 18.969ms per crypt_all()+ gws: 8192 216427 c/s 216427000 rounds/s 37.850ms per crypt_all() gws: 16384 216549 c/s 216549000 rounds/s 75.659ms per crypt_all() gws: 32768 216770 c/s 216770000 rounds/s 151.164ms per crypt_all() Calculating best local worksize (LWS) Testing LWS=128 GWS=4096 ... 68.128ms+ Testing LWS=256 GWS=4096 ... 68.043ms Testing LWS=512 GWS=4096 ... 68.030ms Testing LWS=1024 GWS=4096 ... 134.773ms Testing LWS=2048 GWS=4096 ... 206.746ms Testing LWS=4096 GWS=4096 ... 400.484ms Testing LWS=8192 GWS=8192 ...OpenCL error (CL_INVALID_VALUE) in file (opencl_cryptmd5_fmt_plug.c) at line (381) - (Copy data back) but after a few runs failing like the above, I got one that worked to completion despite of the weird LWS having been tested: $ ./john -test -form=md5crypt-opencl -dev=0 -v=4 Device 0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... Options used: -I ./kernels -cl-mad-enable -D__CPU__ -DDEVICE_INFO=33 -DDEV_VER_MAJOR=1 -DDEV_VER_MINOR=2 -D_OPENCL_COMPILER -DPLAINTEXT_LENGTH=15 Build log: Compilation started Compilation done Linking started Linking done Device build started Device build done Kernel <cryptmd5> was successfully vectorized (8) Done. Calculating best global worksize (GWS); max. 250ms single kernel invocation. gws: 1024 183765 c/s 183765000 rounds/s 5.572ms per crypt_all()! gws: 2048 174702 c/s 174702000 rounds/s 11.722ms per crypt_all() gws: 4096 201126 c/s 201126000 rounds/s 20.365ms per crypt_all()+ gws: 8192 208587 c/s 208587000 rounds/s 39.273ms per crypt_all()+ gws: 16384 216536 c/s 216536000 rounds/s 75.663ms per crypt_all()+ gws: 32768 216629 c/s 216629000 rounds/s 151.262ms per crypt_all() Calculating best local worksize (LWS) Testing LWS=128 GWS=16384 ... 163.140ms+ Testing LWS=256 GWS=16384 ... 169.023ms Testing LWS=512 GWS=16384 ... 163.110ms Testing LWS=1024 GWS=16384 ... 163.121ms Testing LWS=2048 GWS=16384 ... 163.149ms Testing LWS=4096 GWS=16384 ... 321.910ms Testing LWS=8192 GWS=16384 ... 483.195ms Calculating best global worksize (GWS); max. 500ms single kernel invocation. gws: 1024 162439 c/s 162439000 rounds/s 6.303ms per crypt_all()! gws: 2048 178187 c/s 178187000 rounds/s 11.493ms per crypt_all()+ gws: 4096 216142 c/s 216142000 rounds/s 18.950ms per crypt_all()+ gws: 8192 216313 c/s 216313000 rounds/s 37.870ms per crypt_all() gws: 16384 205236 c/s 205236000 rounds/s 79.829ms per crypt_all() gws: 32768 216487 c/s 216487000 rounds/s 151.362ms per crypt_all() gws: 65536 216630 c/s 216630000 rounds/s 302.524ms per crypt_all() Local worksize (LWS) 128, global worksize (GWS) 4096 DONE Raw: 217088 c/s real, 27102 c/s virtual I think the difference here is that it had tuned a higher GWS, so LWS of 8192 was no longer causing an increase in GWS. It even produced a respectable speed for the CPU (even though with AVX2 intrinsics we can do twice faster). AMD's SDK can't do that, failing to vectorize - it only gives 55k c/s, after testing moderately weird LWS: solar@...l:~/j/bleeding-jumbo/run$ ./john -test -form=md5crypt-opencl -dev=4 -v=4 Device 4: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz Benchmarking: md5crypt-opencl, crypt(3) $1$ [MD5 OpenCL]... Options used: -I ./kernels -cl-mad-enable -D__CPU__ -DDEVICE_INFO=33 -DDEV_VER_MAJOR=1800 -DDEV_VER_MINOR=5 -D_OPENCL_COMPILER -DPLAINTEXT_LENGTH=15 Calculating best global worksize (GWS); max. 250ms single kernel invocation. gws: 1024 55747 c/s 55747000 rounds/s 18.368ms per crypt_all()! gws: 2048 55809 c/s 55809000 rounds/s 36.695ms per crypt_all() gws: 4096 55825 c/s 55825000 rounds/s 73.371ms per crypt_all() gws: 8192 55852 c/s 55852000 rounds/s 146.672ms per crypt_all() Calculating best local worksize (LWS) Testing LWS=1 GWS=1024 ... 91.343ms+ Testing LWS=2 GWS=1024 ... 91.265ms Testing LWS=3 GWS=1023 ... 91.995ms Testing LWS=4 GWS=1024 ... 91.217ms Testing LWS=5 GWS=1020 ... 91.762ms Testing LWS=6 GWS=1020 ... 93.013ms Testing LWS=7 GWS=1022 ... 93.539ms Testing LWS=8 GWS=1024 ... 91.054ms+ Testing LWS=9 GWS=1017 ... 95.011ms Testing LWS=16 GWS=1024 ... 91.239ms Testing LWS=32 GWS=1024 ... 91.115ms Testing LWS=64 GWS=1024 ... 92.196ms Testing LWS=128 GWS=1024 ... 91.160ms Testing LWS=256 GWS=1024 ... 161.670ms Testing LWS=512 GWS=1024 ... 315.758ms Testing LWS=1024 GWS=1024 ... 614.760ms Calculating best global worksize (GWS); max. 500ms single kernel invocation. gws: 64 55350 c/s 55350000 rounds/s 1.156ms per crypt_all()! gws: 128 55287 c/s 55287000 rounds/s 2.315ms per crypt_all() gws: 256 55582 c/s 55582000 rounds/s 4.605ms per crypt_all() gws: 512 55707 c/s 55707000 rounds/s 9.190ms per crypt_all() gws: 1024 55751 c/s 55751000 rounds/s 18.367ms per crypt_all() gws: 2048 55801 c/s 55801000 rounds/s 36.701ms per crypt_all() gws: 4096 55848 c/s 55848000 rounds/s 73.341ms per crypt_all() gws: 8192 55838 c/s 55838000 rounds/s 146.707ms per crypt_all() gws: 16384 55846 c/s 55846000 rounds/s 293.374ms per crypt_all() Local worksize (LWS) 8, global worksize (GWS) 64 DONE Raw: 55040 c/s real, 6967 c/s virtual Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.