john-dev - Re: LWS and GWS auto-tuning

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <2d1b2bb4863dac064a0ee69ca8552b4e@smtp.hushmail.com>
Date: Thu, 27 Aug 2015 01:01:27 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: LWS and GWS auto-tuning

On 2015-08-27 00:41, Solar Designer wrote:
> On Wed, Aug 26, 2015 at 10:21:31PM +0200, magnum wrote:
>> On 2015-08-26 21:37, Solar Designer wrote:
>>> Unfortunately, LWS auto-tuning tries unreasonably high values (like
>>> 8192) and sometimes fails totally (results in an error from OpenCL and
>>> program abort) for some formats when tested with one or the other OpenCL
>>> SDK on "well".  Can you look into this, and perhaps commit a fix?
>>
>> That's odd, can you name a format?
>
> For example:
>
> $ ./john -test -form=phpass-opencl -dev=0 -v=4
> Device 0: Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
> Benchmarking: phpass-opencl ($P$9 lengths 0 to 15) [MD5 OpenCL]... Options used: -I ./kernels -cl-mad-enable -D__CPU__ -DDEVICE_INFO=33 -DDEV_VER_MAJOR=1 -DDEV_VER_MINOR=2 -D_OPENCL_COMPILER
> Build log: Compilation started
> Compilation done
> Linking started
> Linking done
> Device build started
> Device build done
> Kernel <phpass> was not vectorized
> Done.
> Calculating best global worksize (GWS); max. 100ms single kernel invocation.
> gws:       256       24569 c/s       24569 rounds/s  10.419ms per crypt_all()!
> gws:       512       24150 c/s       24150 rounds/s  21.200ms per crypt_all()
> gws:      1024       26315 c/s       26315 rounds/s  38.912ms per crypt_all()+
> gws:      2048       26323 c/s       26323 rounds/s  77.800ms per crypt_all()
> Calculating best local worksize (LWS)
> Testing LWS=128 GWS=1024 ... 151.439ms+
> Testing LWS=256 GWS=1024 ... 302.382ms
> Testing LWS=512 GWS=1024 ... 604.730ms
> Testing LWS=1024 GWS=1024 ... 1.209s
> Testing LWS=2048 GWS=2048 ...Segmentation fault

The device actually supports 8192 per the queries, and that's why it is 
tried. This is also seen in our list output:

Platform version: OpenCL 1.2
	Device #0 (0) name:	Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
	Device vendor:		Intel(R) Corporation
	Device type:		CPU (LE)
	Device version:		OpenCL 1.2 (Build 9756)
	Driver version:		1.2.0.9756
	Native vector widths:	char 32, short 16, int 8, long 4
	Preferred vector width:	char 1, short 1, int 1, long 1
	Global Memory:		31.0 GB
	Global Memory Cache:	256.2 KB
	Local Memory:		32.0 KB (Global)
	Max memory alloc. size:	7.0 GB
	Max clock (MHz):	3500
	Profiling timer res.:	1 ns
	Max Work Group Size:	8192  <---- here!
	Parallel compute cores:	8

I'm do not think the de-facto limit of 1024 we've been used to is an 
actual maximum per any specifications. Also, when I tried this it ran 
just fine through the tests up to 8192 but picked a lower number as 
best. If it wasn't actually supported, we should get an 
CL_INVALID_WORK_GROUP_SIZE error and it would have been caught and 
handled properly.

I presume your segfault was unrelated to the work size.

magnum

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.