|
Message-ID: <4F569205.7030801@linuxasylum.net> Date: Tue, 06 Mar 2012 23:39:01 +0100 From: Samuele Giovanni Tonon <samu@...uxasylum.net> To: john-dev@...ts.openwall.com Subject: Re: OpenCL KPC and LWS On 03/06/12 20:14, magnum wrote: > On 02/21/2012 10:59 PM, Samuele Giovanni Tonon wrote: >>> So the main issue is that auto KPC does not pick a good number. The LWS >>> fluctuations might be due to normal variations between runs. I should >>> have recorded the figures for KPC=2M and LWS=64 but I missed that. >> >> looks like a chicken-egg problem: when lws is tested i use the default >> kpc=2M, when LWS is up i use the best LWS i just detected; luksas >> already reported this kind of problem but i thought we were safe since >> LWS usually is rather obvious. > > I realise I have a lot to catch up from you guys but here are a couple > of things that seem to get good and FAST results on my gear, both GPU > and CPU: > > Have you tried querying > clGetKernelWorkGroupInfo() for > CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE? On the CPU's and GPU's I > have tried, it seems reliable (better than the current testing) and very > fast. It's introduced in OpenCL 1.1 so I added a fallback like this: > > > clEnqueueWriteBuffer(queue_prof, buffer_keys, CL_TRUE, 0, > (PLAINTEXT_LENGTH) * SSHA_NUM_KEYS, saved_plain, 0, NULL, NULL); > > + // This is OpenCL 1.1, we catch CL_INVALID_VALUE and use a fallback > + ret_code = clGetKernelWorkGroupInfo (crypt_kernel, devices[gpu_id], > + CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, > + sizeof(best_multiple), &best_multiple, NULL); > + > + if (ret_code == CL_INVALID_VALUE) { > + //printf("Can't get preferred LWS multiple, using 1\n"); > + best_multiple = 1; > + } else { > + HANDLE_CLERROR(ret_code, "Query preferred work group multiple"); > + //printf("preferred multiple: %zu\n", best_multiple); > + } > + > // Find minimum time > - for (my_work_group = 1; (int) my_work_group <= (int) max_group_size; > + for (my_work_group = best_multiple; (int) my_work_group <= (int) > max_group_size; > my_work_group *= 2) { nice one, i'm adding it to the code, thanks ! > > Also, I seem to get good and very fast results with this loop in KPC > enumeration: > > for( num=local_work_size; num <= SSHA_NUM_KEYS ; num<<=1) > > > Is testing every 16K really of any use? I just see fluctuating numbers > and a super slow test. this was the idea i had in mind: by default you will get SSHA_NUM_KEYS, which is the standard, if you set KPC=0 it means you want to do a deep benchmark on which could be the real best kpc; 16384 seemed a reasonable tradeoff between having a very long but detailed benchmark rather 3-4 test which could be misleading. i'm still testing it: on the others format the step is 4096 but *_NUM_KEYS is 1024*2048, on nsldaps i found 1024*2048*4 was giving higher speed with high end cards so i decided to keep a high value but incrementing the steps to not die of boredom (as it already happens). i'm still looking for the best number for the steps, while doubling seems good for a quick benchmark i still think many cards find the best kpc between 1024*1024 and 1024*1024*2 , steps that the doubling miss. Samuele
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.