|
Message-ID: <a9f494abd537e92204ef29f3f2130a4e@smtp.hushmail.com>
Date: Thu, 04 Jun 2015 13:38:36 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: PHC: Parallel in OpenCL
On 2015-06-04 12:07, magnum wrote:
> On 2015-06-04 00:45, Lukas Odzioba wrote:
>> Agnieszka tried to implement optimization that exploits presence of 0
>> bytes in the sha512 input, which happens in "parallel loop".
>> We can't make such assumptions for all sha512 calls used in function
>> parallel, so implementing slightly different SHA512 with this
>> optimizations (and still we had to have the normal version) increased
>> code size, which what we think reduced performance because code size
>> exceeded L1 code cache on GCN, actual performance after this change
>> dropped from 45k to 28k c/s.
>> She also implemented splitted kernel and it itself also degradated
>> performance (from 28k to 27k c/s).
>
> Isn't the loop kernel using the "zeros" sha function (only)? And the
> other kernels use the full version? Then I can't see how code size would
> be larger for a given kernel.
I had a quick look at the code.
Agnieszka, you again based your code on a fast hash's code. While this
doesn't hurt performance, it also does no good at all but makes
everything more complicated and the code harder to read.
In order for the kernel to build on OSX, I had to apply the following
patch (driver bug workaround, should not affect resulting code):
- unsigned long state[8] = {
- 0x6a09e667f3bcc908, 0xbb67ae8584caa73b,
0x3c6ef372fe94f82b, 0xa54ff53a5f1d36f1,
- 0x510e527fade682d1, 0x9b05688c2b3e6c1f,
0x1f83d9abfb41bd6b, 0x5be0cd19137e2179};
+ unsigned long state[8];
unsigned int left = length;
+ state[0] = 0x6a09e667f3bcc908UL;
+ state[1] = 0xbb67ae8584caa73bUL;
+ state[2] = 0x3c6ef372fe94f82bUL;
+ state[3] = 0xa54ff53a5f1d36f1UL;
+ state[4] = 0x510e527fade682d1UL;
+ state[5] = 0x9b05688c2b3e6c1fUL;
+ state[6] = 0x1f83d9abfb41bd6bUL;
+ state[7] = 0x5be0cd19137e2179UL;
+
The biggest of my concerns is that in the end of the day you only call
the loop kernel once (for cost 0). Normally you'd call such kernel 10 or
many more times, with less work being performed per call. This means the
split only adds overhead and this is why you see a performance
regression. It looks to me each call is 3*5*128 rounds of SHA512? Maybe
each loop kernel invocation should just be 128 rounds? And you'd call it
3*5 times.
Also, your use of the shared auto-tune function is totally busted. You'd
be much better off not using auto-tune at all than configuring it wrong.
Attached is a patch that mostly fixes it.
Note these lines (after my patch):
opencl_init_auto_setup(SEED, 3*5*128*1, split_events,
warn, 4, self, create_clobj, release_clobj, BINARY_SIZE*3, 0);
autotune_run(self, 3*5*128*1, 0, 1000);
If you change the loop kernel to do only 128 rounds per call, you should
change it accordingly for opencl_init_auto_setup() but not for
autotune_run(). The latter is total, the former is how much you do per
call. If you change to a test vector with another cost, change the *1
accordingly for both.
magnum
View attachment "parallel.diff" of type "text/plain" (6205 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.