|
Message-ID: <20130522102213.GB6483@openwall.com> Date: Wed, 22 May 2013 14:22:13 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: cmp_all on gpu for descrypt-opencl On Wed, May 22, 2013 at 11:08:43AM +0530, Sayantan Datta wrote: > On Wed, May 22, 2013 at 12:21 AM, Solar Designer <solar@...nwall.com> wrote: > > In my testing, the optimal numbers were 20 iterations in the first loop > > and 44 in the second. This was not in our descrypt-opencl code, but it > > was in another bitslice DES implementation on GPU, which is similar in > > this respect. I think the same or similar numbers will be optimal for > > us as well. From the reddit comment: > > In our cmp_all() code we only have 32 iterations split into 4(full unroll) > and 28(2x unroll) iterations. Which loop are we talking about here ? This one same loop. On CPU, we currently split it into four pieces, of the following iteration counts: 2, 2, 28, and 32 - these last 32 are in cmp_one(). You're referring to the unroll counts, but these are a separate thing. Note that there are "if ... goto next_depth;" lines inside the unrolled loops. As to the unrolls, we should tune them as well. > > "[...] split this loop in two, only proceeding with the second half if > > the first indicates there still might be a match. These specific speed > > numbers are for 20 iterations in the first loop (and 44 in the second). > > The first loop's iteration count needs to be high enough that a branch > > will actually be made most of the time, letting all SIMDs in a CU fully > > skip the second loop." > > I understand what you mean. But if a branch is made in the first 20 > iterations , we skip the next loop with 44 iterations but we still have a > branch!! Yes, but this one branch acts as if it were not data-dependent most of the time. > How is that different from branching out of a 64 iter loop ? When you include the possibility of branching out of the loop early on, the resulting code (at ISA level) includes overhead associated with that to manage the execution masks and eventually actually branch out (when the condition is finally met for all vector elements in the current SIMD unit). Each loop's iteration for a given work-item becomes conditional upon completion of the previous iterations, so the compiler might have difficulty inter-mixing instructions for multiple iterations of the unrolled loop as would be desirable to hide their latencies. It'd think it needs to adjust the execution masks before proceeding with each iteration, thereby turning potentially parallel execution into sequential. (This aspect is seen on CPU as well, but to a much lesser extent. This is why I grouped the first 2 and the second 2 loop iterations together, with no check for possible (unrolled) loop exit between first and second, and between third and fourth iterations.) Additionally, even after a SIMD unit has actually branched out of the loop (and reached the end of kernel?), there's presumably some waiting time incurred if computation for other work-items does not complete as quickly, so there's no gain from exiting too early. This may be related to the OpenCL implementation (might be avoidable with lower-level programming), and I guess it's a per-kernel rather than per-loop synchronization issue. I am not familiar with this. I make this guess from the symptoms: I think the optimal first loop iteration count would be a lot lower than 20 otherwise. > How > is skipping the second loop going to improve performance (the next 44 iters > would be skipped even if the loop has 64 iters ) ? Yes, they would be skipped in the same cases anyway, but early loop iterations' cost would be higher if the loop could be exited on any iteration rather than only after 20 iterations. > If both the loops have > same instructions we are not going to see any improvement in terms of > i-cache perspective, right ? I was not focusing on I-cache usage, but since you brought this up: actually, the version of the first (unrolled) loop without early exit possibility probably has smaller code size (than similarly unrolled loop with possible exits after each iteration). Think of it this way: what's the probability that a SIMD unit will actually branch out after the first iteration? after the second iteration? after the third? It's quite small. Is it worth the overhead of checking and the (higher) overhead of avoiding parallel (or rather pipelined) computation of these early iterations? No. It takes quite some iterations for the probability to become high enough for the overhead to be worth it. > Unless what you mean is that there is no branch in first 20 iter loop > but somehow we are able to figure out if we need to skip the second loop > based on the first loop. We might add a jump statement after the first loop > and before second loop. This is what I mean. However, we can also have this check inside the second loop, because the probability of (not so) early exit is already high enough by the time we reach it and the probability of not exiting the loop by a given iteration number is halved with each iteration. We may also try some hybrid schemes, such as checking for (not so) early exit at every N iterations of the second loop (unrolled), testing values of N from 1 to 22 (if the second loop's iteration count is 44). Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.