|
Message-ID: <c769a4a2f37f3fb2bf6f52dfac9b34dc@smtp.hushmail.com> Date: Sat, 29 Sep 2012 01:35:32 +0200 From: magnum <john.magnum@...hmail.com> To: john-dev@...ts.openwall.com Subject: Re: Benchmarking Milen's RAR kernel in JtR (was: RAR early reject) On 28 Sep, 2012, at 23:26 , Milen Rangelov <gat3way@...il.com> wrote: > > My code is not at all optimal and with better SET_AB this idiotic Endian_Reverses can possibly be skipped. That's something I will investigate soon. > > That is very easy and gives a noticable speedup. Just rewrite SET_AB for big endian. Do the endian swap when initializing d0, d1, ... and then skip all other endian swaps except when writing the serial. > > Then adding the counter values becomes more complex. However this shouldn't be a stopper. That's just a trivial macro. I tried this, it was 10-15% gain on a crappy 9600GT. Not sure how it would end up on a powerful AMD though. > I see lots of other minor things than can be skipped (maybe you already have): For example, this: > > w[GLI][0]=w[GLI][1]=w[GLI][2]=w[GLI][3]=w[GLI][4]=w[GLI][5]=w[GLI][6]=w[GLI][7]=w[GLI][8]=w[GLI][9]=w[GLI][10]=w[GLI][11]=w[GLI][12]=w[GLI][13]=w[GLI][14]=w[GLI][15]=0; > LOOP_BODY(16384*12); > > can be replaced by just > > w[GLI][0]=0; > LOOP_BODY(16384*12); > > Because all the others are nulled in LOOP_BODY anyway. Not much of a boost though. > > > I think the compiler would eliminate those anyway. This looks like an optimization that can be easily done in compile-time. For non-OpenCL code I would agree but even you said we can't trust a simple thing like #pragma unroll. If we can't trust that, I will not trust the compiler to do *any* decent optimizations. I have, though, confirmed that all your use of SHA-1 constants (using ints) indeed end up as constants not using registers. OTOH&BTW you use a define for get_global_id(0) which might be OK for AMD but it sure kills performance for nvidia - it benefits from storing gid (and lid) in a register. > BTW I have this idea: > > At init, create a buffer that holds "password.salt.000" four times in a row in local memory (already endian swapped of course). Regardless of password length, this buffer can be used in the inner loop for 32-bit aligned copy to the sha1 buffer. No bitshifts, no char macros. I just need to come up with some macros for finding the offset to copy and where to update the serials. > > > I've spent some time thinking about that actually...the biggest problem is that serial number updates. It makes things complicated :( > > > Then in the inner loop, just build a whole 64-byte block at a time (i.e. think "blocks" instead of "iterations" - but it's tricky!), update the serials and call sha1_update(). If this can be cleverly implemented I think it should be very fast. > > > yes, that would spare you one branch at least. The bad thing is that I can't think of a way to update the counter values (serials) without branching, so no big win :( I think it can be a win in the end. The serial updates are fixed with a fairly trivial (although not yet defined) macro. Everything else should be a win. Just as a thought, we could save that quad buffer a couple times more and not copy it at all - just run a non-destroying (the one I currently use thrashes the input buffer) sha1_block() on the proper offset. This starts to get hairy but I will try it. > I also feel an absolute need for splitting the kernel so each invocation is 100-200 ms (probably an inner loop kernel with 512 iterations). But this format has a lot of data needing to be kept in global memory, especially if implementing that quad buffer idea. > > You could at least avoid keeping the w[] state if you split it into chunks of the "right" iteration count, depending on plain len. This would make the last kernel a bit more complex though. Anyway I don't like the idea of reading and writing to global memory all the time... I'm still trying to resist fixed-length kernels. If I see a 25% boost I will give in... but I presume you are right. The last kernel should not be an issue though, it will be a walk in the park compared to the inner loop one. Anyways, when I did that split-kernel thing to office-opencl I was surprised it did not harm performance at all - despite it storing and loading global memory every 128 iterations out of a 100,000. magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.