|
Message-ID: <feaffe5ff6b415e27b002d28f95c690f@smtp.hushmail.com> Date: Thu, 8 Nov 2012 19:12:40 +0100 From: magnum <john.magnum@...hmail.com> To: john-dev@...ts.openwall.com Subject: Re: Split kernel for OpenCL WPA-PSK On 7 Nov, 2012, at 18:12 , magnum <john.magnum@...hmail.com> wrote: > On 7 Nov, 2012, at 17:43 , Lukas Odzioba <lukas.odzioba@...il.com> wrote: >>> For some reason it segfaults on the Tahiti (but not on AMDAPP/CPU). On all other devices I've tried it works fine. If we can get this straight we should implement similar changes to a bunch of other OpenCL formats that use PBKDF2-HMAC-SHA1. >> >> On Tahiti it segfaults during selftest, testsuite or some real world cracking? > > It segfaults during self-test. The debugger ends up within the amdocl drivers. I suspect it's yet another driver bug. But I have tried it with 12.8 too and it did not help. I nailed it. Definitely a driver bug but I found a way around it. This is the original code, in the loop kernel, that segfaults: for (i = 0; i < 5; i++) { W[i] = state[gid].W[i]; ipad[i] = state[gid].ipad[i]; opad[i] = state[gid].opad[i]; out[i] = state[gid].out[i]; } I suspect the compiler optimizer tried to rearrange them (a really trivial task) to get coalesced reads from global memory, but screwed up royally. So I did the compiler's job, and this works fine: for (i = 0; i < 5; i++) W[i] = state[gid].W[i]; for (i = 0; i < 5; i++) ipad[i] = state[gid].ipad[i]; for (i = 0; i < 5; i++) opad[i] = state[gid].opad[i]; for (i = 0; i < 5; i++) out[i] = state[gid].out[i]; >>> These are massive changes to both host code and kernel. Some 15-20% boost is gained too btw, and device auto-tuning is implemented. >> >> 15-20% just for nvidia or for amd too? > > I have no idea because of the segfaults but I implemented the usual bitselects, rotate and stuff so if anything, it should be faster. Also, the split kernels reduces register pressure. Some other changes I made released even more registers. On the other hand we depend on global memory between the loop calls. That does not stop office2007 from doing 2.1 billion SHA1/second though, although that one has smaller global memory footprint than this one. Using device 0: Tahiti Local worksize (LWS) 192, Global worksize (GWS) 196608 Benchmarking: WPA-PSK PBKDF2-HMAC-SHA-1 [OpenCL]... DONE Raw: 66197 c/s real, 137970 c/s virtual OK, the boost is over 50%... but a lot of the boost is merely from the new device auto-tuning. So the kernel is not much faster, but it /is/ faster and my primary goal was just to avoid a performance regression when splitting. BTW, I presume the old code would produce ASIC hangs on the Tahiti sooner or later. No risk for that now. This code too does over 2.1 billion SHA1/second, but CPU post-processing nearly halves the speed (without OMP). So I'm in the process of moving all of that post-processing to GPU. It's just a couple HMACs more, so I hope to exceed 120K c/s with that in place. For now, I committed the current code (with CPU post-processing). Please test. magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.