|
Message-ID: <AANLkTi=jayh53eb8cEiCwYYXRER_awKJmBphcXaQ3EFf@mail.gmail.com> Date: Fri, 14 Jan 2011 11:17:13 +0200 From: Milen Rangelov <gat3way@...il.com> To: john-users@...ts.openwall.com Subject: Re: opencl sha1 jtr and others some experiments Hello, I did a quick review of your ocl code. I am not that acquainted with the JtR code, but here are my remarks on the bottlenecks: 1) you are using global_work_size=1, that's just detrimental. You are severely underusing the GPU and that's probably the biggest bottleneck in your code. To utilize all the GPU's ALU units you need at least a couple thousands of workitems. To hide memory latency, you need more than that. If you can experiment, try running with global_work_size=20000, then 500000, 100000, etc. Experiment with that. You see - the time needed to schedule your kernel greatly exceeds its execution time, having global_work_size=1 is a total waste of processing time. 1a) There are basically two high-performance patterns: one is to have an inner loop in the kernel which take care of (partial) candidate generation and hashing - which is what most GPU crackers do. However this is not the only way - another way is to increase the global work size, do not do any candidate generation and let OpenCL schedule your kernel. This may require using some precalculated table of values to map candidates to kernel's global id (which is what I do). However, using shared memory enables some nice tricks on GPUs that support it. Both have advantages and disadvantages - the inner loop one has the penalty of control flow divergences, the other has the penalty of accessing the precomputed table in global memory and more wavefront scheduling overhead. 2) You can almost completely get rid of slow device-to-host PCIe transfers by doing the checks in the kernel code. It is rather easy if you check a single hash, it gets much harder if you check against a number of hashes. In the latter case, different tricks can be employed, however you'll likely need to access slow global memory. If you need to write a fast GPU cracker, you will have to get rid of the transfers anyway as the PCIe bandwidth limits the maximum speed (e.g PCIe 2 has 8GB/s bandwidth or ~ 500 million MD5 c/s as theoretical maximum, however even with pinned memory you're unlikely to even get somewhere near that). And modern ATI GPUs like e.g 5870 are capable of doing more than 3 billion MD5 ops/sec). For SHA1 it gets even worse as the digest size is somehow larger. 3) On ATI GPUs, you will benefit from vectorized code (e.g using uint4s or uint8s). I found out that in some cases, using uint8 gets you better VLIW utilization (thus faster cracking), however it may also mean more global memory reads/writes which may get you back to uint4. On NVidia, you should use scalars. Using vectors on ATI will likely bring ~2x-3x performance as compared to scalar code. 4) Your SHA-1 implementation can be optimized: 4a) On 5xxx and 6xxx ATI cards, there is a single "rotate left" instruction called bitalign which is accessible from OpenCL by using rotate() function. This is almost 3x faster than the generic "rotateleft". You may just use rotate() instead of your rotateleft function - on 4xxx it will be as slow as it, on 5xxx and above it will be much faster. Actually you're already using rotate() for the round functions. 4b) Avoid stuff like that: for(i = 0; i < size; i++) { s[i] = (char) c; } Unroll the loop, avoid loops and branching, it's costly. 4c) The transformation function in round 3(steps 49-60) can be replaced with (b&c)|(d&(b|c)) which saves you 20 bitwise operations. 4d) The final 5 additions / byte order reversals can be precomputed in reverse in host code and omitted in kernel code, that makes more sense if you have a single hash to crack. Actually the last 4 steps of the SHA1 can be partially reversed to speed it up a bit, but that depends on the design and I don't know if it's possible with JTR.
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.