|
Message-ID: <AANLkTinDZAEJ1DPtU4Jnxj0KSxTsG8nahLZJr13f1JMx@mail.gmail.com> Date: Mon, 21 Mar 2011 23:41:24 +0200 From: Milen Rangelov <gat3way@...il.com> To: john-users@...ts.openwall.com Subject: Re: GSOC - GPU for hashes Sorry to hijack a thread here once again...but I am really interested in development of open-source GPU cracking software :) I am developing my own opensource GPU cracker (much "younger" than JtR eheh) and it apparently follows somehow similar pattern. I started using OpenSSL, then done some SSE2 work, then SSE2-vectorized bitslice DES until I finally got to the point where I had to start implementing GPU cracking. I guess sharing my experience would help so that you do not repeat my mistakes (that took me weeks of experimenting and lots of headache and nerves). First, about choosing the platform. Prior to that, I had some experience with Brook+ and some time spent on exprimenting with CAL/IL. My idea was to invest some time improving my CAL knowledge and start learning CUDA, because it logically seemed the best choice in terms of performance. Switching to OpenCL was a rather spontaneous thing, initially a product of a mere curiousity. I don't regret that...OK, except one thing. OpenCL is rather easy compared to CAL and much more powerful compared to Brook+. The learning curve is rather steep. CUDA seems to be more powerful and flexible, yet OpenCL is probably the best way to get into GPGPU stuff. Next about the performance. There are some articles, benchmarks and stuff maintaining the idea that OpenCL is somehow slower than either CUDA or CAL. On ATI side that's not correct and I have proven it by achieving higher speeds than ighashgpu and whitepixel, both written in CAL/IL. Well-optimized OpenCL code can beat hand-written IL kernels and ATI's OpenCL compiler is rather good at optimizing stuff. OTOH as far as NVidia is concerned, I had lots of issues. I just cannot reach the top speeds that some CUDA cracking software achieve no matter how hard I try optimizing the kernels. Also I had LOTS of issues with opencl on Nvidia. For example I create a number of threads that operate in their own context and have their own queue. Until recently, Nvidia's OpenCL implementation was not thread-safe so I was limited to a single thread/context/queue. Sometimes, clBuildProgram() crashed on Nvidia for perfectly correct code. Some OCL functions like rotate() do not work at all on NVidia. You always get surprises like some function not working properly or another returning wrong results. Code is portable across platforms with OpenCL, but performance is not, this forced me to write separate kernels for ATI and NVidia. There are some hardware differences that forced me to do so. People developing oclHashcat got further on and wrote a separate CUDA-based program for Nvidia and a separate OpenCL-based program for ATI. I can well understand this - IMO this turns out to be the best in terms of performance. About candidate generation and checking of calculated hashes - I would advise that you offset this almost completely on GPUs, even for slow hashes. The host-device transfers are the worst bottleneck, it just completely sucks. After a lot of redesigning and coding, I got to this point where I invoke a kernel with NDRange of say 500.000 (that means 500.000*vector_size ~ 2M candidates), check hashes almost completely in GPU code (direct comparison or bitmaps in case multihashing is used), then just transfer 4 bytes indicating that I have a match or not. If some candidate passes the GPU checks, I transfer another 500.000*4=2MB over PCIe, then again (the number of hashes that passed the check)*(digest size). On the host side, I do an indexed search on the hash list. Most of the time I just transfer 4 bytes after each kernel invocation, especially when cracking a single hash. This is fast....well not. I have profiled the code using AMD's Stream Kernel Analyzer. Guess what - calculating 2 million hashes on GPU is almost 25x faster than transfering just 4 bytes over the PCIe bus !!!!! At least on Radeon HD6870. That means I waste about 4% of the time transferring just 4 bytes from GPU to host. So minimize those transfers as much as possible. Even for "slow" hashes. PCIe transfers are a real pain. Always transferring calculated hashes to host for comparison is pure madness, this would never get you to utilizing even 10% of the GPU resources. There is another big problem with what you call "slow" hashes. It comes out from the fact that in GPUs the concept of call stack does not exist. Yes you can declare functions, but those are always inlined. For an interated algorithm like MD5(Unix) this means inlining the md5 stuff 1000 times....and as you might have guessed, compilation time is high. It can take about 20-30 minutes. It is the same problem for OpenCL, IL and I guess CUDA. The only solution is using precompiled binary kernel and that's where the fun starts. With CUDA you've got to compile cubins for different compute capability architectures, for ATI - binary kernels for literarily all AMD models you need to support. Integrating this with the autoconf/automake stuff is again a bitch....I am still working on that. Anyway, I somehow got used to the thought that my program would build for hours, probably days from code. Distributing precompiled binaries is something I consider inevitable. And finally, some algorithms just hate GPUs :) The worst of them is bcrypt which I doubt will be implemented on GPUs in the next years. But even DES is bad enough. I have spent some days trying to implement efficiently DES on OpenCL. The "standart" variant with s-box lookup tables stuffed into constant memory is just slow. OTOH, implementing bitslice DES on GPUs looks like a very big challenge, I guess harder than I can handle. There are numerous problems with bitslice DES on GPUs, first one being the high GPR usage. Second is that it's hard to convert the candidates input to a bit vectors as there are no nice native functions like SSE2's _mm_movemask_epi8 :) Anyway, good luck , hope that my (blatant offtopic) post was useful. I would really like to see more opensource GPU cracking software. I also hope you would find better ways to deal with those challenges than me and don't lose time in repeating my stupid mistakes :)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.