|
Message-ID: <20120905032240.GA29119@openwall.com> Date: Wed, 5 Sep 2012 07:22:40 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: bitslice DES on GPU Hi Sayantan, On Tue, Sep 04, 2012 at 11:04:45PM +0530, Sayantan Datta wrote: > How do I test the LM hashes for the original cpu implementation ? Is it > sufficient to set the LM parameter to 1 in the DES_bs_init() function ? You should be doing whatever LM_fmt.c does. That's probably not the kind of answer you wanted, but I'm afraid I don't have a better one. In short, yes, DES_bs_init() is called with 1 for LM, and proper functions are used after that point. Anyhow, I took a look at your code, and I tested it briefly to see where the bottlenecks are. Many things are done non-optimally, which I guess you're aware of: 1. DES_EXPAND = 1 is probably not an optimal choice (although it was/is OK to try it as well). 2. The expanded keys (and some other fields) should only exist on the GPU side, yet you have space reserved for them in opencl_DES_bs_combined on the CPU side as well. This probably does not directly cost any CPU time (unless I missed something), but it may result in less optimal use of caches on the CPU (potentially more cache tag conflicts between fields of opencl_DES_bs_combined that you do actually use). 3. You're copying too much data to/from GPU. For example, there's no point in copying B to GPU, and there's usually no point in copying K from GPU (although you do need to preserve the partially transposed and expanded key bits across multiple calls for the same salt somehow). There are other minor things to exclude from the copying as well. 4. A lot of time is spent on the CPU side. Some testing I did suggests that it's around 33% (when run on bull's CPU and HD 7970). You have some extra copying on the CPU side as well. 5. DES_bs_cmp_all() became slower than the usual CPU implementation because it now uses 32-bit rather than 64-bit vector elements. 6. You're using NVIDIA-friendly S-box expressions. This results in unneeded overhead when running on AMD GPUs. Also, you don't have vsel() defined to use bitselect() - you'll need to fix that when you switch to proper S-box expressions for AMD. You'll need to support both of these, so that we can run reasonable benchmarks on both GPUs. 7. You keep too many of the things in global memory. With DES_EXPAND = 1, you have to keep the expanded keys in global memory (but at least they're read sequentially). You don't have to keep B and E in global memory. BTW, what do you mean by trying to declare only some of the struct fields __global? I doubt that this is valid - either the entire struct is in global memory or the entire struct is in local memory - no? I'd expect that you'd need to split it into two structs if you have these two kinds of fields. I listed the above in arbitrary order. Of the above, you may want to start by fixing 6, 7, 3, 4 - maybe in this order. To see how much of the slowness is from the extra copying and such rather than from the inner loop being slow, I tried increasing the iteration count from 25 to 725 and using this test hash: {"..X8NBuQ4l6uQ", ""}, It's not a valid DES-based crypt(3) hash because those use 25 iterations (not configurable), but I generated it for my performance testing anyway. With this, I got roughly twice higher speed (after multiplying by 29). Thus, the "overhead" accounts for roughly 50% of the total running time of the current code. The code remains too slow even with the overhead mostly removed (due to the much higher iteration count like this). Last but not least, please add proper statements to the source files so that your revisions of them are not misattributed to me. Right now, some of your modified files say that they've been written by me, which is no longer entirely true. Thanks, Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.