|
Message-ID: <CA+TsHUBwTzgS8C6W3ppSBx3mLsN_M2=MM1sc_HWBe-rxzLFzNA@mail.gmail.com>
Date: Mon, 10 Sep 2012 13:12:54 +0530
From: Sayantan Datta <std2048@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: bitslice DES on GPU
On Mon, Sep 10, 2012 at 8:58 AM, Sayantan Datta <std2048@...il.com> wrote:
>
>
> On Wed, Sep 5, 2012 at 8:52 AM, Solar Designer <solar@...nwall.com> wrote:
>
>> Hi Sayantan,
>>
>> On Tue, Sep 04, 2012 at 11:04:45PM +0530, Sayantan Datta wrote:
>> > How do I test the LM hashes for the original cpu implementation ? Is it
>> > sufficient to set the LM parameter to 1 in the DES_bs_init() function ?
>>
>> You should be doing whatever LM_fmt.c does. That's probably not the
>> kind of answer you wanted, but I'm afraid I don't have a better one.
>> In short, yes, DES_bs_init() is called with 1 for LM, and proper
>> functions are used after that point.
>>
>> Anyhow, I took a look at your code, and I tested it briefly to see where
>> the bottlenecks are. Many things are done non-optimally, which I guess
>> you're aware of:
>>
>> 1. DES_EXPAND = 1 is probably not an optimal choice (although it was/is
>> OK to try it as well).
>>
>> 2. The expanded keys (and some other fields) should only exist on the
>> GPU side, yet you have space reserved for them in opencl_DES_bs_combined
>> on the CPU side as well. This probably does not directly cost any CPU
>> time (unless I missed something), but it may result in less optimal use
>> of caches on the CPU (potentially more cache tag conflicts between
>> fields of opencl_DES_bs_combined that you do actually use).
>>
>> 3. You're copying too much data to/from GPU. For example, there's no
>> point in copying B to GPU, and there's usually no point in copying K
>> from GPU (although you do need to preserve the partially transposed and
>> expanded key bits across multiple calls for the same salt somehow).
>> There are other minor things to exclude from the copying as well.
>>
>> 4. A lot of time is spent on the CPU side. Some testing I did suggests
>> that it's around 33% (when run on bull's CPU and HD 7970). You have
>> some extra copying on the CPU side as well.
>>
>> 5. DES_bs_cmp_all() became slower than the usual CPU implementation
>> because it now uses 32-bit rather than 64-bit vector elements.
>>
>> 6. You're using NVIDIA-friendly S-box expressions. This results in
>> unneeded overhead when running on AMD GPUs. Also, you don't have vsel()
>> defined to use bitselect() - you'll need to fix that when you switch to
>> proper S-box expressions for AMD. You'll need to support both of these,
>> so that we can run reasonable benchmarks on both GPUs.
>>
>> 7. You keep too many of the things in global memory. With DES_EXPAND = 1,
>> you have to keep the expanded keys in global memory (but at least
>> they're read sequentially). You don't have to keep B and E in global
>> memory. BTW, what do you mean by trying to declare only some of the
>> struct fields __global? I doubt that this is valid - either the entire
>> struct is in global memory or the entire struct is in local memory - no?
>> I'd expect that you'd need to split it into two structs if you have
>> these two kinds of fields.
>>
>> I listed the above in arbitrary order. Of the above, you may want to
>> start by fixing 6, 7, 3, 4 - maybe in this order.
>>
>> To see how much of the slowness is from the extra copying and such
>> rather than from the inner loop being slow, I tried increasing the
>> iteration count from 25 to 725 and using this test hash:
>>
>> {"..X8NBuQ4l6uQ", ""},
>>
>> It's not a valid DES-based crypt(3) hash because those use 25 iterations
>> (not configurable), but I generated it for my performance testing anyway.
>>
>> With this, I got roughly twice higher speed (after multiplying by 29).
>> Thus, the "overhead" accounts for roughly 50% of the total running time
>> of the current code. The code remains too slow even with the overhead
>> mostly removed (due to the much higher iteration count like this).
>>
>> Last but not least, please add proper statements to the source files so
>> that your revisions of them are not misattributed to me. Right now,
>> some of your modified files say that they've been written by me, which
>> is no longer entirely true.
>>
>> Thanks,
>>
>> Alexander
>>
>
> I've made some modifications to fix the issues mentioned in 2,3,6,7. Also
> using bitselect is causing performance drop on AMD GPU. I'll find out why.
>
> Regards
> Sayantan
>
Hi Alexander,
Here's the result of keeping B in local memory.
std2048@...l:~/magnum-jumbo-new/run$ ./john -te -fo=des-opencl -pla=1
OpenCL platform 1: AMD Accelerated Parallel Processing, 2 device(s).
Using device 0: Tahiti
Compilation log: "/tmp/OCLJ1fgkF.cl", line 375: warning: goto statement may
cause irreducible
control flow
goto finalize_keys;
^
"/tmp/OCLJ1fgkF.cl", line 423: warning: goto statement may cause irreducible
control flow
if (rounds_and_swapped == 0x100) goto next;
^
"/tmp/OCLJ1fgkF.cl", line 458: warning: goto statement may cause irreducible
control flow
if (--rounds_and_swapped) goto start;
^
"/tmp/OCLJ1fgkF.cl", line 461: warning: goto statement may cause irreducible
control flow
if (--iterations) goto swap;
^
"/tmp/OCLJ1fgkF.cl", line 474: warning: goto statement may cause irreducible
control flow
goto start;
^
"/tmp/OCLJ1fgkF.cl", line 481: warning: goto statement may cause irreducible
control flow
goto body;
^
Benchmarking: DES BS [128/128 BS XOP-16]... DONE
Many salts: 19942K c/s real, 130023K c/s virtual
Only one salt: 15617K c/s real, 45875K c/s virtual
I haven't posted the patch yet.
Regards,
Sayantan
Content of type "text/html" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.