john-users - Re: GSOC - GPU for hashes

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <AANLkTi=5rkqjOx4XYCmuLu5A6O7DpRNNMLMPNB2wnvpD@mail.gmail.com>
Date: Mon, 28 Mar 2011 13:41:45 +0300
From: Milen Rangelov <gat3way@...il.com>
To: john-users@...ts.openwall.com
Subject: Re: GSOC - GPU for hashes

Hello,

Debugging a cl code is real pain, forget useful printf, forget easy
> to use IDE, when everything is syntax correct and you don't get the
> right result good luck my friend it's time to get pen and paper.
>
>
With ATI, there is an cl_amd_printf extension that allows you to use
printf() in kernels. Some format specifiers are not allowed but generally it
does a good job....OK unless you put it in a branch, then it does not work
correctly due to a bug in AMD's opencl implementation.


> In this way first check to see if 1/5 of the hash matches is done very
> fast, in case it matches i get the rest 4 blocks and compare them, if
> everything is ok password has been cracked.
>
> This is done because of slow data transfer between gpu and cpu: As milen
> pointed out this is the real pain you will be working on.
>
>
That's good :)



>
> As you can see there's a difference in speed comparison if you "upload"
> or "download" data on gpu, you need to take that in account when you
> code. You also need to understand that different vc will have different
> data transfer rate, this one is for an hd6970 which i found at about
> 200euro and replaced my "old" 5750 which had "download" data rate lower
> then the upload.
>
>
I've done benchmarks with various AMD and Nvidia hardware. It turns out that
clEnqueueMapBuffer()/clEnqueueUnmapMemObject() is fastest when smaller
ammounts of memory are transferred and
clEnqueueReadBuffer()/clEnqueueWriteBuffer() are better for larger buffers
when CL_USE_HOST_PTR is used.



> Even with this monster i can "feel" something is wrong because the video
> card is not stressed enough; if compared to the times i run pyrit which
> segfaults because of heat, with john running i can watch movies with no
> problem.
>

Best utilization is achieved when:

* NDRange (aka global work size) is large enough, at least 15000-20000
workitems.(probably even more to effectively "hide" memory latencies,
depending on how much ALU stuff is being performed in kernel)
* local work size is divisible by 64
* FetchUnitBusy value is as low as possible compared to ALUUnitBusy (use AMD
Stream Kernel Analyzer for that purpose - very useful profiling tool). That
means less __global memory accesses, more ALU work
* host-device transfers are kept to a minimum

In addition, I found out that creating two threads with their own contexts
and queues on a single GPU device works faster by more than 10% on "fast"
algos due to the fact that GPUs are better utilized then (less time lost
between kernel invocations?).

P.S - you may have a look at my (ATI) md5 bruteforce kernel:

http://hashkill.svn.sourceforge.net/viewvc/hashkill/src/kernels/amd_md5_long.cl?revision=109&content-type=text%2Fplain

 Code is very ugly and you would probably not understand why I am doing lots
of things without looking at the host code, but this demonstrates several
things:

1) single-hash vs multihash optimizations based on a preprocessor define
passed when building the kernel (-DSINGLE_MODE)

2) 5xxx/6xxx vs 4xxx-optimized code (again, preprocessor define - OLD_ATI).
4xxx GPUs generally work better with 4-component vectors as opposed to
8-component ones, dunno why.

3) Bitmap checks for multi-hash cases in order to avoid lots of slow
host-device transfers.

4) Candidate generation based on a lookup table in global memory (ugh) plus
a couple of __private function arguments. Overall it's not bad, but it has
to be improved

5) MD5 algorithm reversal (single-hash case) up to step 43 (about 20 MD5
steps skipped)

6) Using amd_bytealign() seems rather illogical, but I do this because I
patch the compiled binary kernel on-the-fly, replacing BYTEALIGN_INT with
BFI_INT. This brings some nice 20% performance improvement on 5xxx/6xxx
cards as BFI_INT is a single instruction that does round1 and round2
transformation (F and G)

7) Additional optimization in case plaintexts are less than 8 bytes in size
(MAX8).

8) As for the "DOUBLE" thing - in my program that corresponds to a
command-line option. In effect, this increases speeds a bit as global memory
reads are cut in half and more work is done in the kernel so the ratio
"kernel execution time" vs "transfer time" is lower. OTOH memory usage rises
2x. You may just ignore it as done in the 4xxx codepath (OLD_ATI)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.