john-users - Re: opencl sha1 jtr and others some experiments

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <AANLkTi=jayh53eb8cEiCwYYXRER_awKJmBphcXaQ3EFf@mail.gmail.com>
Date: Fri, 14 Jan 2011 11:17:13 +0200
From: Milen Rangelov <gat3way@...il.com>
To: john-users@...ts.openwall.com
Subject: Re: opencl sha1 jtr and others some experiments

Hello, I did a quick review of your ocl code. I am not that acquainted with
the JtR code, but here are my remarks on the bottlenecks:

1) you are using global_work_size=1, that's just detrimental. You are
severely underusing the GPU and that's probably the biggest bottleneck in
your code. To utilize all the GPU's ALU units you need at least a couple
thousands of workitems. To hide memory latency, you need more than that. If
you can experiment, try running with global_work_size=20000, then 500000,
100000, etc. Experiment with that. You see - the time needed to schedule
your kernel greatly exceeds its execution time, having global_work_size=1 is
a total waste of processing time.

1a) There are basically two high-performance patterns: one is to have an
inner loop in the kernel which take care of (partial) candidate generation
and hashing - which is what most GPU crackers do. However this is not the
only way - another way is to increase the global work size, do not do any
candidate generation and let OpenCL schedule your kernel. This may require
using some precalculated table of values to map candidates to kernel's
global id (which is what I do). However, using shared memory enables some
nice tricks on GPUs that support it. Both have advantages and disadvantages
- the inner loop one has the penalty of control flow divergences, the other
has the penalty of accessing the precomputed table in global memory and more
wavefront scheduling overhead.


2) You can almost completely get rid of slow device-to-host PCIe transfers
by doing the checks in the kernel code. It is rather easy if you check a
single hash, it gets much harder if you check against a number of hashes. In
the latter case, different tricks can be employed, however you'll likely
need to access slow global memory. If you need to write a fast GPU cracker,
you will have to get rid of the transfers anyway as the PCIe bandwidth
limits the maximum speed (e.g PCIe 2 has 8GB/s  bandwidth or ~ 500 million
MD5 c/s as theoretical maximum, however even with pinned memory you're
unlikely to even get somewhere near that). And modern ATI GPUs like e.g 5870
are capable of doing more than 3 billion MD5 ops/sec). For SHA1 it gets even
worse as the digest size is somehow larger.

3) On ATI GPUs, you will benefit from vectorized code (e.g using uint4s or
uint8s). I found out that in some cases, using uint8 gets you better VLIW
utilization (thus faster cracking), however it may also mean more global
memory reads/writes which may get you back to uint4.

On NVidia, you should use scalars.

Using vectors on ATI will likely bring ~2x-3x performance as compared to
scalar code.

4) Your SHA-1 implementation can be optimized:

4a) On 5xxx and 6xxx ATI cards, there is a single "rotate left" instruction
called bitalign which is accessible from OpenCL by using rotate() function.
This is almost 3x faster than the generic "rotateleft". You may just use
rotate() instead of your rotateleft function - on 4xxx it will be as slow as
it, on 5xxx and above it will be much faster. Actually you're already using
rotate() for the round functions.

4b) Avoid stuff like that:

    for(i = 0; i < size; i++)
    {
        s[i] = (char) c;
    }

Unroll the loop, avoid loops and branching, it's costly.

4c) The transformation function in round 3(steps 49-60) can be replaced with
(b&c)|(d&(b|c)) which saves you 20 bitwise operations.

4d) The final 5 additions / byte order reversals can be precomputed in
reverse in host code and omitted in kernel code, that makes more sense if
you have a single hash to crack. Actually the last 4 steps of the SHA1 can
be partially reversed to speed it up a bit, but that depends on the design
and I don't know if it's possible with JTR.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.