john-dev - Re: Lukas - status report #2

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20120501061416.GA10734@openwall.com>
Date: Tue, 1 May 2012 10:14:16 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Lukas - status report #2

Lukas -

On Tue, May 01, 2012 at 06:13:01AM +0200, Lukas Odzioba wrote:
> I would be more happy to see 80-90k, previously (just pmk calculation
> - most time consuming) we had 90% of hashcat's speed. For now
> difference will be ever worst for super fast gpus and slow cpu.
> Besides cpu side code utilizes only 1 core. Do you have any ideas to
> get around it other than MPI? On the other side we could move all code
> to second kernel gpu.

So you invoke the SHA-1 compression function about 20 times per key in
wpapsk_postprocess().  This is about 1/400 of total, yet it causes
significant slowdown when your GPU code is optimized, your CPU code is
not optimized, and you run these sequentially rather than in parallel.

Besides the current "just use OpenMP" hack, you can try these approaches:

1. Include wpapsk_postprocess() into your GPU kernel.  I don't see why
you're mentioning a second kernel.  You already happen to have
wpapsk_kernel.cl separate from pbkdf2_kernel.cl (even though I think we
could have a shared PBKDF2 with HMAC-SHA1 kernel, if it were not for
this new WPA specific detail).  Ditto for CUDA.

2. Interleave the GPU and CPU code invocations by invoking the GPU
kernel multiple times from a single crypt_all() call, for different
subsets of the total set of keys.  The first GPU kernel invocation won't
overlap with any on-CPU work, and the last invocation of the on-CPU
postprocessing won't overlap with any on-GPU work - but the rest will.
So you'll need to keep the number of chunks large enough (e.g., 10) -
and have it tunable.

(Maybe we need to enhance the formats interface to allow for async
processing across crypt_all() call boundaries.)

3. Optimize the on-CPU postprocessing.  Replace the calls into OpenSSL
with uses of our SSE2+ intrinsics implementations of SHA-1 and MD5.
Implement HMAC on your own.

#2 and #3 above may be combined.  But #1 is probably better.

#2 alone is probably the easiest to implement now.  You may keep the
OpenMP stuff too.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.