john-dev - Re: rawsha256.cu patch(using shared memory)

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20120328002849.GD19375@openwall.com>
Date: Wed, 28 Mar 2012 04:28:49 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: rawsha256.cu patch(using shared memory)

On Tue, Mar 27, 2012 at 11:26:58PM +0800, myrice wrote:
> I used shared memory in rawsha256.cu(Just as Lukas comments as to-do)
> There are still space for improvement. I think sha256 access patterns have
> bank conflict.
> Overall speedup by ~6% in sha256 and 8% in sha224
...
> =====Before===============
...
> Benchmarking: raw-sha256-cuda [SHA256]... DONE
> Raw: 1979K c/s real, 1998K c/s virtual
> Average: 1933.3 c/s real, 1965.6 c/s virtual
> 
> ============After=================
...
> Benchmarking: raw-sha256-cuda [SHA256]... DONE
> Raw: 2062K c/s real, 2085K c/s virtual
> Average: 2048.6 c/s real, 2080.0 c/s virtual
> 
> Speedup: ~6%

That's nice, but this is still awfully slow.  In fact, even the
benchmarks we have on the wiki somehow show higher speeds, even though
you have a faster card (GTX-580, right?)

    * C-01: i3 2100, 4GB 1333MHz, GeForce 9800GT, slackware 13.1 32bit
    * C-03: C2Duo P7350 2GHz,GF 9600m
    * C-04: 9800GTX
    * C-06: GTX 460 1024M

Benchmarking: SHA256CUDA [SHA256] DONE
john-1.7.6-sha256cuda-0.diff

    * C-01 : Raw: 5734K c/s real, 5745K c/s virtual
    * C-03 : Raw: 1795k c/s real, 1795k c/s virtual
    * C-04 : Raw: 4456k c/s real 4412k c/s virtual
    * C-06 : Raw: 10443K c/s real, 10527K c/s virtual

(This is for an older revision of Lukas' code.)

Here's what I am getting on CPU with OpenSSL calls:

Benchmarking: Raw SHA-256 [32/64]... DONE
Raw:    1565K c/s real, 1565K c/s virtual

Benchmarking: Raw SHA-256 [32/64]... (8xOMP) DONE
Raw:    6342K c/s real, 791325 c/s virtual

The formats interface bottleneck is somewhere above 50M c/s.  Actually,
--format=dummy shows it at around 130M c/s on Core i7-2600, which is
what you said you use, but indeed interfacing to the GPU takes time.
With Samuele's fast hash implementations in OpenCL and running on GPU,
we're getting close to 50M c/s.  So you also need to get close to that.
This is a good thing for you to attempt.

(And once you get there, you'd need to somehow demonstrate that your
code would be even faster without the interface bottleneck - e.g., by
starting to implement candidate password generation and hash comparison
on GPU in whatever quick way you can for the demo.)

Thanks,

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.