john-dev - Re: bf

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20120711080353.GA10824@openwall.com>
Date: Wed, 11 Jul 2012 12:03:53 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: bf_kernel.cl

On Tue, Jul 10, 2012 at 08:13:22PM +0530, Sayantan Datta wrote:
> I was looking at the IL generated on 7970 using LDS only.  Each Encrypt
> call has approximately 540 instruction at IL level.  However according to
> your previous estimates each Encrypt call has 16*16+4+5 = 275

Any idea why we have so many more instructions?  Is it possibly because
they're for two SIMD units?

> rusling in an estimated speed of 52K c/s.

That was a theoretical/optimistic estimate, assuming that results of
each instruction are available for use the very next cycle and that
scatter/gather addressing is used.

> Since the number of instruction is doubled we
> should expect at least half of your previous estimates say roughly 26K c/s.

You probably meant "at most".  However, this is not necessarily right -
we need to find the cause of the doubled instruction count first.
If it's because of explicit separate instructions for use of two SIMD
units, then this does not halve the estimated speed.

> But we are nowhere near that.  I guess your previous estimates were based
> on the fact that each instruction takes 1 clock cycle to execute, is it?
> But it looks like not all instructions rquire same number of clock cycle on
> gpu.

I think all relevant instructions can execute in 1 cycle in terms of
throughput when there's sufficient parallelism available, but we do not
have sufficient parallelism here (because of limited LDS size), so we
incur stalls because of instruction latencies greater than 1 cycle.
This is no surprise.  My estimates were in fact theoretical/optimistic
rather than realistic.

However, one thing to double-check is whether we're using gather
addressing for the S-box lookups or maybe not (which would mean that we
use at most one vector element per SIMD unit).  In the latter case, a
speedup should be possible from using 4 instead of 2 SIMD units.  If you
see no such speedup, then this suggests that we're probably using gather
just fine now, and not reaching the theoretical speeds (by far) is
merely/primarily because of instruction latencies.

BTW, in the wiki page at http://openwall.info/wiki/john/GPU/bcrypt you
mention Bulldozer's L2 cache.  JFYI, this is irrelevant, since on CPUs
the S-boxes fit in L1 cache.  We only run 4 concurrent instances of
bcrypt per Bulldozer module (two per thread, and there's no need to run
more), so that's only 16 KB for the S-boxes.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.