Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130713190947.GA25287@openwall.com>
Date: Sat, 13 Jul 2013 23:09:47 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Parallella: bcrypt

Katja,

On Sat, Jul 13, 2013 at 08:41:24PM +0200, Katja Malvoni wrote:
> My assembly code doesn't produce correct results but it has same execution
> time (20.7 ms for all cores and cost 5) as the one that produces correct
> result and it uses add instruction. When using fadd/iadd execution time is
> 22.6 ms.

Oh, these times are much better than what you reported before.  IIRC,
previously we were unable to get much better than 30 ms - and now you're
saying you're down to 20 ms already, which means almost 50 c/s.  Your
20.7 ms should give us 773 c/s on 16-core Epiphany, which is roughly the
speed of one modern x86 CPU core - but at much lower power usage.  The
64-core Epiphany is thus comparable to Core 2 Quad from a few years ago -
but again, at much lower power usage.

You just need to implement this into a working JtR format now, and do so
in a way that would let the implementation scale at least to 64 cores.

> Except that, I don't see any other way to optimize BF_ROUND - C code is
> very optimal, almost every line corresponds to one instruction (two that
> don't are "tmp3 += BF_INDEX(ctx->s.S[0], tmp4);" and "R ^= ctx->s.P[N +
> 1];").
> In case of higher cost, my code is much slower (2403.999 ms vs
> 2182.041 msfor cost 12 and 9603.889 ms vs 8716.686
> ms for cost 14) - I didn't take care about pipeline structure and hazards
> and I don't think I'll be able to change order of instructions better than
> compiler did it.

Using these numbers, I get:

2404/20.7/128 = 91% efficiency
8717/20.7/512 = 82% efficiency

These are somewhat inconsistent - I would have expected to see almost
the same efficiency percentage from these two, since cost 12 is already
high enough.

In this context, by efficiency I mean CPU time spent in bcrypt's inner
loop vs. total CPU time spent.  (Of course, there are more aspects as
well - such as the inner loop itself still allowing for optimizations -
but the purpose of this "high cost" test was to measure this aspect.)

> Yaniv, Epiphany Architecture Reference, p.69: "The branch prediction
> mechanism used by the CPU assumes that the branch was not taken. There is
> no penalty for branches not taken. For branches that are taken, there is a
> three-cycle constant penalty." - is this true for loops as well? If yes, in
> case of BF_encrypt this means penalty of (3 cycles * 1042) * 2^cost. Can
> that be avoided somehow?

While Yaniv correctly pointed out the hardware loop feature, I suggest
that we don't bother.  We're able to unroll the loop enough that this
penalty is negligible.

> I'll switch to other approach - interleaving two instances of bcrypt and
> I'll try to use Dual-Issue Scheduling Rules.

OK.

Do you feel your work on getting JtR integration working correctly is
still pending Yaniv's help?

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.