|
Message-ID: <20130713190947.GA25287@openwall.com> Date: Sat, 13 Jul 2013 23:09:47 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: Parallella: bcrypt Katja, On Sat, Jul 13, 2013 at 08:41:24PM +0200, Katja Malvoni wrote: > My assembly code doesn't produce correct results but it has same execution > time (20.7 ms for all cores and cost 5) as the one that produces correct > result and it uses add instruction. When using fadd/iadd execution time is > 22.6 ms. Oh, these times are much better than what you reported before. IIRC, previously we were unable to get much better than 30 ms - and now you're saying you're down to 20 ms already, which means almost 50 c/s. Your 20.7 ms should give us 773 c/s on 16-core Epiphany, which is roughly the speed of one modern x86 CPU core - but at much lower power usage. The 64-core Epiphany is thus comparable to Core 2 Quad from a few years ago - but again, at much lower power usage. You just need to implement this into a working JtR format now, and do so in a way that would let the implementation scale at least to 64 cores. > Except that, I don't see any other way to optimize BF_ROUND - C code is > very optimal, almost every line corresponds to one instruction (two that > don't are "tmp3 += BF_INDEX(ctx->s.S[0], tmp4);" and "R ^= ctx->s.P[N + > 1];"). > In case of higher cost, my code is much slower (2403.999 ms vs > 2182.041 msfor cost 12 and 9603.889 ms vs 8716.686 > ms for cost 14) - I didn't take care about pipeline structure and hazards > and I don't think I'll be able to change order of instructions better than > compiler did it. Using these numbers, I get: 2404/20.7/128 = 91% efficiency 8717/20.7/512 = 82% efficiency These are somewhat inconsistent - I would have expected to see almost the same efficiency percentage from these two, since cost 12 is already high enough. In this context, by efficiency I mean CPU time spent in bcrypt's inner loop vs. total CPU time spent. (Of course, there are more aspects as well - such as the inner loop itself still allowing for optimizations - but the purpose of this "high cost" test was to measure this aspect.) > Yaniv, Epiphany Architecture Reference, p.69: "The branch prediction > mechanism used by the CPU assumes that the branch was not taken. There is > no penalty for branches not taken. For branches that are taken, there is a > three-cycle constant penalty." - is this true for loops as well? If yes, in > case of BF_encrypt this means penalty of (3 cycles * 1042) * 2^cost. Can > that be avoided somehow? While Yaniv correctly pointed out the hardware loop feature, I suggest that we don't bother. We're able to unroll the loop enough that this penalty is negligible. > I'll switch to other approach - interleaving two instances of bcrypt and > I'll try to use Dual-Issue Scheduling Rules. OK. Do you feel your work on getting JtR integration working correctly is still pending Yaniv's help? Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.