john-dev - Re: Parallella: bcrypt

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20130727153617.GI27483@openwall.com>
Date: Sat, 27 Jul 2013 19:36:17 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Parallella: bcrypt

Katja,

On Fri, Jul 26, 2013 at 01:55:00PM +0200, Katja Malvoni wrote:
> On Thu, Jul 25, 2013 at 5:35 PM, Solar Designer <solar@...nwall.com> wrote:
> 
> > Even if not, we can do a 32-bit store and we still save 1 cycle.  Right
> > now, we have these two right shifts and two ANDs.  We replace these four
> > with one 32-bit store (if we have to, but I think we don't - see above),
> > two LDRBs, and two IMULs or IMADDs (but these are free for us, since the
> > FPU would otherwise be idle).  So that's 3 (or 2) non-free instructions
> > instead of 4.
> 
> I implemented only one shift right (tmp2) and reordered instructions so
> that dual-issue is possible but it's much slower (793 c/s) than lsr
> followed by and (976 c/s). In Epiphany Architecture Reference, p. 68, Table
> 27 says that Byte Internal Data Load stalls for 2 cycles independently of
> instruction sequence. It seems that we have 2 cycles penalty for every
> LDRB. Although these two cycles don't explain slowdown I get, it shouldn't
> be this big.

Ouch.  Yes, you're right about LDRB (as well as about the slowdown you
observed being excessive per that description).  This makes my idea
unusable.

> On Thu, Jul 25, 2013 at 5:50 AM, Solar Designer <solar@...nwall.com> wrote:
> > On Thu, Jul 25, 2013 at 06:02:52AM +0400, Solar Designer wrote:
> > > |         "imadd r44, r44, r46\n" \
> >
> > With 3 in r46, this simply shifts r44 left by 2 bits, but "off-loading"
> > this operation to the FPU (which would otherwise be idle).  Good.
> > However, note that we could also make use of the addition: put 4 (not 3)
> > in r46 (or rather in a compiler-allocated register, as I pointed out in
> > another message) and use something like:
> >
> >         imadd r44, r20, r46
> >
> > (but with the compiler-allocated register).
> >
> > > |         "ldr r44, [r20, +r44]\n" \
> >
> > This would then become:
> >
> >         ldr r44, [r44]
> >
> > ... but I'd expect this version of code to run at the exact same speed
> > on current Epiphany, as I think there's no penalty for using the adder
> > in address calculation.  The power consumption may be slightly lower,
> > though, since we'd be keeping this adder idle during this one cycle.
> >
> 
> If I'm not mistaken, this would require moving pointer to S[3] to r20
> before every BF_ROUND. IMADD Rd, Rn, Rm does this: Rd = Rd + Rm*Rn. So
> instruction would look like this: imadd r20, r44, r46
> And before next use of this instruction it would be necessary to put S[3]
> back in r20.

I did not verify this, but you're probably right.

Maybe we should use IMUL (by 4) then?

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.