Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130727153617.GI27483@openwall.com>
Date: Sat, 27 Jul 2013 19:36:17 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Parallella: bcrypt

Katja,

On Fri, Jul 26, 2013 at 01:55:00PM +0200, Katja Malvoni wrote:
> On Thu, Jul 25, 2013 at 5:35 PM, Solar Designer <solar@...nwall.com> wrote:
> 
> > Even if not, we can do a 32-bit store and we still save 1 cycle.  Right
> > now, we have these two right shifts and two ANDs.  We replace these four
> > with one 32-bit store (if we have to, but I think we don't - see above),
> > two LDRBs, and two IMULs or IMADDs (but these are free for us, since the
> > FPU would otherwise be idle).  So that's 3 (or 2) non-free instructions
> > instead of 4.
> 
> I implemented only one shift right (tmp2) and reordered instructions so
> that dual-issue is possible but it's much slower (793 c/s) than lsr
> followed by and (976 c/s). In Epiphany Architecture Reference, p. 68, Table
> 27 says that Byte Internal Data Load stalls for 2 cycles independently of
> instruction sequence. It seems that we have 2 cycles penalty for every
> LDRB. Although these two cycles don't explain slowdown I get, it shouldn't
> be this big.

Ouch.  Yes, you're right about LDRB (as well as about the slowdown you
observed being excessive per that description).  This makes my idea
unusable.

> On Thu, Jul 25, 2013 at 5:50 AM, Solar Designer <solar@...nwall.com> wrote:
> > On Thu, Jul 25, 2013 at 06:02:52AM +0400, Solar Designer wrote:
> > > |         "imadd r44, r44, r46\n" \
> >
> > With 3 in r46, this simply shifts r44 left by 2 bits, but "off-loading"
> > this operation to the FPU (which would otherwise be idle).  Good.
> > However, note that we could also make use of the addition: put 4 (not 3)
> > in r46 (or rather in a compiler-allocated register, as I pointed out in
> > another message) and use something like:
> >
> >         imadd r44, r20, r46
> >
> > (but with the compiler-allocated register).
> >
> > > |         "ldr r44, [r20, +r44]\n" \
> >
> > This would then become:
> >
> >         ldr r44, [r44]
> >
> > ... but I'd expect this version of code to run at the exact same speed
> > on current Epiphany, as I think there's no penalty for using the adder
> > in address calculation.  The power consumption may be slightly lower,
> > though, since we'd be keeping this adder idle during this one cycle.
> >
> 
> If I'm not mistaken, this would require moving pointer to S[3] to r20
> before every BF_ROUND. IMADD Rd, Rn, Rm does this: Rd = Rd + Rm*Rn. So
> instruction would look like this: imadd r20, r44, r46
> And before next use of this instruction it would be necessary to put S[3]
> back in r20.

I did not verify this, but you're probably right.

Maybe we should use IMUL (by 4) then?

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.