|
Message-ID: <20130727153617.GI27483@openwall.com> Date: Sat, 27 Jul 2013 19:36:17 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: Parallella: bcrypt Katja, On Fri, Jul 26, 2013 at 01:55:00PM +0200, Katja Malvoni wrote: > On Thu, Jul 25, 2013 at 5:35 PM, Solar Designer <solar@...nwall.com> wrote: > > > Even if not, we can do a 32-bit store and we still save 1 cycle. Right > > now, we have these two right shifts and two ANDs. We replace these four > > with one 32-bit store (if we have to, but I think we don't - see above), > > two LDRBs, and two IMULs or IMADDs (but these are free for us, since the > > FPU would otherwise be idle). So that's 3 (or 2) non-free instructions > > instead of 4. > > I implemented only one shift right (tmp2) and reordered instructions so > that dual-issue is possible but it's much slower (793 c/s) than lsr > followed by and (976 c/s). In Epiphany Architecture Reference, p. 68, Table > 27 says that Byte Internal Data Load stalls for 2 cycles independently of > instruction sequence. It seems that we have 2 cycles penalty for every > LDRB. Although these two cycles don't explain slowdown I get, it shouldn't > be this big. Ouch. Yes, you're right about LDRB (as well as about the slowdown you observed being excessive per that description). This makes my idea unusable. > On Thu, Jul 25, 2013 at 5:50 AM, Solar Designer <solar@...nwall.com> wrote: > > On Thu, Jul 25, 2013 at 06:02:52AM +0400, Solar Designer wrote: > > > | "imadd r44, r44, r46\n" \ > > > > With 3 in r46, this simply shifts r44 left by 2 bits, but "off-loading" > > this operation to the FPU (which would otherwise be idle). Good. > > However, note that we could also make use of the addition: put 4 (not 3) > > in r46 (or rather in a compiler-allocated register, as I pointed out in > > another message) and use something like: > > > > imadd r44, r20, r46 > > > > (but with the compiler-allocated register). > > > > > | "ldr r44, [r20, +r44]\n" \ > > > > This would then become: > > > > ldr r44, [r44] > > > > ... but I'd expect this version of code to run at the exact same speed > > on current Epiphany, as I think there's no penalty for using the adder > > in address calculation. The power consumption may be slightly lower, > > though, since we'd be keeping this adder idle during this one cycle. > > > > If I'm not mistaken, this would require moving pointer to S[3] to r20 > before every BF_ROUND. IMADD Rd, Rn, Rm does this: Rd = Rd + Rm*Rn. So > instruction would look like this: imadd r20, r44, r46 > And before next use of this instruction it would be necessary to put S[3] > back in r20. I did not verify this, but you're probably right. Maybe we should use IMUL (by 4) then? Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.