john-dev - Re: Parallella: bcrypt

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20130725153558.GA15090@openwall.com>
Date: Thu, 25 Jul 2013 19:35:58 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Parallella: bcrypt

Katja, Yaniv -

On Thu, Jul 25, 2013 at 08:06:48AM +0400, Solar Designer wrote:
> Maybe you'll come up with another clever/crazy idea on how to do right
> shifts with Epiphany's FPU instructions (like I mentioned, replacing one
> right shift with multiple FPU instructions is OK).

Here's an idea: use LDRB (load byte) instead of the right shifts by 6
and 14 bits, then use IMUL or IMADD to shift left by 2 bits (emulate the
non-existent index scaling on further loads, off-loading it to the FPU).

I think Epiphany's ISA registers are memory-mapped, so we can use LDRB
directly from the address of the register holding the L or R variable.
Yaniv - is this correct?

Even if not, we can do a 32-bit store and we still save 1 cycle.  Right
now, we have these two right shifts and two ANDs.  We replace these four
with one 32-bit store (if we have to, but I think we don't - see above),
two LDRBs, and two IMULs or IMADDs (but these are free for us, since the
FPU would otherwise be idle).  So that's 3 (or 2) non-free instructions
instead of 4.

Yaniv - which is better: IMUL followed by simple LDR (no index) or IMADD
followed by LDR with index?  In other words, is it better to use the
adder on the FPU or the adder in the IALU for our address calculation,
when we have the choice to use either?  I think the code will run at the
same speed either way, but maybe there's a difference in power usage and
heat production by the chip?

Katja - I don't mention the right shift by 22 bits above, because this
one is easily replaced with right shift by 24 and IMUL (or IMADD) as I
pointed out in another message.  So we avoid the AND for this one even
without having to use LDRB.  You may try the LDRB approach for it as
well, but I think the right shift by 24 approach will result in either
the same or slightly better speed (I think loads have 1 cycle greater
latency, so reduce your flexibility a little bit, compared to LSR).

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.