john-dev - Re: Parallella: bcrypt

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20130725035032.GA12466@openwall.com>
Date: Thu, 25 Jul 2013 07:50:32 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Parallella: bcrypt

Katja -

On Thu, Jul 25, 2013 at 06:02:52AM +0400, Solar Designer wrote:
> |         "imadd r44, r44, r46\n" \

With 3 in r46, this simply shifts r44 left by 2 bits, but "off-loading"
this operation to the FPU (which would otherwise be idle).  Good.
However, note that we could also make use of the addition: put 4 (not 3)
in r46 (or rather in a compiler-allocated register, as I pointed out in
another message) and use something like:

	imadd r44, r20, r46

(but with the compiler-allocated register).

> |         "ldr r44, [r20, +r44]\n" \

This would then become:

	ldr r44, [r44]

... but I'd expect this version of code to run at the exact same speed
on current Epiphany, as I think there's no penalty for using the adder
in address calculation.  The power consumption may be slightly lower,
though, since we'd be keeping this adder idle during this one cycle.

Now, how about trying to come up with a clever/crazy sequence of IMADD,
IMSUB, IMUL, IADD, or/and ISUB to replace those right shifts?  It is OK
to replace one LSR with multiple of these instructions because we're
otherwise mostly keeping the FPU idle.  However, we'll need to implement
interleaving in order to hide the increased latencies.

Can the right shift by 22:

	y = x >> 22;

be replaced with:

	y = (2 << 22) - x * 0xffbfffff;

where 0xffbfffff is the multiplicative inverse for ((2 << 22) - 1).
Would this work well enough for our purpose (where we only use 8 bits of
the result, so don't need precision beyond that)?  Note that, if needed,
we can apply the AND mask before rather than after the simulated right
shift - we have this flexibility.

http://www.hackersdelight.org/magic.htm

Another idea: rather than do:

	tmp4 = L >> 22; \
	tmp4 &= 0x3FC; \

we can shift by 24 bits, which eliminates the need for a mask (we shift
the 2 lower bits out, and the bits being shifted in from the left are
all zeroes), and we can shift back by 2 bits using IMUL or IMADD (which
we get for free, unlike the AND that we're saving).  I think this is
even better than the previous idea, and maybe you can still use the
"magic" approach for the right shift by 14 and/or by 6 (where we can't
avoid the AND)?

Of course, all of this assumes plenty of parallellism, which we'll need
to obtain by interleaving multiple bcrypt's.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.