john-dev - Re: ZedBoard: bcrypt

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140421224147.GA29679@openwall.com>
Date: Tue, 22 Apr 2014 02:41:47 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: ZedBoard: bcrypt

Katja,

On Mon, Apr 14, 2014 at 04:38:57PM +0400, Solar Designer wrote:
> On Mon, Apr 14, 2014 at 03:53:50PM +0400, Solar Designer wrote:
> > With 4 bcrypt instances per core, you need 20 reads per round.  With 2
> > cycles/round, that's 10 reads per cycle, needing 5 BRAMs.  Maybe you can
> > have:
> > 
> > Cycle 0:
> > initiate S0, S1 lookups for instances 0, 1 (total: 4 lookups)
> > initiate S2, S3 lookups for instances 2, 3 (total: 4 lookups)
> > initiate P lookups for instances 0, 1 (total: 2 lookups)
> > (total: 10 lookups)
> > Cycle 1:
> > initiate S2, S3 lookups for instances 0, 1 (total: 4 lookups)
> > initiate S0, S1 lookups for instances 2, 3 (total: 4 lookups)
> > initiate P lookups for instances 2, 3 (total: 2 lookups)
> > (total: 10 lookups)
> > 
> > with the computation also spread across the two cycles as appropriate
> > (and maybe you can reuse the same 32-bit adders across bcrypt instances,
> > although the cost of extra MUXes is likely to kill the advantage).
> 
> BTW, in the example above the data fits into different BRAMs in an
> intuitive manner: each one of the four bcrypt instances can have its
> S-boxes in its own BRAM block, with nothing else in it.  The four
> P-boxes all fit in a fifth BRAM block.  So this may be the winning
> approach in terms of simplicity and elegance.

It occurred to me (earlier today) that in the above approach we're
unnecessarily wasting half of the parallelism inherent to bcrypt: if we
can do all 5 lookups (4 S-box lookups and 1 P-box lookup)
simultaneously, then probably we should.  When we're artificially
splitting this across two cycles, we're then wasting half the BRAM space
to regain the equivalent amount of parallelism by bringing in more
instances.  We're probably also increasing communication delays as a
side-effect of having more instances, as you noted.

What we can probably have instead is a single-cycle implementation of
bcrypt instance, so we'll fully use BRAM ports with only 2 instances/core:

On each cycle:
	Compute 2x 4 S-box lookup indices from previous cycle's lookup results
	Initiate 2x 4 S-box lookups from 4 BRAMs
	Initiate 2x P-box lookups from 1 BRAM

Now, the computation might be slightly slower: it's add, xor, add done
sequentially on the same cycle, right before the BRAM lookups.  However,
I think it's still fast enough to keep the clock rate sane, and on the
bright side we're keeping this circuit busy on every clock cycle, so we
need fewer circuits like this for the same total performance.

Overall, I expect a reduction in LUT utilization compared to the 4
instances/core approach, which might be handy in terms of power
consumption reduction on Zynq and/or in terms of getting things to fit
on Spartan-6 LX150 (not sure whether BRAMs or LUTs will turn out to be
the scarce resource for us there).

In terms of expected performance excluding communication and host
overhead and assuming the same clock rate, this should be exactly the
same as your current 28 cores, 112 instances.  We'll have 28 cores too,
but they will contain only 56 instances (and will use less space per
BRAM for now).  The reduction in instance count is fully compensated for
by the reduction from 2 cycles/round to 1 cycle/round.  What we gain is
needing to communicate with fewer instances, maybe reducing the
overhead.  We also gain the likely reduction in LUT utilization
mentioned above, and most importantly:

A possible next step is to introduce 4 instances/core on top of this,
gaining an extra cycle of latency that we can use to increase the clock
rate.  One way to use this extra cycle is to move the entire add/xor/add
thing onto this cycle (interleave the four instances such that when two
do lookups, the other two do computation, and vice versa).  Another way
to use it is to enable BRAMs' output registers.  Either of these may
allow for a higher clock rate (and it's unclear which one is faster).

So that's another 3 code versions to try (56 1-cycle instances, and 112
2-cycle instances done in two different ways).

On the bright side, I think there's no longer a reason to try the 6/7
idea I had posted last week, which would require a clock rate increase
by a factor of more than 1.5*112/120 = 1.4 to achieve any gain.  This
new idea above lets us have the extra cycle and the associated clock
rate increase without any penalty in terms of throughput per cycle.

Please feel free to experiment with this on Ztex board or/and ZedBoard
when convenient.

Thanks,

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.