john-dev - Re: ZedBoard: bcrypt

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140716220347.GA12854@openwall.com>
Date: Thu, 17 Jul 2014 02:03:47 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: ZedBoard: bcrypt

Hi Katja,

Thank you for posting this update.

On Wed, Jul 16, 2014 at 01:34:39PM +0200, Katja Malvoni wrote:
> I implemented 35 instances of bcrypt on ZedBoard where each instance uses 3
> BRAMs and computes single bf round in one clock cycle. 2 BRAMs are used to
> store S-boxes and 1 BRAM to store other data (P-box, expanded key, salt and
> cost). Clock frequency is 71 MHz. Utilization is:
> Number of Slice Registers:                     5,911 out of 106,400    5%
> Number of Slice LUTs:                         27,837 out of  53,200   52%
>  Number of occupied Slices:                  10,584 out of  13,300   79%
> Number of RAMB36E1/FIFO36E1s:           105 out of     140   75%
> 
> Performance is 3346 c/s (40.58 c/s for cost 12). It is lower than expected
> (I expected it will be bit more than 3754 c/s which was achieved with 70
> instances, 2 clock cycles per bf round). If I measure time, communication
> time is 4 ms while computation time is bit more than 6 ms. This gives ~3500
> c/s which is more than 3255 c/s (1/((4+6)/1000/35). I don't know where this
> other time is spent.

Yes, it's puzzling why this runs slower than 70 2-cycle instances do.

> If I fully use BRAMs, 46 instances fit. Performance in that case is 4075
> c/s (with -te=50). For cost 12 it is 53.28 c/s. Clock frequency is 71 Mhz,
> utilization is:
> Number of Slice Registers:                     6,825 out of 106,400    6%
> Number of Slice LUTs:                         35,877 out of  53,200   67%
> Number of occupied Slices:                  11,129 out of  13,300   83%
> Number of RAMB36E1/FIFO36E1s:           138 out of     140   98%
> (everything was tested on the "zed" system)

This looks good, but as you're aware we're now getting 64.66 c/s at cost 12
with your earlier 112 instances design, after I modded my ZedBoard
yesterday adding the extra wire and 3 capacitors. ;-)

> I'll implement 56 instances with 4 BRAMs per core and see if these will
> perform as expected.

Yes, please.

BTW, I just found this abstract:

https://labh-curien.univ-st-etienne.fr/cryptarchi/workshop14/abstracts/zimmermann.pdf

They got 24 bcrypt cores running at 200 MHz on ZedBoard, thus claiming a
4x improvement over your results from last year.  They don't mention how
many cycles per round.

While your new results have already improved by a factor of ~4 for cost 5
and a factor of ~8 for cost 12, I think you'll also need to try to
increase the clock rate.  As I suggested in April, after you get 56
1-cycle instances working, a next step would be to turn that design into
112 2-cycle instances but with higher latency of BRAM lookups.  And
moving initial S-boxes into the FPGA bitstream (into each core) will
probably help reduce not only communication overhead, but also the
longest path which is currently limiting the clock rate.  In fact,
perhaps this is something you'd need to do before going for higher
latency BRAM lookups.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.