|
Message-ID: <20140716220347.GA12854@openwall.com> Date: Thu, 17 Jul 2014 02:03:47 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: ZedBoard: bcrypt Hi Katja, Thank you for posting this update. On Wed, Jul 16, 2014 at 01:34:39PM +0200, Katja Malvoni wrote: > I implemented 35 instances of bcrypt on ZedBoard where each instance uses 3 > BRAMs and computes single bf round in one clock cycle. 2 BRAMs are used to > store S-boxes and 1 BRAM to store other data (P-box, expanded key, salt and > cost). Clock frequency is 71 MHz. Utilization is: > Number of Slice Registers: 5,911 out of 106,400 5% > Number of Slice LUTs: 27,837 out of 53,200 52% > Number of occupied Slices: 10,584 out of 13,300 79% > Number of RAMB36E1/FIFO36E1s: 105 out of 140 75% > > Performance is 3346 c/s (40.58 c/s for cost 12). It is lower than expected > (I expected it will be bit more than 3754 c/s which was achieved with 70 > instances, 2 clock cycles per bf round). If I measure time, communication > time is 4 ms while computation time is bit more than 6 ms. This gives ~3500 > c/s which is more than 3255 c/s (1/((4+6)/1000/35). I don't know where this > other time is spent. Yes, it's puzzling why this runs slower than 70 2-cycle instances do. > If I fully use BRAMs, 46 instances fit. Performance in that case is 4075 > c/s (with -te=50). For cost 12 it is 53.28 c/s. Clock frequency is 71 Mhz, > utilization is: > Number of Slice Registers: 6,825 out of 106,400 6% > Number of Slice LUTs: 35,877 out of 53,200 67% > Number of occupied Slices: 11,129 out of 13,300 83% > Number of RAMB36E1/FIFO36E1s: 138 out of 140 98% > (everything was tested on the "zed" system) This looks good, but as you're aware we're now getting 64.66 c/s at cost 12 with your earlier 112 instances design, after I modded my ZedBoard yesterday adding the extra wire and 3 capacitors. ;-) > I'll implement 56 instances with 4 BRAMs per core and see if these will > perform as expected. Yes, please. BTW, I just found this abstract: https://labh-curien.univ-st-etienne.fr/cryptarchi/workshop14/abstracts/zimmermann.pdf They got 24 bcrypt cores running at 200 MHz on ZedBoard, thus claiming a 4x improvement over your results from last year. They don't mention how many cycles per round. While your new results have already improved by a factor of ~4 for cost 5 and a factor of ~8 for cost 12, I think you'll also need to try to increase the clock rate. As I suggested in April, after you get 56 1-cycle instances working, a next step would be to turn that design into 112 2-cycle instances but with higher latency of BRAM lookups. And moving initial S-boxes into the FPGA bitstream (into each core) will probably help reduce not only communication overhead, but also the longest path which is currently limiting the clock rate. In fact, perhaps this is something you'd need to do before going for higher latency BRAM lookups. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.