|
Message-ID: <20131030091740.GA25154@openwall.com> Date: Wed, 30 Oct 2013 13:17:40 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: ZedBoard: bcrypt Katja, On Tue, Oct 29, 2013 at 07:48:35PM +0100, Katja Malvoni wrote: > At the moment performance is 602 c/s, maximum frequency is 100 MHz. What has contributed to doubling the performance (since your previous report)? I guess it could be performing the 4 S-box lookups all at once, but then you're giving high numbers of cycles per round anyway: > I can't get one cycle delay block RAM to work. I also tried using RAM > module from http://openwall.info/wiki/crypt-dev/files but on Zynq it has > delay of 2 cycles. Same is with all the others variants I tried. > Currently one BF round takes 3 cycles - two for reading data from S-box > (I'm using two block RAMs so all 4 values are fetched in those 2 cycles) > and one to compute L and R when data is available. I'm not sure I understand how you're counting cycles here. Let's look at one Blowfish round on its own. Are you doing this? - Cycle 0: initiate 4 S-box lookups Cycle 1: wait Cycle 2: compute new R; swap L and R Cycle 3: ready to start next round (initiate 4 S-box lookups, etc.) If so, does anything prevent you from optimizing this to? - Cycle 0: compute new R; swap L and R; initiate 4 S-box lookups Cycle 1: wait ... except possibly for the special cases of the first and the last round? In the first round, bypass some of the logic. After the last round, invoke the same logic, but bypass the S-box lookups. Then, if LUTs and not BRAMs are somehow the scarce resource anyway, you could optimize this further to: Cycle 0: instance 0: compute new R; swap L and R; initiate 4 S-box lookups Cycle 1: instance 1: compute new R; swap L and R; initiate 4 S-box lookups That is, implement two instances of bcrypt per core, so that your logic would be in use on every cycle, and there would be no wait-only cycles. That way, you'd achieve at least the same performance as you would with 1-cycle lookups, without having to actually make them work (which you say you ran into issues with). I say "at least" because, as a bonus, this interleaved approach might allow for a higher clock rate. Also, the swapping of L and R may be avoided by implementing odd and even rounds separately. We do this on CPU and it is obviously beneficial there, but it may or may not result in overall savings on FPGA. You may try both approaches. > git clone https://github.com/kmalvoni/JohnTheRipper -b master I need to take a look at your actual code, indeed. I wrote the above based on your e-mails and my own thoughts only, not on code. Thanks, Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.