Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20131030091740.GA25154@openwall.com>
Date: Wed, 30 Oct 2013 13:17:40 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: ZedBoard: bcrypt

Katja,

On Tue, Oct 29, 2013 at 07:48:35PM +0100, Katja Malvoni wrote:
> At the moment performance is 602 c/s, maximum frequency is 100 MHz.

What has contributed to doubling the performance (since your previous
report)?  I guess it could be performing the 4 S-box lookups all at
once, but then you're giving high numbers of cycles per round anyway:

> I can't get one cycle delay block RAM to work. I also tried using RAM
> module from http://openwall.info/wiki/crypt-dev/files but on Zynq it has
> delay of 2 cycles. Same is with all the others variants I tried.
> Currently one BF round takes 3 cycles - two for reading data from S-box
> (I'm using two block RAMs so all 4 values are fetched in those 2 cycles)
> and one to compute L and R when data is available.

I'm not sure I understand how you're counting cycles here.  Let's look
at one Blowfish round on its own.  Are you doing this? -

Cycle 0: initiate 4 S-box lookups
Cycle 1: wait
Cycle 2: compute new R; swap L and R

Cycle 3: ready to start next round (initiate 4 S-box lookups, etc.)

If so, does anything prevent you from optimizing this to? -

Cycle 0: compute new R; swap L and R; initiate 4 S-box lookups
Cycle 1: wait

... except possibly for the special cases of the first and the last
round?  In the first round, bypass some of the logic.  After the last
round, invoke the same logic, but bypass the S-box lookups.

Then, if LUTs and not BRAMs are somehow the scarce resource anyway, you
could optimize this further to:

Cycle 0: instance 0: compute new R; swap L and R; initiate 4 S-box lookups
Cycle 1: instance 1: compute new R; swap L and R; initiate 4 S-box lookups

That is, implement two instances of bcrypt per core, so that your logic
would be in use on every cycle, and there would be no wait-only cycles.

That way, you'd achieve at least the same performance as you would with
1-cycle lookups, without having to actually make them work (which you
say you ran into issues with).  I say "at least" because, as a bonus,
this interleaved approach might allow for a higher clock rate.

Also, the swapping of L and R may be avoided by implementing odd and
even rounds separately.  We do this on CPU and it is obviously
beneficial there, but it may or may not result in overall savings on
FPGA.  You may try both approaches.

> git clone https://github.com/kmalvoni/JohnTheRipper -b master

I need to take a look at your actual code, indeed.  I wrote the above
based on your e-mails and my own thoughts only, not on code.

Thanks,

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.