crypt-dev - Re: alternative approach

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110520011624.GA11672@openwall.com>
Date: Fri, 20 May 2011 05:16:24 +0400
From: Solar Designer <solar@...nwall.com>
To: crypt-dev@...ts.openwall.com
Subject: Re: alternative approach

Yuri, David -

On Thu, May 19, 2011 at 06:32:33PM -0300, Yuri Gonzaga wrote:
> I did the first attempt. I think I spent more time than I expected mainly
> because the adaptation to Xilinx dev environment.

These results are very helpful, and I am pleased that you did this soon
enough considering the dev environment change.  Thank You!

> Device Utilization Summary (estimated values)

I looked at the HTML version of your message to figure this out.
(Normally, I read text/plain sections only.)

It's four columns:

> Logic Utilization
> Used
> Available
> Utilization

which contain, respectively (and so on for other lines):

> Number of Slice Registers
> 35
> 93120
> 0%

> Number of Slice LUTs
> 131
> 46560
> 0%

I find this puzzling.  We have 272 bits of internal state (not including
auxiliary variables for loop control, etc.)  How do they fit in 35
registers?  http://www.1-core.com/library/digital/fpga-logic-cells/ says
we have four 1-bit registers per slice in a Virtex-5.  Has something
changed in Virtex-6?  (I doubt it.)  Or is this a terminology difference?
(I find this more likely.)  Yuri, David - can one of you explain this to
me, please?

s[] is declared as:

reg [7:0] s [31:0];

Shouldn't this consume 256 registers right away?

As to 131 LUTs, this is believable.

> Number of fully used LUT-FF pairs
> 35
> 131
> 26%

The same info repeated - 35 flip-flops, but 131 LUTs.  Since there's an
equal number of flip-flops (aka registers?) and LUTs per slice, this
means that we're only using 26% of the flip-flops that we could have.

...but I am puzzled why we need so few of them.  I'd expect that we'd
have unused LUTs, not unused flip-flops (because we'd need more of them
than LUTs).

> Number of bonded IOBs
> 14
> 240
> 5%
> 
> Number of BUFG/BUFGCTRLs
> 2
> 32
> 6%

What are these?  Could we reasonably use more of them by redefining our
Blowfish-like algorithm slightly?

> Maximum frequency: 261.356MHz

Assuming that we're LUT-bound and we can make use of all LUTs with this
(which might not be true), this gives us:

46560/131*261.356 = 92891

That's 93 billion two-round iterations per second (assuming sufficient
parallelism, which we'll provide).

We need to compare this against a CPU implementation.  (And maybe also
against GPU, but that's trickier.)  For a quad-core CPU at 3.5 GHz
capable of 3 instructions per cycle (on each core) running a
non-bitslice implementation, assuming 30 instructions per 2 rounds (the
best gcc-generated code I was able to get for an unrolled loop):

3500*4*3/30 = 1400

That's 1.4 billion two-round iterations per second.  So the speedup is:

92891/1400 = 66.35

More optimal code might use 24 instructions per 2 rounds (with 3-operand
instructions if those are supported by the arch, which saves some moves):

3500*4*3/24 = 1750
92891/1750 = 53.08

For DES, we were getting a number of around 40 here - but comparing
actual performance (both FPGA and CPU), not theoretical.

So we have some (theoretical) improvement over DES, but it's not as good
as I would have liked it to be (I wanted to see 100+ here).  Some
further improvement is possible by adding bit permutations.

Is "Virtex6 xc6vlx75t-3ff484" the same as the device that David
mentioned getting 22 billion of DES encryptions per second for, though?
If it's of a different size or speed grade, that could easily turn my
66x estimate into 100x or 50x (for the actual device).

> For NROUNDS = 4 (same target device):

This is also very helpful, thanks!

> Number of Slice Registers
> 35
> 93120
> 0%

No more registers needed.  Makes sense.
(But the low number is just as puzzling.)

> Number of Slice LUTs
> 183
> 46560
> 0%

More LUTs, but not 2x more.  Also makes sense.  The relatively small
increase in LUT count means that with a 2-round implementation, we waste
most LUTs on the state machine, right?

> Maximum frequency: 147.432MHz

This gives:

46560/183*147.432*2 = 75021

75 billion of 2-round iterations per second.  Slightly slower than for
the 2-round implementation.  But this is worth another try with our
closer-to-final Blowfish-like algorithm, if we choose to use one.  On
the other hand, the numbers so far suggest that 2-round is likely
better, and that if we simplify the state machine logic (possible or
not?) it will be better yet.

> About the pipelining, how can we deal with the fact that there are
> dependencies between r, l and s in calculations?
> Will each stage have to store locally r, l and s?

Yes.  We'll need the full set of registers for these for each pipeline
stage.  If right now we're LUT- rather than register-bound (which is
puzzling), then this makes sense (for a pipeline with 4 stages or so).

> The verilog code is attached, including simulation.

Thank you!  Some comments on it:

> `define PCADD(a, b, mask) (((a) ^ (b)) + (((a) & (b) & (mask)) << 1))

Are you sure this is good enough for the compiler to do the smart thing,
realizing that we have an adder with some holes punched in it?

> 			s[0] = 8'ha6;
> 			s[1] = 8'hac;
> 			s[2] = 8'hdb;
> 			s[3] = 8'hb7;
...
> 			s[28] = 8'ha5;
> 			s[29] = 8'h29;
> 			s[30] = 8'h40;
> 			s[31] = 8'h3e;

Do these initial value assignments consume any logic?  Perhaps they do.
If so, perhaps you can save some logic by having the initial state of
s[] uploaded from the host?

Did you verify correctness of your implemenation in any way?  (Perhaps
by making sure it produces the same outputs for the same inputs as the C
implementation does.)

Thanks again,

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.