crypt-dev - Re: alternative approach

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20110604220510.GB6422@openwall.com>
Date: Sun, 5 Jun 2011 02:05:10 +0400
From: Solar Designer <solar@...nwall.com>
To: crypt-dev@...ts.openwall.com
Subject: Re: alternative approach

On Mon, May 30, 2011 at 11:42:55PM -0300, Yuri Gonzaga wrote:
> I synthesized again the bflike.
> Now, it is was to Spartan-6 xc6slx45, same device of pico e101.
> I got the following results (This time I could generate a report with more
> details).
> 
> I could generate a schematic view as well (available on http://bit.ly/koMcxo),
> but is very big and dificult to track. I don't know if it will help.

Thanks!  This is what I wanted to see, but you're right - it's so large
that it's difficult to figure anything out from it.  I think we'll need
to generate such schematic views for much smaller pieces of Verilog
code, not for the entire thing at once.

> > Oh, here's a simpler test: try replacing pcadd() with a simple addition
> > or simple XOR.  If the synthesizer was smart enough, this should not
> > change the LUT count.  If this does reduce the LUT count, then perhaps
> > there was room for improvement.
> 
> I replaced pcadd() for simpler "a ^ b ^ mask" and it really reduced the LUT
> count.

How large was the reduction?  Can you also try simple "a ^ b" (no mask)
and simple "a + b", and report the LUT counts for all four (original
full pcadd(), your "a ^ b ^ mask", and these two I suggested)?  These
numbers might give us some hints.

> However, I didn't figure out how to improve the pcadd() to do the right
> thing using less LUTs.
> All my attempts got wrong results in simulation.

Can you possibly generate a schematic view for a pcadd() alone, with no
or little other logic (just enough to make sure parts of pcadd() aren't
optimized out as unused)?  And for other variations of it as suggested
above.  Then try to compare those.

> Number of Slice LUTs                        105     27,288         1%

It was 131 LUTs for your old code synthesized for Virtex-6.  Why has
this reduced to 105 now?  Is this for your simplified pcadd() (which
doesn't actually do the right thing)?

> Maximum Frequency: 76.940MHz

Somehow this is twice lower than what you reported for EksBlowfish.
Sure, we're doing two rounds at once here, but the rounds are roughly
twice simpler than Blowfish's (two S-box lookups vs. four).  Are lookups
from distributed memory (in LUTs) slower than those from BlockRAM?  Or
does pcadd() as currently synthesized have higher propagation delays
than Blowfish's simple xor and add?  Or is it the reduction in
parallelism (our two rounds are sequential, whereas Blowfish's four
S-box lookups for one round could theoretically occur in parallel)?
(Perhaps you don't have answers to these questions.  I am just thinking
aloud.)

Anyway, I'd expect a comparable LUT count for EksBlowfish as well -
perhaps just a few times higher.

Thanks,

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.