|
Message-ID: <20110604220510.GB6422@openwall.com> Date: Sun, 5 Jun 2011 02:05:10 +0400 From: Solar Designer <solar@...nwall.com> To: crypt-dev@...ts.openwall.com Subject: Re: alternative approach On Mon, May 30, 2011 at 11:42:55PM -0300, Yuri Gonzaga wrote: > I synthesized again the bflike. > Now, it is was to Spartan-6 xc6slx45, same device of pico e101. > I got the following results (This time I could generate a report with more > details). > > I could generate a schematic view as well (available on http://bit.ly/koMcxo), > but is very big and dificult to track. I don't know if it will help. Thanks! This is what I wanted to see, but you're right - it's so large that it's difficult to figure anything out from it. I think we'll need to generate such schematic views for much smaller pieces of Verilog code, not for the entire thing at once. > > Oh, here's a simpler test: try replacing pcadd() with a simple addition > > or simple XOR. If the synthesizer was smart enough, this should not > > change the LUT count. If this does reduce the LUT count, then perhaps > > there was room for improvement. > > I replaced pcadd() for simpler "a ^ b ^ mask" and it really reduced the LUT > count. How large was the reduction? Can you also try simple "a ^ b" (no mask) and simple "a + b", and report the LUT counts for all four (original full pcadd(), your "a ^ b ^ mask", and these two I suggested)? These numbers might give us some hints. > However, I didn't figure out how to improve the pcadd() to do the right > thing using less LUTs. > All my attempts got wrong results in simulation. Can you possibly generate a schematic view for a pcadd() alone, with no or little other logic (just enough to make sure parts of pcadd() aren't optimized out as unused)? And for other variations of it as suggested above. Then try to compare those. > Number of Slice LUTs 105 27,288 1% It was 131 LUTs for your old code synthesized for Virtex-6. Why has this reduced to 105 now? Is this for your simplified pcadd() (which doesn't actually do the right thing)? > Maximum Frequency: 76.940MHz Somehow this is twice lower than what you reported for EksBlowfish. Sure, we're doing two rounds at once here, but the rounds are roughly twice simpler than Blowfish's (two S-box lookups vs. four). Are lookups from distributed memory (in LUTs) slower than those from BlockRAM? Or does pcadd() as currently synthesized have higher propagation delays than Blowfish's simple xor and add? Or is it the reduction in parallelism (our two rounds are sequential, whereas Blowfish's four S-box lookups for one round could theoretically occur in parallel)? (Perhaps you don't have answers to these questions. I am just thinking aloud.) Anyway, I'd expect a comparable LUT count for EksBlowfish as well - perhaps just a few times higher. Thanks, Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.