|
Message-ID: <20110520011624.GA11672@openwall.com> Date: Fri, 20 May 2011 05:16:24 +0400 From: Solar Designer <solar@...nwall.com> To: crypt-dev@...ts.openwall.com Subject: Re: alternative approach Yuri, David - On Thu, May 19, 2011 at 06:32:33PM -0300, Yuri Gonzaga wrote: > I did the first attempt. I think I spent more time than I expected mainly > because the adaptation to Xilinx dev environment. These results are very helpful, and I am pleased that you did this soon enough considering the dev environment change. Thank You! > Device Utilization Summary (estimated values) I looked at the HTML version of your message to figure this out. (Normally, I read text/plain sections only.) It's four columns: > Logic Utilization > Used > Available > Utilization which contain, respectively (and so on for other lines): > Number of Slice Registers > 35 > 93120 > 0% > Number of Slice LUTs > 131 > 46560 > 0% I find this puzzling. We have 272 bits of internal state (not including auxiliary variables for loop control, etc.) How do they fit in 35 registers? http://www.1-core.com/library/digital/fpga-logic-cells/ says we have four 1-bit registers per slice in a Virtex-5. Has something changed in Virtex-6? (I doubt it.) Or is this a terminology difference? (I find this more likely.) Yuri, David - can one of you explain this to me, please? s[] is declared as: reg [7:0] s [31:0]; Shouldn't this consume 256 registers right away? As to 131 LUTs, this is believable. > Number of fully used LUT-FF pairs > 35 > 131 > 26% The same info repeated - 35 flip-flops, but 131 LUTs. Since there's an equal number of flip-flops (aka registers?) and LUTs per slice, this means that we're only using 26% of the flip-flops that we could have. ...but I am puzzled why we need so few of them. I'd expect that we'd have unused LUTs, not unused flip-flops (because we'd need more of them than LUTs). > Number of bonded IOBs > 14 > 240 > 5% > > Number of BUFG/BUFGCTRLs > 2 > 32 > 6% What are these? Could we reasonably use more of them by redefining our Blowfish-like algorithm slightly? > Maximum frequency: 261.356MHz Assuming that we're LUT-bound and we can make use of all LUTs with this (which might not be true), this gives us: 46560/131*261.356 = 92891 That's 93 billion two-round iterations per second (assuming sufficient parallelism, which we'll provide). We need to compare this against a CPU implementation. (And maybe also against GPU, but that's trickier.) For a quad-core CPU at 3.5 GHz capable of 3 instructions per cycle (on each core) running a non-bitslice implementation, assuming 30 instructions per 2 rounds (the best gcc-generated code I was able to get for an unrolled loop): 3500*4*3/30 = 1400 That's 1.4 billion two-round iterations per second. So the speedup is: 92891/1400 = 66.35 More optimal code might use 24 instructions per 2 rounds (with 3-operand instructions if those are supported by the arch, which saves some moves): 3500*4*3/24 = 1750 92891/1750 = 53.08 For DES, we were getting a number of around 40 here - but comparing actual performance (both FPGA and CPU), not theoretical. So we have some (theoretical) improvement over DES, but it's not as good as I would have liked it to be (I wanted to see 100+ here). Some further improvement is possible by adding bit permutations. Is "Virtex6 xc6vlx75t-3ff484" the same as the device that David mentioned getting 22 billion of DES encryptions per second for, though? If it's of a different size or speed grade, that could easily turn my 66x estimate into 100x or 50x (for the actual device). > For NROUNDS = 4 (same target device): This is also very helpful, thanks! > Number of Slice Registers > 35 > 93120 > 0% No more registers needed. Makes sense. (But the low number is just as puzzling.) > Number of Slice LUTs > 183 > 46560 > 0% More LUTs, but not 2x more. Also makes sense. The relatively small increase in LUT count means that with a 2-round implementation, we waste most LUTs on the state machine, right? > Maximum frequency: 147.432MHz This gives: 46560/183*147.432*2 = 75021 75 billion of 2-round iterations per second. Slightly slower than for the 2-round implementation. But this is worth another try with our closer-to-final Blowfish-like algorithm, if we choose to use one. On the other hand, the numbers so far suggest that 2-round is likely better, and that if we simplify the state machine logic (possible or not?) it will be better yet. > About the pipelining, how can we deal with the fact that there are > dependencies between r, l and s in calculations? > Will each stage have to store locally r, l and s? Yes. We'll need the full set of registers for these for each pipeline stage. If right now we're LUT- rather than register-bound (which is puzzling), then this makes sense (for a pipeline with 4 stages or so). > The verilog code is attached, including simulation. Thank you! Some comments on it: > `define PCADD(a, b, mask) (((a) ^ (b)) + (((a) & (b) & (mask)) << 1)) Are you sure this is good enough for the compiler to do the smart thing, realizing that we have an adder with some holes punched in it? > s[0] = 8'ha6; > s[1] = 8'hac; > s[2] = 8'hdb; > s[3] = 8'hb7; ... > s[28] = 8'ha5; > s[29] = 8'h29; > s[30] = 8'h40; > s[31] = 8'h3e; Do these initial value assignments consume any logic? Perhaps they do. If so, perhaps you can save some logic by having the initial state of s[] uploaded from the host? Did you verify correctness of your implemenation in any way? (Perhaps by making sure it produces the same outputs for the same inputs as the C implementation does.) Thanks again, Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.