
Date: Fri, 20 May 2011 05:16:24 +0400 From: Solar Designer <solar@...nwall.com> To: cryptdev@...ts.openwall.com Subject: Re: alternative approach Yuri, David  On Thu, May 19, 2011 at 06:32:33PM 0300, Yuri Gonzaga wrote: > I did the first attempt. I think I spent more time than I expected mainly > because the adaptation to Xilinx dev environment. These results are very helpful, and I am pleased that you did this soon enough considering the dev environment change. Thank You! > Device Utilization Summary (estimated values) I looked at the HTML version of your message to figure this out. (Normally, I read text/plain sections only.) It's four columns: > Logic Utilization > Used > Available > Utilization which contain, respectively (and so on for other lines): > Number of Slice Registers > 35 > 93120 > 0% > Number of Slice LUTs > 131 > 46560 > 0% I find this puzzling. We have 272 bits of internal state (not including auxiliary variables for loop control, etc.) How do they fit in 35 registers? http://www.1core.com/library/digital/fpgalogiccells/ says we have four 1bit registers per slice in a Virtex5. Has something changed in Virtex6? (I doubt it.) Or is this a terminology difference? (I find this more likely.) Yuri, David  can one of you explain this to me, please? s[] is declared as: reg [7:0] s [31:0]; Shouldn't this consume 256 registers right away? As to 131 LUTs, this is believable. > Number of fully used LUTFF pairs > 35 > 131 > 26% The same info repeated  35 flipflops, but 131 LUTs. Since there's an equal number of flipflops (aka registers?) and LUTs per slice, this means that we're only using 26% of the flipflops that we could have. ...but I am puzzled why we need so few of them. I'd expect that we'd have unused LUTs, not unused flipflops (because we'd need more of them than LUTs). > Number of bonded IOBs > 14 > 240 > 5% > > Number of BUFG/BUFGCTRLs > 2 > 32 > 6% What are these? Could we reasonably use more of them by redefining our Blowfishlike algorithm slightly? > Maximum frequency: 261.356MHz Assuming that we're LUTbound and we can make use of all LUTs with this (which might not be true), this gives us: 46560/131*261.356 = 92891 That's 93 billion tworound iterations per second (assuming sufficient parallelism, which we'll provide). We need to compare this against a CPU implementation. (And maybe also against GPU, but that's trickier.) For a quadcore CPU at 3.5 GHz capable of 3 instructions per cycle (on each core) running a nonbitslice implementation, assuming 30 instructions per 2 rounds (the best gccgenerated code I was able to get for an unrolled loop): 3500*4*3/30 = 1400 That's 1.4 billion tworound iterations per second. So the speedup is: 92891/1400 = 66.35 More optimal code might use 24 instructions per 2 rounds (with 3operand instructions if those are supported by the arch, which saves some moves): 3500*4*3/24 = 1750 92891/1750 = 53.08 For DES, we were getting a number of around 40 here  but comparing actual performance (both FPGA and CPU), not theoretical. So we have some (theoretical) improvement over DES, but it's not as good as I would have liked it to be (I wanted to see 100+ here). Some further improvement is possible by adding bit permutations. Is "Virtex6 xc6vlx75t3ff484" the same as the device that David mentioned getting 22 billion of DES encryptions per second for, though? If it's of a different size or speed grade, that could easily turn my 66x estimate into 100x or 50x (for the actual device). > For NROUNDS = 4 (same target device): This is also very helpful, thanks! > Number of Slice Registers > 35 > 93120 > 0% No more registers needed. Makes sense. (But the low number is just as puzzling.) > Number of Slice LUTs > 183 > 46560 > 0% More LUTs, but not 2x more. Also makes sense. The relatively small increase in LUT count means that with a 2round implementation, we waste most LUTs on the state machine, right? > Maximum frequency: 147.432MHz This gives: 46560/183*147.432*2 = 75021 75 billion of 2round iterations per second. Slightly slower than for the 2round implementation. But this is worth another try with our closertofinal Blowfishlike algorithm, if we choose to use one. On the other hand, the numbers so far suggest that 2round is likely better, and that if we simplify the state machine logic (possible or not?) it will be better yet. > About the pipelining, how can we deal with the fact that there are > dependencies between r, l and s in calculations? > Will each stage have to store locally r, l and s? Yes. We'll need the full set of registers for these for each pipeline stage. If right now we're LUT rather than registerbound (which is puzzling), then this makes sense (for a pipeline with 4 stages or so). > The verilog code is attached, including simulation. Thank you! Some comments on it: > `define PCADD(a, b, mask) (((a) ^ (b)) + (((a) & (b) & (mask)) << 1)) Are you sure this is good enough for the compiler to do the smart thing, realizing that we have an adder with some holes punched in it? > s[0] = 8'ha6; > s[1] = 8'hac; > s[2] = 8'hdb; > s[3] = 8'hb7; ... > s[28] = 8'ha5; > s[29] = 8'h29; > s[30] = 8'h40; > s[31] = 8'h3e; Do these initial value assignments consume any logic? Perhaps they do. If so, perhaps you can save some logic by having the initial state of s[] uploaded from the host? Did you verify correctness of your implemenation in any way? (Perhaps by making sure it produces the same outputs for the same inputs as the C implementation does.) Thanks again, Alexander
Powered by blists  more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.