|
Message-ID: <20110520043450.GA12135@openwall.com> Date: Fri, 20 May 2011 08:34:50 +0400 From: Solar Designer <solar@...nwall.com> To: crypt-dev@...ts.openwall.com Subject: Re: alternative approach On Fri, May 20, 2011 at 12:07:05AM -0300, Yuri Gonzaga wrote: > Even so, I did some research on Google. > According to [1], in section "CLBs, Slices, and LUTs", One Virtex-6 slice > has 4 LUTs/8 flip-flops. > It looks like one LUT has up to 2 flip-flops. Yes. This appears to be a 2x improvement over Virtex-5. > Maybe, some of the registers were mergerd to logic in the 131 LUTs. It appears so. This "Virtex-6 Family Overview" you found says that "between 25-50% of all slices can also use their LUTs as distributed 64-bit RAM". Per the table on page 2, for the device you synthesized for, this percentage appears to be at around 35.9%. Also, it appears that you synthesized for the smallest Virtex-6 device, with 11,640 slices. The largest described in that overview is 6.4 times larger. If my math is right, assuming the 66x figure from my previous e-mail, it could give us a 420x advantage over a quad-core CPU, or as much as 2500x for Pico's 6-chip board. Would it be possible to generate some sort of diagram, to easily see how those 131 LUTs are used? That table on page 2 also gives the number of 36 Kbit Block RAMs - from 156 to 1064. This should let us implement this many full bcrypt cores. I think we should try to do just that, and perhaps we'd have plenty of LUTs left for smaller/other cores (bflike and/or DES). So please proceed to implement bcrypt's performance-critical loop (as discussed previously) for Virtex-6 using Block RAM. Finally, there are lots of DSP slices - so many that it feels wrong not to use them - from 288 to 2016. Each DSP can process 48-bit data at 600 MHz. It is unclear from this overview how much memory and by what means the DSPs can access, though. If they can access an equivalent of arbitrary 48-bit registers (many of them), then 2016 of those DSPs could outperform a quad-core dual-issue 128-bit SIMD CPU at bitslice DES by a factor of 15. > > The relatively small > > increase in LUT count means that with a 2-round implementation, we waste > > most LUTs on the state machine, right? > > I am not sure. Maybe doing some experimentations could clarify this > question. > For example, remove part of state machine and see the impact in the result. Yes, you could try. Another guess: many LUTs are wasted on the s[] initial values. Those values could theoretically be packed in just four LUTs, but that would probably not allow for them to be read out quickly. So perhaps more LUTs are wasted for that reason. > > Are you sure this is good enough for the compiler to do the smart thing, > > realizing that we have an adder with some holes punched in it? > > And what do you suggest? Find an equivalent function but simpler? No, I thought you'd show this partial carry addition bit-by-bit, without redundancy that you expect the synthesizer to detect and optimize out. But maybe this is not needed. If we could see (and understand) a diagram of the generated circuit, we'd know for sure. Or you can try both ways and compare the LUT counts. Oh, here's a simpler test: try replacing pcadd() with a simple addition or simple XOR. If the synthesizer was smart enough, this should not change the LUT count. If this does reduce the LUT count, then perhaps there was room for improvement. > In any case, the synthesizer already does optimizations in logic (or at > least, tries to). Yes, perhaps it does that well enough. > > Do these initial value assignments consume any logic? > > Yes. > > > If so, perhaps you can save some logic by having the initial state of > > s[] uploaded from the host? > > It is possible. However for each calculation it will be necessary to upload > them. And that's fine. We'll need to be supplying them anyway - if not from the host, then from a SHA-256 core (or the like) implemented in the FPGA. Also, Niter of 512 that I had in my code was just for testing. In actual use, the iteration counts would be a lot higher - perhaps about 2 million (to compute our hashes in 10ms at 200 MHz). Uploading some data every 10ms is fine. We'd need to do it anyway, the only question is how much data - only input to a SHA-256 core implemented in FPGA (which would in turn feed those other components) or the output of SHA-256 or SHA-512 running on the host (inputs to each of the thousands of cores individually). I think that either approach would work. BTW, I did some testing of bflike.c with millions of iterations - detected no state repeats. > Unless we upload just once and maintain them stored in some registers. No, you're missing the fact that we actually need to provide inputs to those cores. L and R values are not sufficient - they're only 16-bit together, so there will be lots of repeated values of them across the cores and between iterations. We want our cores to be computing different things, not the same thing many times (which an attacker would optimize out). Oh, if we used different initial values of s[] for each core, then we could possibly send/receive just the 16-bit blocks (and accept having some collisions). Yet this feels worse than sending/receiving s[]. Regarding correctness testing: > I did exactly what you said. Great! > [1] > http://www.dinigroup.com/product/data/DN-DualV6-PCIe-4/files/Virtex6_Overview_v11.pdf Thanks, Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.