|
Message-ID: <20130907224212.GA12946@openwall.com> Date: Sun, 8 Sep 2013 02:42:12 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: ZedBoard / Parallella: bcrypt Katja, On Sat, Sep 07, 2013 at 11:47:49AM +0200, Katja Malvoni wrote: > I have implementation of bcrypt's most costly loop and behavioral > simulation gives correct results. But when I put it in user part of IP core > generated by Create and import peripheral wizard it produces incorrect > result (code is attached, user_logic.v). > And I can't figure out why. If I write something to block RAM from PL, host > reads correct data. If I don't modify the contents of memory, host reads > the same unmodified data. If PL reads data from one location in block RAM > and writes to another one, host reads expected value. My guess was that > there is a problem with reading/writing to block RAM from PL but when there > is no computation implemented in logic but only reads and writes it works > as expected. Does anyone have an idea how to debug this? > Changing portions of code to see what would happen isn't practical because > bitstream generation takes 20 minutes. I currently don't have an idea better than trying to bisect it - rather than changing small portions of code, try to keep roughly half of the computation in there (and do the same in a "reference" implementation in C in order to have the expected correct outputs). That way, you may be able to identify which "half" has the issue, or whether both do. If only one produces incorrect results, then split it in "halves" again. That way, you wouldn't need to generate the bitstream more than a few times until you arrive at a fairly small piece of code that still has the issue - which you might then be able to spot far more easily. Also, to remind you, we already had Yuri's bcrypt on FPGA working correctly (including on the actual Spartan-6 device) - so maybe you could have started with his code - or you may use it as a reference. I am fine with you starting from scratch, though. http://openwall.info/wiki/crypt-dev/files > Current implementation is attached (bcrypt_loop.v) and it's too slow - > 7652913 clock cycles for cost 5. It comes mainly from memory latency. 3 > cycles are needed for a read from memory. 3 cycles is a lot. IIRC, it was 1 cycle on Spartan-6 and Virtex-6 when we experimented with bcrypt on those, and it could be 2 cycles with buffer registers if we wanted those (presumably for a higher clock rate). Is Zynq worse in that aspect? Anyhow, a way to deal with latencies is by interleaving of multiple bcrypt instances per core. You do have to allocate separate block RAMs per instance, but several instances (forming one core) can share most of the rest of the logic. http://www.openwall.com/lists/crypt-dev/2011/08/21/1 Which approach is optimal will depend on what ends up being the scarce resource, limiting the overall speed per chip (with as many cores as will fit). It can be block RAMs, or it can be LUTs, or it can be something else. > And I can use only one port because the other one is used by DMA. Yes, when you mentioned having used a port to do DMA a while ago, this felt wasteful to me - and now you confirm that it is. Perhaps you should reconsider that? With DMA, you may be making data transfers from/to host slightly faster, but you're probably almost halving the computation speed by wasting half the block RAM ports. Is it by any chance possible to use the same block RAM ports for both DMA and PL access, at different times? Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.