|
Message-ID: <20110821163356.GA11209@openwall.com> Date: Sun, 21 Aug 2011 20:33:56 +0400 From: Solar Designer <solar@...nwall.com> To: crypt-dev@...ts.openwall.com Subject: optimizing Eksblowfish in FPGA Yuri, all - I initially wrote some of the below in an off-list message to Yuri, but I feel that it's of more use when posted to the list. Yuri - I'll start by repeating some of what I already sent to you off-list, then add to it - so please scroll down to see the new stuff. Achieving the desired performance for Eksblowfish will probably require all of: - Two interleaved Eksblowfish instances per core, for a throughput of two Blowfish rounds per two clock cycles (effectively 1 cycle per round). - Lots of cores per Virtex-6 chip (on the order of 100). - Clock rate of 150 MHz or so (preferably 200+ MHz, maybe using BlockRAMs' buffer registers, which you're not using yet). Comparing this to a quad-core CPU at 3.0 GHz and doing one Blowfish round in 5 clock cycles (per core), we get: Speedup from fewer cycles per round: 5x Speedup from more cores: 25x Slowdown from lower clock rate: 20x So this gives us a reasonable 6.25x overall speedup. But I understand it'd be tough to achieve. Also, it assumes zero communication delays, whereas in practice it'd be tough to minimize those delays to an acceptable level (I guess you'd need to have data transfers overlap with actual operation of the cores - e.g., core #1 should start processing as soon as its data is in FPGA, and it should be the first one to query for results). New stuff (not yet discussed via private e-mail): Perhaps to use buffer registers, you'd need to interleave three and not just two instances per core. It'd be something like this: Next cycle: request memory reads for instance #1 process previously requested data for instance #2 Next cycle: request memory reads for instance #2 process previously requested data for instance #3 Next cycle: request memory reads for instance #3 process previously requested data for instance #1 So you have a latency of 2 cycles on memory reads, which likely allows for a higher clock rate. A drawback may be that a smaller percentage of LUTs will be in use on a given clock cycle, though. So this may or may not be an overall improvement - you'd need to experiment with it. Then, to save on control logic, you may use even more instances per core, but running them in parallel rather than interleaved. So in the example above, you'd have something like this on a given clock cycle, all in the same core (same state machine): request memory reads for instance #1 request memory reads for instance #4 request memory reads for instance #7 process previously requested data for instance #2 process previously requested data for instance #5 process previously requested data for instance #8 These are just some thoughts that I'd like to have recorded for when it's the right time to optimize things. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.