|
Message-ID: <20110512095154.GA12455@openwall.com> Date: Thu, 12 May 2011 13:51:54 +0400 From: Solar Designer <solar@...nwall.com> To: crypt-dev@...ts.openwall.com Subject: Re: alternative approach David, Yuri - On Wed, May 11, 2011 at 12:31:00PM -0700, David Hulton wrote: > I think that DES would be our best bet as far as certified ciphers. I think that technically DES would be great for a component that we want to be optimal for both FPGA/ASIC and CPU, if we design our KDF such that bitslice implementations of DES are convenient. Even though DES was originally designed for hardware implementation, with bitslicing (when that is practical) it is also software-friendly. ...I just found some pseudo-code for a bitslice DES based crypt(3) like function, which I wrote in 1998 (according to the file timestamp). I'll post it separately. Some drawbacks of DES are: - A lot of people have heard that DES has been "broken", even though it hasn't - it's just that its 56-bit key size is insufficient. - DES has been formally obsoleted, in favor of AES. (As an excuse, we can build upon 3DES and say that we're using that...) - The key size (and also block size) is in fact inadequate; we'll need to use DES such that this does not affect our KDF's properties. - For the "alternative approach", DES is too software-friendly (but it is also extremely hardware-friendly). > We're able to do on the order of around 22 billion DES operations/sec > on our current M-501 modules (we currently are able to do over 1 > trillion keys/sec on our 48-board 5U clusters) That's impressive. http://picocomputing.com/m_series.html says "Virtex-6 LX FPGA", "Up to six M-501 modules per board". Is the 22 billion figure for one or for six FPGAs? > and definitely in the > billions range on the E-101 board that we're sending to Yuri. Sounds good. http://picocomputing.com/e_series.html says it's "Spartan-6 LX45 FPGA". Yuri - the numbers I posted in here before were for DES-based crypt(3), which includes 25 iterations of modified-DES. So you need to multiply them by 25 to compare against those posted by David above. Thus, if Core i7-2600 does 20.5M DES-based crypts per second (and it does that here), this means that it does 512M modified-DES encryptions per second. For plain DES (not modified), it would do approx. 550M per second in a similar loop (no key setup in the loop, just iterated DES encryption). More efficient parallelization over the multiple cores could provide another 10% speedup (my 20.5M figure is for thread-safe code and OpenMP, which has overhead). That's still 35-40 times slower than the 22 billion mentioned by David, which is good. > If we > combined PBKDF2-like iteration algorithm with DES/3DES we might be > able to get away with something that conforms close enough to the > standards out there to be usable, allows us to maintain a full > pipeline with the DES implementation (for efficient operation in > server mode) and give us a 10x+ advantage over GPU's/CPU's. This makes sense. As a bonus, since bitslice DES is also CPU-friendly, we could try to design our KDF such that we treat the FPGA and the CPU almost the same (at high level). However, if 10x (or even 40x?) is possible in this way, this leads me to think that we could achieve 100x+ with a truly CPU-unfriendly component for the FPGA. > I haven't > seen many benchmarks for GPU's and DES but the closest I've seen for > Lanman is still in the hundreds of millions at best. Disclaimer: I am not into GPU programming (yet). This is why I use words such as "apparently" below. Yes, DES is somewhat unfriendly to GPUs so far, but this is somewhat likely to change. Bitslice implementations of DES with the S-box expressions published so far need up to approx. 20 registers, which is apparently more than what typical GPUs have available per processing element. Also, apparently accessing a temporary variable spilled to memory is relatively slow (compared to accessing the L1 cache in a CPU). One expected change is that we (Openwall) are going to release more GPU-friendly DES S-box expressions in about a month from now. These match the instruction set of high-end ATI/AMD GPUs (as well as AMD XOP and PPC AltiVec, which is why they'd get into JtR this soon), and they will need fewer registers. Another change that may or may not happen is GPUs getting more registers or/and faster access to temporary variables. This sounds quite realistic to me, but I am not an expert in this area. Based on what I read (about the implementation difficulty for DES on GPUs), I wouldn't be surprised if a mere 2x increase in register count results in a 10x speedup for DES on GPUs. > I have a feeling that going with DES puts us on the right track and > one other benefit is that if this was ever ASICed we would immediately > have another 6x-10x advantage and major cost savings on larger scale > deployments. This makes sense. Wouldn't ASIC provide similar advantage for Blowfish-like constructs, though? > One option that comes to mind is doing something like 16 or 32 (or any > other higher convenient multiple of 16) number of PBKDF2 operations in > parallel (maybe each with different salts?) and then xor all of the > results together at the end. This would give us some degree of > parallelism (if we have a 16-stage DES pipeline we can have all of our > PBKDF2 instances running efficiently in server mode possibly across > multiple DES cores depending on the parallelism setting) and also > providing a tuneable tradeoff with the sequential running time. This makes sense. I think the degree of parallelism would need to be far greater, though. For the CPU, it will need to be at least 128 (since we'll do bitslice). But I also need to post my thoughts on us using a sequential memory-hard algorithm as defined in the scrypt paper. Maybe that is what will occupy the CPU. > Just some thoughts.. I think part of the problem is that most > mainstream algorithms designed in the past 20 years have been heavily > targeted to run efficiently in software rather than gates so Right. > unfortunately we may need to go to the past to take advantage of this. Yes, or modify/design something new. > The other other algorithms that I think would run much more > efficiently would be ones that do lots of smaller bit operations like > LFSR based algorithms These might be too weak. > or possibly FFT-like hashes... I am not familiar with these. I actually thought of modifying Blowfish in specific ways for this project. I am likely to write about this separately. Thank you! Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.