Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 11 May 2011 12:31:00 -0700
From: David Hulton <>
Subject: Re: alternative approach

Alexander & Yuri,

I think that DES would be our best bet as far as certified ciphers.
We're able to do on the order of around 22 billion DES operations/sec
on our current M-501 modules (we currently are able to do over 1
trillion keys/sec on our 48-board 5U clusters) and definitely in the
billions range on the E-101 board that we're sending to Yuri. If we
combined PBKDF2-like iteration algorithm with DES/3DES we might be
able to get away with something that conforms close enough to the
standards out there to be usable, allows us to maintain a full
pipeline with the DES implementation (for efficient operation in
server mode) and give us a 10x+ advantage over GPU's/CPU's. I haven't
seen many benchmarks for GPU's and DES but the closest I've seen for
Lanman is still in the hundreds of millions at best.

I have a feeling that going with DES puts us on the right track and
one other benefit is that if this was ever ASICed we would immediately
have another 6x-10x advantage and major cost savings on larger scale

One option that comes to mind is doing something like 16 or 32 (or any
other higher convenient multiple of 16) number of PBKDF2 operations in
parallel (maybe each with different salts?) and then xor all of the
results together at the end. This would give us some degree of
parallelism (if we have a 16-stage DES pipeline we can have all of our
PBKDF2 instances running efficiently in server mode possibly across
multiple DES cores depending on the parallelism setting) and also
providing a tuneable tradeoff with the sequential running time.

Just some thoughts.. I think part of the problem is that most
mainstream algorithms designed in the past 20 years have been heavily
targeted to run efficiently in software rather than gates so
unfortunately we may need to go to the past to take advantage of this.
The other other algorithms that I think would run much more
efficiently would be ones that do lots of smaller bit operations like
LFSR based algorithms or possibly FFT-like hashes...


On Wed, May 11, 2011 at 12:07 PM, Solar Designer <> wrote:
> Hi Yuri,
> Is my understanding of your reply correct that you agree it is a good
> idea to use the "alternative approach", but you propose that when doing
> so we limit ourselves to "well-known and acceptable hashes/ciphers" as
> our building blocks?
> On Wed, May 11, 2011 at 11:22:38AM -0300, Yuri Gonzaga wrote:
>> It is complicated to proposed a new method that doesn't use a combination of
>> well-known and acceptable hashes/ciphers because of the lack of
>> certification and acceptance since it take some time to happens, only after
>> cientific and comercial communities appreciation.
>> Thus, it is better to our project to use validated hashes/ciphers in order
>> to achieve acceptance.
> Right.  However, most modern cryptographic primitives were designed for
> efficient software implementation (as well as hardware implementation).
> So I'm afraid there isn't a truly software-unfriendly modern hash/cipher
> that has received much scrutiny and government certification.
> Blowfish (and perhaps Twofish) might be the closest we can get - it has
> received a lot of scrutiny, but no government certification.  It is
> friendly to 32-bit CPUs, but not to typical SIMD architectures (which
> lack vectorized array lookups, for good reasons).
> Thus, we may choose to use a component based on Blowfish (such as
> Eksblowfish or the full bcrypt) or even one based on our own Blowfish
> variation (more unfriendly to software implementation and requiring less
> logic in hardware), but include proof that cryptographic security of our
> entire KDF is not dependent on that of this component.  That is, even if
> a gaping cryptographic weakness is found in this component, the worst
> impact would be faster testing of a candidate key in a brute-force
> attack - partially or fully bypassing computation of this "non-crypto"
> component.  Of course, we wouldn't expect such bypass to be actually
> practical, which is why we'd include this component.
> (We may use a similar approach to gain acceptance for a variation of
> scrypt, formally relying on SHA-256 and PBKDF2 rather than on its
> Salsa20/8 component.)
>> Also, the time of GSoC will be better spent if we
>> focus in implement these blocks instead of proposing a new method.
> Maybe, or maybe not.  We definitely need to have something usable and
> useful by the end of summer.  We've already decided to include an FPGA
> implementation and an optional FPGA-based component in our KDF.  No
> matter what we choose, if we implement this at all and it works, it will
> be "usable".  As to it being "useful", it will need to be somehow better
> than what exists now, for some uses and in some aspects.  I see two ways
> to make it useful in this sense:
> 1. Have our KDF, when running on one CPU plus one FPGA, be significantly
> more resistant against offline brute-force attacks by just CPUs (no
> FPGAs) and/or by GPUs, compared to an existing KDF implemented on a CPU
> only.  By "significantly" I mean something like a 10x difference in
> attack speed (assuming fixed $ cost of the attacker's equipment), and
> preferably more than 10x.
> 2. Include and use a host-unreadable local parameter on the FPGA board.
> For this, we need to figure out if Pico Computing's boards are readily
> usable to achieve this task or not...  If not yet, then maybe we won't
> have this by the end of summer (although it'd be a pity).
> If we're 100% confident that we'd achieve one of these by the end of
> summer, then the other is optional (for the summer).  However, at least
> one of these is a must to achieve by the end of the summer.
> Now, while we could wish FPGAs were some magic high-performance beasts,
> definitely much faster than CPUs when properly programmed, they are not.
> If we pick a software-friendly cryptographic primitive, it could turn
> out that our optimized implementation of it on a typical server CPU
> achieves comparable performance to our optimized implementation on our
> FPGA board (one we could offer for installation into servers).
> Here are some performance numbers for DES, which was originally designed
> for hardware implementation, but then had efficient software
> implementations developed for it:
> The performance number given for "AMD64 3000+" is incorrect, please
> disregard it.  See the comments (some of them posted by me) for correct
> performance numbers for CPU implementations.
> This shows that XC3S1000-4 achieves a speed similar to a quad-core CPU
> of 2008.  Core i7-2600 (this year's quad-core CPU) is twice faster - it
> does over 20M of DES-based crypt(3) keys per second with my current code.
> Thus, apparently XC3S5000-4, which is reported to theoretically achieve
> 45 MKeys/sec in a table on that web page, is twice faster than a modern
> quad-core CPU.  However, for our project a mere 2x performance gain of
> FPGA vs. CPU is not good enough to achieve the #1 goal mentioned above.
> (I don't know how FPGAs on Pico Computing's boards meant for production
> use in servers compare to a XC3S5000-4.  I'd appreciate info on that.)
> Overall, the above makes me think that we may in fact need to design and
> implement a software-unfriendly component.  Definitely more than DES
> (which can be made to use arbitrarily wide SIMD vectors in software with
> a bitslice implementation), and perhaps more than Blowfish.  With
> original Blowfish, we might achieve a 4x "boost" in software
> unfriendliness vs. DES (since it only uses 32-bit operations, whereas
> modern CPUs can do 128-bit almost as fast), which might give us a total
> "FPGA boost" vs. CPUs of 8x (relying on speeds for XC3S5000-4 vs. Core
> i7-2600 mentioned above).  However, I'd prefer something better than 8x
> (although 8x is already reasonable for our project), and I am not
> certain of it (full/original Blowfish probably consumes more FPGA logic
> than DES does, which I did not take into account in these calculations,
> so we'd actually fit fewer Blowfish cores in our FPGA chip; also, there
> are differences in pipelining ability, which is nearly perfect for DES).
>> So, my suggestion is that you have to define 3 components: one for CPU,
>> other for GPU and another for FPGA.
> Maybe.  I suggest that we omit a GPU-specific component for now, for a
> variety of reasons.
>> The result of each component will be built in a final result. For example,
>> do a XOR operation using 3 results
> Yes, however XOR only works well when the components' outputs are of the
> same size and all are "random enough" - otherwise the combined output
> might reveal some bits of individual outputs, which would allow for
> faster brute-force attacks (on those revealed outputs separately).
> Thus, we'd want to feed the components' outputs into a generic KDF
> (relatively fast).  We may even include the original password and salt
> along with those component outputs.  This gives us proof that the
> security of our entire construct depends on that final KDF only.
>> or use one output as input in the next component.
> This reduces parallelism for actual use while not affecting parallelism
> available for attacks.  Not good.
>> Finally, as you are the expert in this area, I think you should establish
>> these details so I can go on with the project implementation.
> Yes, I intend to define the project more specifically, but I am trying
> to get you involved in the design decisions and to let you learn new
> things.  Also, some design decisions depend on what they'd mean for
> implementation.  So I might ask you to try implementing different things
> before we decide on which of those to use.
> Thanks,
> Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.