crypt-dev - Re: alternative approach

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110512095154.GA12455@openwall.com>
Date: Thu, 12 May 2011 13:51:54 +0400
From: Solar Designer <solar@...nwall.com>
To: crypt-dev@...ts.openwall.com
Subject: Re: alternative approach

David, Yuri -

On Wed, May 11, 2011 at 12:31:00PM -0700, David Hulton wrote:
> I think that DES would be our best bet as far as certified ciphers.

I think that technically DES would be great for a component that we want
to be optimal for both FPGA/ASIC and CPU, if we design our KDF such that
bitslice implementations of DES are convenient.  Even though DES was
originally designed for hardware implementation, with bitslicing (when
that is practical) it is also software-friendly.

...I just found some pseudo-code for a bitslice DES based crypt(3) like
function, which I wrote in 1998 (according to the file timestamp).  I'll
post it separately.

Some drawbacks of DES are:

- A lot of people have heard that DES has been "broken", even though it
hasn't - it's just that its 56-bit key size is insufficient.

- DES has been formally obsoleted, in favor of AES.  (As an excuse, we
can build upon 3DES and say that we're using that...)

- The key size (and also block size) is in fact inadequate; we'll need
to use DES such that this does not affect our KDF's properties.

- For the "alternative approach", DES is too software-friendly (but it
is also extremely hardware-friendly).

> We're able to do on the order of around 22 billion DES operations/sec
> on our current M-501 modules (we currently are able to do over 1
> trillion keys/sec on our 48-board 5U clusters)

That's impressive.

http://picocomputing.com/m_series.html says "Virtex-6 LX FPGA", "Up to
six M-501 modules per board".  Is the 22 billion figure for one or for
six FPGAs?

> and definitely in the
> billions range on the E-101 board that we're sending to Yuri.

Sounds good.

http://picocomputing.com/e_series.html says it's "Spartan-6 LX45 FPGA".

Yuri - the numbers I posted in here before were for DES-based crypt(3),
which includes 25 iterations of modified-DES.  So you need to multiply
them by 25 to compare against those posted by David above.  Thus, if
Core i7-2600 does 20.5M DES-based crypts per second (and it does that
here), this means that it does 512M modified-DES encryptions per second.
For plain DES (not modified), it would do approx. 550M per second in a
similar loop (no key setup in the loop, just iterated DES encryption).
More efficient parallelization over the multiple cores could provide
another 10% speedup (my 20.5M figure is for thread-safe code and OpenMP,
which has overhead).

That's still 35-40 times slower than the 22 billion mentioned by David,
which is good.

> If we
> combined PBKDF2-like iteration algorithm with DES/3DES we might be
> able to get away with something that conforms close enough to the
> standards out there to be usable, allows us to maintain a full
> pipeline with the DES implementation (for efficient operation in
> server mode) and give us a 10x+ advantage over GPU's/CPU's.

This makes sense.  As a bonus, since bitslice DES is also CPU-friendly,
we could try to design our KDF such that we treat the FPGA and the CPU
almost the same (at high level).

However, if 10x (or even 40x?) is possible in this way, this leads me to
think that we could achieve 100x+ with a truly CPU-unfriendly component
for the FPGA.

> I haven't
> seen many benchmarks for GPU's and DES but the closest I've seen for
> Lanman is still in the hundreds of millions at best.

Disclaimer: I am not into GPU programming (yet).  This is why I use
words such as "apparently" below.

Yes, DES is somewhat unfriendly to GPUs so far, but this is somewhat
likely to change.  Bitslice implementations of DES with the S-box
expressions published so far need up to approx. 20 registers, which is
apparently more than what typical GPUs have available per processing
element.  Also, apparently accessing a temporary variable spilled to
memory is relatively slow (compared to accessing the L1 cache in a CPU).

One expected change is that we (Openwall) are going to release more
GPU-friendly DES S-box expressions in about a month from now.  These
match the instruction set of high-end ATI/AMD GPUs (as well as AMD XOP
and PPC AltiVec, which is why they'd get into JtR this soon), and they
will need fewer registers.

Another change that may or may not happen is GPUs getting more registers
or/and faster access to temporary variables.  This sounds quite
realistic to me, but I am not an expert in this area.  Based on what I
read (about the implementation difficulty for DES on GPUs), I wouldn't
be surprised if a mere 2x increase in register count results in a 10x
speedup for DES on GPUs.

> I have a feeling that going with DES puts us on the right track and
> one other benefit is that if this was ever ASICed we would immediately
> have another 6x-10x advantage and major cost savings on larger scale
> deployments.

This makes sense.  Wouldn't ASIC provide similar advantage for
Blowfish-like constructs, though?

> One option that comes to mind is doing something like 16 or 32 (or any
> other higher convenient multiple of 16) number of PBKDF2 operations in
> parallel (maybe each with different salts?) and then xor all of the
> results together at the end. This would give us some degree of
> parallelism (if we have a 16-stage DES pipeline we can have all of our
> PBKDF2 instances running efficiently in server mode possibly across
> multiple DES cores depending on the parallelism setting) and also
> providing a tuneable tradeoff with the sequential running time.

This makes sense.  I think the degree of parallelism would need to be
far greater, though.  For the CPU, it will need to be at least 128
(since we'll do bitslice).

But I also need to post my thoughts on us using a sequential memory-hard
algorithm as defined in the scrypt paper.  Maybe that is what will
occupy the CPU.

> Just some thoughts.. I think part of the problem is that most
> mainstream algorithms designed in the past 20 years have been heavily
> targeted to run efficiently in software rather than gates so

Right.

> unfortunately we may need to go to the past to take advantage of this.

Yes, or modify/design something new.

> The other other algorithms that I think would run much more
> efficiently would be ones that do lots of smaller bit operations like
> LFSR based algorithms

These might be too weak.

> or possibly FFT-like hashes...

I am not familiar with these.

I actually thought of modifying Blowfish in specific ways for this
project.  I am likely to write about this separately.

Thank you!

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.