crypt-dev - Re: Yuri's Status Report

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110816100049.GA22647@openwall.com>
Date: Tue, 16 Aug 2011 14:00:49 +0400
From: Solar Designer <solar@...nwall.com>
To: crypt-dev@...ts.openwall.com
Subject: Re: Yuri's Status Report - #14 of 15

Hi Yuri,

On Tue, Aug 16, 2011 at 12:50:16AM -0300, Yuri Gonzaga wrote:
>    - Accomplishments:
>       - Back to Pico e101 to run 4 cores of eksblowfish loop in parallel.

That's great, but:

>          - The 4 cores running in parallel to compute the same benchmark
>          hash example wasted 99.312 s on my Windows XP virtual machine
> as on the
>          another examples;

If I understand what the benchmark does correctly, this is awfully slow -
over 1000 times slower than a 1 GHz CPU.  Also, it is unclear to me
how this corresponds to another benchmark result you mentioned before:

http://www.openwall.com/lists/crypt-dev/2011/07/05/1

"- This was executed 10 times sequentially and waste 234.938 seconds
of execution time"

So previously you had 23 seconds per Eksblowfish main loop invocation
(which is also unacceptably slow).  Now you report 99 seconds for
_parallel_ execution of 4 instances of the loop, yet you say you receive
the same speed (or maybe I misinterpret what you wrote).  These numbers
just do not agree; for parallel execution, you should have received the
same total runtime for 4 instances as you did for 1 instance, but you
got it to run 4 times longer now, as if your instances are running
sequentially rather than in parallel?

I asked you some questions in:

http://www.openwall.com/lists/crypt-dev/2011/07/05/2

to which you never replied.  Perhaps I should have insisted on a reply,
so we'd catch the unacceptable performance sooner.  Somehow I thought
that such execution times would apply to something like 1000 invocations
of the Eksblowfish loop (you did mention just 10, though...) or to a
much higher "cost" setting (but you appear to keep it at 5 in the code
that you uploaded to the wiki).  Also, I was hoping that JtR integration
would be ready soon and we'd see the actual performance from there.

The code in eksblowfish-loop-interface.zip and
4-eksblowfish-loop-cores-pico-e101.zip looks like you do just 10
invocations of Eksblowfish at cost=5 in the former and just 4 of them
in the latter (with an attempt to do them in parallel, which may or may
not be successful - we'll need to figure this out too).

So where does the extreme performance loss come from?  A ridiculously
low clock rate, like 1 MHz or below?  Or am I misreading things?

You need to use a reasonable clock rate, comparable to what we'd
actually use, to validate that your design works as intended in hardware.

>          - Files available at http://openwall.info/wiki/crypt-dev/files

Thanks, I downloaded your latest and took a look.

>       - Priorities:
>       - JtR's eksblowfish loop multicore integration, compilation, execution
>       and comparison.

While this is blocked waiting for access to the remote machine, please
figure out what goes on with the horrible performance.

On a 1 GHz CPU, Eksblowfish runs at something between 100 and 200
invocations per second, at cost=5.  You're reporting it running at 23
seconds, which is thus 2000 to 5000 times slower.  Indeed, the clock
rate is lower, but maybe only by a factor of 10 (I am assuming that you
run this at 100 MHz or so).  This is partially compensated by the
reduced number of clock cycles per Blowfish round.  On a CPU, it can be
something between 5 and 10 cycles per Blowfish round:

http://www.schneier.com/blowfish-speed.html

This gives 9 for the original Pentium, I am getting 5.5 for the code in
JtR on newer CPUs.  On an FPGA, you should have between 1 and 5 clock
cycles per Blowfish round.  IIRC, the first implementation with BlockRAMs
that you did was meant to provide 5 cycles per round, and we discussed
how to reduce that.  Even if we take 5 cycles per round for both the CPU
and the FPGA, this gives us only a 10x slowdown per core because of the
clock rate difference alone (an Eksblowfish core in an FPGA vs. code
running on a CPU core).  (And we'd compensate for that by having
something like 100 cores in a larger FPGA chip, as well as by reducing
the number of clock cycles per round, which doesn't have to be as bad as 5.)
But you're getting a ridiculous 2000 to 5000 times slowdown instead?

As to LUT count, it appears that you'd fit only 7 cores in the E-101
board's Spartan-6.  That's obviously too few, but we'll target much
larger chips and we'll need to optimize the LUT count.  For initial
testing, even 4 or 7 cores will work, but we need to get sane performance
from them and we need them to run in parallel for real (as demonstrated
with non-increasing total execution time when you add some meant to be
parallel invocations).

Please comment ASAP.

Thanks,

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.