|
Message-ID: <20110816100049.GA22647@openwall.com> Date: Tue, 16 Aug 2011 14:00:49 +0400 From: Solar Designer <solar@...nwall.com> To: crypt-dev@...ts.openwall.com Subject: Re: Yuri's Status Report - #14 of 15 Hi Yuri, On Tue, Aug 16, 2011 at 12:50:16AM -0300, Yuri Gonzaga wrote: > - Accomplishments: > - Back to Pico e101 to run 4 cores of eksblowfish loop in parallel. That's great, but: > - The 4 cores running in parallel to compute the same benchmark > hash example wasted 99.312 s on my Windows XP virtual machine > as on the > another examples; If I understand what the benchmark does correctly, this is awfully slow - over 1000 times slower than a 1 GHz CPU. Also, it is unclear to me how this corresponds to another benchmark result you mentioned before: http://www.openwall.com/lists/crypt-dev/2011/07/05/1 "- This was executed 10 times sequentially and waste 234.938 seconds of execution time" So previously you had 23 seconds per Eksblowfish main loop invocation (which is also unacceptably slow). Now you report 99 seconds for _parallel_ execution of 4 instances of the loop, yet you say you receive the same speed (or maybe I misinterpret what you wrote). These numbers just do not agree; for parallel execution, you should have received the same total runtime for 4 instances as you did for 1 instance, but you got it to run 4 times longer now, as if your instances are running sequentially rather than in parallel? I asked you some questions in: http://www.openwall.com/lists/crypt-dev/2011/07/05/2 to which you never replied. Perhaps I should have insisted on a reply, so we'd catch the unacceptable performance sooner. Somehow I thought that such execution times would apply to something like 1000 invocations of the Eksblowfish loop (you did mention just 10, though...) or to a much higher "cost" setting (but you appear to keep it at 5 in the code that you uploaded to the wiki). Also, I was hoping that JtR integration would be ready soon and we'd see the actual performance from there. The code in eksblowfish-loop-interface.zip and 4-eksblowfish-loop-cores-pico-e101.zip looks like you do just 10 invocations of Eksblowfish at cost=5 in the former and just 4 of them in the latter (with an attempt to do them in parallel, which may or may not be successful - we'll need to figure this out too). So where does the extreme performance loss come from? A ridiculously low clock rate, like 1 MHz or below? Or am I misreading things? You need to use a reasonable clock rate, comparable to what we'd actually use, to validate that your design works as intended in hardware. > - Files available at http://openwall.info/wiki/crypt-dev/files Thanks, I downloaded your latest and took a look. > - Priorities: > - JtR's eksblowfish loop multicore integration, compilation, execution > and comparison. While this is blocked waiting for access to the remote machine, please figure out what goes on with the horrible performance. On a 1 GHz CPU, Eksblowfish runs at something between 100 and 200 invocations per second, at cost=5. You're reporting it running at 23 seconds, which is thus 2000 to 5000 times slower. Indeed, the clock rate is lower, but maybe only by a factor of 10 (I am assuming that you run this at 100 MHz or so). This is partially compensated by the reduced number of clock cycles per Blowfish round. On a CPU, it can be something between 5 and 10 cycles per Blowfish round: http://www.schneier.com/blowfish-speed.html This gives 9 for the original Pentium, I am getting 5.5 for the code in JtR on newer CPUs. On an FPGA, you should have between 1 and 5 clock cycles per Blowfish round. IIRC, the first implementation with BlockRAMs that you did was meant to provide 5 cycles per round, and we discussed how to reduce that. Even if we take 5 cycles per round for both the CPU and the FPGA, this gives us only a 10x slowdown per core because of the clock rate difference alone (an Eksblowfish core in an FPGA vs. code running on a CPU core). (And we'd compensate for that by having something like 100 cores in a larger FPGA chip, as well as by reducing the number of clock cycles per round, which doesn't have to be as bad as 5.) But you're getting a ridiculous 2000 to 5000 times slowdown instead? As to LUT count, it appears that you'd fit only 7 cores in the E-101 board's Spartan-6. That's obviously too few, but we'll target much larger chips and we'll need to optimize the LUT count. For initial testing, even 4 or 7 cores will work, but we need to get sane performance from them and we need them to run in parallel for real (as demonstrated with non-increasing total execution time when you add some meant to be parallel invocations). Please comment ASAP. Thanks, Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.