|
Message-ID: <20130926030150.GA22958@openwall.com> Date: Thu, 26 Sep 2013 07:01:50 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: Katja's weekly report #15 Hi Katja, On Mon, Sep 23, 2013 at 05:49:27PM +0200, Katja Malvoni wrote: > I wasn't able to work yesterday and I won't be able to work today, I caught > a flu. Get well soon! > Accomplishments: > 1. Updated wiki page Thanks! As I had mentioned, we/you need to get the page at http://openwall.info/wiki/john/development/Parallella linked from some other wiki page(s), such as from john/development or/and from john. > 2. Fixed bug so that bcrypt on FPGA doesn't fail self test on first run Great. What was the bug? > 3. Partially optimized bcrypt on FPGA > - using true dual port RAM for Sbox with two cycle latency. In > simulation I have it with 1 cycle latency, 3 cycles per BF_ROUND and > 1709766 cycles in total but it doesn't work on ZedBoard. 3 cycles per BF_ROUND sounds just right to me. I assume it's one cycle to fetch first two S-box elements, another cycle to fetch the other two, and a third cycle to process these fetched values and compute the next set of S-box indices, for the next round. Correct? Can you perhaps reduce this further, to two cycles per Blowfish round (for most rounds), by fetching the next round's first two S-box elements during the current round's "computation" cycle? In other words, we can and should be doing two S-box lookups from the block RAM on every cycle. There's probably no good enough reason to waste a cycle on computation alone, when we can also use this cycle to send two addresses to memory and have the data ready the next cycle. Yes, the maximum clock rate might be a bit lower than with the 3-cycle approach, but probably by very little. And indeed, you need to get this working on the device, not just in simulation. If you do get the 2-cycle approach working, then it'd make sense to use two block RAMs per core and do all four S-box lookups at once - which means you can do one Blowfish round per cycle. Yes, we'll waste half of our block RAM capacity in this way, but the alternative of having twice more cores (even if we can fit them in the device) that do one Blowfish round per two cycles is possibly not any better. Or is it? Well, there may be slight clock rate differences, as well as differences in "overhead" related to first and/or last round. We'd need to try both approaches. Does the above sound right to you? > I will be moving to a new place this week and I won't be able to do much > work but I will list here everything I can think of at the moment > > Priorities: > 1. Finish optimization - it's about figuring out why having 1 cycle latency > RAM doesn't produce correct result and figuring out clock problems > 2. Implement multiple bcrypt cores in FPGA Sounds good. > 3. Replace mmap() calls in BF_fpga.c with proper drivers What would those proper drivers be? UIO, as I mentioned here? - http://www.openwall.com/lists/john-dev/2013/06/04/2 > 4. Try to get bcrypt on 64-core Epiphany to work Right. I did not expect it'd be as tricky as it turned out to be, but I am happy that you'd like to keep trying. Thanks, Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.