|
Message-ID: <CA+EaD-aM80V_=8CTDoRtDbdMaRZhEmq=NVytaX3itCXA40DZXA@mail.gmail.com>
Date: Sat, 28 Sep 2013 09:30:24 +0200
From: Katja Malvoni <kmalvoni@...il.com>
To: john-dev@...ts.openwall.com
Subject: Re: Katja's weekly report #15
Hi Alexander,
I'm sorry about delay, I moved to a new place and I had some problems with
internet connection, now it's all sorted out.
On Thu, Sep 26, 2013 at 5:01 AM, Solar Designer <solar@...nwall.com> wrote:
> > Accomplishments:
> > 1. Updated wiki page
>
> Thanks! As I had mentioned, we/you need to get the page at
> http://openwall.info/wiki/john/development/Parallella linked from some
> other wiki page(s), such as from john/development or/and from john.
>
I added link on john/development
>
> > 2. Fixed bug so that bcrypt on FPGA doesn't fail self test on first run
>
> Great. What was the bug?
>
I should have said this differently - when I started using true dual-port
RAM for storing Sbox bug disappeared, I don't know what exactly it was and
I made changes to a big portion of code so I can't point to specific part
of code.
> > 3. Partially optimized bcrypt on FPGA
> > - using true dual port RAM for Sbox with two cycle latency. In
> > simulation I have it with 1 cycle latency, 3 cycles per BF_ROUND and
> > 1709766 cycles in total but it doesn't work on ZedBoard.
>
> 3 cycles per BF_ROUND sounds just right to me. I assume it's one cycle
> to fetch first two S-box elements, another cycle to fetch the other two,
> and a third cycle to process these fetched values and compute the next
> set of S-box indices, for the next round. Correct?
>
That is correct.
> Can you perhaps reduce this further, to two cycles per Blowfish round
> (for most rounds), by fetching the next round's first two S-box elements
> during the current round's "computation" cycle?
I think I can, I stopped working on optimizing it further when I noticed I
can't get current code working on ZedBoard.
> [...]
>
> Does the above sound right to you?
>
It does. The only thing which worries me a bit is adding more bcrypt cores.
At the moment I have two ideas. First one is to connect all additional
cores to the same AXI bus and than use software registers to synchronise
reading and writing. I think that this approach could have large
communication overhead. Instead of software registers, additional logic can
be used to distribute data to cores and start computation. I will probably
try both approaches and see which one has better performance. Second idea
is to create one shared BRAM per core but I think I can't do that without
creating one DMA per core and a few AXI buses. This approach would waste
too many resources.
> > 3. Replace mmap() calls in BF_fpga.c with proper drivers
>
> What would those proper drivers be? UIO, as I mentioned here? -
>
> http://www.openwall.com/lists/john-dev/2013/06/04/2
>
My idea was to follow example from Xilinx -
http://www.xilinx.com/support/documentation/sw_manuals/xilinx14_4/ug873-zynq-ctt.pdfChapter
9. In this manual they are using modules.
Katja
Content of type "text/html" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.