Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110607010659.GA17482@openwall.com>
Date: Tue, 7 Jun 2011 05:06:59 +0400
From: Solar Designer <solar@...nwall.com>
To: crypt-dev@...ts.openwall.com
Subject: Re: EskBlowFish with RAM results

Yuri -

On Mon, Jun 06, 2011 at 07:24:57PM -0300, Yuri Gonzaga wrote:
> Yes. It is going attached right now.

Even though I don't really know Verilog, I took a look at the code,
thanks.  It appears that you spend 5 clock cycles per Blowfish round:

1. L xor P[n]
2. read two S-box elements
3. process them, read two more S-box elements
4. process them
5. swap L and R, move to next round

I think you could reduce this to three (while still using two read ports
to BlockRAM only):

1. L xor P[n], read two S-box elements
2. process them, read two more S-box elements
3. process them, swap L and R, move to next round

or to 4 per two rounds:

1. L xor P[n], read two S-box elements
2. process them, read two more S-box elements
3. process them, R xor P[n], read two S-box elements
4. process them, read two more S-box elements, move to next round

I understand that you probably didn't optimize for speed yet, and maybe
I am missing something.

If it turns out that LUTs and not BlockRAMs are the scarce resource
(that is, if we fail to reduce the LUT count needed substantially), then
it'd make sense to waste half the BlockRAM space in order to use four
read ports per EksBlowfish core.  Then you should be able to do two
rounds per two cycles.

In fact, I am considering a larger EksBlowfish-like component, which
would fully use the BlockRAMs in one clock cycle - simply make the
S-boxes 9-to-36 (instead of Blowfish's 8-to-32).  So if you test the
above approach for the original EksBlowfish (which is directly useful
e.g. for JtR), we may reuse it for ours as well (for our new hashing
method).

> I don't know. Maybe, next experimentations could answer this question.
> I will try to reduce, but I don't know if I can achieve a factor of 10.

The first thing to do is exclude all code that is not in the performance
critical loop, as discussed before.

> I think it is 4 because I am storing initial constants in a ROM and the
> synthesizer implements this ROM with RAM.

Yes, this must be the case.

Thanks,

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.