Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130527235834.GB31338@openwall.com>
Date: Tue, 28 May 2013 03:58:34 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: Parallella: bcrypt

Katja, Yaniv -

On Sun, May 26, 2013 at 07:37:55PM -0400, Yaniv Sapir wrote:
> Here's a couple command line options you can try when compiling the code.
> Please look at the manual for further details.
> 
> -mfp-mode=int        # this sets the FPU mode to integer. However, please
> make sure that the generated code does not re-program the CONFIG register
> before every integer operation

Let's definitely try this.  I was afraid we'd have to resort to assembly
code to use the FPU in integer mode - it's great news to me that we seem
not to have to.

> -O3
> -Ofast               # e-gcc supports this level too

We can try these too, but I don't expect much/any advantage over -O2.

> -funroll-loops       # unroll the loops for better performance
> -falign-loops=8      # align the body of the loop to an 8-byte boundary
> -falign-functions=8  # same, but for functions entry point
> -ffast-math          # really, a FP option, but you may gain something here
> too

Of these, -funroll-loops shouldn't help since we're already
hand-unrolling the 16 rounds of Blowfish.

-falign-loops=8 and -falign-functions=8 are worth trying.  (The latter
should only make a measurable difference with the size-optimized
implementation, where a portion of code has been moved into a separate
function that is called from several places.)

-ffast-math should not matter (but it's OK to try anyway).

> If you find out that the code and data set won't fit in, try limiting the
> amount of loop unrolling and see the effect. Use:
> 
> -max-unroll-times
> -max-unrolled-insns
> -Os

These should have no effect since our Blowfish is hand-unrolled.  When
we go for 2 instances of bcrypt with interleaved instructions, I think
we may have to either do this based on the size-optimized implementation
from musl (I think it will fit even with 2 instances interleaved) or
reduce the unrolling for Blowfish (unroll 8 rounds out of 16 and loop
over this two times, although this adds some extra processing since the
indices into P[] would no longer be constant).

> Best strategy is to compile your functions in separate modules, then apply
> optimization switches that best match your needs. Usually when optimizing
> globally, you pay with unnecessarily bloated coed.

I fully agree in general, but here we're dealing with a small piece of
code, and we're hand-unrolling just the right portions of it.  So I
think the separation of bcrypt implementation into modules would only
add complexity for no gain.

We may use this approach when importing larger and third-party code,
with no prior hand-unrolling.

Thanks,

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.