|
Message-ID: <20130527235834.GB31338@openwall.com> Date: Tue, 28 May 2013 03:58:34 +0400 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: Parallella: bcrypt Katja, Yaniv - On Sun, May 26, 2013 at 07:37:55PM -0400, Yaniv Sapir wrote: > Here's a couple command line options you can try when compiling the code. > Please look at the manual for further details. > > -mfp-mode=int # this sets the FPU mode to integer. However, please > make sure that the generated code does not re-program the CONFIG register > before every integer operation Let's definitely try this. I was afraid we'd have to resort to assembly code to use the FPU in integer mode - it's great news to me that we seem not to have to. > -O3 > -Ofast # e-gcc supports this level too We can try these too, but I don't expect much/any advantage over -O2. > -funroll-loops # unroll the loops for better performance > -falign-loops=8 # align the body of the loop to an 8-byte boundary > -falign-functions=8 # same, but for functions entry point > -ffast-math # really, a FP option, but you may gain something here > too Of these, -funroll-loops shouldn't help since we're already hand-unrolling the 16 rounds of Blowfish. -falign-loops=8 and -falign-functions=8 are worth trying. (The latter should only make a measurable difference with the size-optimized implementation, where a portion of code has been moved into a separate function that is called from several places.) -ffast-math should not matter (but it's OK to try anyway). > If you find out that the code and data set won't fit in, try limiting the > amount of loop unrolling and see the effect. Use: > > -max-unroll-times > -max-unrolled-insns > -Os These should have no effect since our Blowfish is hand-unrolled. When we go for 2 instances of bcrypt with interleaved instructions, I think we may have to either do this based on the size-optimized implementation from musl (I think it will fit even with 2 instances interleaved) or reduce the unrolling for Blowfish (unroll 8 rounds out of 16 and loop over this two times, although this adds some extra processing since the indices into P[] would no longer be constant). > Best strategy is to compile your functions in separate modules, then apply > optimization switches that best match your needs. Usually when optimizing > globally, you pay with unnecessarily bloated coed. I fully agree in general, but here we're dealing with a small piece of code, and we're hand-unrolling just the right portions of it. So I think the separation of bcrypt implementation into modules would only add complexity for no gain. We may use this approach when importing larger and third-party code, with no prior hand-unrolling. Thanks, Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.