Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120810170435.GA29839@openwall.com>
Date: Fri, 10 Aug 2012 21:04:35 +0400
From: Solar Designer <solar@...nwall.com>
To: musl@...ts.openwall.com
Subject: Re: crypt* files in crypt directory

On Thu, Aug 09, 2012 at 07:21:32PM -0400, Rich Felker wrote:
> On Thu, Aug 09, 2012 at 01:58:12PM +0200, Szabolcs Nagy wrote:
> > > 	do {
> > > 		ptr += 2;
> > > 		L ^= ctx->s.P[0];
> > > 		BF_ROUND(L, R, 0);
[...]
> > > 		BF_ROUND(R, L, 15);
> > > 		tmp4 = R;
> > > 		R = L;
> > > 		L = tmp4 ^ ctx->s.P[BF_N + 1];
> > > 		*(ptr - 1) = R;
> > > 		*(ptr - 2) = L;
> > > 	} while (ptr < end);
> > 
> > why increase ptr at the begining?
> > it seems the idiomatic way would be
> > 
> >  *ptr++ = L;
> >  *ptr++ = R;
> 
> For me, making this change makes it 5% faster. I suspect the
> difference comes from the fact that gcc is not smart enough to move
> the ptr+=2; across the rest of the loop body, and the fact that it
> gets spilled to the stack and reloaded for *both* points of usage
> rather than just one. The original version may perform better on
> machines with A LOT more registers, but I'm doubtful...

The spilling theory makes sense to me, but it does not fully explain the
5% difference - I think it could explain a 1% difference or so.  More
likely there's some change in register allocation overall, not only for
ptr - or something like it.

Anyhow, this does not match my test results so far, for different
revisions of this code.  What compiler, options, architecture, CPU?

As written, this code did in fact want more registers than 32-bit x86
has - it needs one more register for the context, for thread-safety
introduced in crypt_blowfish as opposed to JtR.  In crypt_blowfish, I
addressed this by some magic in the asm code, and assumed that other
common archs do have more than 8 registers.  With the asm code dropped,
maybe this piece of C does need to be optimized for 32-bit x86 more -
although it performs as well as the asm code on CPUs newer than the
original Pentium (where the asm code is a lot faster) and different than
Atom (where users reported the asm code being significantly faster).

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.