Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120809232132.GX27715@brightrain.aerifal.cx>
Date: Thu, 9 Aug 2012 19:21:32 -0400
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Subject: Re: crypt* files in crypt directory

On Thu, Aug 09, 2012 at 01:58:12PM +0200, Szabolcs Nagy wrote:
> > #define BF_ROUND(L, R, N) \
> > 	tmp1 = L & 0xFF; \
> > 	tmp2 = L >> 8; \
> > 	tmp2 &= 0xFF; \
> > 	tmp3 = L >> 16; \
> > 	tmp3 &= 0xFF; \
> > 	tmp4 = L >> 24; \
> > 	tmp1 = ctx->s.S[3][tmp1]; \
> > 	tmp2 = ctx->s.S[2][tmp2]; \
> > 	tmp3 = ctx->s.S[1][tmp3]; \
> > 	tmp3 += ctx->s.S[0][tmp4]; \
> > 	tmp3 ^= tmp2; \
> > 	R ^= ctx->s.P[N + 1]; \
> > 	tmp3 += tmp1; \
> > 	R ^= tmp3;
> 
> i guess this is performance critical, but
> i wouldn't spread those expressions over
> several lines
> 
> tmp1 = ctx->S[3][L & 0xff];
> tmp2 = ctx->S[2][L>>8 & 0xff];
> tmp3 = ctx->S[1][L>>16 & 0xff];
> tmp4 = ctx->S[0][L>>24 & 0xff];
> R ^= ctx->P[N+1];
> R ^= ((tmp3 + tmp4) ^ tmp2) + tmp1;

My first modified version to remove the manual scheduling is
significantly slower than the hand-scheduled version. I haven't tried
your version here yet, but it looks nicer and I think it would be
reasonable to compare and see if it's better.

> > 	do {
> > 		ptr += 2;
> > 		L ^= ctx->s.P[0];
> > 		BF_ROUND(L, R, 0);
> > 		BF_ROUND(R, L, 1);
> > 		BF_ROUND(L, R, 2);
> > 		BF_ROUND(R, L, 3);
> > 		BF_ROUND(L, R, 4);
> > 		BF_ROUND(R, L, 5);
> > 		BF_ROUND(L, R, 6);
> > 		BF_ROUND(R, L, 7);
> > 		BF_ROUND(L, R, 8);
> > 		BF_ROUND(R, L, 9);
> > 		BF_ROUND(L, R, 10);
> > 		BF_ROUND(R, L, 11);
> > 		BF_ROUND(L, R, 12);
> > 		BF_ROUND(R, L, 13);
> > 		BF_ROUND(L, R, 14);
> > 		BF_ROUND(R, L, 15);
> > 		tmp4 = R;
> > 		R = L;
> > 		L = tmp4 ^ ctx->s.P[BF_N + 1];
> > 		*(ptr - 1) = R;
> > 		*(ptr - 2) = L;
> > 	} while (ptr < end);
> 
> why increase ptr at the begining?
> it seems the idiomatic way would be
> 
>  *ptr++ = L;
>  *ptr++ = R;

For me, making this change makes it 5% faster. I suspect the
difference comes from the fact that gcc is not smart enough to move
the ptr+=2; across the rest of the loop body, and the fact that it
gets spilled to the stack and reloaded for *both* points of usage
rather than just one. The original version may perform better on
machines with A LOT more registers, but I'm doubtful...

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.