musl - Re: crypt* files in crypt directory

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20120809232132.GX27715@brightrain.aerifal.cx>
Date: Thu, 9 Aug 2012 19:21:32 -0400
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Subject: Re: crypt* files in crypt directory

On Thu, Aug 09, 2012 at 01:58:12PM +0200, Szabolcs Nagy wrote:
> > #define BF_ROUND(L, R, N) \
> > 	tmp1 = L & 0xFF; \
> > 	tmp2 = L >> 8; \
> > 	tmp2 &= 0xFF; \
> > 	tmp3 = L >> 16; \
> > 	tmp3 &= 0xFF; \
> > 	tmp4 = L >> 24; \
> > 	tmp1 = ctx->s.S[3][tmp1]; \
> > 	tmp2 = ctx->s.S[2][tmp2]; \
> > 	tmp3 = ctx->s.S[1][tmp3]; \
> > 	tmp3 += ctx->s.S[0][tmp4]; \
> > 	tmp3 ^= tmp2; \
> > 	R ^= ctx->s.P[N + 1]; \
> > 	tmp3 += tmp1; \
> > 	R ^= tmp3;
> 
> i guess this is performance critical, but
> i wouldn't spread those expressions over
> several lines
> 
> tmp1 = ctx->S[3][L & 0xff];
> tmp2 = ctx->S[2][L>>8 & 0xff];
> tmp3 = ctx->S[1][L>>16 & 0xff];
> tmp4 = ctx->S[0][L>>24 & 0xff];
> R ^= ctx->P[N+1];
> R ^= ((tmp3 + tmp4) ^ tmp2) + tmp1;

My first modified version to remove the manual scheduling is
significantly slower than the hand-scheduled version. I haven't tried
your version here yet, but it looks nicer and I think it would be
reasonable to compare and see if it's better.

> > 	do {
> > 		ptr += 2;
> > 		L ^= ctx->s.P[0];
> > 		BF_ROUND(L, R, 0);
> > 		BF_ROUND(R, L, 1);
> > 		BF_ROUND(L, R, 2);
> > 		BF_ROUND(R, L, 3);
> > 		BF_ROUND(L, R, 4);
> > 		BF_ROUND(R, L, 5);
> > 		BF_ROUND(L, R, 6);
> > 		BF_ROUND(R, L, 7);
> > 		BF_ROUND(L, R, 8);
> > 		BF_ROUND(R, L, 9);
> > 		BF_ROUND(L, R, 10);
> > 		BF_ROUND(R, L, 11);
> > 		BF_ROUND(L, R, 12);
> > 		BF_ROUND(R, L, 13);
> > 		BF_ROUND(L, R, 14);
> > 		BF_ROUND(R, L, 15);
> > 		tmp4 = R;
> > 		R = L;
> > 		L = tmp4 ^ ctx->s.P[BF_N + 1];
> > 		*(ptr - 1) = R;
> > 		*(ptr - 2) = L;
> > 	} while (ptr < end);
> 
> why increase ptr at the begining?
> it seems the idiomatic way would be
> 
>  *ptr++ = L;
>  *ptr++ = R;

For me, making this change makes it 5% faster. I suspect the
difference comes from the fact that gcc is not smart enough to move
the ptr+=2; across the rest of the loop body, and the fact that it
gets spilled to the stack and reloaded for *both* points of usage
rather than just one. The original version may perform better on
machines with A LOT more registers, but I'm doubtful...

Rich

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.