Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120809164355.GA28398@openwall.com>
Date: Thu, 9 Aug 2012 20:43:55 +0400
From: Solar Designer <solar@...nwall.com>
To: musl@...ts.openwall.com
Subject: Re: crypt* files in crypt directory

On Thu, Aug 09, 2012 at 01:58:12PM +0200, Szabolcs Nagy wrote:
> it's scary how these crypto codes mix various
> signed and unsigned integer types

Yes, there's room for improvement here.  I'd probably write this
differently if I were writing it from scratch now.  Perhaps a cleanup is
in order, similarly to what we did for the FreeSec code recently.
Specifically, we could have a higher-level wrapper function accept char *
like crypt() is defined to accept, but pass these to a function accepting
unsigned char * and go with that from this point.

> i added review comments
> the style comments are subjective of course

Thanks.  Note that I deliberately did not merge Rich's cleanups yet in
order to be able to continue testing this as part of my tree, with its
wrapper.c still intact, etc.  As discussed with Rich on IRC, I left
merging those changes for musl's use to Rich.

> * Solar Designer <solar@...nwall.com> [2012-08-09 14:53:48 +0400]:
> > typedef unsigned int BF_word;
> > typedef signed int BF_word_signed;
> > 
> > /* Number of Blowfish rounds, this is also hardcoded into a few places */
> > #define BF_N				16
> > 
> > typedef BF_word BF_key[BF_N + 2];
> 
> i don't like these typedefs
> it seems the code assumes 32bit unsigned BF_word

Yes.  As you suggested while we were discussed FreeSec, in musl's
revision of the code we should use <stdint.h> types here.  I made those
changes to that code, but not yet to crypt_blowfish code - I think Rich
will do that.

> > #define BF_safe_atoi64(dst, src) \
> > { \
> > 	tmp = (unsigned char)(src); \
> > 	if ((unsigned int)(tmp -= 0x20) >= 0x60) return -1; \
> 
> tmp is already unsigned int

Rich is dropping BF_safe_atoi64() anyway, in favor of different code.
Thanks for the comment, though.  I might look into this when cleaning
this up for non-musl use.

> > #define BF_ROUND(L, R, N) \
> > 	tmp1 = L & 0xFF; \
> > 	tmp2 = L >> 8; \
> > 	tmp2 &= 0xFF; \
> > 	tmp3 = L >> 16; \
> > 	tmp3 &= 0xFF; \
> > 	tmp4 = L >> 24; \
> > 	tmp1 = ctx->s.S[3][tmp1]; \
> > 	tmp2 = ctx->s.S[2][tmp2]; \
> > 	tmp3 = ctx->s.S[1][tmp3]; \
> > 	tmp3 += ctx->s.S[0][tmp4]; \
> > 	tmp3 ^= tmp2; \
> > 	R ^= ctx->s.P[N + 1]; \
> > 	tmp3 += tmp1; \
> > 	R ^= tmp3;
> 
> i guess this is performance critical, but
> i wouldn't spread those expressions over
> several lines
> 
> tmp1 = ctx->S[3][L & 0xff];
> tmp2 = ctx->S[2][L>>8 & 0xff];
> tmp3 = ctx->S[1][L>>16 & 0xff];
> tmp4 = ctx->S[0][L>>24 & 0xff];
> R ^= ctx->P[N+1];
> R ^= ((tmp3 + tmp4) ^ tmp2) + tmp1;

I think this rewrite is likely to cause a slowdown with some compilers.

My revision of BF_ROUND makes the available parallelism explicit - for
example, that L & 0xFF may be computed early, but the corresponding
array lookup postponed until after other indices are also computed -
which leaves more time for the mask and address generation latency.

Sure, an optimizing compiler is supposed to figure this out and do
proper instruction scheduling on its own.  However, there exist cases
when some optimizations of a compiler have to be disabled - e.g., an
Apple person suggested disabling clang/llvm's instruction scheduling
when compiling this very code (well, a two hashes at a time revision of
it) a couple of years ago, as a workaround for the compiler doing it
poorly.  My hand-written instruction scheduling as above turned out to
work a lot better:

http://www.openwall.com/lists/john-users/2010/08/08/4

(and two follow-ups - click "thread-next").  It would be curious to see
what the same compiler would do with your "unoptimized" revision of the
source code, but I doubt that it would perform better.

> > 	do {
> > 		ptr += 2;
> > 		L ^= ctx->s.P[0];
> > 		BF_ROUND(L, R, 0);
> > 		BF_ROUND(R, L, 1);
> > 		BF_ROUND(L, R, 2);
> > 		BF_ROUND(R, L, 3);
> > 		BF_ROUND(L, R, 4);
> > 		BF_ROUND(R, L, 5);
> > 		BF_ROUND(L, R, 6);
> > 		BF_ROUND(R, L, 7);
> > 		BF_ROUND(L, R, 8);
> > 		BF_ROUND(R, L, 9);
> > 		BF_ROUND(L, R, 10);
> > 		BF_ROUND(R, L, 11);
> > 		BF_ROUND(L, R, 12);
> > 		BF_ROUND(R, L, 13);
> > 		BF_ROUND(L, R, 14);
> > 		BF_ROUND(R, L, 15);
> > 		tmp4 = R;
> > 		R = L;
> > 		L = tmp4 ^ ctx->s.P[BF_N + 1];
> > 		*(ptr - 1) = R;
> > 		*(ptr - 2) = L;
> > 	} while (ptr < end);
> 
> why increase ptr at the begining?

To hide the latency.  Indeed, a smart compiler should do this on its
own, but not all do it well.

> it seems the idiomatic way would be
> 
>  *ptr++ = L;
>  *ptr++ = R;

Yes, but note that the writes are made in the other order, for a reason:
R is computed first, so it is available for the write first.

All of this is a result of experiments, some on old systems many years
ago, some recent.

> > 			tmp[1] |= (BF_word_signed)(signed char)*ptr; /* bug */
> 
> i don't think the BF_word_signed cast helps here
> signed char is promoted to int and then
> it will be converted to BF_word

Yes, I just like it explicit to show what exactly is happening, because
it matters a lot.

> > 	if (setting[0] != '$' ||
> > 	    setting[1] != '2' ||
> > 	    setting[2] < 'a' || setting[2] > 'z' ||
> > 	    !flags_by_subtype[(unsigned int)(unsigned char)setting[2] - 'a'] ||
> 
> setting[2] is already range checked so
> the casts are not necessary
> (assuming 'a' < 'z')

Casting chars to (unsigned int)(unsigned char) for array indices is an
idiom for me.  Now, we have a promotion to (int) due to "- 'a'", but if
this is ever compiled with a C++ compiler, the 'a' constant would be of
type char rather than int, IIRC.  So better safe than sorry.

Yes, the range checking should have helped in this case, but I recall
seeing weird behavior with in-range (implies non-negative) char
subscripts on Alpha in 1990s before I started using this idiom.  IIRC,
instructions were generated such that high bytes of the array index
register were not cleared before the lookup.  Compiler bug?  But it was
seen with different compilers.

> i'd use for (;;) for infinite loops

I saw Rich do that, but I kept the infinite loops in this file I posted
consistent with those I use elsewhere for now.  Rich can musl'ify this
code in this way if desired.

> i like for better
> 
> for (done = 0; ; done++) {
> 	...
> 	if (done)
> 		break;
> 	...
> }

I might try that and see if it's actually cleaner and as fast.  Instead
of "done++", I'd use "done = 1".

> > 		count = 64;
> > 		do {
> > 			L = BF_encrypt(&data.ctx, L, LR[1],
> > 			    &LR[0], &LR[0]);
> > 		} while (--count);
> 
> for (count = 0; count < 64; count++)
> 
> i assume it will be optimized to the same thing
> (count is not used later)

I hope/expect that it will be optimized to the same thing, but I am not
sure it'd happen with all relevant compilers all the time.  I like
writing performance-critical code closer to the desired assembly code.

This one loop is only moderately performance relevant, though.

> > 	output[7 + 22 - 1] = BF_itoa64[(int)
> > 		BF_atoi64[(int)setting[7 + 22 - 1] - 0x20] & 0x30];
> 
> is setting[28] range checked at this point?

Yes, as part of the salt decoding at function start.

> the int casts do not help

Yes, there's promotion to int here anyway.

> > 	_crypt_output_magic(setting, output, size);
> > 	retval = BF_crypt(key, setting, output, size, 16);
> > 	save_errno = errno;
> 
> why save errno before the self test?
> 
> if the self test fails then errno can be anything anyway

Yes, if the test fails, we set errno explicitly to EINVAL after this
point.

> if it does not fail then errno is unchanged i guess

Yes, this should be the case currently.  I just did not want to have a
pitfall in there in case of future changes to BF_crypt() where errno
could potentially be altered on a successful call.

> > 	memcpy(buf.s, test_setting, sizeof(buf.s));
> 
> sizeof buf.s
> without ()

I am consistently using sizeof() with the braces.  Maybe musl's coding
style is different.  If so, this may be changed indeed.

> > 	memset(buf.o, 0x55, sizeof(buf.o));
> > 	buf.o[sizeof(buf.o) - 1] = 0;
> > 	p = BF_crypt(test_key, buf.s, buf.o, sizeof(buf.o) - (1 + 1), 1);
> 
> ditto
> 
> i'd write (1 + 1) as 2

I'm not sure which is cleaner.  It can be said that both the '\x55' and
the NUL after it are canary bytes and thus they're the same kind of
thing and we can talk of a canary of size 2.  Then we'd need to use this
"2" in three places, not just here.  However, in the way we specify the
test_hash[] strings, the NUL is implicit, which kind of makes it
separate from the '\x55'.

> > 	    test_hash[(unsigned int)(unsigned char)buf.s[2] & 1],
> 
> the extra (unsigned int) cast is unnecessary

Yes, we could break the idiom and rely on the promotion to int here.
(Without such promotion or cast, IIRC, things would sometimes fail on
Alpha.)

Thanks,

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.