john-users - Re: DIGEST-MD5, dominosec optimization

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <43F64AA9.5030906@o2.pl>
Date: Fri, 17 Feb 2006 23:14:01 +0100
From: Michal Luczaj <regenrecht@...pl>
To:  john-users@...ts.openwall.com
Subject: Re: DIGEST-MD5, dominosec optimization

Solar Designer wrote:
> I don't think you should be optimizing the memcpy() and memcmp() now
>  (although you can get back to them later).  If you do, you might
> find MMX faster than SSE - especially on other than Intel P4.

You are probably right. I did that because as a simple-minded human
being I thought: "for whole 16-byte memcpy - one load, one store - that
*must* be faster than anything". Hell yeah, it wasn't :]

> More importantly, you must make sure that your arrays are naturally 
> aligned - so simply declaring them as arrays of "char" won't do - and
>  also the stack might not be sufficiently aligned for that.

Sure. I'm was using MOVAPS not MOVUPS and I did take care of that.

> gcc 2.95+ attempts to align the stack for SSE by default, though - 
> and this even has a (small) performance and size impact for code 
> which does not need that, so I am usually disabling this feature. The
> OS must cooperate, too.  The best thing you can do is simply not 
> place variables requiring an alignment larger than the architecture's
>  natural word size - which is 4 bytes for x86 - on the stack.

I had no problems with stack alignment. And I'm pretty sure it was
16-byte align, otherwise I would get segfault. From what I recall I only
changed ARCH_ALLOWS_UNALIGNED to 0 and set -mpreferred-stack-boundary=4.
And maybe some __attribute__ ((aligned (16)))...? I just don't remember now.

Let me explain: I had no SSE specific problems (segfaults and all that
memory related stuff). Everything worked fine. I just wonder why it
turned out to be sooo slooow.

> As it relates to the XOR'ing, you definitely want to apply it to 
> quantities larger than "char" whenever possible.  (But this does not
> appear to be possible in your inner loop.)

There is something like this:

	for (i = 0; i < 16; ++i)
		x[i] = state[i] ^ block[i];

And I switched it to:

	_mm_store_ps(
		(float*)x,
		_mm_xor_ps(
			_mm_load_ps((float*)state),
			_mm_load_ps((float*)block)
			)
		)

Maybe this piece of code is just not important enough... But still, even
if it don't give a boost, why-oh-why it slow the whole thing down?

And the answer might be: because I'm a lousy programmer ;)

Cheers!
Michal

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.