|
Message-ID: <43F64AA9.5030906@o2.pl> Date: Fri, 17 Feb 2006 23:14:01 +0100 From: Michal Luczaj <regenrecht@...pl> To: john-users@...ts.openwall.com Subject: Re: DIGEST-MD5, dominosec optimization Solar Designer wrote: > I don't think you should be optimizing the memcpy() and memcmp() now > (although you can get back to them later). If you do, you might > find MMX faster than SSE - especially on other than Intel P4. You are probably right. I did that because as a simple-minded human being I thought: "for whole 16-byte memcpy - one load, one store - that *must* be faster than anything". Hell yeah, it wasn't :] > More importantly, you must make sure that your arrays are naturally > aligned - so simply declaring them as arrays of "char" won't do - and > also the stack might not be sufficiently aligned for that. Sure. I'm was using MOVAPS not MOVUPS and I did take care of that. > gcc 2.95+ attempts to align the stack for SSE by default, though - > and this even has a (small) performance and size impact for code > which does not need that, so I am usually disabling this feature. The > OS must cooperate, too. The best thing you can do is simply not > place variables requiring an alignment larger than the architecture's > natural word size - which is 4 bytes for x86 - on the stack. I had no problems with stack alignment. And I'm pretty sure it was 16-byte align, otherwise I would get segfault. From what I recall I only changed ARCH_ALLOWS_UNALIGNED to 0 and set -mpreferred-stack-boundary=4. And maybe some __attribute__ ((aligned (16)))...? I just don't remember now. Let me explain: I had no SSE specific problems (segfaults and all that memory related stuff). Everything worked fine. I just wonder why it turned out to be sooo slooow. > As it relates to the XOR'ing, you definitely want to apply it to > quantities larger than "char" whenever possible. (But this does not > appear to be possible in your inner loop.) There is something like this: for (i = 0; i < 16; ++i) x[i] = state[i] ^ block[i]; And I switched it to: _mm_store_ps( (float*)x, _mm_xor_ps( _mm_load_ps((float*)state), _mm_load_ps((float*)block) ) ) Maybe this piece of code is just not important enough... But still, even if it don't give a boost, why-oh-why it slow the whole thing down? And the answer might be: because I'm a lousy programmer ;) Cheers! Michal
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.