|
Message-ID: <20060217224002.GA7177@openwall.com> Date: Sat, 18 Feb 2006 01:40:02 +0300 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: DIGEST-MD5, dominosec optimization On Fri, Feb 17, 2006 at 11:14:01PM +0100, Michal Luczaj wrote: > Sure. I'm was using MOVAPS not MOVUPS and I did take care of that. [...] > I had no problems with stack alignment. And I'm pretty sure it was > 16-byte align, otherwise I would get segfault. I hope so - unless your CPU or OS actually supports unaligned accesses even with MOVAPS. > From what I recall I only changed ARCH_ALLOWS_UNALIGNED to 0 This was not needed. It would only slow down a few things. x86 _can_ do unaligned 32-bit accesses with penalties that are not too bad, so ARCH_ALLOWS_UNALIGNED should be set to 1. Your use of SSE in a few places does not change that. > and set -mpreferred-stack-boundary=4. That's the default with recent gcc. > And maybe some __attribute__ ((aligned (16)))...? I just don't remember now. Yes, you would need that attribute - or you would need to use a data type that is this large anyway. > Let me explain: I had no SSE specific problems (segfaults and all that > memory related stuff). Everything worked fine. I just wonder why it > turned out to be sooo slooow. I am not sure either. I would suspect unaligned accesses. > > As it relates to the XOR'ing, you definitely want to apply it to > > quantities larger than "char" whenever possible. (But this does not > > appear to be possible in your inner loop.) > > There is something like this: > > for (i = 0; i < 16; ++i) > x[i] = state[i] ^ block[i]; Yes, but it's not your inner loop. > And I switched it to: > > _mm_store_ps( > (float*)x, > _mm_xor_ps( > _mm_load_ps((float*)state), > _mm_load_ps((float*)block) > ) > ) > > Maybe this piece of code is just not important enough... Exactly. > But still, even > if it don't give a boost, why-oh-why it slow the whole thing down? Well, another guess would be that gcc generates code different from what you would expect. Did you review the assembly output ("gcc -S ...")? Anyway, this has little to do with John the Ripper and with optimizing your patches to JtR. If you want to achieve a significant speedup, you should concentrate on trying multiple passwords in parallel and taking advantage of that in your inner loop. For dominosec, it does not appear to be possible to reasonably use SSE there - so don't. -- Alexander Peslyak <solar at openwall.com> GPG key ID: B35D3598 fp: 6429 0D7E F130 C13E C929 6447 73C3 A290 B35D 3598 http://www.openwall.com - bringing security into open computing environments Was I helpful? Please give your feedback here: http://rate.affero.net/solar
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.