|
Message-ID: <4BF38020.3070100@banquise.net> Date: Wed, 19 May 2010 08:07:28 +0200 From: Simon Marechal <simon@...quise.net> To: john-users@...ts.openwall.com Subject: Re: C compiler generated SSE2 code Le 19/05/2010 00:38, Solar Designer a écrit : >> This does speak for itself :) The icc does disentangle the whole stuff, >> but is still faster with 3 loops (only 2 in the sample). > > I think you need to disentangle the source code rather than leave that > for the compiler. Specifically, I'd remove the "unneeded" MD5_PARA_DO > loops. Instead, I'd define macros around primitives such as xor, which > would perform the required number of instances of the operation. They > would use constants for the array indices - or, if that does not work > well enough, even use individual local variables instead of array > elements. This is more similar to what I have in MD5_std.c, where I use > separate local variables for the two instances of MD5: > > MD5_word a0, b0 = Cb, c0 = Cc, d0; > MD5_word a1, b1, c1, d1; > MD5_word u, v; > > I understand that you like to be able to easily adjust the number of > instances that you mix, but you'll have to achieve that by defining your > xor, etc. macros differently for common instance counts (say, 2 vs. 3). When it gets to 3, IIRC, icc doesn't disentangle the code and builds something far more effective than with 2. I noticed that when looking at the compiled code of BarsWF and wondered how the author got such a good register scheduling. Without "PARA_DO" stuff, it might get more friendly to gcc. I still believe shipping an effective .S file generated for example with icc would be better. But I'm not sure about licensing issue with the free icc version ...
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.