|
Message-ID: <CAKtfLcu4ZRPS_8DPXhh=J4OEcN2Mm2CScK6k77L94i0yKahcMA@mail.gmail.com> Date: Tue, 21 May 2013 18:32:48 -0400 From: Alain Espinosa <alainesp@...il.com> To: john-dev@...ts.openwall.com Subject: Re: 5x intrinsics? On 5/21/13, magnum <john.magnum@...hmail.com> wrote: > I see Alain's NT format is "5x" for 32-bit SSE2 builds, ie. it does 4x in > SSE2 plus 1x in non-SSE. I presume these are interleaved for hiding latency > so doing that extra 1x more or less for free. Would this be theoretically > and practically worthwhile for the intrinsics? Maybe it'd just get very > messy. I can't remember any discussion on this matter... In my testing with a Pentium 4 this have a very small speedup. With faster SSE engines (beginning with Core 2 Duo) the 32 bits implementation 'probably' will be slower than a SSE2 only implementation. In 64 bits we interleave 2 SSE2 (2*4x) that will result in a good speed-up. I try a 3*4x SSE2 implementation there wasn't any performance gain (i try this with Core 2 Duos). Again, with more vector ports in recent CPUs we may test this again. An improve over the 64 bits SSE2 implementation is the use of non-destructive source with AVX. Also to consider with upcoming Intel CPUs is an AVX2 implementation with 4*8x (using non-destructive source and some temporal memory use for rotating). Probably will provide a speedup given that the CPUs have more ports and better memory engine. saludos, alain
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.