|
Message-ID: <a605aa1dd15b84b05e318a085a505255@smtp.hushmail.com> Date: Mon, 28 Apr 2014 21:59:26 +0200 From: magnum <john.magnum@...hmail.com> To: john-dev@...ts.openwall.com Subject: Re: Re: mmap() On 2014-04-27 22:47, magnum wrote: > I'm experimenting with using SSE *with* mmap (not Atom's code) but > since most words are shorter than 16 bytes it seems to be better using > 32-bit or even 8-bit stuff. The mmap stuff is now committed to bleeding-jumbo. The "problem" with SSE described above is gone: If the word is shorter we'll copy 16 bytes but then we'll leave the loop knowing where to put the null byte. So it's now very fast for any length. I have another problem though, calling for help or knowledge: The SSE2 version is a fine boost on Linux, I've tested it on a couple of Intels and an AMD. But when I run it on OSX with an i7 mobile, it *halves* the speed. At first I thought it was something with poor handling of unaligned SSE but it did seem unlikely for this CPU. And now I booted Linux on the Macbook and could confirm the SSE code runs just fine there, with a 6-7% boost over the 64-bit alternative code. In both cases it was compiled with gcc 4.7-ish. How the heck can SSE intrinsics end up that different? The OS should have absolutely nothing to do with it!? For now I disable the SSE2 code path for __APPLE__ but I really think this is weird. I'll try peaking at the assembler output from the compiler. magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.