|
Message-ID: <20130712041609.GV29800@brightrain.aerifal.cx> Date: Fri, 12 Jul 2013 00:16:09 -0400 From: Rich Felker <dalias@...ifal.cx> To: musl@...ts.openwall.com Subject: Re: Thinking about release On Fri, Jul 12, 2013 at 03:36:42PM +1200, Andre Renaud wrote: > > I was unable to measure any difference in performance of your version > > with the prefetch hack versus simply: > > > > __asm__ __volatile__( > > "ldmia %1!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t" > > "stmia %0!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t" > > : "+r"(d), "+r"(s) : > > : "a4", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "memory"); > > What kind of machine were you using? I see a change of 115MB/s -> It's a combined ARM Cortex-A9 & FPGA chip from Xilinx. Supposedly the timings match the Cortex-A9 in other ARM chips. > 105MB/s when I drop the prefetch, even using the code that you > suggested. This is on an Atmel AT91sam9g45 (ARM926ejs @ 400MHz). I'm > assuming this is some subtlety about how the cache is operating? Perhaps so. By the way, I also did some tests with misaligning the src/dest with respect to cache lines. and the timing did change, but not in any way I could make sense of... It may turn out to be that the issues are sufficiently complex that we won't get ideal performance without either copying the BSD code you suggested or fully understanding what it's doing, and other ARM performance issues, and developing something new based on that understanding... In that case copying/adapting the BSD code might turn out to be the right solution for now. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.