|
Message-ID: <20130712031615.GS29800@brightrain.aerifal.cx> Date: Thu, 11 Jul 2013 23:16:15 -0400 From: Rich Felker <dalias@...ifal.cx> To: musl@...ts.openwall.com Subject: Re: Thinking about release On Fri, Jul 12, 2013 at 10:34:31AM +1200, Andre Renaud wrote: > I've rejiggled it a bit, and it appears to be working. I wasn't > entirely sure what you meant about the proper constraints. There is an > additional reason why 8*4 was used for the align - to force the whole > loop to work in cache-line blocks. I've now done this explicitly on > the lead-in by doing the first few copies as 32-bit, then going to the > full cache-line asm. This has the same performance as the fully native > assembler. However to get that I had to use the same trick that the > native assembler uses - doing a load of the next block prior to > storing this one. I'm a bit concerned that this would mean we'd be > doing a read that was out of bounds, and I can't entirely see why this > wouldn't be happening with the existing assembler (but I'm presuming > it doesn't). Any comments on this side of it? I was unable to measure any difference in performance of your version with the prefetch hack versus simply: __asm__ __volatile__( "ldmia %1!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t" "stmia %0!,{a4,v1,v2,v3,v4,v5,v6,v7}\n\t" : "+r"(d), "+r"(s) : : "a4", "v1", "v2", "v3", "v4", "v5", "v6", "v7", "memory"); in the inner loop. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.