|
Message-ID: <20130808151502.GP221@brightrain.aerifal.cx> Date: Thu, 8 Aug 2013 11:15:02 -0400 From: Rich Felker <dalias@...ifal.cx> To: musl@...ts.openwall.com Subject: Re: Optimized C memcpy On Thu, Aug 08, 2013 at 09:03:51AM -0400, Andrew Bradford wrote: > > > This is not a replacement for the ARM asm (which is still better), but > > > it's a step towards avoiding the need to have written-by-hand assembly > > > for every single new arch we add as a prerequisite for tolerable > > > performance. > > > > Sorry if this has been discussed before but Google isn't much help. Why > > is 32 bytes chosen as the block size over other sizes? > > > > It seems that the code would be fewer lines if blocks were 4 bytes, > > Sorry, I now see why 4 byte blocks won't work due to the misalignment, > but 8 or 16 seem like they should be possible. > Is it just the evaluation of the for loop being expensive that's trying > to be avoided? It's purely empirical reasons. 8 is the smallest that would work without extra logic to shuffle w/x. 16 runs 50% slower than the ARM asm. 32 runs only 25% slower than the ARM asm. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.