|
Message-ID: <20160917021730.GV15995@brightrain.aerifal.cx> Date: Fri, 16 Sep 2016 22:17:30 -0400 From: Rich Felker <dalias@...c.org> To: Rob Landley <rob@...dley.net> Cc: "j-core@...ore.org" <j-core@...ore.org>, musl@...ts.openwall.com Subject: Re: Re: [J-core] Aligned copies and cacheline conflicts? On Fri, Sep 16, 2016 at 08:40:05PM -0500, Rob Landley wrote: > On 09/16/2016 05:16 PM, Rich Felker wrote: > > Attached is a draft memcpy I'm considering for musl. Compared to the > > current one, it: > > > > 1. Works on 32 bytes per iteration, and adds barriers between the load > > phase and store phase to preclude cache line aliasing between src > > and dest with a direct-mapped cache. > > > > 2. Equally unrolls the misaligned src/dest cases. > > > > 3. Adjusts the offsets used in the misaligned src/dest loops to all be > > multiples of 4, with the adjustments to make that work outside the > > loops. This helps compilers generate indexed addressing modes (e.g. > > @(4,Rm)) rather than having to resort to arithmetic. > > > > 4. Factors the misaligned cases into a common inline function to > > reduce code duplication. > > > > Comments welcome. > > Superficial comments first: > > I know the compiler's probably smart enough to convert %4 into &3, but > given that the point is performance optimization I'd have thought you'd > be explicit about what the machine should be doing? Generally this is the style preferred in musl. If a compiler fails that bad... > Both chunks of code have their own 8 register read and 8 register write > (one is 0-7, one is 1-8). Yes, the shifted one originally had the 'w' variable named t0. It works with 9 words (0-8) rather than 8 because the destination blocks span partial source words. Maybe I should bring back the t0 name or rename them though. > Design comments: > > Instead of optimized per-target assembly, you have an #ifdef gnuc The code is inside a __GNUC__ conditional because it cannot be written in plain C; it needs the __may_alias__ attribute to be able to access arbitrary-type buffers as uint32_t arrays. This is no change from the existing memcpy.c in musl. > wrapped around just under 70 lines of C code with an __asm__ > __volatile__ blob in the middle, calling a 20 line C function. Because > presumably on sh this will produce roughly the same workaround for the > primitive cache architecture, only now you're doing it indirectly and > applying it to everybody. I don't see it as a "workaround" but rather a memcpy with fewer assumptions -- it removes the assumption that writes to the destination don't interfere with caching of the source. > The motivation for this is that j2 has a more primitive cache > architecture than normal these days, so it needs an optimization most > other chips don't. This is "generic" so it'll be built on > register-constrained 32 bit x86, and on 64 bit systems where it should > presumably be using u64 not u32. musl has optimized asm for a few first-class archs where I've been completely unable to get the compiler to reliably generate a suitable memcpy -- presently that's x86[_64] and arm. Up until recently, x86_64 was the only 64-bit arch we had, and it had its own memcpy. Now that there's aarch64, mips64, powerpc64, and soon s390x, I think it would be a good idea to revisit making the generic memcpy.c 64-bit friendly. But that's a separate task and does not conflict with any work being done now. So -- back to the point -- musl's generic C implementation of memcpy is intended to have minimal assumptions on the cpu architecture while still producing something very fast, and it generally does. Its performance characteristics are mostly oriented towards lower-end risc archs where you have a decent number of registers, no support for misaligned loads/stores, and where shift-into-place with word-sized loads/stores is the proper strategy. > And of course gcc inlines its own version unless you hit it with a brick > anyway, so solving it in musl is of questionable utility. Only for cases where the size is constant and various other conditions are met. The fact that it does this even for large sizes is a bug and it will be fixed. Almost all archs admit faster bulk memcpy that what you can reasonably inline so gcc simply should not be doing this, and fixing it is necessary anyway if we want to allow vdso-provided dma memcpy. > I'm not sure you're focusing on the right problem? Yes, for sh/j2, but there are other open problems for improving musl's generic memcpy now that we have a wider variety of archs. Regarding the barrier that makes memcpy safe against direct-mapped caches like the j2 one, It could be turned on/off with a simple macro definition provided by the arch files if it harms performance on other archs, but I don't think it does. We could likewise add such a macro for "arch allows misaligned loads/stores" to optimize out the shift-into-place code on archs that don't need it and allow a faster direct load/store (the "mod4=0" case in the current code) to be used for all alignments on such archs. One of the changes made in this update, moving the shifted block copy into its own (inline) function, makes it easier to move to using native wordsize rather than hard-coding 32-bit. The main reason the current code in musl hard-coded 32-bit was that I got tired of writing the (written out in duplicate) cases for 8 different alignments mod 8. But now it's just a few lines per case with little opportunity for introducing errors. In terms of code size, "twice as many cases but half the number of loads/stores per case" should be fairly neutral. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.