|
Message-ID: <20150521041237.GC17573@brightrain.aerifal.cx> Date: Thu, 21 May 2015 00:12:37 -0400 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: Re: Refactoring atomics as llsc? On Wed, May 20, 2015 at 01:11:08AM -0400, Rich Felker wrote: > In the inline sh4a atomics thread, I discussed an idea of refactoring > atomics for llsc-style archs so that the arch would just need to > provide inline asm ll() and sc() functions (with more namespace-clean > names of course), and a shared atomic.h could build the a_* functions > on top of that, as in: > > static inline int a_cas(volatile int *p, int t, int s) > { > int old; > do old = ll(p); > while (old == t && !sc(p, s)); > return old; > } > > (Note: I've omitted barriers for simplicity; they could be in the ll > and sc functions but it would probably make more sense to have them > outside the loop.) This is coming along really well so far. Here's the ARMv7 code generated for a sample external x_swap function that calls a_swap: x_swap: mov r3, r0 dmb ish .L3: ldrex r0, [r3] strex r2,r1,[r3] cmp r2, #0 bne .L3 dmb ish bx lr The code that's producing this is the arm atomic_arch.h (so far only supports inline atomics for v7+): #define a_ll a_ll static inline int a_ll(volatile int *p) { int v; __asm__ __volatile__ ("ldrex %0, %1" : "=r"(v) : "Q"(*p)); return v; } #define a_sc a_sc static inline int a_sc(volatile int *p, int v) { int r; __asm__ __volatile__ ("strex %0,%1,%2" : "=r"(r) : "r"(v), "Q"(*p) : "memory"); return !r; } #define a_barrier a_barrier static inline void a_barrier() { __asm__ __volatile__ ("dmb ish" : : : "memory"); } #define a_pre_llsc a_barrier #define a_post_llsc a_barrier And the relevant part of the generic atomic.h: #ifndef a_swap #define a_swap a_swap static inline int a_swap(volatile int *p, int v) { int old; a_pre_llsc(); do old = a_ll(p); while (!a_sc(p, v)); a_post_llsc(); return old; } #endif The a_pre_llsc and a_post_llsc functions/macros are provided to allow the atomic_arch.h to put the first barrier inside the ll op if needed, and to allow different pre/post barriers if the arch has more efficient variants that can be used. So far I'm really happy with how tiny atomic_arch.h is and how it can give fully optimized versions of a_*. Of course for arm we need new fallback code for pre-v7 versions, which will complicate things, and sh also needs that kind of fallback. I still need to think about how to best do these. Right now arm has suboptimal CAS-loop implementations on pre-v7 even though v6 can do nice optimized llsc versions. On the other hand, sh has optimized GUSA versions of all the individual ops we have now. I'm thinking perhaps the best solution is to have the generic llsc implementations start off with a call to a fallback-check macro, which would basically look like: if (need_fallback_a_swap) return fallback_a_swap(p, v); This would cover pre-v6 arm, but for v6 we still want to use the llsc code but with a different barrier. For that, we would want a_barrier to look like: #define a_barrier a_barrier static inline void a_barrier() { if (v6_compat) __asm__ __volatile__ ( "mcr p15,0,r0,c7,c10,5" : : : "memory"); else __asm__ __volatile__ ("dmb ish" : : : "memory"); } Or else: #define a_barrier a_barrier static inline void a_barrier() { __asm__ __volatile__ ( "blx %0" : : "r"(barrier_func_ptr) : "memory", "lr"); } The asm is there so we can define a custom calling convention that doesn't clobber any registers. Unfortunately there's a nasty snag: global objects like need_fallback_a_swap, v6_compat, or barrier_func_ptr will be re-read over and over in functions using atomics because the "memory" clobbers in the asm invalidate any value the compiler may have cached. Fortunately, there seems to be a clean solution: load them via asm that looks like static inline int v6_compat() { int r; __asm__ ( "..." : "=r"(r) ); return r; } where the "..." is asm to perform the load. Since this asm is not volatile and has no inputs, it can be CSE'd and treated like an attribute-const function. Strictly speaking this doesn't prevent reordering to the very beginning of program execution, before the runtime atomic selection is initialized, but I don't think that's a serious practical concern. It's certainly not a concern with dynamic linking since nothing can be reordered back into dynamic-linker-time, and the atomics would be initialized there. For static-linking LTO this may require some more thought for formal correctness. It should be noted that the current arm atomics fallback code is a very neat hack that's no longer needed thanks to the dynamic linker overhaul, so we have a lot more flexibility if we redo them. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.