musl - Re: Refactoring atomics as llsc?

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150521041237.GC17573@brightrain.aerifal.cx>
Date: Thu, 21 May 2015 00:12:37 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Refactoring atomics as llsc?

On Wed, May 20, 2015 at 01:11:08AM -0400, Rich Felker wrote:
> In the inline sh4a atomics thread, I discussed an idea of refactoring
> atomics for llsc-style archs so that the arch would just need to
> provide inline asm ll() and sc() functions (with more namespace-clean
> names of course), and a shared atomic.h could build the a_* functions
> on top of that, as in:
> 
> static inline int a_cas(volatile int *p, int t, int s)
> {
> 	int old;
> 	do old = ll(p);
> 	while (old == t && !sc(p, s));
> 	return old;
> }
> 
> (Note: I've omitted barriers for simplicity; they could be in the ll
> and sc functions but it would probably make more sense to have them
> outside the loop.)

This is coming along really well so far. Here's the ARMv7 code
generated for a sample external x_swap function that calls a_swap:

x_swap:
        mov     r3, r0
        dmb ish
.L3:
        ldrex r0, [r3]
        strex r2,r1,[r3]
        cmp     r2, #0
        bne     .L3
        dmb ish
        bx      lr

The code that's producing this is the arm atomic_arch.h (so far only
supports inline atomics for v7+):

#define a_ll a_ll
static inline int a_ll(volatile int *p)
{
	int v;
	__asm__ __volatile__ ("ldrex %0, %1" : "=r"(v) : "Q"(*p));
	return v;
}

#define a_sc a_sc
static inline int a_sc(volatile int *p, int v)
{
	int r;
	__asm__ __volatile__ ("strex %0,%1,%2" : "=r"(r) : "r"(v), "Q"(*p) : "memory");
	return !r;
}

#define a_barrier a_barrier
static inline void a_barrier()
{
	__asm__ __volatile__ ("dmb ish" : : : "memory");
}

#define a_pre_llsc a_barrier
#define a_post_llsc a_barrier

And the relevant part of the generic atomic.h:

#ifndef a_swap
#define a_swap a_swap
static inline int a_swap(volatile int *p, int v)
{
	int old;
	a_pre_llsc();
	do old = a_ll(p);
	while (!a_sc(p, v));
	a_post_llsc();
	return old;
}
#endif

The a_pre_llsc and a_post_llsc functions/macros are provided to allow
the atomic_arch.h to put the first barrier inside the ll op if needed,
and to allow different pre/post barriers if the arch has more
efficient variants that can be used.

So far I'm really happy with how tiny atomic_arch.h is and how it can
give fully optimized versions of a_*. Of course for arm we need new
fallback code for pre-v7 versions, which will complicate things, and
sh also needs that kind of fallback. I still need to think about how
to best do these. Right now arm has suboptimal CAS-loop
implementations on pre-v7 even though v6 can do nice optimized llsc
versions. On the other hand, sh has optimized GUSA versions of all the
individual ops we have now.

I'm thinking perhaps the best solution is to have the generic llsc
implementations start off with a call to a fallback-check macro, which
would basically look like:

	if (need_fallback_a_swap) return fallback_a_swap(p, v);

This would cover pre-v6 arm, but for v6 we still want to use the llsc
code but with a different barrier. For that, we would want a_barrier
to look like:

#define a_barrier a_barrier
static inline void a_barrier()
{
	if (v6_compat) 
		__asm__ __volatile__ ( "mcr p15,0,r0,c7,c10,5" : : : "memory");
	else
		__asm__ __volatile__ ("dmb ish" : : : "memory");
}

Or else:

#define a_barrier a_barrier
static inline void a_barrier()
{
	__asm__ __volatile__ ( "blx %0" : : "r"(barrier_func_ptr) : "memory", "lr");
}

The asm is there so we can define a custom calling convention that
doesn't clobber any registers.

Unfortunately there's a nasty snag: global objects like
need_fallback_a_swap, v6_compat, or barrier_func_ptr will be re-read
over and over in functions using atomics because the "memory" clobbers
in the asm invalidate any value the compiler may have cached.

Fortunately, there seems to be a clean solution: load them via asm
that looks like

static inline int v6_compat() {
	int r;
	__asm__ ( "..." : "=r"(r) );
	return r;
}

where the "..." is asm to perform the load. Since this asm is not
volatile and has no inputs, it can be CSE'd and treated like an
attribute-const function. Strictly speaking this doesn't prevent
reordering to the very beginning of program execution, before the
runtime atomic selection is initialized, but I don't think that's a
serious practical concern. It's certainly not a concern with dynamic
linking since nothing can be reordered back into dynamic-linker-time,
and the atomics would be initialized there. For static-linking LTO
this may require some more thought for formal correctness.

It should be noted that the current arm atomics fallback code is a
very neat hack that's no longer needed thanks to the dynamic linker
overhaul, so we have a lot more flexibility if we redo them.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.