musl - Re: [PATCH] inline llsc atomics when compiling for sh4a

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150519003045.GC17573@brightrain.aerifal.cx>
Date: Mon, 18 May 2015 20:30:45 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: [PATCH] inline llsc atomics when compiling for sh4a

On Mon, May 18, 2015 at 05:56:18PM -0500, Bobby Bingham wrote:
> On Sun, May 17, 2015 at 10:34:02PM -0400, Rich Felker wrote:
> > On Sun, May 17, 2015 at 01:55:16PM -0500, Bobby Bingham wrote:
> > > If we're building for sh4a, the compiler is already free to use
> > > instructions only available on sh4a, so we can do the same and inline the
> > > llsc atomics. If we're building for an older processor, we still do the
> > > same runtime atomics selection as before.
> >
> > Thanks! I think it's ok for commit as-is, but based on re-reading this
> > code I have some ideas for improving it that are orthogonal to this
> > change. See comments inline:
> 
> Would you prefer I resend this patch to remove the LLSC_* macros at the
> same time, or another patch to remove them separately?

No, let's do that separately. I like keeping independent changes
separate in commits. I've tested the current patch already and it
seems to be fine. If we do the second LLSC_* removal patch it
shouldn't affect the generated binaries, which is easy to verify.

> 
> >
> > > ---
> > >  arch/sh/atomic.h     |  83 +++++++++++++++++++++++++++++++
> > >  arch/sh/src/atomic.c | 135 +++++++++++++++++----------------------------------
> > >  2 files changed, 128 insertions(+), 90 deletions(-)
> > >
> > > diff --git a/arch/sh/atomic.h b/arch/sh/atomic.h
> > > index a1d22e4..f2e6dac 100644
> > > --- a/arch/sh/atomic.h
> > > +++ b/arch/sh/atomic.h
> > > @@ -22,6 +22,88 @@ static inline int a_ctz_64(uint64_t x)
> > >  	return a_ctz_l(y);
> > >  }
> > >
> > > +#define LLSC_CLOBBERS "r0", "t", "memory"
> > > +#define LLSC_START(mem) "synco\n"  \
> > > +	"0:	movli.l @" mem ", r0\n"
> > > +#define LLSC_END(mem)              \
> > > +	"1:	movco.l r0, @" mem "\n"    \
> > > +	"	bf 0b\n"                   \
> > > +	"	synco\n"
> > > +
> > > +static inline int __sh_cas_llsc(volatile int *p, int t, int s)
> > > +{
> > > +	int old;
> > > +	__asm__ __volatile__(
> > > +		LLSC_START("%1")
> > > +		"	mov r0, %0\n"
> > > +		"	cmp/eq %0, %2\n"
> > > +		"	bf 1f\n"
> > > +		"	mov %3, r0\n"
> > > +		LLSC_END("%1")
> > > +		: "=&r"(old) : "r"(p), "r"(t), "r"(s) : LLSC_CLOBBERS);
> > > +	return old;
> > > +}
> >
> > The mov from r0 to %0 seems unnecessary here. Presumably it's because
> > you didn't have a constraint to force old to come out via r0. Could
> > you do the following?
> 
> No, because the "mov %3, r0" a couple instructions down clobbers r0, and
> this is necessary because movco.l only accept r0 as the source operand.

Oh, I see. We lose the old value that the function needs to return. So
a version of cas that just returns success/failure rather than the old
value would be able to omit it, but musl's can't, right? I should keep
that in mind, since at some point, if I can determine that the old
value isn't important anywhere, I might consider changing the a_cas
API to just have success/failure as its result. This is mildly cheaper
on some other archs too, I think.

> > I've actually always wondered about the value of having the LLSC_*
> > macros. I usually prefer for the whole asm to be written out
> > explicitly and readable unless there's a compelling reason to wrap it
> > up in macros. Then it would look like:
> 
> I think it made a bigger difference for the gusa version, so I mostly
> did it with llsc for consistency.  And before this patch, with the gusa
> and llsc versions were side by side in the same function, it made it
> easier for me to verify both versions were doing the same thing as I
> wrote them.
> 
> Now that the llsc version is moving, I'm less attached to the LLSC_*
> macros.  I do think the gusa stuff is ugly and magical enough that I'd
> still prefer to keep it hidden away, if you don't object.

Yeah, I don't mind.

> > static inline int __sh_cas_llsc(volatile int *p, int t, int s)
> > {
> > 	register int old __asm__("r0");
> > 	__asm__ __volatile__(
> > 		"	synco\n"
> > 		"0:	movli.l @%1, r0\n"
> > 		"	cmp/eq r0, %2\n"
> > 		"	bf 1f\n"
> > 		"	mov %3, r0\n"
> > 		"1:	movco.l r0, @%1\n"
> > 		"	bf 0b\n"
> > 		"	synco\n"
> > 		: "=&r"(old) : "r"(p), "r"(t), "r"(s) : "t", "memory");
> > 	return old;
> > }
> >
> > and similar for other functions. Part of the motivation of not hiding
> > the outer logic in macros is that it might make it possible to
> > fold/simplify some special cases like I did above for CAS.
> 
> I don't mind in principle, but I think the fact that movco.l requires
> its input be in r0 is going to mean there's not actually any
> simplification you can do.

I think you may be right.

> > Another idea is letting the compiler simplify, with something like the
> > following, which could actually be used cross-platform for all
> > llsc-type archs:
> >
> > static inline int __sh_cas_llsc(volatile int *p, int t, int s)
> > {
> > 	do old = llsc_start(p);
> > 	while (*p == t && !llsc_end(p, s));
> > 	return old;
> > }
> >
> > Here llsc_start and llsc_end would be inline functions using asm with
> > appropriate constraints. Unfortunately I don't see a way to model
> > using the value of the truth flag "t" as the output of the asm for
> > llsc_end, though. I suspect this would be a problem on a number of
> > other archs too; the asm would have to waste an instruction (or
> > several) converting the flag to an integer. Unless there's a solution
> > to that problem, it makes an approach like this less appealing.
> 
> I agree this would be even nicer if we could make it work.

Yes. FWIW the above code has a bug. It should be:

static inline int __sh_cas_llsc(volatile int *p, int t, int s)
{
	do old = llsc_start(p);
	while (old == t && !llsc_end(p, s));
	return old;
}

I think there would need to be a barrier before the return statement
too. Embedding the barrier in llsc_end would not work because (1) it
won't get executed on old!=t, and (2) it should be avoided when
llsc_end fails since llsc_start will do a barrier anyway, but if we
put the conditional inside llsc_end's asm then the condition would get
branched-on twice. Issue (2) may be solvable by using a C conditional
inside llsc_end, but that still leaves issue (1), so I think something
like this would be needed:

static inline int __sh_cas_llsc(volatile int *p, int t, int s)
{
	do old = llsc_start(p);
	while (old == t && !llsc_end(p, s));
	a_barrier();
	return old;
}

But that's suboptimal on archs where the 'sc' part of llsc has an
implicit barrier (microblaze and or1k, IIRC). So perhaps the ideal
general version would be:

static inline int __sh_cas_llsc(volatile int *p, int t, int s)
{
	do old = llsc_start(p);
	while (old == t ? !llsc_end(p, s) : (a_barrier(), 0));
	return old;
}

with a barrier in the success path for llsc_end. Alternatively we
could aim to always do the sc:

static inline int __sh_cas_llsc(volatile int *p, int t, int s)
{
	do old = llsc_start(p);
	while (!llsc_end(p, old==t ? s : old));
	return old;
}

This version is structurally analogous to the non-CAS atomics, but
perhaps more costly in the old!=t case.

Anyway at this point I don't see an efficient way do to the
conditional, so it's mostly a theoretical topic at this point.

> > For the GUSA stuff, do you really nexed the ODD/EVEN macros? I think
> > you could just add appropriate .align inside that would cause the
> > assembler to insert a nop if necessary.
> 
> Should be possible.  I'll work on it and send another patch.

Thanks!

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.