musl - Re: Illegal killlock skipping when transitioning to single-threaded state

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20221004155713.GN29905@brightrain.aerifal.cx>
Date: Tue, 4 Oct 2022 11:57:14 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Illegal killlock skipping when transitioning to
 single-threaded state

On Tue, Oct 04, 2022 at 06:43:49PM +0300, Alexey Izbyshev wrote:
> On 2022-10-04 17:19, Rich Felker wrote:
> >On Tue, Oct 04, 2022 at 10:12:42AM -0400, Rich Felker wrote:
> >>On Tue, Oct 04, 2022 at 04:50:00PM +0300, Alexey Izbyshev wrote:
> >>> On 2022-10-04 02:05, Rich Felker wrote:
> >>> >On Mon, Oct 03, 2022 at 06:54:17PM -0400, Rich Felker wrote:
> >>> >>On Mon, Oct 03, 2022 at 11:27:05PM +0200, Szabolcs Nagy wrote:
> >>> >>> * Szabolcs Nagy <nsz@...t70.net> [2022-10-03 15:26:15 +0200]:
> >>> >>>
> >>> >>> > * Alexey Izbyshev <izbyshev@...ras.ru> [2022-10-03 09:16:03 +0300]:
> >>> >>> > > On 2022-09-19 18:29, Rich Felker wrote:
> >>> >>> > > > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote:
> >>> >>> > ...
> >>> >>> > > > > Reordering the "libc.need_locks = -1" assignment and
> >>> >>> > > > > UNLOCK(E->killlock) and providing a store barrier between them
> >>> >>> > > > > should fix the issue.
> >>> >>> > > >
> >>> >>> > > > I think this all sounds correct. I'm not sure what you mean by a store
> >>> >>> > > > barrier between them, since all lock and unlock operations are already
> >>> >>> > > > full barriers.
> >>> >>> > > >
> >>> >>> > >
> >>> >>> > > Before sending the report I tried to infer the intended ordering semantics
> >>> >>> > > of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't
> >>> >>> > > see why they would provide a full barrier (my reasoning is below), so I
> >>> >>> > > concluded that probably acquire/release semantics was intended in general
> >>> >>> > > and suggested an extra store barrier to prevent hoisting of "libc.need_locks
> >>> >>> > > = -1" store spelled after UNLOCK(E->killlock) back into the critical
> >>> >>> > > section.
> >>> >>> > >
> >>> >>> > > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple
> >>> >>> > > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented
> >>> >>> > > via load-acquire/store-release instructions. Therefore, if we consider a
> >>> >>> > > LOCK/UNLOCK critical section containing only plain loads and stores, (a) any
> >>> >>> > > such memory access can be reordered with the initial ldaxr in UNLOCK, and
> >>> >>> > > (b) any plain load following UNLOCK can be reordered with stlxr (assuming
> >>> >>> > > the processor predicts that stlxr succeeds), and further, due to (a), with
> >>> >>> > > any memory access inside the critical section. Therefore, UNLOCK is not full
> >>> >>> > > barrier. Is this right?
> >>> >>> >
> >>> >>> > i dont think this is right.
> >>> >>>
> >>> >>>
> >>> >>> i think i was wrong and you are right.
> >>> >>>
> >>> >>> so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and
> >>> >>> starting with 'something == 0' the exiting E and remaining R threads:
> >>> >>>
> >>> >>> E:something=1      // protected by killlock
> >>> >>> E:UNLOCK(killlock)
> >>> >>> E:need_locks=-1
> >>> >>>
> >>> >>> R:LOCK(unrelated)  // reads need_locks == -1
> >>> >>> R:need_locks=0
> >>> >>> R:UNLOCK(unrelated)
> >>> >>> R:LOCK(killlock)   // does not lock
> >>> >>> R:read something   // can it be 0 ?
> >>> >>>
> >>> >>> and here something can be 0 (ie. not protected by killlock) on aarch64
> >>> >>> because
> >>> >>>
> >>> >>> T1
> >>> >>> 	something=1
> >>> >>> 	ldaxr ... killlock
> >>> >>> 	stlxr ... killlock
> >>> >>> 	need_locks=-1
> >>> >>>
> >>> >>> T2
> >>> >>> 	x=need_locks
> >>> >>> 	ldaxr ... unrelated
> >>> >>> 	stlxr ... unrelated
> >>> >>> 	y=something
> >>> >>>
> >>> >>> can end with x==-1 and y==0.
> >>> >>>
> >>> >>> and to fix it, both a_fetch_add and a_cas need an a_barrier.
> >>> >>>
> >>> >>> i need to think how to support such lock usage on aarch64
> >>> >>> without adding too many dmb.
> >>> >>
> >>> >>I don't really understand this, but FWIW gcc emits
> >>> >>
> >>> >>    ldxr
> >>> >>    ...
> >>> >>    stlxr
> >>> >>    ...
> >>> >>    dmb ish
> >>> >>
> >>> >>for __sync_val_compare_and_swap. So this is probably the right thing
> >>> >>we should have. And it seems to match what the kernel folks discussed
> >>> >>here:
> >>> >>
> >>> >>http://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229588.html
> >>> >>
> >>> >>I wondered if there are similar issues for any others archs which need
> >>> >>review, but it looks like all the other llsc archs have explicit
> >>> >>pre/post barriers defined.
> >>> >
> >>> >Actually I don't understand what's going on with cmpxchg there. The
> >>> >patch I linked has it using ldxr/stxr (not stlxr) for cmpxchg. There's
> >>> >some follow-up in the thread I don't understand, about the case where
> >>> >the cas fails, but we already handle that by doing an explicit barrier
> >>> >in that case.
> >>> >
> >>> I think in that follow-up[1] they mean the following case (in musl
> >>> terms):
> >>>
> >>> volatile int x, flag;
> >>>
> >>> T1:
> >>>     x = 1;
> >>>     a_store(&flag, 1);
> >>>
> >>> T2:
> >>>     while (!flag);
> >>>     a_cas(&x, 0, 1); // can this fail?
> >>>
> >>> They want it to never fail. But if a_cas() is implemented as
> >>> ldrx/stlrx/dmb, this is not guaranteed because ldxr can be reordered
> >>> with the load of flag.
> >>>
> >>> Note that musl does *not* handle this now, because a_barrier() in
> >>> the failure path is after a_ll().
> >>>
> >>> [1] https://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229693.html
> >>
> >>OK, then indeed this too needs to be fixed -- the a_cas is failing to
> >>synchronize with the a_store. How do we do that? Based on my
> >>understanding, my proposed patch doesn't fix it.
> >>
> >>Do we need ldarx/stlrx/dmb? Or perhaps relegate the extra
> >>synchronization to a retry in the case where the comparison fails?
> >>
> ldarx will not work for the same reason as ldrx doesn't work (it can
> still be reordered with the load of flag).
> 
> >>If this is actually the case, it's disturbing that GCC does not seem
> >>to be getting it right either...
> >
> This is indeed disturbing, considering that comment[1] claims that
> __sync RMWs were once defined as:
> 
>   __sync_synchronize();
>   operation...
>   __sync_synchronize();
> 
> which clearly doesn't hold for the current implementation of
> __sync_val_compare_and_swap() for AArch64.
> 
> I can speculate that some of confusion is due to the fact that
> apparently each __sync builtin originally mapped to a single Itanium
> instruction, so the issues with ordering of "internal" LL/SC
> accesses simply didn't exist.
> 
> [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697#c24
> 
> >One stupid (or maybe not stupid) option would be to always stlrx back
> >the old value on comparison failure, so we can rely on the stlrx to
> >fail if anything was reordered wrong. I think this is actually kinda
> >how the x86 CAS works, or how it historically worked, so maybe it's
> >not a bad choice. It's also analogous to how the non-CAS operations
> >work: no branching out of the ll/sc loop, and just defining the result
> >as old == t ? s : old:
> >
> >static inline int a_cas(volatile int *p, int t, int s)
> >{
> >	int old;
> >	a_pre_llsc();
> >	do old = a_ll(p);
> >	while (!a_sc(p, old == t ? s : old));
> >	a_post_llsc();
> >	return old;
> >}
> >
> I can't comment on whether always doing a store (or even many
> stores, until "stabilization") on a failed CAS is better than an
> extra dmb. Maybe if CAS failure normally happens only on contention,
> and we optimize for the non-contended case, it's indeed so. As for
> correctness, I agree that it should work.

OK. I think the alternative is something like:

static inline int a_cas(volatile int *p, int t, int s)
{
	int old;
	a_pre_llsc();
	do {
		old = a_ll(p);
		if (old != t) {
			a_barrier();
			if (*p != old) continue;
			break;
		}
	} while (!a_sc(p, s));
	a_post_llsc();
	return old;
}

This is more/messier code, and still has the possibility of unbounded
looping waiting for consistency under contention. It doesn't perform a
store to the atomic object during such contention, which might be less
of a burden on other threads, but on the other hand it performs full
barriers during this contention, which might be more of a burden.

Conceptually I think I like the version that always stores for
symmetry with the other atomic operations that makes it "obviously
correct if they're correct". But I'm not sure if there are cost
reasons we should prefer a different choice. At least it "shouldn't be
worse" than the other atomic ops are if we do it that way.

> >In fact if we use this as the general pattern for a_cas (in
> >src/internal/atomic.h) I'm not even sure we'd need a_pre_llsc() at
> >all, on any archs -- but that probably depends on the ISA's SC
> >semantics...
> >
> This pattern works on AArch64 without a_pre_llsc() only because
> a_sc() is a store-release. If it were a plain store-exclusive,
> preceding memory accesses could be reordered with it.

Yes, that sounds right.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.