|
Message-ID: <e4b4f0fbb0bb7d168a4c9c2f50fdb48a@ispras.ru> Date: Tue, 04 Oct 2022 21:15:12 +0300 From: Alexey Izbyshev <izbyshev@...ras.ru> To: musl@...ts.openwall.com Subject: Re: Illegal killlock skipping when transitioning to single-threaded state On 2022-10-04 18:57, Rich Felker wrote: > On Tue, Oct 04, 2022 at 06:43:49PM +0300, Alexey Izbyshev wrote: >> On 2022-10-04 17:19, Rich Felker wrote: >> >On Tue, Oct 04, 2022 at 10:12:42AM -0400, Rich Felker wrote: >> >>On Tue, Oct 04, 2022 at 04:50:00PM +0300, Alexey Izbyshev wrote: >> >>> On 2022-10-04 02:05, Rich Felker wrote: >> >>> >On Mon, Oct 03, 2022 at 06:54:17PM -0400, Rich Felker wrote: >> >>> >>On Mon, Oct 03, 2022 at 11:27:05PM +0200, Szabolcs Nagy wrote: >> >>> >>> * Szabolcs Nagy <nsz@...t70.net> [2022-10-03 15:26:15 +0200]: >> >>> >>> >> >>> >>> > * Alexey Izbyshev <izbyshev@...ras.ru> [2022-10-03 09:16:03 +0300]: >> >>> >>> > > On 2022-09-19 18:29, Rich Felker wrote: >> >>> >>> > > > On Wed, Sep 07, 2022 at 03:46:53AM +0300, Alexey Izbyshev wrote: >> >>> >>> > ... >> >>> >>> > > > > Reordering the "libc.need_locks = -1" assignment and >> >>> >>> > > > > UNLOCK(E->killlock) and providing a store barrier between them >> >>> >>> > > > > should fix the issue. >> >>> >>> > > > >> >>> >>> > > > I think this all sounds correct. I'm not sure what you mean by a store >> >>> >>> > > > barrier between them, since all lock and unlock operations are already >> >>> >>> > > > full barriers. >> >>> >>> > > > >> >>> >>> > > >> >>> >>> > > Before sending the report I tried to infer the intended ordering semantics >> >>> >>> > > of LOCK/UNLOCK by looking at their implementations. For AArch64, I didn't >> >>> >>> > > see why they would provide a full barrier (my reasoning is below), so I >> >>> >>> > > concluded that probably acquire/release semantics was intended in general >> >>> >>> > > and suggested an extra store barrier to prevent hoisting of "libc.need_locks >> >>> >>> > > = -1" store spelled after UNLOCK(E->killlock) back into the critical >> >>> >>> > > section. >> >>> >>> > > >> >>> >>> > > UNLOCK is implemented via a_fetch_add(). On AArch64, it is a simple >> >>> >>> > > a_ll()/a_sc() loop without extra barriers, and a_ll()/a_sc() are implemented >> >>> >>> > > via load-acquire/store-release instructions. Therefore, if we consider a >> >>> >>> > > LOCK/UNLOCK critical section containing only plain loads and stores, (a) any >> >>> >>> > > such memory access can be reordered with the initial ldaxr in UNLOCK, and >> >>> >>> > > (b) any plain load following UNLOCK can be reordered with stlxr (assuming >> >>> >>> > > the processor predicts that stlxr succeeds), and further, due to (a), with >> >>> >>> > > any memory access inside the critical section. Therefore, UNLOCK is not full >> >>> >>> > > barrier. Is this right? >> >>> >>> > >> >>> >>> > i dont think this is right. >> >>> >>> >> >>> >>> >> >>> >>> i think i was wrong and you are right. >> >>> >>> >> >>> >>> so with your suggested swap of UNLOCK(killlock) and need_locks=-1 and >> >>> >>> starting with 'something == 0' the exiting E and remaining R threads: >> >>> >>> >> >>> >>> E:something=1 // protected by killlock >> >>> >>> E:UNLOCK(killlock) >> >>> >>> E:need_locks=-1 >> >>> >>> >> >>> >>> R:LOCK(unrelated) // reads need_locks == -1 >> >>> >>> R:need_locks=0 >> >>> >>> R:UNLOCK(unrelated) >> >>> >>> R:LOCK(killlock) // does not lock >> >>> >>> R:read something // can it be 0 ? >> >>> >>> >> >>> >>> and here something can be 0 (ie. not protected by killlock) on aarch64 >> >>> >>> because >> >>> >>> >> >>> >>> T1 >> >>> >>> something=1 >> >>> >>> ldaxr ... killlock >> >>> >>> stlxr ... killlock >> >>> >>> need_locks=-1 >> >>> >>> >> >>> >>> T2 >> >>> >>> x=need_locks >> >>> >>> ldaxr ... unrelated >> >>> >>> stlxr ... unrelated >> >>> >>> y=something >> >>> >>> >> >>> >>> can end with x==-1 and y==0. >> >>> >>> >> >>> >>> and to fix it, both a_fetch_add and a_cas need an a_barrier. >> >>> >>> >> >>> >>> i need to think how to support such lock usage on aarch64 >> >>> >>> without adding too many dmb. >> >>> >> >> >>> >>I don't really understand this, but FWIW gcc emits >> >>> >> >> >>> >> ldxr >> >>> >> ... >> >>> >> stlxr >> >>> >> ... >> >>> >> dmb ish >> >>> >> >> >>> >>for __sync_val_compare_and_swap. So this is probably the right thing >> >>> >>we should have. And it seems to match what the kernel folks discussed >> >>> >>here: >> >>> >> >> >>> >>http://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229588.html >> >>> >> >> >>> >>I wondered if there are similar issues for any others archs which need >> >>> >>review, but it looks like all the other llsc archs have explicit >> >>> >>pre/post barriers defined. >> >>> > >> >>> >Actually I don't understand what's going on with cmpxchg there. The >> >>> >patch I linked has it using ldxr/stxr (not stlxr) for cmpxchg. There's >> >>> >some follow-up in the thread I don't understand, about the case where >> >>> >the cas fails, but we already handle that by doing an explicit barrier >> >>> >in that case. >> >>> > >> >>> I think in that follow-up[1] they mean the following case (in musl >> >>> terms): >> >>> >> >>> volatile int x, flag; >> >>> >> >>> T1: >> >>> x = 1; >> >>> a_store(&flag, 1); >> >>> >> >>> T2: >> >>> while (!flag); >> >>> a_cas(&x, 0, 1); // can this fail? >> >>> >> >>> They want it to never fail. But if a_cas() is implemented as >> >>> ldrx/stlrx/dmb, this is not guaranteed because ldxr can be reordered >> >>> with the load of flag. >> >>> >> >>> Note that musl does *not* handle this now, because a_barrier() in >> >>> the failure path is after a_ll(). >> >>> >> >>> [1] https://lists.infradead.org/pipermail/linux-arm-kernel/2014-February/229693.html >> >> >> >>OK, then indeed this too needs to be fixed -- the a_cas is failing to >> >>synchronize with the a_store. How do we do that? Based on my >> >>understanding, my proposed patch doesn't fix it. >> >> >> >>Do we need ldarx/stlrx/dmb? Or perhaps relegate the extra >> >>synchronization to a retry in the case where the comparison fails? >> >> >> ldarx will not work for the same reason as ldrx doesn't work (it can >> still be reordered with the load of flag). >> >> >>If this is actually the case, it's disturbing that GCC does not seem >> >>to be getting it right either... >> > >> This is indeed disturbing, considering that comment[1] claims that >> __sync RMWs were once defined as: >> >> __sync_synchronize(); >> operation... >> __sync_synchronize(); >> >> which clearly doesn't hold for the current implementation of >> __sync_val_compare_and_swap() for AArch64. >> >> I can speculate that some of confusion is due to the fact that >> apparently each __sync builtin originally mapped to a single Itanium >> instruction, so the issues with ordering of "internal" LL/SC >> accesses simply didn't exist. >> >> [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697#c24 >> >> >One stupid (or maybe not stupid) option would be to always stlrx back >> >the old value on comparison failure, so we can rely on the stlrx to >> >fail if anything was reordered wrong. I think this is actually kinda >> >how the x86 CAS works, or how it historically worked, so maybe it's >> >not a bad choice. It's also analogous to how the non-CAS operations >> >work: no branching out of the ll/sc loop, and just defining the result >> >as old == t ? s : old: >> > >> >static inline int a_cas(volatile int *p, int t, int s) >> >{ >> > int old; >> > a_pre_llsc(); >> > do old = a_ll(p); >> > while (!a_sc(p, old == t ? s : old)); >> > a_post_llsc(); >> > return old; >> >} >> > >> I can't comment on whether always doing a store (or even many >> stores, until "stabilization") on a failed CAS is better than an >> extra dmb. Maybe if CAS failure normally happens only on contention, >> and we optimize for the non-contended case, it's indeed so. As for >> correctness, I agree that it should work. > > OK. I think the alternative is something like: > > static inline int a_cas(volatile int *p, int t, int s) > { > int old; > a_pre_llsc(); > do { > old = a_ll(p); > if (old != t) { > a_barrier(); > if (*p != old) continue; > break; > } > } while (!a_sc(p, s)); > a_post_llsc(); > return old; > } > This also looks correct to me. > This is more/messier code, and still has the possibility of unbounded > looping waiting for consistency under contention. It doesn't perform a > store to the atomic object during such contention, which might be less > of a burden on other threads, but on the other hand it performs full > barriers during this contention, which might be more of a burden. > > Conceptually I think I like the version that always stores for > symmetry with the other atomic operations that makes it "obviously > correct if they're correct". But I'm not sure if there are cost > reasons we should prefer a different choice. At least it "shouldn't be > worse" than the other atomic ops are if we do it that way. > Symmetry is indeed nice, but I'm not sure it's really appropriate because at least in my mind a_cas() is different from other atomic RMW ops by making the "W" part conditional. I do agree on the correctness part though. Interestingly, the kernel actually removed the full barrier semantics from a failed atomic_cmpxchg(), with the reasoning that "nobody needs it": https://lists.infradead.org/pipermail/linux-arm-kernel/2015-July/355981.html Alexey
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.