musl - Re: Re:Re: Re: Question：Why musl call a_barrier in __pthread

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20230523194138.GX3630668@port70.net>
Date: Tue, 23 May 2023 21:41:38 +0200
From: Szabolcs Nagy <nsz@...t70.net>
To: 847567161 <847567161@...com>
Cc: musl <musl@...ts.openwall.com>
Subject: Re: Re:Re: Re: Question：Why musl call a_barrier in __pthread_once?

* 847567161 <847567161@...com> [2023-05-22 09:53:05 +0800]:
> &gt;Besides it doesn't help your case. You wanted to remove the "dmb"
> &gt;instruction right? Well, that code adds it if the compiler thinks it is
> &gt;necessary, and GCC trunk for ARM does so: https://godbolt.org/z/WcrfTdTx5
> 
> I compared the implement between musl and bionic in assembly code, 
> I see bionic don't generate 'dmb' with clang, you can also check in here,
> https://godbolt.org/z/hroY3cc4d
> So it has better performance in comman case which init function is done.
> 
> So can we do any optimization here? 

we went over the options up thread but

1) arm != aarch64. clang generates dmb too on armv7-a.

2) the clang you use there is wrong for aarch64: it generates code for
the latest arch, which would not run on baseline -march=armv8-a cpu.
(ldapr is armv8.3-a and not implemented on most cpus currently in use)

3) ldar (or ldapr) are atomic instructions that have an acquire barrier
internally, so while it's not as strong as a dmb it is still a barrier.
(x86 is a different story, there normal loads are already acquire mo)

4) acquire barrier is not enough, posix requires 'synchronize memory'
for the first call in each thread. (bionic implements weak 'synchronize
memory', which arguably breaks some valid posix code, so musl is
conservative).

5) even if musl allowed weak 'synchronize memory', adding weak atomics
to musl would either be a maintenance burden (to implement on all
targets and update the synchronization code) or a regression if
compiler builtins were used for them (old compilers would not work,
even recent gcc/llvm had atomics bugs affecting abi on various targets).

6) the TLS solution would avoid the barrier completely and it guarantees
that in each thread for each once_flag there is a synchronize memory
either at the first pthread_once call *or* somewhere between that and
the once_function completion. (the *or*... bit is not posix conform,
but it is iso c conform. faster than the bionic code in the common case)

so call_once can be optimized to have no barrier or atomics at all
in the fast path (apart from relaxed mo atomics), to do the same for
pthread_once at least an accepted austingroupbugs report is needed to
relax the requirement. a posix relaxation is needed for using acquire
barrier as well (though most libc implementations already assume this
relaxation).

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.