|
Message-ID: <20230523194138.GX3630668@port70.net> Date: Tue, 23 May 2023 21:41:38 +0200 From: Szabolcs Nagy <nsz@...t70.net> To: 847567161 <847567161@...com> Cc: musl <musl@...ts.openwall.com> Subject: Re: Re:Re: Re: Question:Why musl call a_barrier in __pthread_once? * 847567161 <847567161@...com> [2023-05-22 09:53:05 +0800]: > >Besides it doesn't help your case. You wanted to remove the "dmb" > >instruction right? Well, that code adds it if the compiler thinks it is > >necessary, and GCC trunk for ARM does so: https://godbolt.org/z/WcrfTdTx5 > > I compared the implement between musl and bionic in assembly code, > I see bionic don't generate 'dmb' with clang, you can also check in here, > https://godbolt.org/z/hroY3cc4d > So it has better performance in comman case which init function is done. > > So can we do any optimization here? we went over the options up thread but 1) arm != aarch64. clang generates dmb too on armv7-a. 2) the clang you use there is wrong for aarch64: it generates code for the latest arch, which would not run on baseline -march=armv8-a cpu. (ldapr is armv8.3-a and not implemented on most cpus currently in use) 3) ldar (or ldapr) are atomic instructions that have an acquire barrier internally, so while it's not as strong as a dmb it is still a barrier. (x86 is a different story, there normal loads are already acquire mo) 4) acquire barrier is not enough, posix requires 'synchronize memory' for the first call in each thread. (bionic implements weak 'synchronize memory', which arguably breaks some valid posix code, so musl is conservative). 5) even if musl allowed weak 'synchronize memory', adding weak atomics to musl would either be a maintenance burden (to implement on all targets and update the synchronization code) or a regression if compiler builtins were used for them (old compilers would not work, even recent gcc/llvm had atomics bugs affecting abi on various targets). 6) the TLS solution would avoid the barrier completely and it guarantees that in each thread for each once_flag there is a synchronize memory either at the first pthread_once call *or* somewhere between that and the once_function completion. (the *or*... bit is not posix conform, but it is iso c conform. faster than the bionic code in the common case) so call_once can be optimized to have no barrier or atomics at all in the fast path (apart from relaxed mo atomics), to do the same for pthread_once at least an accepted austingroupbugs report is needed to relax the requirement. a posix relaxation is needed for using acquire barrier as well (though most libc implementations already assume this relaxation).
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.