|
Message-ID: <1438602469.2248.2.camel@inria.fr> Date: Mon, 03 Aug 2015 13:47:49 +0200 From: Jens Gustedt <jens.gustedt@...ia.fr> To: musl@...ts.openwall.com Subject: stdatomic library, v2 Hello, here is the second version of my stdatomic implementation. The main difference to the previous one, is a handcrafted lock. This lock is there for a very particular situation, namely critical sections that are extremely short. It is not yet clear if the same strategy could or should apply to other parts of musl, maybe in replacement of the __lock/__unlock pair. The situation is so special here, since every cycle counts. - On the fast-path, the shorter it gets, the less conflicts there are with other lockers. - On the mild-conflict path, we spin only once on average, so here also, the less cycles the less interaction with other lockers. This lock is already implemented with stdatomic semantics. First, because it is nicer and we finally have it, here :) More seriously, the assembler produced is much better than with the inline assembler macros, simply because the compiler can integrate everything more smoothly and take advantage of particular properties of the instructions. E.g on x86 the cmpxchg instruction already places the result of the operation in a flag, there is no need to retest the condition. And since, every cycle counts, see above, using stdatomic alone seems to have benefits over __lock/__unlock. The gain of this new lock is that on my machine my bench (list element allocations-insertions or removal-deallocation) goes from about 700000 up to about 1000000 per second. I also tried to benchmark the individual parts of the new algorithm to get a hand on where the problem spots are, but exact measurement doesn't seem possible due to Heisenberg effects. As before, please find the "implementation" part of my notes below, I also attach the full .org file of it. Jens * The implementation ** Requirements *** Compilers You should be able to compile this implementation with any version of modern gcc and clang. (Versions are hard to tell, gcc should work for 4.1) The quality of the resulting binary will depend on the implementation of atomic support by the compiler. There are three different implementations, for modern clang and gcc, and one for those compilers that only support the =__sync_= built-ins. They are only tested with clang and gcc, but might work with other compilers that implement one of the sets of built-ins and is otherwise compatible to some gcc extensions: - compound expressions with =({ })= - =__attribute__= with =__alias__= and =__unused__= - =__builtin_choose_expr= for the =__sync= version as a precursor of C11's =_Generic= There are some heuristics in place to decide at compile time which case applies, namely =__clang__= to detect clang, =__ATOMIC_...= macros to detect the C11 versions of the built-ins. *** OS or C library support The library may work with different lock constructs, currently we implement one simple generic approach that only uses spinning, and a mixed approach that uses Linux' =futex= as an inactive sleep strategy as a last resort. The latter has been tested with the =musl= C library. This locking strategy can be a performance bottleneck for applications with a strong congestion on one particular atomic data, e.g code that would insert list elements through a centralized list head. If this list head can not be realized with a lock-free atomic, the critical section of modifying it is protected by our lock. Such code has very particular properties. - Since the critical section usually is really short compared to a scheduling interval of the OS, the probability that the lock can be taken immediately is high. So the fast path for taking the lock must be *really fast*. Our implementation essentially has an =atomic_compare_exchange_strong_explicit=, here. One memory instruction on the fast path must be enough. - If locking fails a the first try, still the probability is very high that it will succeed soon after. This is because only scheduled threads compete, here, so there are never more threads in play than we have processors. Therefore as a second strategy we spin for a while until we get the lock. In our experiments on average one single round of spinning was enough. - A third exceptional case may occur, when the thread that is holding the lock is descheduled in the middle of the critical section. The probability for that event is quite rare (0.1 % in our experiments) but still this case occurs. If it does, the world changes drastically, a herd of threads all have to wait for a long time (until the locker is rescheduled) to have any chance to obtain the lock. Active wait here is counterproductive. In the contrary, by going into an inactive OS sleep, the possibility for the locker to regain an execution slot increases. We implement this strategy a bit differently than classical locks with wait-counters would do. We just have a single =unsigned= value that at the same time holds the lock bit (HO bit) and a counter. That counter is not viewed as a counter of the threads that are in a kernel wait, but just counts the number of threads inside the critical section. This has the following advantages: - An update to the counter part is relatively rare. So we save memory bandwidth, and we also avoid too much interaction between the different threads that compete for the lock. - The fast path occurs when the value is =0=, initially. It sets the HO bit (the lock bit) and the LO bit (for a counter of value =1=) in one go. The resulting value is =UINT_MAX/2u+2u=. - If the fast path fails, the counter is atomically incremented by one, and we enter a spin lock to set the HO bit as well. - After having spun for sometime, we suppose that we are in the bad situation and go into a =futex_wait=. Going into the =futex_wait= may fail if the value changes. Since additional threads only change the counter when they arrive, this can't happen too often and the thread goes to sleep, eventually. - Unlocking is a very simple operation. The locker has contributed =UINT_MAX/2u+2u= to the value, and so just has to decrement the value atomically by that amount. By doing so, the thread also notices if other threads still are in the critical section and wakens one of them. ** Caveats *** Symbol renaming There is one important difficulty when compiling this. The original =__atomic= library interface was developed with C++ in mind and not C. Therefore it freely uses function overloading for the built-ins versus the library interface. Since we also use the library functions as fallbacks in the implementation of some of the =_X= variants this naming scheme is not supportable with a C compiler. We get away with it by using internal names, prefixed with =__impl_= for all functions. Then we rename symbols to the intended ones using =objcopy=. - The current integration into musl does this with a *script* that you have to run *manually* after compilation. - You then have to launch =make= a *second time* to do the final link. This technique is certainly not ideal and subject to improvement. *** Support of 16 byte atomic instructions The main difference for modern processors that is relevant here is if it supports 16 byte atomic instructions or not. There is no difficulty to detect this at compile time, but if the library is used with code that is compiled with a different compiler or just different compiler options, incompatible binary code may be produced. My plan is to freeze that feature at compile time of the library and reflect the capacity in the =<stdatomic.h>= that is provided. This then may result in code that is a bit less optimized than it could, but that is compatible. - If the library is *not* compiled with direct 16 byte support the application may not use it, and thus use a memory implementation for such operations. - If the library *is* compiled with direct 16 byte support but the application compiler doesn't support it, the user code should fallback to library calls, but which in turn use the atomic instructions. So such a variant would have a call overhead and would not be able to inline the atomics in the user binary. All of this is not yet, done, though. Be careful when using this preliminary version. ** Leftovers There are some leftovers that will hopefully disappear. - There are several hash functions and a instrumentation infrastructure for the hashes. I didn't have enough test cases yet to see what would be best, here. - There is optional instrumentation for the lock functions. Switching it on changes overall performance substantially, and thus I'd expect a noticeable Heisenberg effect. So these counter can give qualitative information about what happens, you shouldn't take the figures verbally. -- :: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS ::: :: ::::::::::::::: office Strasbourg : +33 368854536 :: :: :::::::::::::::::::::: gsm France : +33 651400183 :: :: ::::::::::::::: gsm international : +49 15737185122 :: :: http://icube-icps.unistra.fr/index.php/Jens_Gustedt :: View attachment "stdatomic-patch-v2.txt" of type "text/plain" (82181 bytes) View attachment "README.org" of type "text/plain" (26842 bytes) Download attachment "signature.asc" of type "application/pgp-signature" (182 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.