|
Message-ID: <1439728953.9803.30.camel@inria.fr> Date: Sun, 16 Aug 2015 14:42:33 +0200 From: Jens Gustedt <jens.gustedt@...ia.fr> To: musl@...ts.openwall.com Subject: Re: [PATCH] replace a mfence instruction by an xchg instruction Hello, Am Samstag, den 15.08.2015, 19:28 -0400 schrieb Rich Felker: > On Sat, Aug 15, 2015 at 11:01:40PM +0200, Jens Gustedt wrote: > > Am Samstag, den 15.08.2015, 16:17 -0400 schrieb Rich Felker: > > > On Sat, Aug 15, 2015 at 08:51:41AM +0200, Jens Gustedt wrote: > > > > according to the wisdom of the Internet, e.g > > > > > > > > https://peeterjoot.wordpress.com/2009/12/04/intel-memory-ordering-fence-instructions-and-atomic-operations/ > > > > > > > > a mfence instruction is about 3 times slower than an xchg instruction. > > > > > > Uhg, then why does this instruction even exist if it does less and > > > does it slower? > > > > Because they do different things ?) > > > > mfence is to synchronize all memory, xchg, at least at a first glance, > > only one word. > > No, any lock-prefixed instruction, or xchg which has a builtin lock, > fully orders all memory accesses. Essentially it contains a builtin > mfence. Hm, I think mfence does a bit more than that. The three fence instructions were introduced when they invented the asynchronous ("non-temporal") move instructions that came with sse. I don't think that "lock" instructions synchronize with these asynchronous moves, so the two (lock instructions and fences) are just different types of animals. And this answers perhaps your question up-thread, why there is actually something like mfence. > This is why I find it odd that a lone mfence is 3x slower. > > But I also read that the relative performance of these instructions > > depend a lot on the actual dice you are dealing with. > > Did you measure mfence being slower here? Yes, I attach two graphs that show this (and other things). This is my benchmark for testing generic atomics, here by using __lock/__unlock as a primitive. If you just look at the performance difference for one thread, it is is dramatic. Just by changing the a_store instruction the application throughput is 40% better. The difference comes not only from the fact that performance of __lock/__unlock changes, but also because the whole rest of musl changes performance. In particular, I think, because the __lock/__unlock clones in malloc are affected by that change, too. (That benchmarks also does a lot of malloc/free.) What I didn't notice before running these tests systematically this night, is that the performance of that bench also changes for the congestion case, but there it is the other way around. > > > > Here we not only had mfence but also the mov instruction that was to be > > > > protected by the fence. Replace all that by a native atomic instruction > > > > that gives all the ordering guarantees that we need. > > > > > > > > This a_store function is performance critical for the __lock > > > > primitive. In my benchmarks to test my stdatomic implementation I have a > > > > substantial performance increase (more than 10%), just because malloc > > > > does better with it. > > > > > > Is there a reason you're not using the same approach as on i386? It > > > was faster than xchg for me, and in principle it "should be faster". > > > > I discovered your approach for i386 after I experimented with "xchg" > > fore x86_64. I guess the "lock orl" instruction is a replacement for > > "mfence" because that one is not implemented for all variants of i386? > > > > Exactly why a "mov" followed by a read-modify-write operation to some > > random address (here the stack pointer) should be faster than a > > read-modify-write operation with exactly the address you want to deal > > with looks weird. > > xchg on the atomic address in principle reads a cache line (the write > destination) that's known to be shared with other threads despite not > needing it, then modifies is. mov on the other hand just appends the > write to the local store buffer; reading the target cache line is not > necessary. The subsequent barrier then takes care of ordering issues > (just the one case x86 doesn't guarantee already). > > At least that's what happens in theory. It seemed to be slightly > faster in practice for me too, which is why I went with it for now > (theoretically faster + weak empirical evidence that it's faster = > reasonable basis for tentative conclusion, IMO) but I'm open to > further study of the different approaches and changing if we find > something else is better. Yes, that theory probably isn't enough to describe things. If you look at the latency of a read-modify-write instructions you have something like L = I + 2 M where I is instruction latency itself, and M is the memory latency. M varies with the platform, chipset whatever, but it should be the same for all lock instructions on the same machine. Just to give an order of magnitude, M should be around 100. "I" in turn can be very different for different lock instructions, and in addition varies for different processor variants. If I read the tables correctly it can be from about 10 to 100 cycles, so something really noticeable compared to M. What you'd like to have for a store instruction is L_{store} = I_{store} + M avoiding the second M for the return from memory. (The second M can be hidden until we hit the next load instruction) Your approach with orl has I_{mov} + I_{or} + M the original one has I_{mov} + I_{mfence} + 2 M which gives you an extra M the xchg approach has I_{xchg} + 2 M From the graphs we can deduce that in fact the orl strategy is more stable that the two others, but still the 20% gap for the "normal" situation worries me. > Ultimately I think I'd like to get rid of a_store at some point anyway > since I think we can do better without it. At least in this context, yes. From what I see at the moment the best strategy for the initial part of a locking primitive is to only use cmpxchg instructions. This instruction sets "cc", so if is integrated properly by the compiler, it can be used to conditionally jump, without even inspecting the value that had previously been in memory. Currently within musl such a close integration is difficult, because you can't easily jump from within __asm__. When using my stdatomic implementation (only building upon the builtin compiler support for atomics) the generated code is much better. Jens -- :: INRIA Nancy Grand Est ::: Camus ::::::: ICube/ICPS ::: :: ::::::::::::::: office Strasbourg : +33 368854536 :: :: :::::::::::::::::::::: gsm France : +33 651400183 :: :: ::::::::::::::: gsm international : +49 15737185122 :: :: http://icube-icps.unistra.fr/index.php/Jens_Gustedt :: Download attachment "test-musl-all.eps" of type "image/x-eps" (21344 bytes) Download attachment "test-musl-relative.eps" of type "image/x-eps" (21498 bytes) Download attachment "signature.asc" of type "application/pgp-signature" (182 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.