|
Message-ID: <20230211092249.GC1903@voyager> Date: Sat, 11 Feb 2023 10:22:49 +0100 From: Markus Wichmann <nullplan@....net> To: musl@...ts.openwall.com Subject: Re: Re:Re: qsort On Mon, Jan 30, 2023 at 06:04:39PM +0800, David Wang wrote: > On average, qsort with musl is significantly slower than c++ sort, > while with glibc the average performance of qsort is bettern than c++ > sort. Getting back to this, with the benchmarks recently performed, I notice one area where C++ is able to significantly outperform C, and that is in the area of generic functions. C++ std::sort is a template function, and the compiler can optimize the code for the situation. For example, in your case, you only ever need to swap two elements by way of copying four bytes, which the compiler can just inline. In C, the same code can only call memcpy(), and a call to cycle() will involve many calls to memcpy(), and they cannot be inlined since the compiler will not know the copy size at the time it is compiling the calls. We end up with tons of calls to memcpy() with a size of four, when memcpy() is optimized for large sizes. Merge sort, as glibc implements it, allows at least a few memcpy() calls with larger sizes. Glibc also avoids calling memcpy by duplicating the code for common sizes, only resorting to memcpy() for the default case. That may be an avenue worth pursuing, at least for cycle(). I'm not quite certain why glibc chose the sizes they chose for msort_with_tmp(), but they are branching out for 32-bit integers, 64-bit integers, longs (which in a System V ABI are always one of the former two) and void* (which in a System V ABI are always the same as a long). I think special casing 32-bit and 64-bit integers ought to get us most of the way there, since those are the most common sizes that can still be copied simply inline. Also note that glibc further reduces the need for memcpy() by sorting indirectly if the size of a single element exceeds 32 bytes. That is of course something they can do, since they already are allocating memory and have a fall-back strategy in place. Not sure if this is worthwhile for musl. At least in the static linking case, it would suddenly pull malloc() and free() into executables that previously did not have those. That last optimization would also not have any bearing on the current batch of benchmarks, since in those, a size of four is set in stone. Ciao, Markus
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.