|
Message-Id: <CQFM48UU024L.3F72QJSEDJMQ@sumire> Date: Sat, 11 Feb 2023 10:06:02 +0100 From: "alice" <alice@...ya.dev> To: <musl@...ts.openwall.com> Subject: Re: Re:Re: Re:Re: Re:Re: Re:Re: qsort On Sat Feb 11, 2023 at 9:39 AM CET, Joakim Sindholt wrote: > On Sat, 11 Feb 2023 06:44:29 +0100, "alice" <alice@...ya.dev> wrote: > > based on the glibc profiling, glibc also has their natively-loaded-cpu-specific > > optimisations, the _avx_ functions in your case. musl doesn't implement any > > SIMD optimisations, so this is a bit apples-to-oranges unless musl implements > > the same kind of native per-arch optimisation. > > > > you should rerun these with GLIBC_TUNABLES, from something in: > > https://www.gnu.org/software/libc/manual/html_node/Hardware-Capability-Tunables.html > > which should let you disable them all (if you just want to compare C to C code). > > > > ( unrelated, but has there been some historic discussion of implementing > > something similar in musl? i feel like i might be forgetting something. ) > > There already are arch-specific asm implementations of functions like > memcpy. apologies, i wasn't quite clear- the difference between src/string/x86_64/memcpy.s and the glibc fiesta is that the latter utilises subarch-specific SIMD (as you explain below), e.g. AVX like in the above benchmarks. a baseline x86_64 asm is more fair-game if the difference is as significant as it is for memcpy :) i wonder if anyone has tried such baseline-asm for str*, or for non i386/ x86_64 by now. there seems to only be x86 and mips asm in the tree currently (base platform support aside). (purely out of interest of course- i don't have the ability to write such things (yet), and maybe there are some gains more significant than "2.2%" possible with just sse2 for instance.) > As I see it there are 3 issues standing between musl and the > glibc approach of writing a new function every time Intel or AMD > releases a new core design: > 1) ifunc resolvers don't work on statically linked binaries. > 2) If they did it would mean shipping 12 different implementations of > each optimized function, making the binary huge for, for the most > part, no good reason. > 3) The esoteric bug is no longer in memcpy but in either memcpy_c, > memcpy_mmx, memcpy_3dnow, memcpy_sse2, memcpy_sse3, memcpy_ssse3, > memcpy_sse41, memcpy_sse42, memcpy_avx, memcpy_avx2, memcpy_avx512, > or memcpy_amx or whatever else is added in the future in a > never-ending spiral of implementations piling up. 3) is admittedly the worst effect- niche esoteric debugging is worse than "disk space", and having so many implementations is certainly hard to maintain. > It is my opinion that musl should remain small and concise to allow it > to effectively serve both the "small" and "gotta go fast" markets. I say > both because you can always haul in libreallyreallyfastsort.a/so but you > can't take the 47 qsort/memcpy implementations out of libc. yes, i generally find myself having the same opinion :)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.