Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230211133532.GD4163@brightrain.aerifal.cx>
Date: Sat, 11 Feb 2023 08:35:33 -0500
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Re:Re: Re:Re: Re:Re: Re:Re: 
 qsort

On Sat, Feb 11, 2023 at 10:06:02AM +0100, alice wrote:
> On Sat Feb 11, 2023 at 9:39 AM CET, Joakim Sindholt wrote:
> > On Sat, 11 Feb 2023 06:44:29 +0100, "alice" <alice@...ya.dev> wrote:
> > > based on the glibc profiling, glibc also has their natively-loaded-cpu-specific
> > > optimisations, the _avx_ functions in your case. musl doesn't implement any
> > > SIMD optimisations, so this is a bit apples-to-oranges unless musl implements
> > > the same kind of native per-arch optimisation.
> > > 
> > > you should rerun these with GLIBC_TUNABLES, from something in:
> > > https://www.gnu.org/software/libc/manual/html_node/Hardware-Capability-Tunables.html
> > > which should let you disable them all (if you just want to compare C to C code).
> > > 
> > > ( unrelated, but has there been some historic discussion of implementing
> > >   something similar in musl? i feel like i might be forgetting something. )
> >
> > There already are arch-specific asm implementations of functions like
> > memcpy.
> 
> apologies, i wasn't quite clear- the difference
> between src/string/x86_64/memcpy.s and the glibc fiesta is that the latter
> utilises subarch-specific SIMD (as you explain below), e.g. AVX like in the
> above benchmarks. a baseline x86_64 asm is more fair-game if the difference is
> as significant as it is for memcpy :)

Folks are missing the point here. It's not anything to do with AVX or
even glibc's memcpy making glibc faster here. Rather, it's that glibc
is *not calling memcpy* for 4-byte (and likely a bunch of other
specialized cases) element sizes. Either they manually special-case
them, or the compiler (due to lack of -ffreestanding and likely -O3 or
something) is inlining the memcpy.

Based on the profiling data, I would predict an instant 2x speed boost
special-casing small sizes to swap directly with no memcpy call.

Incidentally, our memcpy is almost surely at least as fast as glibc's
for 4-byte copies. It's very large sizes where performance is likely
to diverge.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.