musl - Re: x86 fma with run-time switch?

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sat, 16 Mar 2024 04:37:29 +0100
From: Markus Wichmann <nullplan@....net>
To: musl@...ts.openwall.com
Subject: Re: x86 fma with run-time switch?

Am Fri, Mar 15, 2024 at 05:36:23PM -0400 schrieb Rich Felker:
> You're making it too complicated. Just
>
> #define fma __soft_fma
> #include "../fma.c"
>
> or similar.
>

But that leaves a __soft_fma() symbol with external linkage in there.
OK, I suppose this is still less of a hack.

> My expectation was that you would just use __hwcap, whereby it would
> be a hidden global access no more expensive than accessing a function
> pointer for indirect branch, and likely cheaper to do local direct
> branches based on testing bits of it, something like:
>
> 	if (__hwcap & WHATEVER) {
> 		__asm__(...);
> 		return ...;
> 	} else return __soft_fma(...);
>
> However, it looks like x86_64 lacks usable __hwcap.
>

My understanding was that __hwcap is a contract between kernel and
userspace about what ISA extentions both kernel and CPU support. Whereas
FMA on x86 bypasses the kernel.

On other architectures, such as PowerPC, userspace has no way to
negotiate ISA version with the CPU, so __hwcap does that too, but that
is only by necessity.

> We already do that.
>

My point was that I'd try to continue doing that with a minimum of
repeated code.

> > Doing the same for i386 requires also verifying kernel SSE support in
> > hwcap (that implies CPUID support in the CPU, since the baseline 80486
> > does not necessarily have that, but all CPUs with SSE have it) and also
> > support for extended CPUID in case of fma4.
>
> That seems a lot less likely to be worthwhile since it involves
> shuffling data back and forth between x87 and sse registers, but maybe
> it's still a big enough win to want to do it?
>

Right. Though FP parameters are passed on stack on i386, they are
returned in %st(0). So the calling code would likely have the numbers in
FPU, then spill them to memory for the call, then we load it into SSE,
spill back to memory, load into FPU, and return. However, just a glance
at the generic fma() code makes me doubt a bunch of spilling to memory
and running a single SSE instruction can ever be slower than all of what
is happening in there.

> Regardless, I wonder if we should have the x86_64 startup code store
> cpuid result somewhere we can use, so that we don't have to do nasty
> atomic stuff determining it late, and could just branch on the bits of
> a runtime-constant like we would with __hwcap. This would set the
> stage for being able to do with more-impactful things like mem* too.
> (That's a big project that involves designing the system for how archs
> define the large-block-ops the generic C functions would use, so not
> immediately applicable, but useful to be working towards it.)
>
> Rich

The problem with that idea is that CPUID returns an enormous amount of
information, but right now we only care about two bits on x86_64 (namely
FMA and FMA4 support). So we could have some internal word such as
__cpuid, that basically contains a digest of the CPUID information,
namely only the bits we care about. That would be extensible for up to
64 bits we want to look at, but it would require complex control flow.
And I thought you were against complex control flow in assembler.

I am also unsure whether you need much more than __hwcap (and possibly
__hwcap2) for the string functions. I don't know what optimizations you
have in mind for those, but if you need AVX support, for example, then
CPUID information does not help you. You need confirmation from the
kernel that it supports AVX, or else you will catch a SIGILL in the
attempt to use it. Very few ISA extensions can make do without kernel
support, after all.

Ciao,
Markus
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.