musl - Re: x86 fma with run-time switch?

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 15 Mar 2024 17:36:23 -0400
From: Rich Felker <dalias@...c.org>
To: Markus Wichmann <nullplan@....net>
Cc: musl@...ts.openwall.com
Subject: Re: x86 fma with run-time switch?

On Fri, Mar 15, 2024 at 06:53:59PM +0100, Markus Wichmann wrote:
> Hi all,
> 
> in commit e9016138, Szabolcs wrote into the message that we really
> should be using the single-instruction versions if possible, and we
> should be switching at run time. I have an idea for how to do that
> without losing all of the history of the generic fma.c:
> 
> - Rename src/math/fma.c to src/math/fma-soft.h. Rename the fma function
>   inside to fma_soft and make it static (inline?).
> - Create a new src/math/fma.c that includes fma-soft.h and just calls
>   fma_soft().
> - In src/math/x86_64/fma.c: Unconditionally define fma_fma() and
>   fma_fma4() (which are the current assembler versions) and include
>   fma-soft.h. Create a dispatcher to figure out which version to call,
>   and call that from fma().

You're making it too complicated. Just

#define fma __soft_fma
#include "../fma.c"

or similar.

> Yeah, I know, the header file with stuff in it that takes memory is not
> exactly great, but I can't think of another way to define the generic
> version such that it is accessible to the arch-specific versions under a
> different name and linkage. The file must not be a .c file, or else it
> will confuse the build system.
> 
> Question I have right out the gate is whether this would be interesting
> to the group. Second question is whether it is better to be running
> cpuid every time fma() is called, or to use a function pointer? I am
> partial to the dispatcher pattern myself. In that case, the function
> pointer is initialized at load time to point to the dispatcher, which
> then selects the best implementation and updates the function pointer.
> The main function only unconditionally calls the function pointer.

My expectation was that you would just use __hwcap, whereby it would
be a hidden global access no more expensive than accessing a function
pointer for indirect branch, and likely cheaper to do local direct
branches based on testing bits of it, something like:

	if (__hwcap & WHATEVER) {
		__asm__(...);
		return ...;
	} else return __soft_fma(...);

However, it looks like x86_64 lacks usable __hwcap.

> With a bit of preprocessor magic, I can also ensure that if __FMA__ or
> __FMA4__ are set, the dispatcher is not included, and only the given
> function is called. Although that may incur a warning of an unused
> static function. I suppose that is a problem that can be fixed with more
> preprocessor magic.

We already do that.

> From my preliminary research, the fma3 and fma4 ISA extensions require
> no kernel support, so this will be the first time a CPUID call is
> needed. fma3 support is signalled with bit 12 of ECX in CPUID function
> 1. fma4 support is signalled with bit 16 of ECX in CPUID function
> 0x80000001 - on AMD CPUs. Intel has the bit reserved, so to be extra
> safe, the CPU vendor ought to be checked, too.
> 
> Doing the same for i386 requires also verifying kernel SSE support in
> hwcap (that implies CPUID support in the CPU, since the baseline 80486
> does not necessarily have that, but all CPUs with SSE have it) and also
> support for extended CPUID in case of fma4.

That seems a lot less likely to be worthwhile since it involves
shuffling data back and forth between x87 and sse registers, but maybe
it's still a big enough win to want to do it?

The logic for how you determine availability seems right.

> Since the CPUID challenges would be shared between fma and fmaf, I would
> like to put them into a new header file in src/include (maybe create
> src/include/x86_64? Or should it be added to arch/x86_64?)
> 
> So what are your thoughts on this?

I'm somewhat skeptical of what value there is to doing this
particularly for fma. There're probably a lot more places we don't do
any runtime-conditional optimized code that have higher returns
(memcpy etc. being the most obvious) and it seems likely that programs
that care about fma performance are themselves compiled with the right
ISA levels and using the compiler builtin, never calling the function
at all.

Regardless, I wonder if we should have the x86_64 startup code store
cpuid result somewhere we can use, so that we don't have to do nasty
atomic stuff determining it late, and could just branch on the bits of
a runtime-constant like we would with __hwcap. This would set the
stage for being able to do with more-impactful things like mem* too.
(That's a big project that involves designing the system for how archs
define the large-block-ops the generic C functions would use, so not
immediately applicable, but useful to be working towards it.)

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.