musl - Re: roundf() (and round(), and ...)

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <19361e3c-8d13-af73-7896-bc4665e9788f@esi.com.au>
Date: Tue, 25 Jun 2024 13:41:45 +1000 (AEST)
From: Damian McGuckin <damianm@....com.au>
To: MUSL <musl@...ts.openwall.com>
Subject: Re: roundf() (and round(), and ...)

On Sun, 23 Jun 2024, Rich Felker wrote:

> I think this is not a good tradeoff. It seems unlikely anyone would be
> using round in a performance-critical context with numbers that are in
> a range where an inline compare would be far better, and the costs are
> increased code size, decreased performance in useful cases, and
> exacerbating timing dependency on data (which is generally a negative
> thing).

I was actually playing devil's advocate. I do not like the tradeoff either 
but I was looking for some words of wisdom (as per yours above) to justify
it.

Anyway... I am not convinced of the benefits of changing the existing 
rounding routines in MUSL to reflect any modification I might have made 
to the internal algorithms. And before any changes, I think there are 
other questions which need to be answered. So I might park it for now.

Here is a summary of the work for those who are interested. They reduce
the number of lines of GCC-11 generated assembler code by 33%. I am sure
that others have thought of (and used) the same incremental changes long
ago but never published them. Only the C 'float' routines are in a good
shape. The project itself was actually done in another language.  My
modifications assumes that a call likes 'fabsf' is inlined to assembler
which I hope we can assume is the rule these days.

This is the code size in GCC11 assembler instructions for C 'float' 
routines:

 	rintf....	32/16 - metric was 5% faster
 	truncf...	19/21 - metric was 10% faster
 	roundf...	47/30 - no speed difference
 	ceilf....	38/29 - no speed difference
 	floorf...	38/29 - no speed difference

This line count does not include the startproc and endproc lines, but
does include any line with just a label. The first number is that of
the MUSL 2.5 routine, and the second is that of the revised routine.
Overall, 174 lines down to 115 lines or a >33% reduction.

The performance metric was the computation and comparison of

 	GLibc routine + MUSL routine
or
 	Glibc routine + Revised routine

The GLIBC routines is there for comparison.  Where this showed the
revised algorithm was faster by 5% or more, I have noted it. Note
than no revision algorithm is slower. The CPU was a Xeon E5-2650v4.

A 'roundeven' routine exists and it is 44 lines of assembler long.

CLANG's line count is slightly longer,

I can recode the revised roundf() to be >5% faster than MUSL's roundf() 
but then I run the risk that I am have a timing dependency on data, in 
this case, the set of all the 32-bit floating point numbers.

- Damian

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.