Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1F3569BD7D6E45889B7518DC9BE5004B@H270>
Date: Sun, 15 Aug 2021 17:19:05 +0200
From: "Stefan Kanthak" <stefan.kanthak@...go.de>
To: "Szabolcs Nagy" <nsz@...t70.net>
Cc: <musl@...ts.openwall.com>
Subject: Re: [PATCH #2] Properly simplified nextafter()

Szabolcs Nagy <nsz@...t70.net> wrote:

> * Stefan Kanthak <stefan.kanthak@...go.de> [2021-08-15 09:04:55 +0200]:
>> Szabolcs Nagy <nsz@...t70.net> wrote:
>>> you should benchmark, but the second best is to look
>>> at the longest dependency chain in the hot path and
>>> add up the instruction latencies.
>> 
>> 1 billion calls to nextafter(), with random from, and to either 0 or +INF:
>> run 1 against glibc,                         8.58 ns/call
>> run 2 against musl original,                 3.59
>> run 3 against musl patched,                  0.52
>> run 4 the pure floating-point variant from   0.72
>>       my initial post in this thread,
>> run 5 the assembly variant I posted.         0.28 ns/call
>
> thanks for the numbers. it's not the best measurment

IF YOU DON'T LIKE IT, PERFORM YOUR OWN MEASUREMENT!

> but shows some interesting effects.

It clearly shows that musl's current implementation SUCKS, at least
on AMD64.

>> 
>> Now hurry up and patch your slowmotion code!
>> 
>> Stefan
>> 
>> PS: I cheated a very tiny little bit: the isnan() macro of musl patched is
>> 
>> #ifdef PATCH
>> #define isnan(x) ( \
>> sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) << 1) > 0xff00000U : \
>> sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) << 1) > 0xffe0000000000000ULL : \
>> __fpclassifyl(x) == FP_NAN)
>> #else
>> #define isnan(x) ( \
>> sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) & 0x7fffffff) > 0x7f800000 : \
>> sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) & -1ULL>>1) > 0x7ffULL<<52 : \
>> __fpclassifyl(x) == FP_NAN)
>> #endif // PATCH
>
> i think on x86 this only changes an and to an add
> (or nothing at all if the compiler is smart)

BETTER THINK TWICE: where does the mask needed for the and come from?
Does it need an extra register?
How do you (for example) build it on ARM?

> if this is measurable that's an uarch issue of your cpu.

ARGH: it's not the and that makes the difference!

JFTR: movabs $0x7ff0000000000000, %r*x is a 10 byte instruction
      I recommend to read Intel's and AMD's processor optimisation
      manuals and learn just a little bit!

[braindead fullquote removed]

Stefan

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.