|
Message-ID: <1F3569BD7D6E45889B7518DC9BE5004B@H270> Date: Sun, 15 Aug 2021 17:19:05 +0200 From: "Stefan Kanthak" <stefan.kanthak@...go.de> To: "Szabolcs Nagy" <nsz@...t70.net> Cc: <musl@...ts.openwall.com> Subject: Re: [PATCH #2] Properly simplified nextafter() Szabolcs Nagy <nsz@...t70.net> wrote: > * Stefan Kanthak <stefan.kanthak@...go.de> [2021-08-15 09:04:55 +0200]: >> Szabolcs Nagy <nsz@...t70.net> wrote: >>> you should benchmark, but the second best is to look >>> at the longest dependency chain in the hot path and >>> add up the instruction latencies. >> >> 1 billion calls to nextafter(), with random from, and to either 0 or +INF: >> run 1 against glibc, 8.58 ns/call >> run 2 against musl original, 3.59 >> run 3 against musl patched, 0.52 >> run 4 the pure floating-point variant from 0.72 >> my initial post in this thread, >> run 5 the assembly variant I posted. 0.28 ns/call > > thanks for the numbers. it's not the best measurment IF YOU DON'T LIKE IT, PERFORM YOUR OWN MEASUREMENT! > but shows some interesting effects. It clearly shows that musl's current implementation SUCKS, at least on AMD64. >> >> Now hurry up and patch your slowmotion code! >> >> Stefan >> >> PS: I cheated a very tiny little bit: the isnan() macro of musl patched is >> >> #ifdef PATCH >> #define isnan(x) ( \ >> sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) << 1) > 0xff00000U : \ >> sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) << 1) > 0xffe0000000000000ULL : \ >> __fpclassifyl(x) == FP_NAN) >> #else >> #define isnan(x) ( \ >> sizeof(x) == sizeof(float) ? (__FLOAT_BITS(x) & 0x7fffffff) > 0x7f800000 : \ >> sizeof(x) == sizeof(double) ? (__DOUBLE_BITS(x) & -1ULL>>1) > 0x7ffULL<<52 : \ >> __fpclassifyl(x) == FP_NAN) >> #endif // PATCH > > i think on x86 this only changes an and to an add > (or nothing at all if the compiler is smart) BETTER THINK TWICE: where does the mask needed for the and come from? Does it need an extra register? How do you (for example) build it on ARM? > if this is measurable that's an uarch issue of your cpu. ARGH: it's not the and that makes the difference! JFTR: movabs $0x7ff0000000000000, %r*x is a 10 byte instruction I recommend to read Intel's and AMD's processor optimisation manuals and learn just a little bit! [braindead fullquote removed] Stefan
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.