Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LRH.2.02.2001181011130.11425@key0.esi.com.au>
Date: Sat, 18 Jan 2020 15:45:27 +1100 (AEDT)
From: Damian McGuckin <damianm@....com.au>
To: musl@...ts.openwall.com
Subject: Re: Considering x86-64 fenv.s to C

On Fri, 17 Jan 2020, Rich Felker wrote:

>> arm (bare)
>>
>> *	Empty
>
> I think you missed that there's fenv-hf.S for armhf. The default ABI
> does not have fpu/fenv.

Oops. Thanks for that.

>> *	The fldenv instruction to update the status registers has a serious
>> 	overhead which cannot be avoided in 'feraiseexcept'. No attempt is
>> 	made to optimize any unnecessary usage (as occurs in feclearexcept).
>
> One thing we could do in C is write feraiseexcept portably to raise
> the exceptions via performing appropriate floating point operations,
> rather than directly modifying status word. This would probably be
> faster on most if not all archs.

To confirm your use of 'probably' on (say) Ivy Bridge

Most places in MUSL use FORCE_EVAL for just that reason.  Just looking at 
the example within 'trunc', to force INEXACT there is an ADDSD and a MOVSD 
with memory accesses in both so 1 cycle each so 2 in total. Done inline, 
the STMXCSR and LDMXCSR are 3 cycles each so 6 in total. But in another 
example to force UNDERFLOW within 'nextafter' using FORCE_EVAL needs 4 
cycles total against the same 6. So a bit closer

Most of the time I prefer FORCE_EVAL but 'feraiseexcept' should probably 
be treated as a fall back as use the current method. My 2c is that we 
should retain 'modifying the register' concept of the 'feraiseexcept' 
routine.

.... Your call.

On a related note:

For the non-zero bits in the first 'setexceptflag' argument, doesn't this 
routine do the same thing as 'feraiseexcept' for those exceptions which 
correspond to those bits.

> Note that the high level C could avoid any action when the flags to be
> cleared are already clear.

Agreed. Nobody does this currently to keep the routine size small.

>> *	What is the best way to query '__hwcap' from inline __asm__ statement,
>
>> From *inline* asm you don't have to do it at all. You just write the
> branch in C. This is one of the reasons to prefer C.
>
>> 	specifically to verify if SSE instructions have to be supported

I have never done this in my life? On BSD there is a system call but that 
is a bit heavy handed.

> It's not to verify that they have to be supported, rather to verify
> that they *can* be used. If the bit is not set in hwcap, the
> instructions are not there and will fault if executed.

I understand this.

> C specifies it as:
>
>    "The feraiseexcept function returns zero if the excepts argument
>    is zero or if all the specified exceptions were successfully
>    raised. Otherwise, it returns a nonzero value."

Pretty vague. This is not why the M68K routines return (-1).

No routine currently checks that the exceptions were successfully raised. 
They assume that a write to the status register works. If we are going to 
check each instruction such as storing into a register works, we have a 
lot of work to do.

> It's not 100% clear to me that this is supposed to apply to invalid
> arguments rather than just some kind of failure to raise valid
> arguments, but I'd err on the side of assuming it does apply if we're
> overhauling this code in C. It's a minor issue though.

Most routines simply mask away anything invalid and then proceed.

>> powerpc
>>
>> *	All assembler
>>
>> *	I think that this architecture has more exception bits than IEEE 754
>> 	specifies. It has lots of specific cases of FE_INVALID. This needs
>> 	to be considered when dealing with FE_INVALID.
>
> I'm trying to remember -- does the hardware lack a unified FE_INVALID
> bit, so that you have to emulate it by aggregating the individual
> ones? I think that could be hidden inside a primitive that
> loads/stores the status word.

Not as far as I understand but others may wish to correct me. When they 
raise FE_INVALID, they concurrently raise a 'reason'. So if we are going 
to manually raise FE_INVALID, the mask MUST include a 'reason'. There is a 
catch-all 'miscellaneous' reason, what is call Software-Defined Condition, 
that is currently used. Other reasons are SNan, Infinity-Infinite, 
Infinity/Infinity, Zero/Zero, Infinite*Zero, Invalid Compare, Invalid 
Integer Convert and Invalid Square Root.

Reasons are a nice ideas but these 'explanations' have others chips have 
never picked up on the concept (which has been around a while).  Sadly, a 
lot of great ideas have never been commercial successes and humanity is 
the poorer for it. I digress. Sorry.

>> powerpc64

>> 	double to a union and then extract the data as a long.
>>
>> 		return (union {double f; long i;}) {get_fpscr_f()}.i;
>>
>> 	Is this style of coding universally accepted within MUSL? From my
>> 	reading of other routines, it is normally done as
>>
>> 		union {double f; long i;} f = { get_fpscr_f() };
>>
>> 		return f.i;
>>
>> 	Just curious.
>
> Yes, the compound literal form is preferred since it avoids a
> gratuitous named variable.

I would humbly suggest initially using the longer form, and then once the 
architecture of the routine is complete, we revert to the compound literal 
form where the code is simple enough to make it possible.

>> sh (SuperH??)
>>
>> *	In assembler
>>
>> *	I know zero about this assembler
>>
>> *	There is some pecularity about updating the environment. I have no
>> 	idea what is going on here. Anybody clear to elaborate?
>
> The comment about preserving precision bit is just incorrect as long
> as this is an external function. The function call ABI has precision
> set to double on function entry. If it's inline asm we might have to
> think about ensuring it's safe. I'm not sure how to constrain the
> precision on __asm__ statement entry but I assume there must be a way
> or you couldn't write inline asm using floating point operands.

Interesting. Way beyond my knowledge. Interesting chip.

>> x86_64
>>
>> *	In assembler
>>
>> *	Why does 'feclearexcept' disrespect the flags by clearing ALL x86 bits?
>
> As you said above, updating x87 status register is expensive because
> the only way to write it is to read-modify-write the whole fenv. But
> since we know on x86_64 we have sse registers, we can just move all
> the flags to the sse register, then use fnclex to clear the x87 one
> inexpensively, and the effective set of raised flags remains the same.

Thanks for the explanation. Neat.

> I think we could do this on i386 too with a couple tricks:
>
> 1. Do the same thing if sse is available (hwcap check).
>
> 2. If sse is not available, clear all flags then re-raise the desired
> set via arithmetic operations.

Simple. I like it. But more code.

Also, playing devil's advocate for a minute ....

Are we, or should we, be aiming to have

 	fetestexcept(int excepts)

and (even also)

 	feraiseexcept(int excepts)

being expanded inline so their use does not compromise optimization?  How 
feasible it this? How much will compilers get confused by the fact that 
there are side-effects of floating point operations that affect floating 
point status registers.

Regards - Damian

Pacific Engineering Systems International, 277-279 Broadway, Glebe NSW 2037
Ph:+61-2-8571-0847 .. Fx:+61-2-9692-9623 | unsolicited email not wanted here
Views & opinions here are mine and not those of any past or present employer

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.