Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1587002854.f0slo0111r.astroid@bobo.none>
Date: Thu, 16 Apr 2020 12:24:16 +1000
From: Nicholas Piggin <npiggin@...il.com>
To: Rich Felker <dalias@...c.org>
Cc: libc-alpha@...rceware.org, libc-dev@...ts.llvm.org,
	linuxppc-dev@...ts.ozlabs.org, musl@...ts.openwall.com, Segher Boessenkool
	<segher@...nel.crashing.org>
Subject: Re: Powerpc Linux 'scv' system call ABI proposal take 2

Excerpts from Rich Felker's message of April 16, 2020 10:48 am:
> On Thu, Apr 16, 2020 at 10:16:54AM +1000, Nicholas Piggin wrote:
>> Excerpts from Rich Felker's message of April 16, 2020 8:55 am:
>> > On Thu, Apr 16, 2020 at 07:45:09AM +1000, Nicholas Piggin wrote:
>> >> I would like to enable Linux support for the powerpc 'scv' instruction,
>> >> as a faster system call instruction.
>> >> 
>> >> This requires two things to be defined: Firstly a way to advertise to 
>> >> userspace that kernel supports scv, and a way to allocate and advertise
>> >> support for individual scv vectors. Secondly, a calling convention ABI
>> >> for this new instruction.
>> >> 
>> >> Thanks to those who commented last time, since then I have removed my
>> >> answered questions and unpopular alternatives but you can find them
>> >> here
>> >> 
>> >> https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-January/203545.html
>> >> 
>> >> Let me try one more with a wider cc list, and then we'll get something
>> >> merged. Any questions or counter-opinions are welcome.
>> >> 
>> >> System Call Vectored (scv) ABI
>> >> ==============================
>> >> 
>> >> The scv instruction is introduced with POWER9 / ISA3, it comes with an
>> >> rfscv counter-part. The benefit of these instructions is performance
>> >> (trading slower SRR0/1 with faster LR/CTR registers, and entering the
>> >> kernel with MSR[EE] and MSR[RI] left enabled, which can reduce MSR 
>> >> updates. The scv instruction has 128 interrupt entry points (not enough 
>> >> to cover the Linux system call space).
>> >> 
>> >> The proposal is to assign scv numbers very conservatively and allocate 
>> >> them as individual HWCAP features as we add support for more. The zero 
>> >> vector ('scv 0') will be used for normal system calls, equivalent to 'sc'.
>> >> 
>> >> Advertisement
>> >> 
>> >> Linux has not enabled FSCR[SCV] yet, so the instruction will cause a
>> >> SIGILL in current environments. Linux has defined a HWCAP2 bit 
>> >> PPC_FEATURE2_SCV for SCV support, but does not set it.
>> >> 
>> >> When scv instruction support and the scv 0 vector for system calls are 
>> >> added, PPC_FEATURE2_SCV will indicate support for these. Other vectors 
>> >> should not be used without future HWCAP bits indicating support, which is
>> >> how we will allocate them. (Should unallocated ones generate SIGILL, or
>> >> return -ENOSYS in r3?)
>> >> 
>> >> Calling convention
>> >> 
>> >> The proposal is for scv 0 to provide the standard Linux system call ABI 
>> >> with the following differences from sc convention[1]:
>> >> 
>> >> - LR is to be volatile across scv calls. This is necessary because the 
>> >>   scv instruction clobbers LR. From previous discussion, this should be 
>> >>   possible to deal with in GCC clobbers and CFI.
>> >> 
>> >> - CR1 and CR5-CR7 are volatile. This matches the C ABI and would allow the
>> >>   kernel system call exit to avoid restoring the CR register (although 
>> >>   we probably still would anyway to avoid information leak).
>> >> 
>> >> - Error handling: I think the consensus has been to move to using negative
>> >>   return value in r3 rather than CR0[SO]=1 to indicate error, which matches
>> >>   most other architectures and is closer to a function call.
>> >> 
>> >> The number of scratch registers (r9-r12) at kernel entry seems 
>> >> sufficient that we don't have any costly spilling, patch is here[2].  
>> >> 
>> >> [1] https://github.com/torvalds/linux/blob/master/Documentation/powerpc/syscall64-abi.rst
>> >> [2] https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-February/204840..html
>> > 
>> > My preference would be that it work just like the i386 AT_SYSINFO
>> > where you just replace "int $128" with "call *%%gs:16" and the kernel
>> > provides a stub in the vdso that performs either scv or the old
>> > mechanism with the same calling convention. Then if the kernel doesn't
>> > provide it (because the kernel is too old) libc would have to provide
>> > its own stub that uses the legacy method and matches the calling
>> > convention of the one the kernel is expected to provide.
>> 
>> I'm not sure if that's necessary. That's done on x86-32 because they
>> select different sequences to use based on the CPU running and if the host
>> kernel is 32 or 64 bit. Sure they could in theory have a bunch of HWCAP
>> bits and select the right sequence in libc as well I suppose.
> 
> It's not just a HWCAP. It's a contract between the kernel and
> userspace to support a particular calling convention that's not
> exposed except as the public entry point the kernel exports via
> AT_SYSINFO.

Right.

>> > Note that any libc that actually makes use of the new functionality is
>> > not going to be able to make clobbers conditional on support for it;
>> > branching around different clobbers is going to defeat any gains vs
>> > always just treating anything clobbered by either method as clobbered.
>> 
>> Well it would have to test HWCAP and patch in or branch to two 
>> completely different sequences including register save/restores yes.
>> You could have the same asm and matching clobbers to put the sequence
>> inline and then you could patch the one sc/scv instruction I suppose.
>> 
>> A bit of logic to select between them doesn't defeat gains though,
>> it's about 90 cycle improvement which is a handful of branch mispredicts 
>> so it really is an improvement. Eventually userspace will stop 
>> supporting the old variant too.
> 
> Oh, I didn't mean it would neutralize the benefit of svc. Rather, I
> meant it would be worse to do:
> 
> 	if (hwcap & X) {
> 		__asm__(... with some clobbers);
> 	} else {
> 		__asm__(... with different clobbers);
> 	}
> 
> instead of just
> 
> 	__asm__("indirect call" ... with common clobbers);

Ah okay. Well that's debatable but if you didn't have an indirect call,
rather a runtime-patched sequence, then yes saving the LR clobber or
whatever wouldn't be worth a branch.

> where the indirect call is to an address ideally provided like on
> i386, or otherwise initialized to one of two or more code addresses in
> libc based on hwcap bits.

Right, I'm just skeptical we need the indirect call or need to provide
it in the vdso. The "clever" reason to add it on x86-32 was because of
the bugs and different combinations needed, that doesn't really apply
to scv 0 and was not necessarily a great choice.

> 
>> > Likewise, it's not useful to have different error return mechanisms
>> > because the caller just has to branch to support both (or the
>> > kernel-provided stub just has to emulate one for it; that could work
>> > if you really want to change the bad existing convention).
>> > 
>> > Thoughts?
>> 
>> The existing convention has to change somewhat because of the clobbers,
>> so I thought we could change the error return at the same time. I'm
>> open to not changing it and using CR0[SO], but others liked the idea.
>> Pro: it matches sc and vsyscall. Con: it's different from other common
>> archs. Performnce-wise it would really be a wash -- cost of conditional
>> branch is not the cmp but the mispredict.
> 
> If you do the branch on hwcap at each syscall, then you significantly
> increase code size of every syscall point, likely turning a bunch of
> trivial functions that didn't need stack frames into ones that do. You
> also potentially make them need a TOC pointer. Making them all just do
> an indirect call unconditionally (with pointer in TLS like i386?) is a
> lot more efficient in code size and at least as good for performance.

I disagree. Doing the long vdso indirect call *necessarily* requires
touching a new icache line, and even a new TLB entry. Indirect branches
also tend to be more costly and/or less accurate to predict than
direct even without spectre (generally fewer indirect predictor entries
than direct, far branches in particular require a lot of bits for 
target). And with spectre we're flushing the indirect predictors
on context switch or even disabling indirect prediction or flushing
across privilege domains in the same context.

And finally, the HWCAP test can eventually go away in future. A vdso
call can not.

If you really want to select with an indirect branch rather than
direct conditional, you can do that all within the library.

Thanks,
Nick

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.