Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJgzZopv1Pap2MHZ7-+3M8+4XE_RVsR8_kRa1=2fEcpXHtSBaw@mail.gmail.com>
Date: Wed, 14 Feb 2024 16:22:05 -0800
From: enh <enh@...gle.com>
To: Rich Felker <dalias@...c.org>, William Roberts <bill.c.roberts@...il.com>, enh <enh@...gle.com>, 
	musl@...ts.openwall.com
Subject: Re: PAC/BTI Support on aarch64

On Wed, Feb 14, 2024 at 4:03 PM Szabolcs Nagy <nsz@...t70.net> wrote:
>
> * Rich Felker <dalias@...c.org> [2024-02-12 21:08:34 -0500]:
> > On Mon, Feb 12, 2024 at 05:18:22PM -0600, William Roberts wrote:
> > > On Mon, Feb 12, 2024 at 5:05 PM enh <enh@...gle.com> wrote:
> > > > On Mon, Feb 12, 2024 at 2:46 PM Rich Felker <dalias@...c.org> wrote:
> > > > > On Mon, Feb 12, 2024 at 03:25:48PM -0600, William Roberts wrote:
> > > > > > It's a matter of building with -mbranch-protection=standard
> > > > > >
> > > > > > Just the ASM labels need the first instruction to be a BTI. They're in
> > > > > > the NOP space
> > > > > > so they are backwards compatible, older hardware will just NOP it.
>
> not quite that simple. sorry long brain dump follows:
>
> tl;dr: i think the main issues are asm handling (property notes
> and cfi for pac-ret), property note handling in ld.so, perf
> overhead and possible compat issues of pac-ret (not possible
> to disable per process) and testing (ensuring the code works
> when pac/bti is not nop).
>
> - asm code needs manual marking.
>
> GNU_PROPERTY note in asm is ugly and error prone, see
> https://github.com/ARM-software/optimized-routines/blob/master/string/aarch64/asmdefs.h#L23
>
> i.e. no equivalent to -Wa,--noexecstack we use for GNU_STACK
> (this was an oversight, llvm added an option, binutils gas has none).

what's the option? (since Android only supports llvm, that might be
worth considering as a slight cleanup for us...)

> - if asm code does indirect tailcall it should use x16 or x17.
>
> bti c is compatible with the indirect branch that way.
>
> - only functions that *maybe called indirectly* need bti c.
>
> this turned out to be trickier than expected: lld/llvm and bfd.ld/gcc
> currently disagree about the definition in case of linker inserted
> veneers: gcc assumes ld handles the case if an inserted veneer
> indirectly branches to a non-bti location, llvm emits bti in all
> functions (including local ones without their address taken), just
> in case ld inserts an indirect veneer.
> https://github.com/ARM-software/abi-aa/issues/196
>
> there is some difference to how PLTs are emitted between the
> linkers, but i don't think that causes compat issues (might cause
> trouble for tools that try to interpret the PLT).
>
> - dynamic linker has to figure out when to enable it.
>
> systemd MDWE (memory deny write exec) feature used to seccomp
> filter mprotect(PROT_EXEC) so even if the underlying mapping was
> already PROT_EXEC and it just added a PROT_BTI on top, mprotect
> would fail. this was fixed by adding an MDWE prctl to linux for
> systemd to make it stop using that filter, but there may be other
> similar seccomp filters and old kernels without MDWE so glibc
> re-mmaps the exec segment (which systemd happily accepts).
>
> note: the bit that tells if a load segment needs to be mapped as
> PROT_BTI is in the load segment (usually program headers are in
> the executable segment) so first mmap cannot get it right, unless
> a quick read of the prog headers are done before mmap but that has
> a lot of failure modes (the size can be unbounded) and does not
> gain much compared to just mmap twice. except when the exe is mapped
> by the kernel, then we cannot mmap since there is no fd and mprotect
> may fail).
> MDWE prctl: https://lwn.net/Articles/937315/
>
> another detail is that static-exe / ld.so / vdso marking is handled
> by the kernel, while other dsos are PROT_BTI marked by ld.so. an
> interesting case is dynamic linked exe which was originally handled
> by ld.so, but after the MDWE fiasco the kernel started loading it
> with PROT_BTI (i.e. now all binaries mapped by the kernel are BTI
> protected by the kernel).
>
> ld.so should also take care to gracefully handle BTI protection
> failure (or invalid notes) in dlopen. (although invalid note is
> more of an x86 thing.)
>
> - special functions may need to return indirectly instead of ret
>
> the only really nasty one is swapcontext which is not supported by
> musl so that's good (and it is only a problem if bti is used with
> a shadow-stack-like feature).
>
> for returns_twice functions (like setjmp) the compiler emits bti j
> at the call site so the second return can use an indirect branch.
> return from unwinder to an exception handling landing pad via
> indirect branch is supported too. otherwise return via indirect
> branch is not supported (no bti j at call sites).
>
> - a64fx (hpc core) implemented hint nops in a slow way
>
> so glibc only adds bti if glibc is configured for bti.
> https://sourceware.org/pipermail/libc-alpha/2021-May/125784.html
>
> > > > > I think it's a little more elaborate than that. Those asm instructions
> > > > > need to be added (probably as .instr or .word or something, unless
> > > > > there's a way to spell this particular nop that existing tooling will
> > > > > understand).
> > >
> > > You just use the hint <immediate> instructions, they are understood by old
> > > toolchains. But you can only support a subset of the BTI/PAC instructions
>
> yes 'hint <imm>' is armv8.0-a, part of the base isa.
>
> i think the non-hint pac instructions are not relevant to musl.
>
> > > but it's been enough for most projects that follow the normal ABI conventions
> > > like OpenSSL/BoringSSL,etc, but not enough for libffi for example.
>
> as far as i know openssl is the reason android does not enable pac:
> they added the hint instructions incorrectly at first so there are
> binaries that fail if the hw enables pac.

that was one reason why "android does not enable pac" _by default in
the android target triples for app developers_, yes --- though i think
we're at the point where we think we should flip that default (not
least because the number of users whose devices would actually
_benefit_ from the extra instructions is a lot larger now!):
https://github.com/android/ndk/issues/1914

> this reveals another issue with pac/bti: since they are nops on most
> existing hw, they are not properly tested (e.g. by distro QA) so a
> binary can look ok until you move it to a newer machine. (but recent
> amazon graviton 4 has pac/bti so we may get more coverage soon.)

exactly --- that was one of our key concerns, that app developers
would _think_ they've tested with pac/bti but unknowingly used a
device without. (or even an x86-64 emulator!)

> > > > > I also wondered if [sig]setjmp/longjmp would be affected, but probably
> > > > > not.
> > > >
> > > > bionic does use PAC, but i think glibc has its own "pointer mangling" thing?
> > >
> > > You need it, as the first instruction from a branch (where longjmp returns to)
> > > needs to be a BTI instruction.
> >
> > Is that different from a normal function return?
> >
> > Note that in the case of sigsetjmp, (sig)longjmp returns to a point
> > inside the sigsetjmp asm, so that point needs the annotation I think.
>
> setjmp/sigsetjmp has to decide how to protect the longjmp return.
>
> with a shadow-stack-like bw-edge-cfi, longjmp cannot return with
> ret (first ret from setjmp consumes the return address from the
> shadow stack), it must use indirect branch. (there is now gcs
> which is aarch64 shadow stack, linux support is in progress).
>
> since jmpbuf does not expose the return address representation a
> libc specific protection can be applied (mangling or pac) and then
> longjmp can use ret and remain protected. (but e.g. setcontext
> exposes the pc/lr so those cannot be mangled in memory).
>
> compilers emit bti j at the call site of returns_twice functions
> to allow both ret and indirect branch. musl sigsetjmp can decide
> what it does.
>
> > > > > > It's been done for many projects, glibc and bionic have it. The
> > > > > > problem with BTI is that when one item in the link
> > > > > > list doesn't support BTI the loader/linker turns it off. So when it's
> > > > > > something like a libc that is fundamental in the link chain,
> > > > > > it turns it off for everything.
> > > > >
> > > > > This presumably requires some kind of machinery for how dynamic
> > > > > linking will work, and possibly turning it off if a library without it
> > > > > is dlopened?
> > > > >
> > > > > My understanding doing some brief searches though was that you can
> > > > > individually mprotect it off in certain regions. So maybe it's
> > > > > possible to just enable only for DSOs that support it?
> > > >
> > > > correct.
> >
> > OK, that's good to know. So which direction is it? Do DSOs that
> > support BTI need it explicitly turned on via mprotect/mmap flags? Or
> > is there some process-global flag to turn it on, and then ones that
> > don't support it need it turned off?
>
> per dso marking and explicit PROT_BTI setting.
>
> > I suspect it's possible to first enable BTI for third-party libraries
> > as a feature of the dynamic linker, and add BTI support for libc
> > itself as a separate thing. That might be a nice factoring to make
> > changes minimal and easy for ppl to read.
>
> for third-party, it is enough to fix crt*.o and handle markings
> in ld.so.
>
> (but of course for bti to be effective the libc must have it too,
> otherwise there are likely enough gadgets to do whatever)
>
> > The changes in dynlink.c should be as arch-agnostic as possible. If
> > there's a corresponding feature on other archs, it should use the same
> > basic code, with arch-specific headers (arch/$ARCH/reloc.h) defining
> > the mechanisms for evaluating if an ELF file is compatible, how to do
> > the mprotect, etc.
>
> generic code should work.
>
> but e.g. see above about who handles the marking (kernel vs ld.so),
> turns out x86 (and likely aarch64, riscv) shadow stack uses different
> rules: always libc handles the marking. so there are caveats.
>
> > > > > > The initial scope of code changes would be what's reported when
> > > > > > LDFLAGS=-Wl,-zforce-bti,--fatal-warnings
> > > > >
> > > > > Is there a way to disable these warnings so that every asm file does
> > > > > not need to be cluttered with annotations?
> > > >
> > > > well, that's the ELF note stuff i was talking about, and if you don't
> > > > have it you'll fall foul of the static linker saying "not all this
> > > > code is BTI-enabled, therefore this .so isn't", and the dynamic linker
> > > > doing nothing because the static linker effectively tells it not to.
> > >
> > > Yep, well said ENH. It's been since Android since we crossed paths :-).
> > >
> > > It's not that hard to annotate an asm file :-p I forget what project
> > > (I think it was gnutls, but they just use openssl's code for the asm)
> > > but I just put it in a header file and by virtue of #include'ing it you get the
> > > notes added.
> >
> > Yes, we generally don't do that. There are no "asm headers" in musl;
> > all asm files are self-contained and readable standalone. So if
> > there's no way to tell the assembler/linker from the command line that
> > files are BTI-compatible without generating a huge load of warning
> > spam, I guess it's a mess of copy-and-paste...
>
> currently there is only the ugly asm directives, see above.
>
> final notes:
>
> bti is fairly deployable (iirc x86 ibt failed because it is not per
> dso, so dlopen does not really work, but aarch64 does not have that
> problem), not strong security (the final binary is littered with
> bti j/c so plenty opportunity to misdirect an indirect jump/call),
> but at least it has minimal impact (minimal compatibility issues and
> on a modern core even when bti is enabled it should not be slower).
>
> for pac every function can independently decide if it uses pac-ret
> (aka return address signing), no need for per dso marking. however
> it has bigger compat as well as performance impact:
>
> there are custom unwinders, pac-ret uses a new dwarf cfi (to mark
> the code regions where the return address is signed), custom
> unwinders may not understand this and that's a runtime crash.
>
> some code looks at or modifies the return address (various hacks), such
> code needs to be updated or not use pac-ret in relevant functions, but
> such issues are hard to discover without hw.
>
> pac is a per-boot system-wide setting, not per-process, so if there is
> any issue or bug there is no way to disable it for one broken binary
> (nowadays there is a disable prctl, but it is documented to slow the
> system down, so not suitable for working around perf issues).
>
> on simple cores pac can be slow (can add latency to non-leaf functions)
>
> with 48bit va space, there is only 7bit pac, i.e. limited protection.

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.