musl - Re: Musl on Cortex-M Devices

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALqyXLgnCgNOJSA-HtpfMnPNPDBvnKJQv=VDJaCBkageTx-x3A@mail.gmail.com>
Date: Mon, 21 Dec 2020 18:58:47 -0500
From: Jesse DeGuire <jesse.a.deguire@...il.com>
To: Rich Felker <dalias@...c.org>
Cc: musl@...ts.openwall.com
Subject: Re: Musl on Cortex-M Devices

On Fri, Dec 18, 2020 at 12:30 PM Rich Felker <dalias@...c.org> wrote:
>
> On Fri, Dec 18, 2020 at 02:15:20AM -0500, Jesse DeGuire wrote:
> > > > A number of errors were simply the result of
> > > > ARMv6-M requiring one to use the "S" variant of an instruction that
> > > > sets status flags (such as "ADDS" vs "ADD" or "MOVS" vs "MOV"). A few
> > > > files I had to change from a "lower case s" to a "capital-S" file so
> > > > that I could use macros to check for either the Thumb1 ISA
> > > > ("__thumb2__ || !__thumb__") or for an M-Profile device
> > > > ("!__ARM_ARCH_ISA_ARM").
> > >
> > > Is __ARM_ARCH_ISA_ARM universally available (even on old compilers)?
> > > If not this may need an alternate detection. But I'd like to drop as
> > > much as possible and just make the code compatible rather than having
> > > 2 versions of it. I don't think there are any places where the
> > > performance or size is at all relevant.
> >
> > I hadn't thought enough about that until your reply, so I used
> > Compiler Explorer to at least get a general idea of when that macro
> > was supported. Clang 3.5 supports it, but Clang 3.4.1 does not. GCC
> > 5.4 supports it, but 4.64 (the previous version available on Compiler
> > Explorer) does not. With that, I tried to come up with something to
> > detect M-Profile devices and have updated my changes with it.
> >
> >    #if __ARM_ARCH_6M__ || __ARM_ARCH_7M__ || __ARM_ARCH_7EM__ \
> >      || (__ARM_ARCH > 7 && !__ARM_ARCH_ISA_ARM)
>
> I think you can just use presence of __ARM_ARCH as a proxy for
> "checking __ARM_ARCH_ISA_ARM is meaningful" if you can verify that
> they were added at the same time. Then you don't need to special-case
> each M variant. Or am I missing something?

Nah, I was overly cautious I think. I did search through the commit
histories of GCC and LLVM to find that __ARM_ARCH and
__ARM_ARCH_PROFILE were added at the same time in both toolchains, but
__ARM_ARCH_ISA_ARM appears to have been added later in Clang.
Therefore, doing a check for "__ARM_ARCH_PROFILE == 'M'" should be
sufficient.

> > > > The changes under
> > > > "src/thread/arm/__set_thread_area.c" are different in that I made
> > > > those because I don't believe Cortex-M devices could handle what was
> > > > there (no M-Profile device has Coprocessor 15, for example) and so I
> > >
> > > Unless this is an ISA level that can't be executed on a normal (non-M)
> > > ARM profile, it still needs all the backends that might be needed and
> > > runtime selection of which to use. This is okay. I believe Linux for
> > > nommu ARM has a syscall for get_tp, which is rather awful but probably
> > > needs to be added as a backend. The right way to do this would have
> > > been with trap-and-emulate (of cp15) I think...
> >
> > Out of curiosity, what makes that syscall awful?
>
> Well it precludes just using the same inline code as long as the ISA
> level is >= a minimum; now levels supporting Cortex-M have to use the
> runtime selection. I'm not sure which performs better. I would think
> they'd be comparable.

I see, thanks. I've been doing most of my work with MIPS32r2-based
micros for the past few years and from what I've seen so far ARM has
much more variety to deal with, for better or worse.

> > > Further comments inline below:
> > >
> > > > diff --git a/arch/arm/atomic_arch.h b/arch/arm/atomic_arch.h
> > > > index 9e3937cc..21f102ad 100644
> > > > --- a/arch/arm/atomic_arch.h
> > > > +++ b/arch/arm/atomic_arch.h
> > > > @@ -64,11 +64,18 @@ static inline int a_cas(volatile int *p, int t, int s)
> > > >
> > > >  #ifndef a_barrier
> > > >  #define a_barrier a_barrier
> > > > +#if __ARM_ARCH_ISA_ARM
> > > >  static inline void a_barrier()
> > > >  {
> > > >       register uintptr_t ip __asm__("ip") = __a_barrier_ptr;
> > > >       __asm__ __volatile__( BLX " ip" : "+r"(ip) : : "memory", "cc", "lr" );
> > > >  }
> > > > +#else
> > > > +static inline void a_barrier()
> > > > +{
> > > > +     __asm__ __volatile__ ("dmb" : : : "memory");
> > > > +}
> > > > +#endif
> > > >  #endif
> > >
> > > There's a condition above with this code already; you just need to
> > > make it used for an additional case.
> >
> > Fixed. The original code was nested inside an ifdef that also would
> > have used LDREX and STREX--neither of which ARMv6-M supports--so I
> > moved that out and added the extra check.
>
> If it lacks LDREX and STREX how do you implement atomic? I don't see
> where you're adding any alternative, so is v6-m support
> non-functional? That rather defeats the purpose of doing anything to
> support it...

Correct, I haven't yet added an alternative. Arm's answer--and what we
generally do in the embedded world--is to disable interrupts using
"cpsid", do your thing, then re-enable interrupts with "cpsie". This
could be done with a new "__a_cas_v6m" variant that I'd add to
atomics.s. This still won't work for Linux because the "cps(ie|id)"
instruction is effectively a no-op if it is executed in an
unprivileged context (meaning you can't trap and emulate it). You'd be
looking at another system call if you really wanted v6-m Linux. That
said, this could let Musl work on v6-m in a bare metal or RTOS
environment, which I think Musl would be great for, and so I'd still
work on adding support for it. Also, not all v6-m devices support a
privilege model and run as though everything is privileged.
ARMv8-M.base is similar to v6-m with LDREX and STREX and so that could
have full support.

> > > > diff --git a/arch/arm/pthread_arch.h b/arch/arm/pthread_arch.h
> > > > index e689ea21..a89fb554 100644
> > > > --- a/arch/arm/pthread_arch.h
> > > > +++ b/arch/arm/pthread_arch.h
> > > > @@ -1,5 +1,5 @@
> > > > -#if ((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
> > > > - || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7
> > > > +#if __ARM_ARCH_ISA_ARM && (((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
> > > > + || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7)
> > >
> > > That's probably not the right condition.
> >
> > Can you provide more detail as I'm missing what you mean? The intent
> > was to prevent this check from succeeding for any M-Profile device so
> > that they fall back to the __a gettp_ptr solution. I did update this
> > check to what I described at the top of this reply.
>
> I think it's only the __ARM_ARCH >= 7 branch that needs any further
> condition. The rest of the branches are already specific to non-M.

I see, thanks.

> > > >  static inline pthread_t __pthread_self()
> > > >  {
> > > > diff --git a/configure b/configure
> > > > old mode 100755
> > > > new mode 100644
> > > > diff --git a/crt/arm/crtn.S b/crt/arm/crtn.S
> > > > index dc020f92..83e20da2 100644
> > > > --- a/crt/arm/crtn.S
> > > > +++ b/crt/arm/crtn.S
> > > > @@ -1,9 +1,17 @@
> > > >  .syntax unified
> > > >
> > > >  .section .init
> > > > +#if __ARM_ARCH_ISA_ARM
> > > >       pop {r0,lr}
> > > >       bx lr
> > > > +#else
> > > > +     pop {r0,pc}
> > > > +#endif
> > > >
> > > >  .section .fini
> > > > +#if __ARM_ARCH_ISA_ARM
> > > >       pop {r0,lr}
> > > >       bx lr
> > > > +#else
> > > > +     pop {r0,pc}
> > > > +#endif
> > >
> > > Ideally the linker could handle this.
> >
> > The issue here is that ARMv6-M (and I believe v8-M.base) cannot POP
> > into LR. Can this sequence replace LR with R1? My current version POPs
> > into R1 and moves that into LR because I wasn't sure if that was
> > important for the caller.
>
> I don't see any reason you couldn't just use a different
> call-clobbered register.

That's what I figured, but I wanted to make sure since I'm still
learning how all these pieces fit together. R1 it is, then.

> > > > diff --git a/src/ldso/arm/tlsdesc.S b/src/ldso/arm/tlsdesc.S
> > > > index 3ae133c9..1843d97d 100644
> > > > --- a/src/ldso/arm/tlsdesc.S
> > > > +++ b/src/ldso/arm/tlsdesc.S
> > > > @@ -12,13 +12,19 @@ __tlsdesc_static:
> > > >  .hidden __tlsdesc_dynamic
> > > >  .type __tlsdesc_dynamic,%function
> > > >  __tlsdesc_dynamic:
> > > > +#if defined(__thumb2__)  ||  !defined(__thumb__)
> > > >       push {r2,r3,ip,lr}
> > > > +#else
> > > > +     push {r2-r4,lr}
> > > > +     mov r2,ip
> > > > +     push {r2}
> > > > +#endif
> > >
> > > AFAICT ip is not special here. If it's a pain to push/pop, just use a
> > > different register.
> >
> > Gotcha. It looks like there isn't really a good reason for the
> > original to use IP, either, so I could change that and merge a couple
> > of those alternate code paths. If you don't mind a few extra
> > instructions, I think I can get rid of all of the (__thumb2__ ||
> > !__thumb__) checks in there.
>
> This is an extreme hot path so it should probably be optimized with
> preprocessor where it helps, but that doesn't mean it should have
> gratuitous PP branches just to pick which register to use when a
> common one that works everywhere can be chosen.

Perhaps using ip in the original code is to save pushing another
register to memory so I won't mess with it since this is a hot path.

Does this function need to push r2 and r3 and pop them at the end? My
understanding is that the Arm procedure call standard does not require
subroutines to preserve r0-r3 unless their contents are needed across
a function call that may clobber them. The r3 register isn't used here
until after the call to __a_gettp_ptr. The r2 register is used across
the call, but code does not make an attempt to preserve it before the
call or restore it after. __a_getttp_ptr shouldn't mess with it
anyway. The default version just returns a CP15 register and the
Cortex-M version will trigger a syscall, which is treated like other
interrupts and r0-r3+ip are automatically stacked and unstacked on
Cortex-M interrupts.

> > > >  .hidden __sigsetjmp_tail
> > > >       b __sigsetjmp_tail
> > > > diff --git a/src/string/arm/memcpy.S b/src/string/arm/memcpy.S
> > > > index 869e3448..c5d17533 100644
> > > > --- a/src/string/arm/memcpy.S
> > > > +++ b/src/string/arm/memcpy.S
> > > > @@ -43,6 +43,8 @@
> > > >   * building as thumb 2 and big-endian.
> > > >   */
> > > >
> > > > +#if defined(__thumb2__)  ||  !defined(__thumb__)
> > > > +
> > > >  .syntax unified
> > > >
> > > >  .global memcpy
> > > > @@ -477,3 +479,4 @@ copy_last_3_and_return:
> > > >       ldmfd   sp!, {r0, r4, lr}
> > > >       bx      lr
> > > >
> > > > +#endif // defined(__thumb2__)  ||  !defined(__thumb__)
> > >
> > > The logic to enable the C memcpy seems to be missing.
> >
> > I wasn't actually sure of the best way to do this. In my current
> > version, I sort of followed the lead of the stuff in src/math/arm and
> > made it a C file that will either conditionally have the ARM version
> > as inline assembly or include the generic C version. Is there a better
> > way? The diff didn't capture this because of the rename; should I
> > attach it separately?
>
> Don't move giant asm functions into a .c file. You can just add a new
> file that's conditionally empty. See how it was before commit
> 9dce93ac7f7a76978b70581c6f073f671b583347.

Will do. I'm not sure why I didn't think of that.

> > > > diff --git a/src/thread/arm/__set_thread_area.c b/src/thread/arm/__set_thread_area.c
> > > > index 09de65aa..cb5d757d 100644
> > > > --- a/src/thread/arm/__set_thread_area.c
> > > > +++ b/src/thread/arm/__set_thread_area.c
> > > > @@ -26,7 +26,11 @@ extern hidden uintptr_t __a_barrier_ptr, __a_cas_ptr, __a_gettp_ptr;
> > > >
> > > >  int __set_thread_area(void *p)
> > > >  {
> > > > -#if !__ARM_ARCH_7A__ && !__ARM_ARCH_7R__ && __ARM_ARCH < 7
> > > > +#if !__ARM_ARCH_ISA_ARM
> > > > +     __a_gettp_ptr = __a_gettp_kuser;
> > > > +     __a_cas_ptr = __a_cas_kuser;
> > > > +     __a_barrier_ptr = __a_barrier_kuser;
> > > > +#elif !__ARM_ARCH_7A__ && !__ARM_ARCH_7R__ && __ARM_ARCH < 7
> > > >       if (__hwcap & HWCAP_TLS) {
> > > >               size_t *aux;
> > > >               __a_cas_ptr = __a_cas_v7;
> > >
> > > HWCAP needs to be honored here. You also need to add the new logic to
> > > use the syscall for getting thread pointer, I think.
> >
> > I updated this to use the "get_tls" syscall (0xf0006) for M-Profile
> > devices. The barrier and cas pointers point to the default dummy
> > functions.
> >
> > I'm not entirely sure how I should handle HWCAP. What should I do if
> > TLS is or isn't set? If it isn't set, the original code uses those
> > 'kuser' addresses that you pointed out aren't usable and will bail out
> > if the version at the special address isn't correct. Right now, I have
> > mine set to call a_crash() if TLS is not set in HWCAP because that
> > seems to be the closest I can come to the current functionality.
>
> The current HWCAP_TLS handling is for *all* pre-v7 ISA levels. By
> convention its meaning is presence of the cp15 thread pointer, but
> presence of this also implies atomics can be used directly (and
> absence of this means the kernel must be providing kuser_helper, so we
> just use it for atomics too).
>
> With M profile support, though, AIUI it's possible that you have the
> atomics but not the thread pointer. You should not assume that lack of
> HWCAP_TLS implies lack of the atomics; rather you just have to assume
> the atomics are present, I think, or use some other means of detection
> with fallback to interrupt masking (assuming these devices have no
> kernel/user distinction that prevents you from masking interrupts).
> HWCAP_TLS should probably be probed just so you don't assume the
> syscall exists in case a system omits it and does trap-and-emulate for
> the instruction instead.

I think I'm starting to understand this, which is good because it's
looking like my startup code for the micros will need to properly set
HWCAP before Musl can be used. I assume I'll need to set that
'aux{"AT_PLATFORM"}' to "v6" or "v7" as well to make this runtime
detection work properly. I'll have to figure out if "v6m" and "v7m"
are supported values for the platform. I may have more questions in
the future as I try actually implementing something.

Like I mentioned above, I can make a routine to temporarily mask
interrupts that could be used for v6-m or as a fallback/default for
other M-Profile devices; however, this only works properly if the
application is running in a privileged mode like one might do on bare
metal or an RTOS (not all v6-m devices have an MPU and so not all have
a concept of privilege anyway). The original code looks like it needs
to handle the case of one running a Musl built for ARMv6 on something
newer like ARMv7-A (I assume that's why there's "v7" versions of
functions in atomics.s). I'll try to follow that as well.

> > > > diff --git a/src/thread/arm/__unmapself.s b/src/thread/arm/__unmapself.s
> > > > index 29c2d07b..fe0406e3 100644
> > > > --- a/src/thread/arm/__unmapself.s
> > > > +++ b/src/thread/arm/__unmapself.s
> > > > @@ -3,7 +3,7 @@
> > > >  .global __unmapself
> > > >  .type   __unmapself,%function
> > > >  __unmapself:
> > > > -     mov r7,#91
> > > > +     movs r7,#91
> > > >       svc 0
> > > > -     mov r7,#1
> > > > +     movs r7,#1
> > > >       svc 0
> > >
> > > OK, but this file can't be used on nommu anyway. Instead the generic C
> > > version has to be used.
> >
> > LIke memcpy above (and the stuffi in src/math/arm), I made this a C
> > file that will either have this code as inline assembly or will
> > include the generic C version. The diff didn't capture this, should I
> > attach this separately?
>
> You can see how it's done in src/thread/sh for nommu support. Although
> I'd actually prefer moving the asm to inline in a .c file. If you do
> this, use a real C function with inline asm (note: you can't break it
> into multiple asm blocks since there's no stack after the first
> syscall and thus C code can't run) rather than stuffing the asm
> function in a top-level __asm__ statement.

Gotcha.

> > > > diff --git a/src/thread/arm/atomics.S b/src/thread/arm/atomics.S
> > > > index da50508d..d49809d9 100644
> > > > --- a/src/thread/arm/atomics.S
> > > > +++ b/src/thread/arm/atomics.S
> > > > @@ -1,3 +1,5 @@
> > > > +#if __ARM_ARCH_ISA_ARM
> > > > +
> > > >  .syntax unified
> > > >  .text
> > > >
> > > > @@ -104,3 +106,5 @@ __a_cas_ptr:
> > > >  .hidden __a_gettp_ptr
> > > >  __a_gettp_ptr:
> > > >       .word __a_gettp_cp15
> > > > +
> > > > +#endif
> > >
> > > This file should not be conditional. If it can't be assembled in
> > > default mode it just needs ISA level override directives. Which
> > > functions are actually called is selected at runtime.
> >
> > I removed the condition so that everything in here is always built and
> > added directives as needed.
> >
> > I did run into an issue, though. Is there a way to restore the default
> > architecture after setting it using ".arch <arch>"? Something like
> > ".set push" and ".set pop" in MIPS assembly? I didn't realize what the
> > "scope" of those .arch directives were until it occurred to me that
> > one of the functions shouldn't have built for ARMv6-M. I rearranged
> > the functions so that the ones without .arch directives come first,
> > which did take care of the problem for now. Trying ".arch" by itself
> > gets me an error from Clang.
>
> Look in setjmp.s at the .fpu and .eabi_attribute usage. I'm not sure
> it's entirely the same but should be analogous. Basically you just
> want to override whatever ISA-level/ABI tags the assembler tries to
> put on the file.
>
> Rich

I get it now, thanks. I'll probably have a third attempt at a patch
after the holidays.

-Jesse
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.