Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201218173051.GR534@brightrain.aerifal.cx>
Date: Fri, 18 Dec 2020 12:30:51 -0500
From: Rich Felker <dalias@...c.org>
To: Jesse DeGuire <jesse.a.deguire@...il.com>
Cc: musl@...ts.openwall.com
Subject: Re: Musl on Cortex-M Devices

On Fri, Dec 18, 2020 at 02:15:20AM -0500, Jesse DeGuire wrote:
> > > A number of errors were simply the result of
> > > ARMv6-M requiring one to use the "S" variant of an instruction that
> > > sets status flags (such as "ADDS" vs "ADD" or "MOVS" vs "MOV"). A few
> > > files I had to change from a "lower case s" to a "capital-S" file so
> > > that I could use macros to check for either the Thumb1 ISA
> > > ("__thumb2__ || !__thumb__") or for an M-Profile device
> > > ("!__ARM_ARCH_ISA_ARM").
> >
> > Is __ARM_ARCH_ISA_ARM universally available (even on old compilers)?
> > If not this may need an alternate detection. But I'd like to drop as
> > much as possible and just make the code compatible rather than having
> > 2 versions of it. I don't think there are any places where the
> > performance or size is at all relevant.
> 
> I hadn't thought enough about that until your reply, so I used
> Compiler Explorer to at least get a general idea of when that macro
> was supported. Clang 3.5 supports it, but Clang 3.4.1 does not. GCC
> 5.4 supports it, but 4.64 (the previous version available on Compiler
> Explorer) does not. With that, I tried to come up with something to
> detect M-Profile devices and have updated my changes with it.
> 
>    #if __ARM_ARCH_6M__ || __ARM_ARCH_7M__ || __ARM_ARCH_7EM__ \
>      || (__ARM_ARCH > 7 && !__ARM_ARCH_ISA_ARM)

I think you can just use presence of __ARM_ARCH as a proxy for
"checking __ARM_ARCH_ISA_ARM is meaningful" if you can verify that
they were added at the same time. Then you don't need to special-case
each M variant. Or am I missing something?

> > > The changes under
> > > "src/thread/arm/__set_thread_area.c" are different in that I made
> > > those because I don't believe Cortex-M devices could handle what was
> > > there (no M-Profile device has Coprocessor 15, for example) and so I
> >
> > Unless this is an ISA level that can't be executed on a normal (non-M)
> > ARM profile, it still needs all the backends that might be needed and
> > runtime selection of which to use. This is okay. I believe Linux for
> > nommu ARM has a syscall for get_tp, which is rather awful but probably
> > needs to be added as a backend. The right way to do this would have
> > been with trap-and-emulate (of cp15) I think...
> 
> Out of curiosity, what makes that syscall awful?

Well it precludes just using the same inline code as long as the ISA
level is >= a minimum; now levels supporting Cortex-M have to use the
runtime selection. I'm not sure which performs better. I would think
they'd be comparable.

> > Further comments inline below:
> >
> > > diff --git a/arch/arm/atomic_arch.h b/arch/arm/atomic_arch.h
> > > index 9e3937cc..21f102ad 100644
> > > --- a/arch/arm/atomic_arch.h
> > > +++ b/arch/arm/atomic_arch.h
> > > @@ -64,11 +64,18 @@ static inline int a_cas(volatile int *p, int t, int s)
> > >
> > >  #ifndef a_barrier
> > >  #define a_barrier a_barrier
> > > +#if __ARM_ARCH_ISA_ARM
> > >  static inline void a_barrier()
> > >  {
> > >       register uintptr_t ip __asm__("ip") = __a_barrier_ptr;
> > >       __asm__ __volatile__( BLX " ip" : "+r"(ip) : : "memory", "cc", "lr" );
> > >  }
> > > +#else
> > > +static inline void a_barrier()
> > > +{
> > > +     __asm__ __volatile__ ("dmb" : : : "memory");
> > > +}
> > > +#endif
> > >  #endif
> >
> > There's a condition above with this code already; you just need to
> > make it used for an additional case.
> 
> Fixed. The original code was nested inside an ifdef that also would
> have used LDREX and STREX--neither of which ARMv6-M supports--so I
> moved that out and added the extra check.

If it lacks LDREX and STREX how do you implement atomic? I don't see
where you're adding any alternative, so is v6-m support
non-functional? That rather defeats the purpose of doing anything to
support it...

> > > diff --git a/arch/arm/pthread_arch.h b/arch/arm/pthread_arch.h
> > > index e689ea21..a89fb554 100644
> > > --- a/arch/arm/pthread_arch.h
> > > +++ b/arch/arm/pthread_arch.h
> > > @@ -1,5 +1,5 @@
> > > -#if ((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
> > > - || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7
> > > +#if __ARM_ARCH_ISA_ARM && (((__ARM_ARCH_6K__ || __ARM_ARCH_6KZ__ || __ARM_ARCH_6ZK__) && !__thumb__) \
> > > + || __ARM_ARCH_7A__ || __ARM_ARCH_7R__ || __ARM_ARCH >= 7)
> >
> > That's probably not the right condition.
> 
> Can you provide more detail as I'm missing what you mean? The intent
> was to prevent this check from succeeding for any M-Profile device so
> that they fall back to the __a gettp_ptr solution. I did update this
> check to what I described at the top of this reply.

I think it's only the __ARM_ARCH >= 7 branch that needs any further
condition. The rest of the branches are already specific to non-M.

> > >  static inline pthread_t __pthread_self()
> > >  {
> > > diff --git a/configure b/configure
> > > old mode 100755
> > > new mode 100644
> > > diff --git a/crt/arm/crtn.S b/crt/arm/crtn.S
> > > index dc020f92..83e20da2 100644
> > > --- a/crt/arm/crtn.S
> > > +++ b/crt/arm/crtn.S
> > > @@ -1,9 +1,17 @@
> > >  .syntax unified
> > >
> > >  .section .init
> > > +#if __ARM_ARCH_ISA_ARM
> > >       pop {r0,lr}
> > >       bx lr
> > > +#else
> > > +     pop {r0,pc}
> > > +#endif
> > >
> > >  .section .fini
> > > +#if __ARM_ARCH_ISA_ARM
> > >       pop {r0,lr}
> > >       bx lr
> > > +#else
> > > +     pop {r0,pc}
> > > +#endif
> >
> > Ideally the linker could handle this.
> 
> The issue here is that ARMv6-M (and I believe v8-M.base) cannot POP
> into LR. Can this sequence replace LR with R1? My current version POPs
> into R1 and moves that into LR because I wasn't sure if that was
> important for the caller.

I don't see any reason you couldn't just use a different
call-clobbered register.

> > > diff --git a/src/ldso/arm/tlsdesc.S b/src/ldso/arm/tlsdesc.S
> > > index 3ae133c9..1843d97d 100644
> > > --- a/src/ldso/arm/tlsdesc.S
> > > +++ b/src/ldso/arm/tlsdesc.S
> > > @@ -12,13 +12,19 @@ __tlsdesc_static:
> > >  .hidden __tlsdesc_dynamic
> > >  .type __tlsdesc_dynamic,%function
> > >  __tlsdesc_dynamic:
> > > +#if defined(__thumb2__)  ||  !defined(__thumb__)
> > >       push {r2,r3,ip,lr}
> > > +#else
> > > +     push {r2-r4,lr}
> > > +     mov r2,ip
> > > +     push {r2}
> > > +#endif
> >
> > AFAICT ip is not special here. If it's a pain to push/pop, just use a
> > different register.
> 
> Gotcha. It looks like there isn't really a good reason for the
> original to use IP, either, so I could change that and merge a couple
> of those alternate code paths. If you don't mind a few extra
> instructions, I think I can get rid of all of the (__thumb2__ ||
> !__thumb__) checks in there.

This is an extreme hot path so it should probably be optimized with
preprocessor where it helps, but that doesn't mean it should have
gratuitous PP branches just to pick which register to use when a
common one that works everywhere can be chosen.

> > >  .hidden __sigsetjmp_tail
> > >       b __sigsetjmp_tail
> > > diff --git a/src/string/arm/memcpy.S b/src/string/arm/memcpy.S
> > > index 869e3448..c5d17533 100644
> > > --- a/src/string/arm/memcpy.S
> > > +++ b/src/string/arm/memcpy.S
> > > @@ -43,6 +43,8 @@
> > >   * building as thumb 2 and big-endian.
> > >   */
> > >
> > > +#if defined(__thumb2__)  ||  !defined(__thumb__)
> > > +
> > >  .syntax unified
> > >
> > >  .global memcpy
> > > @@ -477,3 +479,4 @@ copy_last_3_and_return:
> > >       ldmfd   sp!, {r0, r4, lr}
> > >       bx      lr
> > >
> > > +#endif // defined(__thumb2__)  ||  !defined(__thumb__)
> >
> > The logic to enable the C memcpy seems to be missing.
> 
> I wasn't actually sure of the best way to do this. In my current
> version, I sort of followed the lead of the stuff in src/math/arm and
> made it a C file that will either conditionally have the ARM version
> as inline assembly or include the generic C version. Is there a better
> way? The diff didn't capture this because of the rename; should I
> attach it separately?

Don't move giant asm functions into a .c file. You can just add a new
file that's conditionally empty. See how it was before commit
9dce93ac7f7a76978b70581c6f073f671b583347.

> > > diff --git a/src/thread/arm/__set_thread_area.c b/src/thread/arm/__set_thread_area.c
> > > index 09de65aa..cb5d757d 100644
> > > --- a/src/thread/arm/__set_thread_area.c
> > > +++ b/src/thread/arm/__set_thread_area.c
> > > @@ -26,7 +26,11 @@ extern hidden uintptr_t __a_barrier_ptr, __a_cas_ptr, __a_gettp_ptr;
> > >
> > >  int __set_thread_area(void *p)
> > >  {
> > > -#if !__ARM_ARCH_7A__ && !__ARM_ARCH_7R__ && __ARM_ARCH < 7
> > > +#if !__ARM_ARCH_ISA_ARM
> > > +     __a_gettp_ptr = __a_gettp_kuser;
> > > +     __a_cas_ptr = __a_cas_kuser;
> > > +     __a_barrier_ptr = __a_barrier_kuser;
> > > +#elif !__ARM_ARCH_7A__ && !__ARM_ARCH_7R__ && __ARM_ARCH < 7
> > >       if (__hwcap & HWCAP_TLS) {
> > >               size_t *aux;
> > >               __a_cas_ptr = __a_cas_v7;
> >
> > HWCAP needs to be honored here. You also need to add the new logic to
> > use the syscall for getting thread pointer, I think.
> 
> I updated this to use the "get_tls" syscall (0xf0006) for M-Profile
> devices. The barrier and cas pointers point to the default dummy
> functions.
> 
> I'm not entirely sure how I should handle HWCAP. What should I do if
> TLS is or isn't set? If it isn't set, the original code uses those
> 'kuser' addresses that you pointed out aren't usable and will bail out
> if the version at the special address isn't correct. Right now, I have
> mine set to call a_crash() if TLS is not set in HWCAP because that
> seems to be the closest I can come to the current functionality.

The current HWCAP_TLS handling is for *all* pre-v7 ISA levels. By
convention its meaning is presence of the cp15 thread pointer, but
presence of this also implies atomics can be used directly (and
absence of this means the kernel must be providing kuser_helper, so we
just use it for atomics too).

With M profile support, though, AIUI it's possible that you have the
atomics but not the thread pointer. You should not assume that lack of
HWCAP_TLS implies lack of the atomics; rather you just have to assume
the atomics are present, I think, or use some other means of detection
with fallback to interrupt masking (assuming these devices have no
kernel/user distinction that prevents you from masking interrupts).
HWCAP_TLS should probably be probed just so you don't assume the
syscall exists in case a system omits it and does trap-and-emulate for
the instruction instead.

> > > diff --git a/src/thread/arm/__unmapself.s b/src/thread/arm/__unmapself.s
> > > index 29c2d07b..fe0406e3 100644
> > > --- a/src/thread/arm/__unmapself.s
> > > +++ b/src/thread/arm/__unmapself.s
> > > @@ -3,7 +3,7 @@
> > >  .global __unmapself
> > >  .type   __unmapself,%function
> > >  __unmapself:
> > > -     mov r7,#91
> > > +     movs r7,#91
> > >       svc 0
> > > -     mov r7,#1
> > > +     movs r7,#1
> > >       svc 0
> >
> > OK, but this file can't be used on nommu anyway. Instead the generic C
> > version has to be used.
> 
> LIke memcpy above (and the stuffi in src/math/arm), I made this a C
> file that will either have this code as inline assembly or will
> include the generic C version. The diff didn't capture this, should I
> attach this separately?

You can see how it's done in src/thread/sh for nommu support. Although
I'd actually prefer moving the asm to inline in a .c file. If you do
this, use a real C function with inline asm (note: you can't break it
into multiple asm blocks since there's no stack after the first
syscall and thus C code can't run) rather than stuffing the asm
function in a top-level __asm__ statement.

> > > diff --git a/src/thread/arm/atomics.S b/src/thread/arm/atomics.S
> > > index da50508d..d49809d9 100644
> > > --- a/src/thread/arm/atomics.S
> > > +++ b/src/thread/arm/atomics.S
> > > @@ -1,3 +1,5 @@
> > > +#if __ARM_ARCH_ISA_ARM
> > > +
> > >  .syntax unified
> > >  .text
> > >
> > > @@ -104,3 +106,5 @@ __a_cas_ptr:
> > >  .hidden __a_gettp_ptr
> > >  __a_gettp_ptr:
> > >       .word __a_gettp_cp15
> > > +
> > > +#endif
> >
> > This file should not be conditional. If it can't be assembled in
> > default mode it just needs ISA level override directives. Which
> > functions are actually called is selected at runtime.
> 
> I removed the condition so that everything in here is always built and
> added directives as needed.
> 
> I did run into an issue, though. Is there a way to restore the default
> architecture after setting it using ".arch <arch>"? Something like
> ".set push" and ".set pop" in MIPS assembly? I didn't realize what the
> "scope" of those .arch directives were until it occurred to me that
> one of the functions shouldn't have built for ARMv6-M. I rearranged
> the functions so that the ones without .arch directives come first,
> which did take care of the problem for now. Trying ".arch" by itself
> gets me an error from Clang.

Look in setjmp.s at the .fpu and .eabi_attribute usage. I'm not sure
it's entirely the same but should be analogous. Basically you just
want to override whatever ISA-level/ABI tags the assembler tries to
put on the file.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.