musl - Re: TLS issue on aarch64

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180601005200.GO1392@brightrain.aerifal.cx>
Date: Thu, 31 May 2018 20:52:00 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: TLS issue on aarch64

On Fri, Jun 01, 2018 at 02:11:02AM +0200, Szabolcs Nagy wrote:
> * Szabolcs Nagy <nsz@...t70.net> [2018-05-29 08:33:17 +0200]:
> > * Rich Felker <dalias@...c.org> [2018-05-28 18:15:21 -0400]:
> > > On Mon, May 28, 2018 at 10:47:31PM +0200, Szabolcs Nagy wrote:
> > > > another issue with the patch is that if tp is aligned then pthread_t
> > > > may not get aligned:
> > > > 
> > > > tp == self + sizeof(pthread_t) - reserved
> > > 
> > > This expression does not look correct; it would have tp point
> > > somewhere inside of the pthread structure rather than just past the
> > > end of it.
> > > 
> > 
> > this is the current tp setup on
> > aarch64, arm and sh4, see TP_ADJ
> > 
> > we can just make tp = self+sizeof(pthread_t)
> > but then there will be a small unused hole
> > 
> > > Maybe your code is doing this to avoid wasted padding, but if so I
> > > think that's a bad idea. It violates the invariant that the pthread
> > > structure is at a constant offset from tp, which is needed for
> > > efficient pthread_self and for access to fixed slots at the end of it
> > > (like canary or dtv copy).
> > > 
> > > > so sizeof(pthread_t) - reserved must be divisible with
> > > > gcd(alignment of tp, alignof(pthread_t)) to be able to make both
> > > > self and tp aligned.
> > > > 
> > > > this is not an issue on current targets with current pthread_t,
> > > > but we may want to decouple internal struct pthread alignment
> > > > details and the abi reserved tls size, i.e. tp_adj could be like
> > > > 
> > > > tp == alignup(self + sizeof(pthread_t) - reserved, alignof(pthread_t))
> > > > 
> > > > or we add a static assert that reserved and alignof(pthread_t)
> > > > are not conflicting.
> > > 
> > > Maybe I'm misunderstanding what "reserved" is, since you're talking
> > > about a static assert...?
> > > 
> > 
> > it is the abi reserved space after tp
> > (the bfd linker calls it TCB_SIZE)
> 
> did some digging into the bfd linker code, i dump here my understanding:
> 
> tls variant 1 has two sub variants: ppc (mips is the same) and
> aarch64 (arm, sh are the same just s/16/8/ for the reserved space)

This is very helpful.

> base: tls segment base address in the elf obj
> align: tls segment alignment requirement in the elf obj
> addr: address of a tls var in the tls segment of the elf obj
> tp: thread pointer at runtime
> ptr: address of tls var at runtime
> 
> local-exec (off: offset from tp in the code):
> 	code: ptr = tp + off
> 
> 	ppc: off = addr - (base + 0x7000)
> 	aarch64: off = addr - (base - alignup(16, align))
> 
> initial-exec (got,add: REL_TPOFF got entry and value/addend):
> [...]
> local-dynamic (non-tlsdesc, got,add: REL_DTPOFF got entry and value, off: offset in the code):
> [...]
> general-dynamic (non-tlsdesc, got,add: REL_DTPOFF got entry and value):
> [...]

None but local-exec are really linker ABI, except kinda in the case of
static linking if the linker fails to perform relaxations. A more
precise way to say this would be that only TLS which is accessible via
local-exec model has any constraints imposed on it by the linker as to
where it lies relative to TP. For initial-exec, the address relative
to TP must be constant across all threads, but it's not set by the
linker (except in the case of failed relaxation, in which case it's
just the local-exec model with the offset stored out-of-band). For
other models, there are no constraints at all; it's entirely up to the
dynamic linker implementation how it arranges for things to work.

> ptr has correct alignment if the tls segment is mapped to an aligned address,
> i.e. ptr - (addr - base) must be aligned, working backwards from this
> the requirement is
> 
> for local-exec to work:
> 	ppc: tp - 0x7000 must be aligned

This should be fine, since musl arranges for self+sizeof(struct
pthread) to be aligned, and that's exactly what tp-0x7000 is.

> 	aarch64: tp + alignup(16, align) must be aligned == tp must be aligned

OK, I see two possible solutions here:

1. tp==self+sizeof(struct pthread). In this case we'll waste some
space (vs the current approach) when no extra alignment is needed, but
it's simple and clean because all the alignments match up naturally.

2. tp==self+sizeof(struct pthread)-16 (or rather -reserved in
general). This preserves the current memory usage, but requires
complex new alignment logic since self will no longer be aligned mod
tls_align when tls_align>reserved.

I pretty strongly prefer option 1.

In either case, the main_tls.offset/app.tls.offset value needs to
correctly reflect the offset of the TLS from TP, so it either needs to
be alignup(reserved,tls_align) or alignup(reserved,tls_align)-reserved
depending on option 1 or 2. After that change is made, we need to make
sure the storage needs (libc.tls_size) are computed correctly and
account for the extra space due to the initial positive offset.

No change is then needed in __copy_tls.

Changes to TP_ADJ and __pthread_self are needed to get reserved out of
them, and the value of reserved needs to be provided somewhere else
for computing main_tls.offset.

> for initial-exec to work:
> 	tp + *got - add must be aligned (i.e. *got has to be set up to meet
> 	the alignment requirement of the module, this does not seem to require
> 	realignment of tp so even runtime loading of initial-exec tls should
> 	be possible assuming there is enough space etc...)

There's never space so it's not even a question, but even if there
were, no, it can't be done because tp will not be aligned mod some
possibly-larger alignment than the alignment in effect at the time the
thread was created.

> i can come up with a new patch, but it's not clear if on aarch64 we want
> to allow tp to point inside pthread_t or strictly point at the end (then
> the 16 reserved bytes are unused), note that in glibc the 16 byte after tp
> contains dtv and an unused ptr and it is expected that additional magic tls
> data the compiler needs to access (e.g. stackend for split stack support or
> stack canary) comes before tls (so the layout is pthread_t, some magic tls
> data, tp, 16 byte with dtv, tls), for us only the dtv is important to access
> with fixed offset from tp (even that's not necessary if we generate the load
> offset in the tlsdesc asm based on sizeof(pthread_t) etc)
> 
> i haven't looked into how gdb tries to handle tls (without threaddb)
> there might be further abi requirements for gdb to work well.

I don't think gdb is working reasonably at all right now for accessing
TLS on musl. It really should just be enhanced to inject calls to
__tls_get_addr and/or dlsym.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.