musl - Re: Thread-local memory for thread structures

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20190411150959.GW23599@brightrain.aerifal.cx>
Date: Thu, 11 Apr 2019 11:09:59 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Thread-local memory for thread structures

On Thu, Apr 11, 2019 at 12:12:46PM +0100, Raphael Cohn wrote:
> Dear List,
> 
> I'm playing around with allocating 100s of bytes of TLS memory for
> various purposes. Something I noticed in the code for creating the
> mmap'd memory for TLS is that it does not (quite reasonably) assign it
> a NUMA memory policy.
> 
> I'd like to assign a NUMA memory policy to the memory used for
> managing a thread. Is there anything 'underhanded' I can do to find
> out its location and size? I realize anything is likely to be brittle.
> Ideally what I'd like is a 'set NUMA memory policy of this thead's
> mmap'd management memory to the local NUMA node [once I've scheduled
> it to run a particular set of CPUs].
> 
> Any suggestions?

This is an interesting question.

First, keep in mind that the thread structure and all TLS must be
accessible from all threads of the process. These objects have
addresses which can be taken and passed around, and the thread
structure will be touched by other threads for things like
cancellation, linking/unlinking new/exiting threads from the thread
list, joining, etc. So whatever you do it needs to preserve
accessibility and just tweak what's efficient. Ideally the scheduler
on a NUMA kernel would do this for you based on access patterns or
such.

Now, on to "how you'd do it": At first I thought pthread_getattr_np
would give you the info via stack size, but no, it's only the actual
stack, not the TLS or thread structure area. Normally these areas are
contiguous with the stack, unless you manually allocated a stack with
pthread_attr_setstack, in which case pthread_create will allocate
separate space for the thread structure and TLS unless they're under
both a certain absolute size and a certain percentage of the stack
size (basically, to guarantee that large TLS doesn't leave you with a
significantly smaller stack than you expected). So for now, if you're
not doing custom stacks, TLS and the thread structure will be in the
same mapping as the stack (you could find its extents via
/proc/self/maps or something).

Now, that's probably a bad idea to rely on, because at some point we
might add an extra guard page between the stack and the TLS/thread
structure for hardening, so that stack-based overflows can't clobber
TLS or thread structure.

It's also not true for the main thread, where TLS and thread structure
will be in .bss or mmap-allocated memory (depending on size) separate
from the main thread's stack.

One dumb idea would be taking &errno and looking for the map in
/proc/self/maps that contains it. This would cover all static TLS and
the thread structure, since ABI constrains them to be contiguous. It
won't cover dynamic TLS.

Another idea would be calling __tls_get_addr for each module, using
the module id's provided by dl_iterate_phdr. This will be offset by an
arch-dependent adjustment you need to be aware of, however. It looks
like the dlpi_tls_data field of the dl_iterate_phdr callback structure
is also supposed to contain a pointer to the calling thread's TLS
region for the module (pointer to beginning? end?), but we actually
seem to have this wrong in musl right now -- we're giving a pointer to
the module's TLS image used for instantiating new threads' TLS. Note
that you can also obtain the size from dl_iterate_phdr by using the
PT_TLS program header.

Now, this is going to be a lot less useful than you'd think, because
dynamic TLS tends to be small and is allocated by malloc, not mmap, so
it won't be page-aligned or per-thread. In fact, right now, it's
allocated contiguously *by library*, not *by thread*, which is pretty
awful for NUMA. Fortunately this only applies to threads that already
existed when dlopen was called; new threads get all existing TLS
allocated as if it were static. Since 1.1.22 changed how dynamic TLS
is installed, I do intent do change the point and strategy of
allocation, and I'll keep NUMA in mind when I do (i.e. allocate each
thread's new DTLS as a unit rather than allocating each library's).

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.