libc-coord - Re: Thread properties API

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210523193521.3xlotg5yhx335qgx@gmail.com>
Date: Sun, 23 May 2021 12:35:21 -0700
From: Fangrui Song <i@...kray.me>
To: Florian Weimer <fweimer@...hat.com>
Cc: Kostya Serebryany <kcc@...gle.com>,
	Evgenii Stepanov <eugenis@...gle.com>, enh <enh@...gle.com>,
	libc-coord@...ts.openwall.com
Subject: Re: Thread properties API

Thanks for proposing this.

On 2021-05-21, Florian Weimer wrote:
>In glibc, we have a historic gap when it comes to access to per-thread
>data that may contain application data.  Such functionality is required
>to support memory leak detectors and conservative garbage collectors
>that need to discover pointer values in TLS data.  The sanitizers
>currently hard-code glibc implementation details (that are not really
>unchanging in practice), and we want to switch to a defined interface
>eventually.

Yes, the GetTls related functions in
llvm-project/compiler-rt/lib/sanitizer_common/sanitizer_linux_libcdep.cpp
is very difficult to maintain (and non-x86 glibc doesn't work well).

>We provide some access to per-thread data using libthread_db, but that
>functionality is incomplete in several ways and not readily consumable
>inside a process.
>
>There is a completely different proposal here:
>
>  <https://sourceware.org/glibc/wiki/ThreadPropertiesAPI>
>
>The interfaces that follow below avoid callback functions that must be
>invoked with internal locks held because that is prone to lead to
>deadlocks.  They also avoid encoding a specific, unchanging layout for
>the static TLS area (which can be extended, but not moved) or internal
>pthread_setspecific tables, or whether any such per-thread data
>structures are allocated by malloc.
>
>I'd appreciate comments on this proposal.  It's very early and I'm not
>entirely convinced yet that it is actually implementable. 8-)
>
>void pthread_retain_np (pthread_t thread);
>
>pthread_retain_np marks THREAD for retention.  A retained thread
>identifier remains valid after the thread has exited and (if joinable)
>has been joined, but all subsequent operations on the thread ID, except
>for pthread_release_np, pthread_getattr_np and pthread_tls_areas_get_np,
>will fail.
>
>void pthread_release_np (pthread_t thread);
>
>pthread_release_np undoes the effect of a previous pthread_retain_np
>call (which can be implied by pthread_all_threads_np).  Every
>pthread_retain_np call must eventually be paired with a
>pthread_release_np call, or otherwise there is a resource leak.
>
>Once the number of pthread_release_np calls is equal to the number of
>pthread_retain_np calls for a particular thread ID (including such calls
>implied by pthread_all_threads_np), the thread ID is again only valid
>while the thread is running or joinable.  If the number of calls it is
>equal, it is undefined to call pthread_release_np.
>
>size_t pthread_all_threads_np (pthread_t *result, size_t length);
>
>pthread_all_threads_np returns the number of all currently running or
>joinable threads in the process.  The identifiers of the first LENGTH
>such threads are written to the array starting at RESULT.  For those
>thread identifiers, pthread_retrain_np is invoked; this happens in such
>a way that the thread is running or joinable at this point.
>
>Applications need to call pthread_release_np on all the thread
>identifiers that pthread_all_threads_np has written to RESULT.  This
>also applies to the case where pthread_all_threads_np is called in a
>loop that grows the RESULT array to the size required to store all
>thread IDs in the current process.
>
>Due to the lack of synchronization, an unspecified time can pass between
>the termination of a detached thread and the time its thread ID no
>longer appears among the thread IDs provided by pthread_all_threads_np.
>It is unspecified whether threads that are neither running nor joinable,
>but have been retained, appear among those thread IDs.
>
>struct pthread_tls_area_np { const void *start; size_t length; };
>size_t pthread_tls_areas_get_np (pthread_t thread,
>  struct pthread_tls_area_np *areas, size_t length);
>size_t pthread_tls_areas_release_np (pthread_t thread,
>  const struct pthread_tls_area_np *areas, size_t count);
>
>pthread_tls_area_get_np returns the number of TLS areas currently
>allocated for THREAD.  This number may be zero if THREAD refers to a
>thread that is not running.  The location and size of up to LENGTH of
>these areas are written to the array starting at AREAS, followed by zero
>elements until LENGTH array elements have been written.  An application
>can detect that the provided array is too small by check the return
>value against LENGTH.
>
>The application may inspect the pointers and memory areas identified by
>the array elements (up to the return value of pthread_tls_area_np).  At
>this point, it is guaranteed that the memory locations remain valid for
>access.  After inspecting the TLS areas, the application must call
>pthread_tls_areas_release_np, passing the same THREAD, AREAS and LENGTH
>arguments that were used in the pthread_tls_areas_get_np.
>
>For example, if it has been previously determined that a __thread
>variable is at address P for a particular THREAD, and if the LENGTH
>argument to pthread_tls_area_np is sufficiently large to hold all TLS
>areas for THREAD, then P will be contain within one of the TLS areas.
>If the implementation supports access to __thread variables from other
>threads, it is safe to access *P, subject to the usual constraints
>regarding data races, until pthread_tls_areas_release_np is called.
>Similarly, pointer values stored by pthread_setspecific will appear in
>the TLS areas at unspecified locations, and the values will be current
>in the sense that if a pthread_setspecific call for a key happens-before
>the pthread_tls_areas_get_np call and the access to the TLS areas
>happens-before the next pthread_setspecific call for that key on the
>thread, then the pointer value stored by the first pthread_setspecific
>call will appear in one of the TLS areas listed by
>pthread_tls_areas_get_np.  However, writes to that pointer value may not
>be reflected in future pthread_getspecific calls (even with
>synchronization).
>
>The distribution of various TLS data structures among the AREAS array is
>unspecified.  Some of the areas may be allocated on the heap using
>malloc, or part of such heap allocations.  It is unspecified whether
>previously allocated TLS areas are returned for a thread that is no
>longer running.  If any per-thread data is allocated by the
>implementation in such a way that it will be deallocated using free,
>pointers to such allocations should appear among the areas returned by
>pthread_tls_areas_get_np, so that internal allocations made by the
>implementation are not falsely flagged as leaked.
>
>Thanks,
>Florian
>

Currently msan and tsan intercept __tls_get_addr so that newly allocated TLS
can be tracked and immediately acted upon. See my attached notes for
details.
(Note: aarch64 uses TLSDESC by default and there is no interposable symbol.)

Is there an observer API for the new allocated TLS area?
(msan's hook is async-signal-safe (even though lazy TLS allocation
isn't). It does very simple unpoison operation.)



[I am attaching my notes about sanitizer runtime's usage of TLS
boundaries. Hope that can be useful.]


## Why does compiler-rt need to know TLS blocks?

### AddressSanitizer "asan" (`-fsanitize=address`)

The main task of AddressSanitizer is to detect addressability problems. If a regular memory byte is not addressable (i.e. accesses should be UB), it is said to be poisoned and the associated shadow encodes the addressability information (all unpoisoned/all poisoned/partly poisoned).

On thread creation, the runtime should unpoison the thread stack and static TLS blocks to allow accesses. (`test/asan/TestCases/Linux/unpoison_tls.cpp`; introduced in https://github.com/llvm/llvm-project/commit/09886cd17ab8e5e601fda0e2aa21ff28c1a8fa63 "[asan] Make ASan report the correct thread address ranges to LSan.")
The runtime additionally unpoisons the thread stack and TLS blocks on thread exit to allow accesses from later TSD destructors.

Note: if the allocation is rtld/libc internal and not intercepted, there is no need to unpoison the range. The associated shadow is supposed to be zeros.
However, if the allocation is intercepted, the runtime should unpoison the range in case the range reuses a previous allocation which happens to contain poisoned bytes.

In glibc, `_dl_allocate_tls` and `_dl_deallocate_tls` call malloc/free functions which are internal and not intercepted, so the allocations are opaque to the runtime and the shadow bytes are all zeroes.

### Hardware-assisted AddressSanitizer "hwasan" (`-fsanitize=hwaddress`)

Its `ClearShadowForThreadStackAndTLS` is similar to asan's.

### LeakSanitizer "lsan" (`-fsanitize=leak`)

LeakSanitizer detects memory leaks. On many targets, it is integrated (and enabled by default) in AddressSanitizer, but it can be used standalone.
The checker is triggered by an `atexit` hook (the default options are `LSAN_OPTIONS=detect_leaks=1:leak_check_at_exit=1`), but it can also be invoked via `__lsan_do_leak_check`.

Each supported platform provides an entry point: `StopTheWorld` (e.g. Linux [1]), which does the following:

* Invoke the clone syscall to create a new process which shared the address space with the calling process.
* In the new process, list threads by iterating over `/proc/$pid/task/`.
* In the new process, call `SuspendThread` (ptrace `PTRACE_ATTACH`) to suspend a thread.

`StopTheWorld` returns. The runtime performs mark-and-sweep, reports leaks, and then calls `ResumeAllThreads` (ptrace `PTRACE_DETACH`).

Note: the implementation cannot call libc functions. It does not perform code injection. The toot set includes static/dynamic TLS blocks for each thread.

(The `pthread_create` interceptor calls `AdjustStackSize` which computes a minimum stack size with `GetTlsSize`. https://code.woboq.org/llvm/compiler-rt/lib/sanitizer_common/sanitizer_posix_libcdep.cpp.html#411 I am not sure musl needs this.)

The current lsan implementation has more requirement on `GetTls`: it does not intercept `pthread_setspecific`.
Instead, it expects `GetTls` returned range to include pointers to `pthread_setspecific` regions, otherwise there would be false positive leak reports.

In addition, lsan gets the static TLS boundaries at ptread_create time and expects the boundaries to include TLS blocks of dynamically loaded modules.
This means that `GetTls` returned range needs to include static TLS surplus.

(
You might ask that the thread control block has the dtv pointer, why can't lsan track the referenced allocations.
Well, for threads, rtld/libc implementations typically allocate the static TLS blocks as part of the thread stack, which are not seen by the runtime, so the runtime does not know the allocations.
)

On glibc, `GetTls` returned range includes `pthread::{specific_1stblock,specific}` for thread-specific data keys.
There is currently a hack to ignore allocations from ld.so allocated dynamic TLS blocks.
Note: if the `pthread::{specific_1stblock,specific}` pointers are encrypted, lsan cannot track the allocation.

[1]: https://code.woboq.org/llvm/compiler-rt/lib/sanitizer_common/sanitizer_stoptheworld_linux_libcdep.cpp.html#144

### MemorySanitizer "msan" (`-fsanitize=memory`)

MemorySanitizer detects uses of uninitialized memory. If a regular memory byte has uninitialized (poisoned) bits, its associated shadow byte has one bits.

Similar to asan.
On thread creation, the runtime should unpoison the thread stack and static TLS blocks to allow accesses. (`test/msan/tls_reuse.cpp`)
The runtime additionally unpoisons the thread stack and TLS blocks on thread exit to allow accesses from TSD destructors.

msan needs to do more than asan: the `__tls_get_addr` interceptor (`DTLS_on_tls_get_addr`) detects new dynamic TLS blocks and unpoisons the shadow.
Otherwise, if a dynamic TLS block reuses a previous allocation with poison, there may be false positives.
One way to semi reliably trigger this is (`test/msan/dtls_test.cpp` https://github.com/google/sanitizers/issues/547):

* in a thread, write an uninitialized (poisoned) value to a dynamic TLS block
* destroy the thread
* create a new thread
* try making the new thread reuse the poisoned dynamic TLS block.

Note: aarch64 uses TLSDESC by default and there is no interposable symbol.

During the development of glibc 2.19, [commit 1f33d36a8a9e78c81bed59b47f260723f56bb7e6](https://sourceware.org/git/?p=glibc.git;a=commit;h=1f33d36a8a9e78c81bed59b47f260723f56bb7e6) ("Patch 2/4 of the effort to make TLS access async-signal-safe.") was checked in.
`DTLS_on_tls_get_addr` detects the `__signal_safe_memalign` header and considers it a dynamic TLS block if the block is not within the static TLS boundaries.
[commit dd654bf9ba1848bf9ed250f8ebaa5097c383dcf8](https://sourceware.org/git/?p=glibc.git;a=commit;h=dd654bf9ba1848bf9ed250f8ebaa5097c383dcf8) ("Revert "Patch 2/4 of the effort to make TLS access async-signal-safe.") reverted `__signal_safe_memalign`, but the implementation remains in grte branches.

See also [Re: glibc 2.19 - asyn-signal safe TLS and ASan.](https://groups.google.com/g/address-sanitizer/c/BfwYD8HMxTM)

Similar to lsan: the `pthread_create` interceptor calls `AdjustStackSize` which computes a minimum stack size with `GetTlsSize`.

### ThreadSanitizer "tsan" (`-fsanitize=thread`)

Similar to lsan: the `pthread_create` interceptor calls `AdjustStackSize` which computes a minimum stack size with `GetTlsSize`.

Similar to msan, the runtime unpoisons TLS blocks to avoid false positives.
Tested by `test/tsan/dtls.c` (D20927).
tsan also needs to intercept `__tls_get_addr`. The problem that aarch64 TLSDESC does not have an interposable symbol also applies.

I wrongly thought <https://reviews.llvm.org/D93866> was a workaround. <https://sourceware.org/pipermail/libc-alpha/2021-January/121352.html> explained that the code has not materialized changed since 2012.

For dynamic TLS blocks, older glibc (e.g. 2.23) calls `__libc_memalign`, which is intercepted (`tsan/rtl/tsan_interceptors_posix.cpp`); since BZ #17730, newer glibc (e.g. 2.32) calls `malloc`.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.