|
|
Message-ID: <20230602141655.GO4163@brightrain.aerifal.cx>
Date: Fri, 2 Jun 2023 10:16:55 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: [PATCH] [RFC] make clone() usable
On Fri, Jun 02, 2023 at 10:28:09AM +0300, Alexey Izbyshev wrote:
> On 2023-06-01 23:08, Rich Felker wrote:
> >On Thu, Jun 01, 2023 at 01:12:57PM +0300, Alexey Izbyshev wrote:
> >>On 2023-05-31 02:35, Rich Felker wrote:
> >>>As discussed before (see the 2021 thread "Incorrect thread TID
> >>>caching") clone() has been effectively unusable because it produces a
> >>>child process in invalid state, with wrong tid in its thread
> >>>structure, among other problems.
> >>>
> >>>The attached proposed patch attempts to make clone() usable by having
> >>>it share the _Fork logic for establishing a consistent process state
> >>>after forking, and also blocks use of flags which produce invalid
> >>>state.
> >>>
> >>Tangentially, while thinking about this, I realized that because the
> >>kernel clears the address of the exit futex on clone and _Fork
> >>doesn't re-establish it, the following blocks forever:
> >>
> >>void *thr(void *arg) {
> >> // Blocks in __tl_sync because __thread_list_lock is never
> >>unlocked
> >> pthread_join(*(pthread_t *)arg, NULL));
> >> return NULL;
> >>}
> >>
> >>int main() {
> >> if (!fork()) {
> >> static pthread_t pt;
> >> pt = pthread_self();
> >> pthread_create(&(pthread_t){0}, NULL, thr, &pt);
> >> pthread_exit(NULL);
> >> }
> >> wait(NULL);
> >>}
> >>
> >>This could be fixed by the following change in _Fork.c:
> >>
> >>- self->tid = __syscall(SYS_gettid);
> >>+ self->tid = __syscall(SYS_set_tid_address,
> >>&__thread_list_lock);
> >
> >Wow, I had no idea Linux's fork cleared the exit futex address. That's
> >a big bug we've had around for nobody to have noticed...
> >
> >>>With CLONE_VM, the old behavior is kept, with a caveat that the child
> >>>context is extremely restrictive, ala vfork.
> >>>
> >>>It was raised previously (pardon the pun) that raise() should perhaps
> >>>be modified to make a SYS_gettid syscall rather than using the thread
> >>>structure tid, so that it's possible to call raise() in a vfork
> >>>context without signaling the wrong process. It might make sense to do
> >>>that now too.
> >>>
> >>Yes, it would be nice to have raise and abort (which also reads the
> >>cached tid separately) working in a vfork child.
> >>
> >>>diff --git a/src/linux/clone.c b/src/linux/clone.c
> >>>index 8c1af7d3..e8f76bbc 100644
> >>>--- a/src/linux/clone.c
> >>>+++ b/src/linux/clone.c
> >>>@@ -4,18 +4,55 @@
> >>> #include <sched.h>
> >>> #include "pthread_impl.h"
> >>> #include "syscall.h"
> >>>+#include "lock.h"
> >>>+#include "fork_impl.h"
> >>>+
> >>>+struct clone_start_args {
> >>>+ int (*func)(void *);
> >>>+ void *arg;
> >>>+ sigset_t sigmask;
> >>>+};
> >>>+
> >>>+static int clone_start(void *arg)
> >>>+{
> >>>+ struct clone_start_args *csa = arg;
> >>>+ __post_Fork(0);
> >>>+ __restore_sigs(&csa->sigmask);
> >>>+ return csa->func(csa->arg);
> >>>+}
> >>>
> >>> int clone(int (*func)(void *), void *stack, int flags, void *arg, ...)
> >>> {
> >>>+ struct clone_start_args csa;
> >>> va_list ap;
> >>>- pid_t *ptid, *ctid;
> >>>- void *tls;
> >>>+ pid_t *ptid = 0, *ctid = 0;
> >>>+ void *tls = 0;
> >>>+
> >>>+ /* Flags that produce an invalid thread/TLS state are disallowed. */
> >>>+ int badflags = CLONE_THREAD | CLONE_SETTLS | CLONE_CHILD_CLEARTID;
> >>>+ if (flags & badflags)
> >>>+ return __syscall_ret(-EINVAL);
> >>>
> >>Maybe also check for NULL stack, as proposed in
> >>https://www.openwall.com/lists/musl/2022/08/02/1 ?
> >
> >I guess that can be done. Normally we don't check for null pointers,
> >but I think the syscall would silently keep the existing stack pointer
> >if passed null rather than blowing up, which we probably don't want --
> >even without CLONE_VM, where it'd blow up, we'd be giving
> >appears-to-work behavior without a contract for it to work.
> >
> Yes, and this is not only about CLONE_VM. On some arches (e.g. i386,
> x86_64, aarch64) musl's __clone stores things on "stack" and hence
> segfaults if "stack" is NULL, but on other arches it doesn't (e.g.
> arm). Exposing this is inconsistent and confusing. If support for
> passing NULL "stack" is ever desired, __clone should be changed
> first.
OK. Crashing here would actually be okay, but having it appear to work
on some archs while crashing on others is not a good situation. So I
think just reporting it as an error is okay for now. We can revisit
this later if it would be useful.
> >>> va_start(ap, arg);
> >>>- ptid = va_arg(ap, pid_t *);
> >>>- tls = va_arg(ap, void *);
> >>>- ctid = va_arg(ap, pid_t *);
> >>>+ if (flags & (CLONE_PIDFD | CLONE_PARENT_SETTID | CLONE_CHILD_SETTID))
> >>>+ ptid = va_arg(ap, pid_t *);
> >>>+ if (flags & CLONE_CHILD_SETTID) {
> >>>+ tls = va_arg(ap, void *);
> >>>+ ctid = va_arg(ap, pid_t *);
> >>>+ }
> >>> va_end(ap);
> >>>
> >>>- return __syscall_ret(__clone(func, stack, flags, arg, ptid, tls,
> >>>ctid));
> >>>+ if (flags & CLONE_VM)
> >>>+ return __syscall_ret(__clone(func, stack, flags, arg, ptid,
> >>>tls, ctid));
> >>>+
> >>>+ __block_all_sigs(&csa.sigmask);
> >>>+ LOCK(__abort_lock);
> >>>+
> >>>+ csa.func = func;
> >>>+ csa.arg = arg;
> >>>+ int ret = __clone(clone_start, stack, flags, &csa, ptid, tls, ctid);
> >>>+
> >>>+ __post_Fork(ret);
> >>>+ __restore_sigs(&csa.sigmask);
> >>>+ return __syscall_ret(ret);
> >>> }
> >>
> >>Another thing that might be somewhat relevant if we expect to see
> >>wider usage of clone is that syscall (the public function) currently
> >>blindly extracts six arguments via va_arg, which on some
> >>architectures requires sufficient stack space not to crash. So on
> >>i386 the following silly function will crash if passed to clone,
> >>provided that "stack" points to the beginning of unmapped space past
> >>the stack page end.
> >
> >That's changed in the patch. It only calls va_arg for arguments whose
> >existence the flags implies.
> >
> I was talking about va_arg usage in syscall() (in misc/syscall.c).
> __clone only aligns stack on 16 bytes (at least on arches that I
> checked), but doesn't reserve space. So a sufficiently unlucky
> user-provided function will crash when called from __clone in the
> child, as in my example.
OK, this is really a bug in the syscall() function, albeit one that
doesn't seem to admit any reasonable fix. Note that this function is
not used in libc; it's only present for use by applications.
Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.