musl - Re: Running code on all other threads (for sandboxing)

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250824021820.GI1827@brightrain.aerifal.cx>
Date: Sat, 23 Aug 2025 22:18:20 -0400
From: Rich Felker <dalias@...c.org>
To: Demi Marie Obenour <demiobenour@...il.com>
Cc: musl@...ts.openwall.com, libc-alpha@...rceware.org
Subject: Re: Running code on all other threads (for sandboxing)

On Fri, Aug 22, 2025 at 09:34:55PM -0400, Demi Marie Obenour wrote:
> There are cases where it is highly desirable for a process
> to start out with full user rights (or at least close to them),
> initialize, and then drop these privileges using Linux kernel
> features like seccomp.  Unfortunately, this breaks if the
> process uses third-party libraries that create threads during
> initialization.  In particular, Mesa can do this, and there is
> no realistic alternative to it as Mesa is ~2 million lines of
> GPU compiler and driver code.  Loading Mesa later is undesirable
> as it prevents removing all filesystem access.
> 
> There are two ways to fix this problem:
> 
> 1. Fix the problem in the Linux kernel.
> 2. Work around it in userspace, as is already done for setuid()
>    and friends.
> 
> For the second, it should be sufficient to provide a function
> that runs a caller-provided function on each thread, while
> ensuring that the process is atomic with respect to other
> threads in the process.  This function only needs to make
> system calls and crashes the process if there is an error.
> If the function uses anything that isn't a syscall or
> compiler builtin, it gets to keep both pieces.
> 
> Is this something that would make sense to implement?  I know
> that this problem has been an issue for Chromium on Linux.

I'm not sure what the right solution to this specific problem is, but
I don't think exposing a "run arbitrary code in each thread" as a
public API is a good choice. Such code would run in a context which is
worse/more-restrictive even than "async signal" context, making it
really difficult to define any reasonable class of "what you're
allowed to do here". I know you said "syscalls", but even that
requires defining what you mean by syscalls (raw via asm? via
syscall()? any function that's "traditionally just a syscall"?) and
further specifying which syscalls are actually allowed (any which
break the __synccall context assumptions would need to be forbidden).

I think there are potentially semi-portable solutions to your problem
that don't require such a big hammer as arbitrary __synccall.

One that comes to mind is installing a SECCOMP_RET_USER_NOTIF or
SECCOMP_RET_TRAP filter before loading Mesa. This could allow the
filesystem access to load Mesa libraries only until you set a flag
that loading has finished, then cause filesystem access syscalls to
fail once the flag has been set.

Another approach is doing what I'd call "manual __synccall" with your
own signal, which is better than exposing actual __synccall because
the application code does not run in an invalid-libc context, but this
would only work if Mesa's hidden threads don't mask signals. A library
creating its own threads behind the scenes *should* be masking all
signals, so this probably doesn't work. Even if Mesa botched it, you
wouldn't want to preclude them fixing it.

There is probably also a way to do this with ptrace, which blocked
signals wouldn't interfere with, but that gets really nasty really
quick.

Unfortunately there don't seem to be any ways to inject new seccomp
filters into another task (even a thread of your own process)
directly. This is what Linux really should be offering here.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.