Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9dd123f1-0053-4f92-83bd-b443a78cfeb5@gmail.com>
Date: Sun, 24 Aug 2025 08:11:30 -0400
From: Demi Marie Obenour <demiobenour@...il.com>
To: Rich Felker <dalias@...c.org>
Cc: musl@...ts.openwall.com, libc-alpha@...rceware.org
Subject: Re: Running code on all other threads (for sandboxing)

On 8/23/25 22:18, Rich Felker wrote:
> On Fri, Aug 22, 2025 at 09:34:55PM -0400, Demi Marie Obenour wrote:
>> There are cases where it is highly desirable for a process
>> to start out with full user rights (or at least close to them),
>> initialize, and then drop these privileges using Linux kernel
>> features like seccomp.  Unfortunately, this breaks if the
>> process uses third-party libraries that create threads during
>> initialization.  In particular, Mesa can do this, and there is
>> no realistic alternative to it as Mesa is ~2 million lines of
>> GPU compiler and driver code.  Loading Mesa later is undesirable
>> as it prevents removing all filesystem access.
>>
>> There are two ways to fix this problem:
>>
>> 1. Fix the problem in the Linux kernel.
>> 2. Work around it in userspace, as is already done for setuid()
>>    and friends.
>>
>> For the second, it should be sufficient to provide a function
>> that runs a caller-provided function on each thread, while
>> ensuring that the process is atomic with respect to other
>> threads in the process.  This function only needs to make
>> system calls and crashes the process if there is an error.
>> If the function uses anything that isn't a syscall or
>> compiler builtin, it gets to keep both pieces.
>>
>> Is this something that would make sense to implement?  I know
>> that this problem has been an issue for Chromium on Linux.
> 
> I'm not sure what the right solution to this specific problem is, but
> I don't think exposing a "run arbitrary code in each thread" as a
> public API is a good choice. Such code would run in a context which is
> worse/more-restrictive even than "async signal" context, making it
> really difficult to define any reasonable class of "what you're
> allowed to do here". I know you said "syscalls", but even that
> requires defining what you mean by syscalls (raw via asm? via
> syscall()? any function that's "traditionally just a syscall"?) and
> further specifying which syscalls are actually allowed (any which
> break the __synccall context assumptions would need to be forbidden).

I think just seccomp() and compiler-inserted calls to functions
like memcpy().  memcpy() should only depend on a valid stack (which
*is* guaranteed unless I am greatly mistaken) and seccomp() is just
a wrapper around syscall().

> I think there are potentially semi-portable solutions to your problem
> that don't require such a big hammer as arbitrary __synccall.
> 
> One that comes to mind is installing a SECCOMP_RET_USER_NOTIF or
> SECCOMP_RET_TRAP filter before loading Mesa. This could allow the
> filesystem access to load Mesa libraries only until you set a flag
> that loading has finished, then cause filesystem access syscalls to
> fail once the flag has been set.

Would this involve emulating all the filesystem syscalls?  The problem
is that the flag would need to be set in a way that it can’t be unset.

> Another approach is doing what I'd call "manual __synccall" with your
> own signal, which is better than exposing actual __synccall because
> the application code does not run in an invalid-libc context, but this
> would only work if Mesa's hidden threads don't mask signals. A library
> creating its own threads behind the scenes *should* be masking all
> signals, so this probably doesn't work. Even if Mesa botched it, you
> wouldn't want to preclude them fixing it.

Also, from my reading of past mailing list posts, this is inherently
racy against thread creation.

> There is probably also a way to do this with ptrace, which blocked
> signals wouldn't interfere with, but that gets really nasty really
> quick.
> 
> Unfortunately there don't seem to be any ways to inject new seccomp
> filters into another task (even a thread of your own process)
> directly. This is what Linux really should be offering here.

Actually, it already supports this (SECCOMP_FILTER_FLAG_TSYNC).
I don't think this is supported for Landlock, though.
-- 
Sincerely,
Demi Marie Obenour (she/her/hers)
Download attachment "OpenPGP_0xB288B55FFF9C22C1.asc" of type "application/pgp-keys" (7141 bytes)

Download attachment "OpenPGP_signature.asc" of type "application/pgp-signature" (834 bytes)

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.