musl - Re: pthread_mutex_t shared between processes with different pid namespaces

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250211113827.GB10433@brightrain.aerifal.cx>
Date: Tue, 11 Feb 2025 06:38:27 -0500
From: Rich Felker <dalias@...c.org>
To: Daniele Personal <d.dario76@...il.com>
Cc: Florian Weimer <fweimer@...hat.com>, musl@...ts.openwall.com
Subject: Re: pthread_mutex_t shared between processes with different
 pid namespaces

On Tue, Feb 11, 2025 at 10:34:30AM +0100, Daniele Personal wrote:
> On Mon, 2025-02-10 at 13:14 -0500, Rich Felker wrote:
> > On Mon, Feb 10, 2025 at 05:12:52PM +0100, Daniele Personal wrote:
> > > On Sat, 2025-02-08 at 09:52 -0500, Rich Felker wrote:
> > > > On Sat, Feb 08, 2025 at 03:40:18PM +0100, Daniele Dario wrote:
> > > > > Il sab 8 feb 2025, 13:39 Rich Felker <dalias@...c.org> ha
> > > > > scritto:
> > > > > 
> > > > > > On Sat, Feb 08, 2025 at 10:20:45AM +0100, Daniele Dario
> > > > > > wrote:
> > > > > > > But wouldn't this mean that robust mutexes functionality is
> > > > > > > totally
> > > > > > > incompatible with pid namespaces?
> > > > > > 
> > > > > > No, only with trying to synchronize *across* different pid
> > > > > > namespaces.
> > > > > > 
> > > > > > > If the kernel relies on tid stored in memory by the process
> > > > > > > this always
> > > > > > > lacks the information about the pid namespace the tid
> > > > > > > belongs
> > > > > > > to.
> > > > > > 
> > > > > > It's necessarily within the same pid namespace as the process
> > > > > > itself.
> > > > > > 
> > > > > > Functionally, you should consider different pid namespaces as
> > > > > > different systems that happen to be capable of sharing some
> > > > > > resources.
> > > > > > 
> > > > > > Rich
> > > > > > 
> > > > > 
> > > > > Yes, I'm just saying that sharing pthread_mutex_t instances
> > > > > across
> > > > > processes within the same pid namespace but on a system with
> > > > > more
> > > > > than a
> > > > > pid namespace could lead to issues anyway if the stored tid
> > > > > value
> > > > > is used
> > > > > by the kernel as who to contact without the knowledge of on
> > > > > which
> > > > > pid
> > > > > namespace.
> > > > > 
> > > > > I not saying this is true, I'm trying to understand and if
> > > > > possible,
> > > > > improve things.
> > > > 
> > > > That's not a problem. The stored tid is used only in the context
> > > > of a
> > > > process exiting, where the kernel code knows the relevant pid
> > > > namespace (the one the exiting process is in) and uses the tid
> > > > relative to that. If it didn't work this way, it would be a fatal
> > > > bug
> > > > in the pid namespace implementation, which is supposed to allow
> > > > essentially transparent containerization (which includes
> > > > processes in
> > > > the ns being able to use their tids as they could if they were
> > > > outside
> > > > of any container/in global ns).
> > > > 
> > > > Rich
> > > > 
> > > 
> > > So, IIUC, the problem of sharing robust pthread_mutex_t instances
> > > across different pid namespaces is on the user space side which is
> > > not
> > > able to distinguish clashes on TIDs. In particular, problems could
> > > arise when:
> > 
> > No, it is not "on the user side". The user side can be modified
> > arbitrarily, and, modulo some cost, could surely be made to work for
> > non-robust process-shared mutexes. The problem is that the kernel --
> > the part which makes them robust -- has to honor the protocol, and
> > the
> > protocol does not admit distinguishing "pid N in ns X" from "pid N in
> > ns Y".
> 
> Ah, I thought your previous sentence was saying that the kernel is able
> to make this distinction.

No, it's able to make the *assumption* that the namespace the tid is
relative to is that of the dying process. That's what lets it work
(and a large part of why namespaces were practical to add to Linux to
begin with -- all of the existing interfaces that use pids/tids need
to know which namespace you're talking about, but they work because
the kernel can assume "same namespace as the executing task").

> Unfortunately it is not possible to say which variables need cross-ns
> locking and which not. This means that we should treat all in the same
> way and so replace all the mutexes with sysv semaphores but this has
> some costs: locking sysv semaphores always require syscalls and context
> switch between user/kernel spaces even if there's no contention and
> moreover, they imply the presence of accessible files.
> 
> We basically use a chunk of shared memory as a storage where variables
> could be added/read/written by the various applications. Since mutexes
> used to protect the variables are embedded in the same chunk of shared
> memory, there is only an mmap needed in order to access the storage by
> applications.
> 
> Up to now, applications were running in the same pid namespace but now,
> for some products, we needed to integrate a 3rd party application and
> this requires a certain degree of isolation so we opted to containerize
> this application and here we come to why I asked for clarifications.
> 
> I get your point when you say that sharing robust pthread_mutex_t
> instances violates the pid namespace isolation but you choose the
> degree of isolation balancing the risks and the benefits. Even if you
> have a new mount namespace you can decide to bind mount some parts of
> the filesystem to allow access to pars of the host flash for instance,
> same could happen with network.
> 
> Long story short, I'm pulling water to my mill, but I think that it's
> not bad to have posix robust shared mutexes working across different
> pid namespaces. It will allow users to use a really powerful tool also
> with containerized applications (again pulling water to my mill) which
> need it.

Generally we implement nonstandard functionality only on the basis of
strong historical precedent, need by multiple major real-world
applications, lack of cost imposed onto everyone else who doesn't
want/need the functionality, and other similar conditions. On all of
these axes, the thing you're asking for is completely in the opposite
direction.

> If there's any idea on how to gain this I'd really work on it: limiting
> the max number of pids which could run on a pid namespace to allow the
> use of some bits for the ns in the tid stored in the robust list for
> instance?

This is something where you're on your own either writing it or hiring
someone to do so and maintianing your forks of musl and the kernel.
There is just no way this kind of hack ever belongs upstream.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.