musl - Re: pthread_mutex_t shared between processes with different pid namespaces

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <433bc1021c8bdcc1c2b17c5fd58d6e19ec144624.camel@gmail.com>
Date: Tue, 11 Feb 2025 14:53:22 +0100
From: Daniele Personal <d.dario76@...il.com>
To: Rich Felker <dalias@...c.org>
Cc: Florian Weimer <fweimer@...hat.com>, musl@...ts.openwall.com
Subject: Re: pthread_mutex_t shared between processes with different
 pid namespaces

On Tue, 2025-02-11 at 06:38 -0500, Rich Felker wrote:
> On Tue, Feb 11, 2025 at 10:34:30AM +0100, Daniele Personal wrote:
> > On Mon, 2025-02-10 at 13:14 -0500, Rich Felker wrote:
> > > On Mon, Feb 10, 2025 at 05:12:52PM +0100, Daniele Personal wrote:
> > > > On Sat, 2025-02-08 at 09:52 -0500, Rich Felker wrote:
> > > > > On Sat, Feb 08, 2025 at 03:40:18PM +0100, Daniele Dario
> > > > > wrote:
> > > > > > Il sab 8 feb 2025, 13:39 Rich Felker <dalias@...c.org> ha
> > > > > > scritto:
> > > > > > 
> > > > > > > On Sat, Feb 08, 2025 at 10:20:45AM +0100, Daniele Dario
> > > > > > > wrote:
> > > > > > > > But wouldn't this mean that robust mutexes
> > > > > > > > functionality is
> > > > > > > > totally
> > > > > > > > incompatible with pid namespaces?
> > > > > > > 
> > > > > > > No, only with trying to synchronize *across* different
> > > > > > > pid
> > > > > > > namespaces.
> > > > > > > 
> > > > > > > > If the kernel relies on tid stored in memory by the
> > > > > > > > process
> > > > > > > > this always
> > > > > > > > lacks the information about the pid namespace the tid
> > > > > > > > belongs
> > > > > > > > to.
> > > > > > > 
> > > > > > > It's necessarily within the same pid namespace as the
> > > > > > > process
> > > > > > > itself.
> > > > > > > 
> > > > > > > Functionally, you should consider different pid
> > > > > > > namespaces as
> > > > > > > different systems that happen to be capable of sharing
> > > > > > > some
> > > > > > > resources.
> > > > > > > 
> > > > > > > Rich
> > > > > > > 
> > > > > > 
> > > > > > Yes, I'm just saying that sharing pthread_mutex_t instances
> > > > > > across
> > > > > > processes within the same pid namespace but on a system
> > > > > > with
> > > > > > more
> > > > > > than a
> > > > > > pid namespace could lead to issues anyway if the stored tid
> > > > > > value
> > > > > > is used
> > > > > > by the kernel as who to contact without the knowledge of on
> > > > > > which
> > > > > > pid
> > > > > > namespace.
> > > > > > 
> > > > > > I not saying this is true, I'm trying to understand and if
> > > > > > possible,
> > > > > > improve things.
> > > > > 
> > > > > That's not a problem. The stored tid is used only in the
> > > > > context
> > > > > of a
> > > > > process exiting, where the kernel code knows the relevant pid
> > > > > namespace (the one the exiting process is in) and uses the
> > > > > tid
> > > > > relative to that. If it didn't work this way, it would be a
> > > > > fatal
> > > > > bug
> > > > > in the pid namespace implementation, which is supposed to
> > > > > allow
> > > > > essentially transparent containerization (which includes
> > > > > processes in
> > > > > the ns being able to use their tids as they could if they
> > > > > were
> > > > > outside
> > > > > of any container/in global ns).
> > > > > 
> > > > > Rich
> > > > > 
> > > > 
> > > > So, IIUC, the problem of sharing robust pthread_mutex_t
> > > > instances
> > > > across different pid namespaces is on the user space side which
> > > > is
> > > > not
> > > > able to distinguish clashes on TIDs. In particular, problems
> > > > could
> > > > arise when:
> > > 
> > > No, it is not "on the user side". The user side can be modified
> > > arbitrarily, and, modulo some cost, could surely be made to work
> > > for
> > > non-robust process-shared mutexes. The problem is that the kernel
> > > --
> > > the part which makes them robust -- has to honor the protocol,
> > > and
> > > the
> > > protocol does not admit distinguishing "pid N in ns X" from "pid
> > > N in
> > > ns Y".
> > 
> > Ah, I thought your previous sentence was saying that the kernel is
> > able
> > to make this distinction.
> 
> No, it's able to make the *assumption* that the namespace the tid is
> relative to is that of the dying process. That's what lets it work
> (and a large part of why namespaces were practical to add to Linux to
> begin with -- all of the existing interfaces that use pids/tids need
> to know which namespace you're talking about, but they work because
> the kernel can assume "same namespace as the executing task").
> 
> > Unfortunately it is not possible to say which variables need cross-
> > ns
> > locking and which not. This means that we should treat all in the
> > same
> > way and so replace all the mutexes with sysv semaphores but this
> > has
> > some costs: locking sysv semaphores always require syscalls and
> > context
> > switch between user/kernel spaces even if there's no contention and
> > moreover, they imply the presence of accessible files.
> > 
> > We basically use a chunk of shared memory as a storage where
> > variables
> > could be added/read/written by the various applications. Since
> > mutexes
> > used to protect the variables are embedded in the same chunk of
> > shared
> > memory, there is only an mmap needed in order to access the storage
> > by
> > applications.
> > 
> > Up to now, applications were running in the same pid namespace but
> > now,
> > for some products, we needed to integrate a 3rd party application
> > and
> > this requires a certain degree of isolation so we opted to
> > containerize
> > this application and here we come to why I asked for
> > clarifications.
> > 
> > I get your point when you say that sharing robust pthread_mutex_t
> > instances violates the pid namespace isolation but you choose the
> > degree of isolation balancing the risks and the benefits. Even if
> > you
> > have a new mount namespace you can decide to bind mount some parts
> > of
> > the filesystem to allow access to pars of the host flash for
> > instance,
> > same could happen with network.
> > 
> > Long story short, I'm pulling water to my mill, but I think that
> > it's
> > not bad to have posix robust shared mutexes working across
> > different
> > pid namespaces. It will allow users to use a really powerful tool
> > also
> > with containerized applications (again pulling water to my mill)
> > which
> > need it.
> 
> Generally we implement nonstandard functionality only on the basis of
> strong historical precedent, need by multiple major real-world
> applications, lack of cost imposed onto everyone else who doesn't
> want/need the functionality, and other similar conditions. On all of
> these axes, the thing you're asking for is completely in the opposite
> direction.
> 
> > If there's any idea on how to gain this I'd really work on it:
> > limiting
> > the max number of pids which could run on a pid namespace to allow
> > the
> > use of some bits for the ns in the tid stored in the robust list
> > for
> > instance?
> 
> This is something where you're on your own either writing it or
> hiring
> someone to do so and maintianing your forks of musl and the kernel.
> There is just no way this kind of hack ever belongs upstream.
> 
> Rich

Thanks for the time you spent on this, I really appreciated.

Daniele.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.