kernel-hardening - Re: [PATCH] nsproxy: attach to namespaces via pidfds

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAG48ez2D36QZU0djiXGbirCgcFeAWA02s8PCk6SWEY5MoKg_kg@mail.gmail.com>
Date: Mon, 27 Apr 2020 21:41:20 +0200
From: Jann Horn <jannh@...gle.com>
To: Christian Brauner <christian.brauner@...ntu.com>
Cc: kernel list <linux-kernel@...r.kernel.org>, Alexander Viro <viro@...iv.linux.org.uk>, 
	Stéphane Graber <stgraber@...ntu.com>, 
	Linux Containers <containers@...ts.linux-foundation.org>, 
	"Eric W . Biederman" <ebiederm@...ssion.com>, Serge Hallyn <serge@...lyn.com>, 
	Aleksa Sarai <cyphar@...har.com>, 
	linux-security-module <linux-security-module@...r.kernel.org>, 
	Kernel Hardening <kernel-hardening@...ts.openwall.com>, Linux API <linux-api@...r.kernel.org>
Subject: Re: [PATCH] nsproxy: attach to namespaces via pidfds

On Mon, Apr 27, 2020 at 8:15 PM Christian Brauner
<christian.brauner@...ntu.com> wrote:
> On Mon, Apr 27, 2020 at 07:28:56PM +0200, Jann Horn wrote:
> > On Mon, Apr 27, 2020 at 4:47 PM Christian Brauner
> > <christian.brauner@...ntu.com> wrote:
[...]
> > > That means
> > > setns(nsfd, CLONE_NEWNET) equals setns(pidfd, CLONE_NEWNET). However,
> > > when a pidfd is passed, multiple namespace flags can be specified in the
> > > second setns() argument and setns() will attach the caller to all the
> > > specified namespaces all at once or to none of them. If 0 is specified
> > > together with a pidfd then setns() will interpret it the same way 0 is
> > > interpreted together with a nsfd argument, i.e. attach to any/all
> > > namespaces.
> > [...]
> > > Apart from significiantly reducing the number of syscalls from double
> > > digit to single digit which is a decent reason post-spectre/meltdown
> > > this also allows to switch to a set of namespaces atomically, i.e.
> > > either attaching to all the specified namespaces succeeds or we fail.
> >
> > Apart from the issues I've pointed out below, I think it's worth
> > calling out explicitly that with the current design, the switch will
> > not, in fact, be fully atomic - the process will temporarily be in
> > intermediate stages where the switches to some namespaces have
> > completed while the switches to other namespaces are still pending;
> > and while there will be less of these intermediate stages than before,
> > it also means that they will be less explicit to userspace.
>
> Right, that can be fixed by switching to the unshare model of getting a
> new set of credentials and committing it after the nsproxy has been
> installed? Then there shouldn't be an intermediate state anymore or
> rather an intermediate stage where we can still fail somehow.

It still wouldn't be atomic (in the sense of parallelism, not in the
sense of intermediate error handling) though; for example, if task B
does setns(<pidfd_of_task_a>, 0) and task C concurrently does
setns(<pidfd_of_task_b>, 0), then task C may end up with the new mount
namespace of task B but the old user namespace, or something like
that. If C is more privileged than B, that may cause C to have more
privileges through its configuration of namespaces than B does (e.g.
by running in the &init_user_ns but with a mount namespace owned by an
unprivileged user), which C may not expect. Same thing for racing
between unshare() and setns().

[...]
> > > +               put_user_ns(user_ns);
> > > +       }
> > > +#else
> > > +       if (flags & CLONE_NEWUSER)
> > > +               ret = -EINVAL;
> > > +#endif
> > > +
> > > +       if (!ret && wants_ns(flags, CLONE_NEWNS))
> > > +               ret = __ns_install(nsproxy, mnt_ns_to_common(nsp->mnt_ns));
> >
> > And this one might be even worse, because the mount namespace change
> > itself is only stored in the nsproxy at this point, but the cwd and
> > root paths have already been overwritten on the task's fs_struct.
> >
> > To actually make sys_set_ns() atomic, I think you'd need some
> > moderately complicated prep work, splitting the ->install handlers up
> > into prep work and a commit phase that can't fail.
>
> Wouldn't it be sufficient to move to an unshare like model, i.e.
> creating a new set of creds, and passing the new user_ns to
> create_new_namespaces() as well as having a temporary new_fs struct?
> That should get rid of all intermediate stages.

Ah, good point, I didn't realize that that already exists for unshare().
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.