kernel-hardening - Re: [PATCH 7/7] proc: Ensure we see the exit of each process tid exactly once

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87mu9a4obu.fsf@x220.int.ebiederm.org>
Date: Sat, 22 Feb 2020 09:46:29 -0600
From: ebiederm@...ssion.com (Eric W. Biederman)
To: Oleg Nesterov <oleg@...hat.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,  Al Viro <viro@...iv.linux.org.uk>,  LKML <linux-kernel@...r.kernel.org>,  Kernel Hardening <kernel-hardening@...ts.openwall.com>,  Linux API <linux-api@...r.kernel.org>,  Linux FS Devel <linux-fsdevel@...r.kernel.org>,  Linux Security Module <linux-security-module@...r.kernel.org>,  Akinobu Mita <akinobu.mita@...il.com>,  Alexey Dobriyan <adobriyan@...il.com>,  Andrew Morton <akpm@...ux-foundation.org>,  Andy Lutomirski <luto@...nel.org>,  Daniel Micay <danielmicay@...il.com>,  Djalal Harouni <tixxdz@...il.com>,  "Dmitry V . Levin" <ldv@...linux.org>,  Greg Kroah-Hartman <gregkh@...uxfoundation.org>,  Ingo Molnar <mingo@...nel.org>,  "J . Bruce Fields" <bfields@...ldses.org>,  Jeff Layton <jlayton@...chiereds.net>,  Jonathan Corbet <corbet@....net>,  Kees Cook <keescook@...omium.org>,  Solar Designer <solar@...nwall.com>
Subject: Re: [PATCH 7/7] proc: Ensure we see the exit of each process tid exactly once

Oleg Nesterov <oleg@...hat.com> writes:

> On 02/20, Eric W. Biederman wrote:
>>
>> +void exchange_tids(struct task_struct *ntask, struct task_struct *otask)
>> +{
>> +	/* pid_links[PIDTYPE_PID].next is always NULL */
>> +	struct pid *npid = READ_ONCE(ntask->thread_pid);
>> +	struct pid *opid = READ_ONCE(otask->thread_pid);
>> +
>> +	rcu_assign_pointer(opid->tasks[PIDTYPE_PID].first, &ntask->pid_links[PIDTYPE_PID]);
>> +	rcu_assign_pointer(npid->tasks[PIDTYPE_PID].first, &otask->pid_links[PIDTYPE_PID]);
>> +	rcu_assign_pointer(ntask->thread_pid, opid);
>> +	rcu_assign_pointer(otask->thread_pid, npid);
>
> this breaks has_group_leader_pid()...
>
> proc_pid_readdir() can miss a process doing mt-exec but this looks fixable,
> just we need to update ntask->thread_pid before updating ->first.
>
> The more problematic case is __exit_signal() which does
> 		
> 		if (unlikely(has_group_leader_pid(tsk)))
> 			posix_cpu_timers_exit_group(tsk);

Along with the comment:
		/*
		 * This can only happen if the caller is de_thread().
		 * FIXME: this is the temporary hack, we should teach
		 * posix-cpu-timers to handle this case correctly.
		 */
So I suspect this is fixable and the above fix might be part of that.

Hmm looking at your commit:

commit e0a70217107e6f9844628120412cb27bb4cea194
Author: Oleg Nesterov <oleg@...hat.com>
Date:   Fri Nov 5 16:53:42 2010 +0100

    posix-cpu-timers: workaround to suppress the problems with mt exec
    
    posix-cpu-timers.c correctly assumes that the dying process does
    posix_cpu_timers_exit_group() and removes all !CPUCLOCK_PERTHREAD
    timers from signal->cpu_timers list.
    
    But, it also assumes that timer->it.cpu.task is always the group
    leader, and thus the dead ->task means the dead thread group.
    
    This is obviously not true after de_thread() changes the leader.
    After that almost every posix_cpu_timer_ method has problems.
    
    It is not simple to fix this bug correctly. First of all, I think
    that timer->it.cpu should use struct pid instead of task_struct.
    Also, the locking should be reworked completely. In particular,
    tasklist_lock should not be used at all. This all needs a lot of
    nontrivial and hard-to-test changes.
    
    Change __exit_signal() to do posix_cpu_timers_exit_group() when
    the old leader dies during exec. This is not the fix, just the
    temporary hack to hide the problem for 2.6.37 and stable. IOW,
    this is obviously wrong but this is what we currently have anyway:
    cpu timers do not work after mt exec.
    
    In theory this change adds another race. The exiting leader can
    detach the timers which were attached to the new leader. However,
    the window between de_thread() and release_task() is small, we
    can pretend that sys_timer_create() was called before de_thread().
    
    Signed-off-by: Oleg Nesterov <oleg@...hat.com>
    Signed-off-by: Linus Torvalds <torvalds@...ux-foundation.org>

It looks like the data structures need fixing.  Possibly to use struct
pid.  Possibly to move the group data to signal struct.

I think I played with some of that awhile ago.

I am going to move this change to another patchset.  So I don't wind up
playing shift the bug around.  I thought I would need this to get the
other code working but it turns out we remain bug compatible without
this.

Hopefully I can get something out in the next week or so that addresses
the issues you have pointed out.

Eric
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.