musl - Re: __synccall: deadlock and reliance on racy /proc/self/task

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190210005250.GZ23599@brightrain.aerifal.cx>
Date: Sat, 9 Feb 2019 19:52:50 -0500
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com, Alexey Izbyshev <izbyshev@...ras.ru>
Subject: Re: __synccall: deadlock and reliance on racy /proc/self/task

On Sat, Feb 09, 2019 at 10:40:45PM +0100, Szabolcs Nagy wrote:
> * Alexey Izbyshev <izbyshev@...ras.ru> [2019-02-09 21:33:32 +0300]:
> > On 2019-02-09 19:21, Szabolcs Nagy wrote:
> > > * Rich Felker <dalias@...c.org> [2019-02-08 13:33:57 -0500]:
> > > > On Fri, Feb 08, 2019 at 09:14:48PM +0300, Alexey Izbyshev wrote:
> > > > > On 2/7/19 9:36 PM, Rich Felker wrote:
> > > > > >Does it work if we force two iterations of the readdir loop with no
> > > > > >tasks missed, rather than just one, to catch the case of missed
> > > > > >concurrent additions? I'm not sure. But all this makes me really
> > > > > >uncomfortable with the current approach.
> > > > >
> > > > > I've tested with 0, 1, 2 and 3 retries of the main loop if miss_cnt
> > > > > == 0. The test eventually failed in all cases, with 0 retries
> > > > > requiring only a handful of iterations, 1 -- on the order of 100, 2
> > > > > -- on the order of 10000 and 3 -- on the order of 100000.
> > > > 
> > > > Do you have a theory on the mechanism of failure here? I'm guessing
> > > > it's something like this: there's a thread that goes unseen in the
> > > > first round, and during the second round, it creates a new thread and
> > > > exits itself. The exit gets seen (again, it doesn't show up in the
> > > > dirents) but the new thread it created still doesn't. Is that right?
> > > > 
> > > > In any case, it looks like the whole mechanism we're using is
> > > > unreliable, so something needs to be done. My leaning is to go with
> > > > the global thread list and atomicity of list-unlock with exit.
> > > 
> > > yes that sounds possible, i added some instrumentation to musl
> > > and the trace shows situations like that before the deadlock,
> > > exiting threads can even cause old (previously seen) entries to
> > > disappear from the dir.
> > > 
> > Thanks for the thorough instrumentation! Your traces confirm both my theory
> > about the deadlock and unreliability of /proc/self/task.
> > 
> > I'd also done a very light instrumentation just before I got your email, but
> > it took me a while to understand the output I got (see below).
> 
> the attached patch fixes the issue on my machine.
> i don't know if this is just luck.
> 
> the assumption is that if /proc/self/task is read twice such that
> all tids in it seem to be active and caught, then all the active
> threads of the process are caught (no new threads that are already
> started but not visible there yet)

I'm skeptical of whether this should work in principle. If the first
scan of /proc/self/task misses tid J, and during the next scan, tid J
creates tid K then exits, it seems like we could see the same set of
tids on both scans.

Maybe it's salvagable though. Since __block_new_threads is true, in
order for this to happen, tid J must have been between the
__block_new_threads check in pthread_create and the clone syscall at
the time __synccall started. The number of threads in such a state
seems to be bounded by some small constant (like 2) times
libc.threads_minus_1+1, computed at any point after
__block_new_threads is set to true, so sufficiently heavy presignaling
(heavier than we have now) might suffice to guarantee that all are
captured. 

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.