|
Message-ID: <20190210005250.GZ23599@brightrain.aerifal.cx> Date: Sat, 9 Feb 2019 19:52:50 -0500 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com, Alexey Izbyshev <izbyshev@...ras.ru> Subject: Re: __synccall: deadlock and reliance on racy /proc/self/task On Sat, Feb 09, 2019 at 10:40:45PM +0100, Szabolcs Nagy wrote: > * Alexey Izbyshev <izbyshev@...ras.ru> [2019-02-09 21:33:32 +0300]: > > On 2019-02-09 19:21, Szabolcs Nagy wrote: > > > * Rich Felker <dalias@...c.org> [2019-02-08 13:33:57 -0500]: > > > > On Fri, Feb 08, 2019 at 09:14:48PM +0300, Alexey Izbyshev wrote: > > > > > On 2/7/19 9:36 PM, Rich Felker wrote: > > > > > >Does it work if we force two iterations of the readdir loop with no > > > > > >tasks missed, rather than just one, to catch the case of missed > > > > > >concurrent additions? I'm not sure. But all this makes me really > > > > > >uncomfortable with the current approach. > > > > > > > > > > I've tested with 0, 1, 2 and 3 retries of the main loop if miss_cnt > > > > > == 0. The test eventually failed in all cases, with 0 retries > > > > > requiring only a handful of iterations, 1 -- on the order of 100, 2 > > > > > -- on the order of 10000 and 3 -- on the order of 100000. > > > > > > > > Do you have a theory on the mechanism of failure here? I'm guessing > > > > it's something like this: there's a thread that goes unseen in the > > > > first round, and during the second round, it creates a new thread and > > > > exits itself. The exit gets seen (again, it doesn't show up in the > > > > dirents) but the new thread it created still doesn't. Is that right? > > > > > > > > In any case, it looks like the whole mechanism we're using is > > > > unreliable, so something needs to be done. My leaning is to go with > > > > the global thread list and atomicity of list-unlock with exit. > > > > > > yes that sounds possible, i added some instrumentation to musl > > > and the trace shows situations like that before the deadlock, > > > exiting threads can even cause old (previously seen) entries to > > > disappear from the dir. > > > > > Thanks for the thorough instrumentation! Your traces confirm both my theory > > about the deadlock and unreliability of /proc/self/task. > > > > I'd also done a very light instrumentation just before I got your email, but > > it took me a while to understand the output I got (see below). > > the attached patch fixes the issue on my machine. > i don't know if this is just luck. > > the assumption is that if /proc/self/task is read twice such that > all tids in it seem to be active and caught, then all the active > threads of the process are caught (no new threads that are already > started but not visible there yet) I'm skeptical of whether this should work in principle. If the first scan of /proc/self/task misses tid J, and during the next scan, tid J creates tid K then exits, it seems like we could see the same set of tids on both scans. Maybe it's salvagable though. Since __block_new_threads is true, in order for this to happen, tid J must have been between the __block_new_threads check in pthread_create and the clone syscall at the time __synccall started. The number of threads in such a state seems to be bounded by some small constant (like 2) times libc.threads_minus_1+1, computed at any point after __block_new_threads is set to true, so sufficiently heavy presignaling (heavier than we have now) might suffice to guarantee that all are captured. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.