musl - Re: bug in pthread_cond

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20140813123416.GJ12888@brightrain.aerifal.cx>
Date: Wed, 13 Aug 2014 08:34:16 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: bug in pthread_cond_broadcast

On Wed, Aug 13, 2014 at 09:00:56AM +0200, Jens Gustedt wrote:
> Am Dienstag, den 12.08.2014, 20:30 -0400 schrieb Rich Felker:
> > On Wed, Aug 13, 2014 at 12:50:19AM +0200, Jens Gustedt wrote:
> > > The signalling or broacasting thread (waker) should do most of the
> > > bookkeeping on the waiters counts. This might be done by
> > > 
> > >  - lock _c_lock
> > > 
> > >  - if there are no waiters, unlock _c_lock and quit
> > > 
> > >  - requeue the wanted number of threads (1 or everybody) from the cnd
> > >    to the mtx. requeue tells us how many threads have been requeued,
> > >    and this lets us deduce the number of threads that have been woken
> > >    up.
> > 
> > If you requeue here, where does any wake happen?
> > 
> > >  - verify that all wanted waiters are in, otherwise repeat the requeue
> > >    operation. (this should be a rare event)
> > 
> > This step is not possible. One or more waiters could be in signal
> > handlers which interrupted the wait,
> 
> yes, but only one waiter at the time can be in the initial phase of
> the wait, waiters always hold the mutex in question. So the waiters
> you are talking about are basically the ones that already released the
> mutex and are going into the futex-wait. There should be no signal
> handler waiting for an event coming from such a thread.

Signal handler means in the sense of signal.h. The only way to
guarantee this would be to block signals during this interval, but
there's no way to atomically unblock them before going into the futex
wait, where they need to be unblocked, since the wait could last
arbitrarily long. Anyway the likely case is that the signal arrives
_while_ in the futex wait and thereby causes the wait to be
interrupted and restarted later.

Technically there is unbounded time between the interruption and
restart, but it's reasonable for one thread that's stuck in a signal
handler that's interrupted a non-AS-safe function to block forward
progress in other threads, so on further consideration I don't think
your retry-loop idea is invalid.

> So basically you can assume that waiters have done their part of the
> bookkeeping when you are in that situation.

It would be possible to ensure that they have finished all their
bookkeeping (although mildly expensive, via syscalls to block signals)
but it's not possible to ensure that they are actually in the futex
wait syscall and able to receive requeues or wakes.

BTW I'm not sure what happens when a signal interrupts a wait that's
been requeued. It could be one of three things:

- Restarting the wait on the original futex address, which the
  application would necessarily have to arrange to contain a new value
  so that it fails with EAGAIN.

- Restarting the wait on the requeued address via poking at syscall
  argument values or use of a "restart block" containing the state for
  the interrupted syscall.

- EINTR and letting the application handle it.

Which one of these happens seems like it could make a big difference
to what usage patterns are valid, and I fear the behavior may differ
between kernel versions...

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.