musl - Re: Possible PEBKAC or Bug in musl Semaphores and/or Mutexes

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220311202151.GW7074@brightrain.aerifal.cx>
Date: Fri, 11 Mar 2022 15:21:52 -0500
From: Rich Felker <dalias@...c.org>
To: Gavin Howard <gavin.d.howard@...il.com>
Cc: musl@...ts.openwall.com
Subject: Re: Possible PEBKAC or Bug in musl Semaphores and/or Mutexes

On Fri, Mar 11, 2022 at 12:41:05PM -0700, Gavin Howard wrote:
> Hello,
> 
> If this email seems rushed, it's because I was told musl is close to
> being released and that you all would want to know about this bug
> quickly.
> 
> All of this was tested on musl 1.2.2.
> 
> I'm writing a multi-threaded build system, and I've run into an
> interesting error when I compile with musl (the musl-gcc generated by
> building musl from source).
> 
> Background:
> 
> I have two main threads and as many worker threads as desired.
> 
> Worker Threads: Responsible for actually building targets. Once they are
> spawned, they wait on a semaphore for targets and grab them (with a
> mutex) off of a queue. When worker threads run a child process, they do
> the following:
> 
> * Run the process (whether through posix_spawn or fork()+exec()).
> * Check the return value to make sure the child process was created.
> * Take a mutex.
> * Add the child pid and an fd to communicate with the thread that is the
>   parent of the child to a map.
> * Post a semaphore.
> * Unlock the mutex.
> * Continue executing the target and other targets.
> 
> Main Thread #1: Spawns all of the other threads, waits on the worker
> threads to join, and once they do, does the following:
> 
> * Locks the mutex mentioned above.
> * Sets a flag.
> * Posts the semaphore mentioned above.
> * Unlocks the mutex.
> * Calls pthread_join() on Main Thread #2.
> 
> Main Thread #2: Responsible for sending info about all reaped children
> to the threads that spawned them. Once it is started, it does the
> following:
> 
> * Waits on the semaphore mentioned above.
> * Locks the mutex mentioned above.
> * Checks the flag mentioned above. If it is set, it unlocks the mutex
>   and does a pthread_exit().
> * If the flag is not set, it continues and:
> * Unlocks the mutex.
> * Does a waitpid().
> * Checks the return value of waitpid(). If it's ECHILD, it causes an
>   abort() (for the purposes of debugging this issue).
> * Locks the mutex again.
> * Sends the child status info to the right thread using the map
>   mentioned above.
> * Unlocks the mutex.
> * Repeat the process.
> 
> When running my code under musl and glibc, the abort() mentioned above
> is triggered every so often at the end. This was because I was using an
> atomic variable with the semaphore. I could trigger the problem every
> 10-20 minutes by continually bootstrapping the build system, and it
> always happened at the end of the build, when the semaphore was posted
> to have Main Thread #2 exit.
> 
> The reason the atomic variable did not work is (I suspect) because there
> is no guarantee of compiler or processor ordering with pure,
> memory_order_relaxed atomic variables. I suspect that the processor was
> reordering the instructions such that the semaphore was signalled by
> Main Thread #1 before it set the atomic variable, even though the atomic
> store to the variable was before the semaphore post call in the source.
> The setting of the flag could then come *after* Main Thread #2 did an
> atomic load on it.
> 
> So I replaced it with the scheme I laid out above, and on glibc, the
> issue still triggers; it just takes about 4 hours to do it.
> 
> However, on musl, I can still consistently trigger the issue in less
> than 10 minutes.
> 
> Nevertheless, I still think it is more correct, and here's why: the
> mutex should enforce ordering in the processor. By holding the mutex
> when the semaphore is posted, the post cannot be reordered to before the
> taking of the mutex lock, or after releasing it. Then, because the flag
> is also set while holding the mutex lock, this means that setting the
> flag cannot be reordered to before the lock. This means that the setting
> of the flag always comes before the checking of the flag, even if Main
> Thread #2's wait on the semaphore ends *before* the lock is released.
> 
> This, in turn, means that when Main Thread #2 takes the lock after
> waiting on the semaphore, the flag *will* be set if it has been set at
> all.
> 
> Likewise, since a thread adding a process to the map holds the lock
> while adding the process to the map, and posts the semaphore *while
> holding the lock*, then if Main Thread #2's wait on the semaphore ends
> before the lock is released, it is still forced to wait to take the
> lock. This taking of the lock also forces the waitpid() call to happen
> *after* the releasing of the lock by the thread that is adding a process
> to the map.
> 
> In like manner, the taking and releasing of the lock by the worker
> thread means that the creation of the child process must happen before
> the release of the lock, so I believe that there should *never* be a
> situation where waitpid() is called before a child process is properly
> created.
> 
> I believe all of that together means that there should *never* be a
> situation where waitpid() returns ECHILD. Yet musl does, consistently.
> 
> I've checked the musl source for waitpid(), and it's just a syscall, as
> it should be, so I don't think the bug is in waitpid().
> 
> So after eliminating other possibilities, it seems to me there may be a
> bug in either musl's semaphores, its mutexes, or both.
> 
> Now, I fully admit that I could have done my analysis wrong and still
> have a bug in my code. I admit that this could still be PEBKAC. But even
> in that case, I need the help of the musl community to figure out where
> I did go wrong.
> 
> Thank you for your time.

I don't think there's enough detail to debug your issue based on the
high-level description you gave. The bug is almost surely in something
we can't see. As long as all modification to and reading from the flag
is done with the mutex held, you don't need to worry about data races
on it. 

You should probably look at the strace of your program to see what's
going on with the ECHILD. It's not clear whether you're calling
waidpid with an argument of -1 or a specific pid, but in the latter
case you could look for whether the child was already reaped and
whether the pid was ever valid (from the strace log). Also, be aware
that ignoring SIGCHLD may break waiting for it, in case you're doing
that.

As an aside, overall, the whole setup looks like you were looking for
condition variables but skipped over them for some reason.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.