Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180530005009.GM1392@brightrain.aerifal.cx>
Date: Tue, 29 May 2018 20:50:09 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Re: pthread cancel cleanup and pthread_mutex_lock

On Wed, May 30, 2018 at 10:06:17AM +1000, Patrick Oppenlander wrote:
> I accidentally hit send before I finished typing..
> 
> > I've recently been running some of the open posix testsuite tests from
> > the linux test project.
> >
> > One particular test has been giving me headaches:
> > https://github.com/linux-test-project/ltp/blob/master/testcases/open_posix_testsuite/conformance/interfaces/pthread_mutex_init/1-2.c
> >
> > There are a couple of different tests in there but the most
> > interesting one is the deadlock test which does the following:
> >
> > Thread A:          Thread B:
> > pthread_create
> >                    pthread_cleanup_push(...)
> >                    pthread_mutex_lock(M)
> >                    pthread_setcanceltype(ASYNC)
> >                    pthread_setcancelstate(ENABLE)
>                      pthread_mutex_lock(M) <-- blocks here
>   pthread_cancel(B)
>   pthread_join(B)
> 
> The test then expects the cleanup handler to run and unlock mutex M
> allowing thread B to run to completion and the join to succeed.

This test is invalid. pthread_mutex_lock is not async-cancel-safe and
cannot legally be called while cancel type is async.

FYI something like 50% of the "Open POSIX Test Suite" tests are
invalid; in the majority of cases they're testing some property after
undefined behavior has been invoked like here.

> I've run this test with musl, glibc and on some different platforms
> with varying results:
> 
> x86_64 linux 4.16.11, glibc: test runs to completion
> x86_64 linux 4.16.11, musl: deadlock (cleanup handler doesn't run)
> arm linux 4.16.5, musl: test runs to completion

The test is invalid in other ways too, involving races. It attempts to
use sched_yield to ensure that the test thread enters
pthread_mutex_lock a second time, but there's no reason to expect that
to do anything, especially if there are sufficiently many cores (as
many or more than running threads). I suspect the different behaviors
come down to just different scheduling properties due to performance
differences, or something like that. Naively, I would expect the test
to "work" despite being invalid.

> I'm not even sure that this test is valid -- I can't find any
> documentation which says that pthread_mutex_lock is a cancellation
> point, or that you're allowed to call pthread_mutex_unlock from an
> async cancel handler.

You can call anything you want from an async cancel handler, but you
can't call any libc functions except the ones controlling cancel state
while cancel type is async. Basically, all you can do in async cancel
state is pure computation.

> However, it's still concerning to see different results on different platforms.
> 
> What's the expected behaviour here?

Nothing meaningful.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.