musl - Re: Re: FYI: some observations when testing next-gen malloc

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <82b69741-72e6-ab53-c523-ce4e1e7dc98e@wwcom.ch>
Date: Mon, 9 Mar 2020 19:14:59 +0100
From: Pirmin Walthert <pirmin.walthert@...om.ch>
To: musl@...ts.openwall.com
Subject: Re: Re: FYI: some observations when testing next-gen malloc

Am 09.03.20 um 18:12 schrieb Rich Felker:
> On Mon, Mar 09, 2020 at 05:49:02PM +0100, Pirmin Walthert wrote:
>> Dear Rich,
>>
>> First of all many thanks for your brilliant C library.
>>
>> As I do not know whether the musl mailinglist is already the right
>> place to discuss the next-gen malloc module, I decided to send you
>> my observations directly.
> It is, so I'm cc'ing the list now.
>
>> I'd like to mention that I am not yet entirely sure whether the
>> following is a problem with the new malloc code or with asterisk
>> itself but maybe you can already keep the following in the back of
>> your head if someone else is reporting similar behavior with a
>> different application:
>>
>> We use asterisk (16.7) in a musl libc based distribution and for
>> some operations asterisk forks (in a thread) the main process to
>> execute a system command. When using libmallocng.so (newest version
>> with "fix race condition in lock-free path of free" applied, but
>> already without that change) some of these forked child processes
>> will hang during a call to pthread_mutex_unlock.
>>
>> Unfortunatelly the backtrace is not of much help I guess, but the
>> child process always seems to hang on pthread_mutex_unlock. So
>> something seems to happen with the mutex on fork:
>>
>> #0  0x00007f2152a20092 in pthread_mutex_unlock () from
>> /lib/ld-musl-x86_64.so.1
>> No symbol table info available.
>> #1  0x0000000000000008 in ?? ()
>> No symbol table info available.
>> #2  0x0000000000000000 in ?? ()
>> No symbol table info available.
>>
>> I will for sure try to dig into this further. For the moment the
>> only thing I know is that I did not yet observe this on any of the
>> several hundred systems with musl 1.1.23 (same asterisk version),
>> not on any of the around 5 with 1.2.0 (same asterisk version, old
>> malloc) but quite frequently on the two systems with 1.1.24 and
>> libmallocng.so.
> This is completely expected and should happen with old or new malloc.
> I'm surprised you haven't hit it before. After a multithreaded process
> calls fork, the child inherits a state where locks may be permanently
> held. See https://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html
>
>      - A process shall be created with a single thread. If a
>        multi-threaded process calls fork(), the new process shall
>        contain a replica of the calling thread and its entire address
>        space, possibly including the states of mutexes and other
>        resources. Consequently, to avoid errors, the child process may
>        only execute async-signal-safe operations until such time as one
>        of the exec functions is called.
>
> It's not described very rigorously, but effectively it's in an async
> signal context and can only call functions which are AS-safe.
>
> A future version of the standard is expected to drop the requirement
> that fork itself be async-signal-safe, and may thereby add
> requirements to synchronize against some or all internal locks so that
> the child can inherit a working context. But the right solution here is
> always to stop using fork without exec.
>
> Rich

Well, I have now changed the code a bit to make sure that no 
async-signal-unsafe command is being executed before execl. Things I've 
removed:

a call to cap_from_text, cap_set_proc and cap_free has been removed as 
well as sched_setscheduler. Now the only thing being executed before 
execl in the child process is closefrom()

However I got a hanging process again:

(gdb) bt full
#0  0x00007f42f649c6da in __syscall_cp_c () from /lib/ld-musl-x86_64.so.1
No symbol table info available.
#1  0x0000000000000000 in ?? ()
No symbol table info available.

Best regards,

Pirmin
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.