musl - Re: SIGSEGV/stack overflow in pthread_create

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240913152522.GA10433@brightrain.aerifal.cx>
Date: Fri, 13 Sep 2024 11:25:23 -0400
From: Rich Felker <dalias@...c.org>
To: Lukas Zeller <luz@...n44.ch>
Cc: musl@...ts.openwall.com
Subject: Re: SIGSEGV/stack overflow in pthread_create - race condition?

On Fri, Sep 13, 2024 at 01:30:00PM +0200, Lukas Zeller wrote:
> Hello list,
> 
> I hope this is the right place to post the following.
> 
> Using OpenWrt 22.03 with musl 1.2.3, *some* times, on *some* RPi devices (the faster, the more likely) I get the following:
> 
> > Thread 2 "debugtarget" received signal SIGSEGV, Segmentation fault.
> > [Switching to Thread 4993.5022]
> > 0xb6ec42f0 in pk_parse_kite_request () from /Volumes/CaseSens/openwrt-2/scripts/../staging_dir/target-arm_cortex-a7+neon-vfpv4_musl_eabi/root-bcm27xx/usr/lib/libpagekite.so.1
> > (gdb) bt
> > #0  0xb6ec42f0 in pk_parse_kite_request ()
> >    from /Volumes/CaseSens/openwrt-2/scripts/../staging_dir/target-arm_cortex-a7+neon-vfpv4_musl_eabi/root-bcm27xx/usr/lib/libpagekite.so.1
> > #1  0xb6ec457c in pk_parse_pagekite_response ()
> >    from /Volumes/CaseSens/openwrt-2/scripts/../staging_dir/target-arm_cortex-a7+neon-vfpv4_musl_eabi/root-bcm27xx/usr/lib/libpagekite.so.1
> > #2  0xb6ec4b1c in pk_connect_ai ()
> >    from /Volumes/CaseSens/openwrt-2/scripts/../staging_dir/target-arm_cortex-a7+neon-vfpv4_musl_eabi/root-bcm27xx/usr/lib/libpagekite.so.1
> > #3  0xb6ec8494 in pkm_reconnect_all ()
> >    from /Volumes/CaseSens/openwrt-2/scripts/../staging_dir/target-arm_cortex-a7+neon-vfpv4_musl_eabi/root-bcm27xx/usr/lib/libpagekite.so.1
> > #4  0xb6ec79d4 in pkb_check_tunnels ()
> >    from /Volumes/CaseSens/openwrt-2/scripts/../staging_dir/target-arm_cortex-a7+neon-vfpv4_musl_eabi/root-bcm27xx/usr/lib/libpagekite.so.1
> > #5  0xb6ec7b94 in pkb_run_blocker ()
> >    from /Volumes/CaseSens/openwrt-2/scripts/../staging_dir/target-arm_cortex-a7+neon-vfpv4_musl_eabi/root-bcm27xx/usr/lib/libpagekite.so.1
> > #6  0xb6fd0af4 in start (p=0xb6adfd68) at src/thread/pthread_create.c:203
> > #7  0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > #8  0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > #9  0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > #10 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > #11 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > #12 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > #13 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > #14 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > #15 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > #16 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > #17 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > #18 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > #19 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > #20 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > #21 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > #22 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > #23 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > #24 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > #25 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > #26 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > [... thousands of iterations ...]
> 
> Searching the internet i found that this is not specific to my
> setup, OpenWrt or libpagekite, but happens in different, otherwise
> completely unrelated setups, such as
> https://github.com/mikebrady/shairport-sync/issues/388 or
> https://github.com/void-linux/void-packages/issues/980.
> 
> I could not spot any conclusive findings - in the second example,
> apparently they just made the stack bigger to "solve" it, which
> indicates that maybe the race can come to a benign end eventually
> and unwind the stack before it explodes.

Why do you expect this is a race condition? The backtrace is not
sufficient to show it, but my default assumption would just be that
this is just a stack overflow in the application code, i.e. allocating
too much on the stack (in automatic storage local variables).

You can increase the default stack size at link time with
-Wl,stack-size=N where N is the size you want (default 128k so
increase from there), or make the program explicitly request the
amount of space it needs with pthread attribute functions.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.