Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <D459LAEJFNZF.2X4U737GVY2P6@ayaya.dev>
Date: Fri, 13 Sep 2024 17:34:01 +0200
From: "alice" <alice@...ya.dev>
To: <musl@...ts.openwall.com>, "Lukas Zeller" <luz@...n44.ch>
Subject: Re: SIGSEGV/stack overflow in pthread_create - race
 condition?

On Fri Sep 13, 2024 at 5:25 PM CEST, Rich Felker wrote:
> On Fri, Sep 13, 2024 at 01:30:00PM +0200, Lukas Zeller wrote:
> > Hello list,
> > 
> > I hope this is the right place to post the following.
> > 
> > Using OpenWrt 22.03 with musl 1.2.3, *some* times, on *some* RPi devices (the faster, the more likely) I get the following:
> > 
> > > Thread 2 "debugtarget" received signal SIGSEGV, Segmentation fault.
> > > [Switching to Thread 4993.5022]
> > > 0xb6ec42f0 in pk_parse_kite_request () from /Volumes/CaseSens/openwrt-2/scripts/../staging_dir/target-arm_cortex-a7+neon-vfpv4_musl_eabi/root-bcm27xx/usr/lib/libpagekite.so.1
> > > (gdb) bt
> > > #0  0xb6ec42f0 in pk_parse_kite_request ()
> > >    from /Volumes/CaseSens/openwrt-2/scripts/../staging_dir/target-arm_cortex-a7+neon-vfpv4_musl_eabi/root-bcm27xx/usr/lib/libpagekite.so.1
> > > #1  0xb6ec457c in pk_parse_pagekite_response ()
> > >    from /Volumes/CaseSens/openwrt-2/scripts/../staging_dir/target-arm_cortex-a7+neon-vfpv4_musl_eabi/root-bcm27xx/usr/lib/libpagekite.so.1
> > > #2  0xb6ec4b1c in pk_connect_ai ()
> > >    from /Volumes/CaseSens/openwrt-2/scripts/../staging_dir/target-arm_cortex-a7+neon-vfpv4_musl_eabi/root-bcm27xx/usr/lib/libpagekite.so.1
> > > #3  0xb6ec8494 in pkm_reconnect_all ()
> > >    from /Volumes/CaseSens/openwrt-2/scripts/../staging_dir/target-arm_cortex-a7+neon-vfpv4_musl_eabi/root-bcm27xx/usr/lib/libpagekite.so.1
> > > #4  0xb6ec79d4 in pkb_check_tunnels ()
> > >    from /Volumes/CaseSens/openwrt-2/scripts/../staging_dir/target-arm_cortex-a7+neon-vfpv4_musl_eabi/root-bcm27xx/usr/lib/libpagekite.so.1
> > > #5  0xb6ec7b94 in pkb_run_blocker ()
> > >    from /Volumes/CaseSens/openwrt-2/scripts/../staging_dir/target-arm_cortex-a7+neon-vfpv4_musl_eabi/root-bcm27xx/usr/lib/libpagekite.so.1
> > > #6  0xb6fd0af4 in start (p=0xb6adfd68) at src/thread/pthread_create.c:203
> > > #7  0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > > #8  0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > > #9  0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > > #10 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > > #11 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > > #12 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > > #13 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > > #14 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > > #15 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > > #16 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > > #17 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > > #18 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > > #19 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > > #20 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > > #21 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > > #22 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > > #23 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > > #24 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > > #25 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > > #26 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > > [... thousands of iterations ...]
> > 
> > Searching the internet i found that this is not specific to my
> > setup, OpenWrt or libpagekite, but happens in different, otherwise
> > completely unrelated setups, such as
> > https://github.com/mikebrady/shairport-sync/issues/388 or
> > https://github.com/void-linux/void-packages/issues/980.
> > 
> > I could not spot any conclusive findings - in the second example,
> > apparently they just made the stack bigger to "solve" it, which
> > indicates that maybe the race can come to a benign end eventually
> > and unwind the stack before it explodes.
>
> Why do you expect this is a race condition? The backtrace is not
> sufficient to show it, but my default assumption would just be that
> this is just a stack overflow in the application code, i.e. allocating
> too much on the stack (in automatic storage local variables).
>
> You can increase the default stack size at link time with
> -Wl,stack-size=N where N is the size you want (default 128k so
> increase from there), or make the program explicitly request the
> amount of space it needs with pthread attribute functions.
>
> Rich

something that is probably leading to confusion here is the infinite backtrace
in clone. iirc this was related to the lack of cfi directives on arm for gdb to
unwind correctly? or old debugger, it's been a long time..

the actual code does not loop there forever; all the actual stack use is in the
frames above in the application. it's just an unwinder shortcoming.

something you can check in this case (from memory with lldb on a coredump or the
application when it crashes):

f 0 #crashed frame
reg read $sp #stack pointer
f 7 #clone entry
reg read $sp #stack pointer

and then subtract the two values you get for $sp. you'll see something like
'131000' or bigger; and that means the code overflowed the 128k default musl
stack size.

if you compile the project with gcc -fstack-usage, you can probably look at the
generated .su files and find a function that takes like 400k of stack on entry
(and causes this overflow)

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.