musl - Re: SIGSEGV/stack overflow in pthread_create

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240913193233.GC10433@brightrain.aerifal.cx>
Date: Fri, 13 Sep 2024 15:32:33 -0400
From: Rich Felker <dalias@...c.org>
To: Lukas Zeller <luz@...n44.ch>
Cc: alice <alice@...ya.dev>, musl@...ts.openwall.com
Subject: Re: SIGSEGV/stack overflow in pthread_create - race condition?

On Fri, Sep 13, 2024 at 09:26:27PM +0200, Lukas Zeller wrote:
> Hi Alice,
> 
> > On 13 Sep 2024, at 17:34, alice <alice@...ya.dev> wrote:
> > 
> > something that is probably leading to confusion here is the infinite backtrace
> > in clone.
> 
> Confusing indeed ;-)
> 
> > iirc this was related to the lack of cfi directives on arm for gdb to
> > unwind correctly? or old debugger, it's been a long time..
> 
> Oh, thanks - so that might just be a gdb display artifact?
> I am using the toolset of OpenWrt 22.03, which is gcc-11.2.0. So not really old, but definitely not latest.
> 
> > the actual code does not loop there forever;
> 
> Not forever, it obviously eventually exits from clone() into the thread function pkb_run_blocker() via the start() wrapper.
> Which then hits the stack limit. 
> 
> > all the actual stack use is in the
> > frames above in the application. it's just an unwinder shortcoming.
> 
> Maybe, but I'm not sure, see below
> 
> > something you can check in this case (from memory with lldb on a coredump or the
> > application when it crashes):
> > 
> > f 0 #crashed frame
> > reg read $sp #stack pointer
> > f 7 #clone entry
> > reg read $sp #stack pointer
> > 
> > and then subtract the two values you get for $sp. you'll see something like
> > '131000' or bigger; and that means the code overflowed the 128k default musl
> > stack size.
> 
> Ok, here's what I got:
> 
> > (gdb) f 0
> > #0  0xb6ec42f0 in pk_parse_kite_request ()
> >    from /Volumes/CaseSens/openwrt-2/scripts/../staging_dir/target-arm_cortex-a7+neon-vfpv4_musl_eabi/root-bcm27xx/usr/lib/libpagekite.so.1
> > (gdb) info reg $sp
> > sp             0xb6adbbd0          0xb6adbbd0
> > (gdb) f 7
> > #7  0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > 23 bl 3f
> > (gdb) info reg $sp
> > sp             0xb6adfd60          0xb6adfd60
> 
> 0xb6adfd60 - 0xb6adbbd0 = 0x4190 = 16784
> 
> So that child thread has put only 16k on the stack between starting and when it crashes.

That is only going to work if gdb correctly recovered the frame state
for __clone.

> Interestingly, however, the previous (bogus?) stack frames each seem to have consumed 16 bytes.
> This is what I would expect from "stmfd sp!,{r4,r5,r6,r7}" on line 7 of clone.s
> 
> > (gdb) f 8
> > #8  0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > 23 bl 3f
> > (gdb) info reg $sp
> > sp             0xb6adfd70          0xb6adfd70
> > (gdb) f 9
> > #9  0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > 23 bl 3f
> > (gdb) info reg $sp
> > sp             0xb6adfd80          0xb6adfd80
> > (gdb) 
> > 
> > ...
> > 
> > (gdb) f 1000
> > #1000 0xb6fcf22c in __clone () at src/thread/arm/clone.s:23
> > 23 bl 3f
> > (gdb) info reg $sp
> > sp             0xb6ae3b70          0xb6ae3b70
> > (gdb) 
> 
> 0xb6ae3b70 - 0xb6adfd80 = 0x3DF0 -> ~16k consumed by 1000 "frames"
> 
> Unless the shown $sp values are gdb display artifacts as well, these
> "iterations" DO consume stack space and very much look like a real
> recursion happening within __clone().

They are not. It is most definitely gdb misinterpreting the process
state and making up these stack offsets to agree with its
interpretation of the call stack.

If you want to see the original stack pointer for the thread, it will
be very close to the value of the thread pointer and you could look
that up.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.