musl - Re: Crash in kill(..., SIGHUP) when using SA

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3201c36ee287e6d38e0f3805440a507de8fb52bf.camel@postmarketos.org>
Date: Thu, 30 May 2024 12:17:59 +0200
From: Pablo Correa Gomez <pabloyoyoista@...tmarketos.org>
To: Rich Felker <dalias@...c.org>
Cc: musl@...ts.openwall.com
Subject: Re: Crash in kill(..., SIGHUP) when using SA_ONSTACK

Hi Rich, thanks a lot for your reply

El mie, 29-05-2024 a las 09:15 -0400, Rich Felker escribió:
> On Wed, May 29, 2024 at 02:04:25PM +0200, Pablo Correa Gomez wrote:
> > Hi everybody,
> > 
> > I am responsible for musl CI in GNOME's GLib, and we have recently
> > bumped into a crash that I have been unable to resolve. 
> > 
> > https://gitlab.gnome.org/GNOME/glib/-
> > /commit/137db219a7266300ffde1aa75d781284fb0807cb
> > introduced in GLib an alternate stack by setting the signal action
> > SA_ONSTACK if available. However, the tests that were introduced,
> > and
> > that pass in most other libc's (there's CI for a lot more than just
> > glibc and musl) crash in my alpine linux edge installation with
> > SIGSEGV
> > (stack trace below) while doing: kill (getpid(), SIGHUP)
> > 
> > I have verified that not adding SA_ONSTACK fixes the crash. Would
> > anybody have some pointers of what could possibly be going wrong?
> > If
> > anybody is really interested, the public issue is
> > https://gitlab.gnome.org/GNOME/glib/-/issues/3315
> > 
> > Stack trace
> > ------------
> > 
> > Thread 1 "unix" received signal SIGSEGV, Segmentation fault.
> > 0x00007ffff7fa96e8 in __syscall2 (a2=1, a1=17483, n=62) at
> > ../arch/x86_64/syscall_arch.h:21
> > warning: 21     ./arch/x86_64/syscall_arch.h: No such file or
> > directory
> > (gdb) bt
> > #0  0x00007ffff7fa96e8 in __syscall2 (a2=1, a1=17483, n=62) at
> > ../arch/x86_64/syscall_arch.h:21
> > #1  kill (pid=17483, sig=sig@...ry=1) at src/signal/kill.c:6
> > #2  0x0000555555556e96 in test_signal (signum=signum@...ry=1) at
> > .../glib/tests/unix.c:534
> > #3  0x0000555555557200 in test_signal_alternate_stack (signal=1) at
> > .../glib/tests/unix.c:590
> > #4  0x00007ffff7e8f364 in test_case_run (path=<optimized out>,
> > test_run_name=0x55555555d3f0 "/glib-unix/sighup/alternate-stack",
> > tc=0x55555555db60) at ../glib/gtestutils.c:2988
> > #5  g_test_run_suite_internal (suite=suite@...ry=0x55555555da70,
> > path=path@...ry=0x0) at ../glib/gtestutils.c:3090
> > #6  0x00007ffff7e8f2db in g_test_run_suite_internal
> > (suite=suite@...ry=0x7ffff7ffee20, path=path@...ry=0x0) at
> > .../glib/gtestutils.c:3109
> > #7  0x00007ffff7e8f2db in g_test_run_suite_internal
> > (suite=suite@...ry=0x7ffff7ffede0, path=path@...ry=0x0) at
> > .../glib/gtestutils.c:3109
> > #8  0x00007ffff7e8f86a in g_test_run_suite
> > (suite=suite@...ry=0x7ffff7ffede0) at ../glib/gtestutils.c:3189
> > #9  0x00007ffff7e8f8ea in g_test_run () at
> > ../glib/gtestutils.c:2275
> > #10 0x00005555555561f7 in main (argc=<optimized out>,
> > argv=<optimized
> > out>) at ../glib/tests/unix.c:910
> 
> Can you get a disassembly and register dump at the point of crash?

(gdb) layout asm

 0x7ffff7fa96f9 <kill+7>     movslq %esi,%rsi                         
 0x7ffff7fa96fc <kill+10>    mov    $0x3e,%eax                        
 0x7ffff7fa9701 <kill+15>    syscall                                  
>0x7ffff7fa9703 <kill+17>    mov    %rax,%rdi                         
 0x7ffff7fa9706 <kill+20>    call  0x7ffff7f7afb7 <__syscall_ret>     
 0x7ffff7fa970b <kill+25>    add    $0x8,%rsp                         
 0x7ffff7fa970f <kill+29>    ret                                      
 0x7ffff7fa9710 <killpg>     test   %edi,%edi                         
 0x7ffff7fa9712 <killpg+2>   js     0x7ffff7fa971b <killpg+11>        
 0x7ffff7fa9714 <killpg+4>   neg    %edi                              
 0x7ffff7fa9716 <killpg+6>   jmp    0x7ffff7fa96f2 <kill>             
 0x7ffff7fa971b <killpg+11>  sub    $0x8,%rsp                         
 0x7ffff7fa971f <killpg+15>  call   0x7ffff7f78bae <__errno_location> 
 0x7ffff7fa9724 <killpg+20>  movl   $0x16,(%rax)                      
 0x7ffff7fa972a <killpg+26>  mov    $0xffffffff,%eax                  
 0x7ffff7fa972f <killpg+31>  add    $0x8,%rsp                         
 0x7ffff7fa9733 <killpg+35>  ret                                      
 0x7ffff7fa9734 <psiginfo>   mov    (%rdi),%edi                       
 0x7ffff7fa9736 <psiginfo+2> jmp    0x7ffff7fa973b <psignal>          
 0x7ffff7fa973b <psignal>    push   %r15    
 0x7ffff7fa973d <psignal+2>  push   %r14                              
 0x7ffff7fa973f <psignal+4>  push   %r13                              
 0x7ffff7fa9741 <psignal+6>  lea 0x51938(%rip),%r13 # 0x7ffff7ffb080
<__stderr_FILE>
 0x7ffff7fa9748 <psignal+13> push   %r12     
 0x7ffff7fa974a <psignal+15> xor    %r12d,%r12d                       
 0x7ffff7fa974d <psignal+18> push   %rbp                              
 0x7ffff7fa974e <psignal+19> push   %rbx                              
 0x7ffff7fa974f <psignal+20> mov    %rsi,%rbx                         
 0x7ffff7fa9752 <psignal+23> sub    $0x18,%rsp                        
 0x7ffff7fa9756 <psignal+27> call   0x7ffff7fb5780 <strsignal>   

(gdb) info registers
rax            0x0                 0
rbx            0x7ffff7f55c30      140737353440304
rcx            0x7ffff7fa9703      140737353783043
rdx            0x0                 0
rsi            0x1                 1
rdi            0x525e              21086
rbp            0x1                 0x1
rsp            0x7fffffffd5d0      0x7fffffffd5d0
r8             0x0                 0
r9             0x80                128
r10            0x8                 8
r11            0x202               514
r12            0x7ffff7ffdb5c      140737354128220
r13            0x1                 1
r14            0x7fffffffd6d0      140737488344784
r15            0x7fffffffd6f0      140737488344816
rip            0x7ffff7fa9703      0x7ffff7fa9703 <kill+17>
eflags         0x202               [ IF ]
cs             0x33                51
ss             0x2b                43
ds             0x0                 0
es             0x0                 0
fs             0x0                 0
gs             0x0                 0
fs_base        0x7ffff7ffdb28      140737354128168
gs_base        0x0                 0

Does this tell you anything?
       
> I'm not sure if the crashing code is running on the signal stack or
> main stack, but here's a thought: is it possible the CI machines are
> running on a cpu/kernel with some monster AVX512 or whatever
> extension
> enabled with register file that doesn't fit in MINSIGSTKSZ?

That might be the case. Would explain why I could not reproduce in my
9-year old laptop I was running last month, but I can reproduce it now
in a new machine with a 13th Gen Intel(R) Core(TM) i7-1360P

>  If so,
> using sysconf(_SC_MINSIGSTKSZ) (conditional on _SC_MINSIGSTKSZ being
> defined) to allocate the alt stack should mitigate the problem. If
> doing this, it should probably be allocated by mmap or malloc, since
> in principle it could be too large for the caller's stack.
> 

I'll forward this to the maintainers, let's see if we can come up with
a solution. Thanks a lot for your feedback!

> It's also possible that the kernel may have some weird behavior
> deciding if the task is already "running on the alt stack" when the
> alt stack is embedded in the normal stack like this. Just getting rid
> of that might be worth trying. If so, whether the problem manifests
> could be subject to timing of signal delivery (although I would not
> expect that for synchronously generated signals like here).
> 
> Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.