Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Thu, 5 Oct 2023 08:39:03 -0400
From: Rich Felker <dalias@...c.org>
To: Markus Wichmann <nullplan@....net>
Cc: musl@...ts.openwall.com
Subject: Re: Hung processes with althttpd web server

On Thu, Oct 05, 2023 at 05:37:41AM +0200, Markus Wichmann wrote:
> Am Wed, Oct 04, 2023 at 09:41:41PM -0400 schrieb Carl Chave:
> > Hello, I'm running the althttpd web server on Alpine Linux using a Ramnode VPS.
> >
> > I've been having issues for quite a while with "hung" processes. There
> > is a long lived parent process and then a short lived forked process
> > for each http request. What I've been seeing is that the forked
> > processes will sometimes get stuck:
> >
> > sod01:/srv/www/log$ sudo strace -p 11329
> > strace: Process 11329 attached
> > futex(0x7f5bdcd77900, FUTEX_WAIT_PRIVATE, 4294967295, NULL
> >
> 
> I often see this system call hung when signal handlers are doing
> signal-unsafe things. Looking at the source code, that is exactly what
> happens if the process catches a signal at the wrong time. Try removing
> all calls to signal(); that should do what the designers intended
> better (namely quit the process). If you want to log when a process dies
> of unnatural causes, that's something the parent process can do.
> 
> The signal handler will call MakeLogEntry(), and that will do
> signal-unsafe things such as call free(), localtime(), or fopen(). If
> the main process is currently using malloc() when that happens, you will
> get precisely this hang.
> 
> 
> > Please see this forum thread for additional information:
> > https://sqlite.org/althttpd/forumpost/4dc31619341ce947
> >
> 
> Seems like they haven't yet found the trail of the signal handler.

OK, this is almost surely the source of the problem. It would still be
interesting to know which lock is being hit here, since for the most
part, locks are skipped in single-threaded processes. But even if the
lock were skipped, the invalid calls to async-signal-unsafe functions
from async-signal context would be corrupting the state those locks
were meant to protect. That's probably what's happening on glibc
(meaning this code only appears to work there, but it likely behaving
dangerously).

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.