oss-security - Re: backtrace_symbols() misuse by Ceph and its supposedly-safe use

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <ZpJ4imxZbVpeHijv@remnant.pseudorandom.co.uk>
Date: Sat, 13 Jul 2024 13:52:26 +0100
From: Simon McVittie <smcv@...ian.org>
To: oss-security@...ts.openwall.com
Subject: Re: backtrace_symbols() misuse by Ceph and its
 supposedly-safe use

On Fri, 12 Jul 2024 at 17:37:59 +0800, Alexander Patrakov wrote:
> Ceph daemons, however, have a signal handler that catches SIGABRT and
> SIGSEGV and tries to format and log a backtrace.
...
> What would be a good solution (as in: something that does not convert
> crashes into deadlocks) here? I understand that, after memory
> corruption, we are already in the UB territory, but is there anything
> better possible than what is implemented?

Let it crash, and have a kernel core-dump collection hook collect it and
do post-mortem analysis? systemd-coredump and corekeeper are the
implementations of this that I've used myself, but I'm sure there are
plenty more available.

This has the additional benefit that it works for every daemon your
system might be relying on, not just Ceph itself (I don't know how
self-contained Ceph is).

The other way to do this is to go to heroic efforts
to avoid heap allocations, like Google's Breakpad does:
https://chromium.googlesource.com/breakpad/breakpad/+/HEAD/docs/client_design.md#exception-basics
This is necessary because Breakpad is typically used by leaf applications
(Chrome, games, etc.) that want to be able to report crashes to their
vendor, independent of how the underlying OS is set up. Of course, by the
time you're in UB territory, literally anything could be happening (for
example memory corruption could conceivably have overwritten the stack
of Breakpad's crash-handler thread, if you're spectacularly unlucky)
but this is more about "pragmatic compromises that usually work" than
being 100% correct.

But if you control the machine at OS level (as you typically would for
a server) it seems more reliable to let the daemon crash and dump core,
and let a trusted OS-level component that is not already in an undefined
state process the core dump.

This seems like it applies extra-strongly if you suspect that the crash
might be caused by a malicious actor who is manipulating the memory
corruption to their benefit, rather than an accident.

    smcv

Please check out the Open Source Software Security Wiki, which is counterpart to this mailing list.

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.