oss-security - backtrace_symbols() misuse by Ceph and its supposedly-safe use

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAN_LGv0tC7ef+0BV=m7i_9caO2g+B-SkqENT=w-=uif7ntsqjw@mail.gmail.com>
Date: Fri, 12 Jul 2024 17:37:59 +0800
From: Alexander Patrakov <patrakov@...il.com>
To: oss-security@...ts.openwall.com
Subject: backtrace_symbols() misuse by Ceph and its supposedly-safe use

Hi,

Given the recent discussions regarding the calls to
non-async-signal-safe functions from signal handlers, let me add my
two pesos to the discussion, even though I don't think there is
anything security-related in this email.

A few months ago, I handled a customer report about a deadlock in Ceph
RADOS Gateway. The Ceph version is 17.2.7. In that case, I attached
gdb to the deadlocked process and applied the backtrace to all
threads. One of the threads had this backtrace:

```
Thread 594 (Thread 0x7f41ac610700 (LWP 870073) "radosgw"):
#0  0x00007f43ca050d0e in __lll_lock_wait_private () from
target:/lib64/libc.so.6
#1  0x00007f43ca0568fa in malloc () from target:/lib64/libc.so.6
#2  0x00007f43ca0df884 in backtrace_symbols () from target:/lib64/libc.so.6
#3  0x00007f43cb24485a in handle_oneshot_fatal_signal(int) () from
target:/usr/lib64/libradosgw.so.2
#4  <signal handler called>
#5  0x00007f43ca003d2b in raise () from target:/lib64/libc.so.6
#6  0x00007f43ca0053e5 in abort () from target:/lib64/libc.so.6
#7  0x00007f43ca049c87 in __libc_message () from target:/lib64/libc.so.6
#8  0x00007f43ca051d2a in malloc_printerr () from target:/lib64/libc.so.6
#9  0x00007f43ca0525f6 in unlink_chunk.isra () from target:/lib64/libc.so.6
#10 0x00007f43ca0555c0 in _int_malloc () from target:/lib64/libc.so.6
#11 0x00007f43ca0567d8 in malloc () from target:/lib64/libc.so.6
#12 0x00007f43ca28e3fc in operator new(unsigned long) () from
target:/usr/lib64/libstdc++.so.6
#13 0x00007f43cae3b42f in std::map<rgw_obj, RGWObjState,
std::less<rgw_obj>, std::allocator<std::pair<rgw_obj const,
RGWObjState> > >::operator[](rgw_obj const&) () from
target:/usr/lib64/libradosgw.so.2
#14 0x00007f43caddb5a5 in RGWObjectCtx::set_atomic(rgw_obj const&) ()
from target:/usr/lib64/libradosgw.so.2
#15 0x00007f43caf189c0 in
rgw::sal::RadosObject::set_atomic(RGWObjectCtx*) const () from
target:/usr/lib64/libradosgw.so.2
#16 0x00007f43cad8319b in
RGWDeleteMultiObj::handle_individual_object(rgw_obj_key const&,
optional_yield,
boost::asio::basic_deadline_timer<boost::posix_time::ptime,
boost::asio::time_traits<boost::posix_time::ptime>,
boost::asio::executor>*) () from target:/usr/lib64/libradosgw.so.2
#17 0x00007f43cad83b99 in void
boost::context::detail::context_entry<boost::context::detail::record<boost::context::continuation,
boost::context::basic_fixedsize_stack<boost::context::stack_traits>,
spawn::detail::spawn_helper<boost::asio::executor_binder<void (*)(),
boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>,
0u> > >, RGWDeleteMultiObj::execute(optional_yield)::{lambda(spawn::basic_yield_context<boost::asio::executor_binder<void
(*)(), boost::asio::strand<boost::asio::io_context::basic_executor_type<std::allocator<void>,
0u> > > >)#3}, boost::context::basic_fixedsize_stack<boost::context::stack_traits>
>::operator()()::{lambda(boost::context::continuation&&)#1}>
>(boost::context::detail::transfer_t) () from
target:/usr/lib64/libradosgw.so.2
#18 0x00007f43cb293c1f in make_fcontext () from
target:/usr/lib64/libradosgw.so.2
```

Here is my attempt to make sense of it from the bottom up.

In frame #13, something was written into a C++ std::map; this has led
to a memory allocation via malloc() in frame #11. However, the
internal logic of the glibc implementation of malloc() detected some
corruption, and there was, indeed this message in the log:

Jun 21 04:52:35 abc-osd08 radosgw[868945]: corrupted size vs. prev_size
Jun 21 04:52:36 abc-osd08 radosgw[868945]: *** Caught signal (Aborted) **
Jun 21 04:52:36 abc-osd08 radosgw[868945]:  in thread 7f41ac610700
thread_name:radosgw

After printing this message, glibc tried to safely abort the process
to prevent further damage/exploitation, see frame #6.

Ceph daemons, however, have a signal handler that catches SIGABRT and
SIGSEGV and tries to format and log a backtrace. Their systemd units
are also set to automatically restart the daemons. Without the
handler, this is what would have happened. However, within the
handler, radosgw calls backtrace_symbols(), which, in turn, calls
malloc() in frame #1, which is not async-signal-safe. So, here is a
usage bug by definition, and it converted a restartable crash into a
classical A-A deadlock because the outer malloc() call from frame #11
already holds a lock.

In Ceph public bug tracker, there is also a similar report with MDS
instead of the RADOS gateway: https://tracker.ceph.com/issues/65039

The question here is whether this is only a Ceph bug or whether
backtrace_symbols() is an unusable piece of API a-la gets() because it
is calling malloc() and is documented to do so.

The same manual page documents backtrace_symbols_fd(), but, as-is, it
is not good enough because Ceph also wants to demangle C++ symbols, to
log the result, and to post the crash information into the designated
directory.

I acknowledge that there is a todo comment in src/global/signal_handler.cc:

  // TODO: don't use an ostringstream here. It could call malloc(), which we
  // don't want inside a signal handler.
  // Also fix the backtrace code not to allocate memory.
  ClibBackTrace bt(1);
  ostringstream oss;
  bt.print(oss);
  dout_emergency(oss.str());

Apparently, the authors don't realize that the use of ostringstream is
not the only cause of possible malloc() calls.

What would be a good solution (as in: something that does not convert
crashes into deadlocks) here? I understand that, after memory
corruption, we are already in the UB territory, but is there anything
better possible than what is implemented?

-- 
Alexander Patrakov
Please check out the Open Source Software Security Wiki, which is counterpart to this mailing list.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.