Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <Y83j89CsGRHIfqWC@atmark-techno.com>
Date: Mon, 23 Jan 2023 10:33:39 +0900
From: Dominique MARTINET <dominique.martinet@...ark-techno.com>
To: musl@...ts.openwall.com
Subject: infinite loop in mallocng's try_avail

Hi,

A coworker of mine has been observing hangs during shutdown on our ARM
alpine/openrc board, where the openrc shutdown process is stuck in
userspace after having stopped almost everything once in a blue moon
(running reboot in a loop, this is reproducible roughly one time every
three days)

It turns out it's stuck in malloc() as follow:
(gdb) bt
#0  try_avail (pm=0xb6f692e8 <__malloc_context+48>) at src/malloc/mallocng/malloc.c:156
#1  alloc_slot (sc=sc@...ry=0, req=req@...ry=8) at src/malloc/mallocng/malloc.c:288
#2  0xb6efdc94 in __libc_malloc_impl (n=8) at src/malloc/mallocng/malloc.c:369
#3  0xb6e9d532 in xmalloc (size=size@...ry=8) at ../src/shared/helpers.h:64
#4  0xb6e9d598 in rc_stringlist_new () at ../src/librc/librc-stringlist.c:24
#5  0xb6e9b91e in get_provided (depinfo=depinfo@...ry=0xb6f585d0, runlevel=runlevel@...ry=0xbe940f99 "shutdown", options=options@...ry=3)
    at ../src/librc/librc-depend.c:269
#6  0xb6e9bce8 in visit_service (options=3, runlevel=0xbe940f99 "shutdown", depinfo=0xb6f5ee30, visited=0xb6f57b00, sorted=0xb6f57af0, types=0xb6f575a0, 
    deptree=0xb6ed8ec0) at ../src/librc/librc-depend.c:419
#7  visit_service (deptree=0xb6ed8ec0, types=0xb6f575a0, sorted=0xb6f57af0, visited=0xb6f57b00, depinfo=0xb6f5ee30, runlevel=0xbe940f99 "shutdown", options=3)
    at ../src/librc/librc-depend.c:357
#8  0xb6e9bcb2 in visit_service (options=3, runlevel=0xbe940f99 "shutdown", depinfo=0xb6f62f50, visited=0xb6f57b00, sorted=0xb6f57af0, types=0xb6f575a0, 
    deptree=0xb6ed8ec0) at ../src/librc/librc-depend.c:405
#9  visit_service (deptree=0xb6ed8ec0, types=0xb6f575a0, sorted=0xb6f57af0, visited=0xb6f57b00, depinfo=0xb6f62f50, runlevel=0xbe940f99 "shutdown", options=3)
    at ../src/librc/librc-depend.c:357
#10 0xb6e9bcb2 in visit_service (options=3, runlevel=0xbe940f99 "shutdown", depinfo=0xb6ed85a0, visited=0xb6f57b00, sorted=0xb6f57af0, types=0xb6f575a0, 
    deptree=0xb6ed8ec0) at ../src/librc/librc-depend.c:405
#11 visit_service (deptree=0xb6ed8ec0, types=0xb6f575a0, sorted=0xb6f57af0, visited=0xb6f57b00, depinfo=0xb6ed85a0, runlevel=0xbe940f99 "shutdown", options=3)
    at ../src/librc/librc-depend.c:357
#12 0xb6e9bffa in rc_deptree_depends (deptree=deptree@...ry=0xb6ed8ec0, types=types@...ry=0xb6f575a0, services=services@...ry=0xb6f57ab0, 
    runlevel=0xbe940f99 "shutdown", options=options@...ry=3) at ../src/librc/librc-depend.c:482
#13 0x004226ea in do_stop_services (going_down=true, parallel=true, newlevel=0xbe940f99 "shutdown", deptree=0xb6ed8ec0, stop_services=<optimized out>, 
    start_services=0xb6f57d70, types_nw=0xb6f575a0) at ../src/openrc/rc.c:592
#14 main (argc=<optimized out>, argv=<optimized out>) at ../src/openrc/rc.c:1022

(this is musl 1.2.4 with a couple of patchs, none around malloc:
https://gitlab.alpinelinux.org/alpine/aports/-/tree/3.17-stable/main/musl
)

For convenience, I've copied the incriminated loop here:
        int cnt = m->mem->active_idx + 2;
        int size = size_classes[m->sizeclass]*UNIT;
        int span = UNIT + size*cnt;
        // activate up to next 4k boundary
        while ((span^(span+size-1)) < 4096) {
                cnt++;
                span += size;
        }

Here, we have UNIT=16 (constant), we're considering a malloc(8) so
sizeclass is 0 -> size_classes[0] = 1 -> size = 16, so we're looping
on this condition:
  while (span ^ (span + 15) < 4096)
with span being a multiple of 16.

I'm not good at maths, but I'm fairly confindent this never gets false:
0x0ff0 ^ 0x0fff = f
0x1000 ^ 0x100f = f
It could become true without the -1 (span+16), is that was intended ?
(It works for other sizes because of the UNIT offset in size, so
e.g. for size=32 we have 0x0ff0 ^ 0x100f which is 8191 alright, but
that'd also work without the -1 as far as I can tell)


Anyway, I'm not ruling out a problem in openrc yet -- rc_parallel is
known to be buggy -- but looking at the mallocng ctx I can't see any
obvious problem so I'm asking here first:
Is that supposed to work? Now I've gotten it down so far it might be
possible to build a reproducer; unfortunately I do not know how ctx was
when it entered try_avail (the function entered the if (!mask) so the
first entry was dequeued), but here's it's current state (please ask if
you need anything else)
------
(gdb) p __malloc_context            
$94 = {
  secret = 15756413639004407235,
  init_done = 1,
  mmap_counter = 135,
  free_meta_head = 0x0,
  avail_meta = 0x18a3f70,
  avail_meta_count = 6,
  avail_meta_area_count = 0,
  meta_alloc_shift = 0,
  meta_area_head = 0x18a3000,
  meta_area_tail = 0x18a3000,
  avail_meta_areas = 0x18a4000 <error: Cannot access memory at address 0x18a4000>,
  active = {0x18a3e98, 0x18a3eb0, 0x18a3208, 0x18a3280, 0x0, 0x0, 0x0, 0x18a31c0, 0x0, 0x0, 0x0, 0x18a3148, 0x0, 0x0, 0x0, 0x18a3dd8, 0x0, 0x0, 0x0, 0x18a3d90, 0x0, 
    0x18a31f0, 0x0, 0x18a3b68, 0x0, 0x18a3f28, 0x0, 0x0, 0x0, 0x18a3238, 0x0 <repeats 18 times>},
  usage_by_class = {2580, 600, 10, 7, 0 <repeats 11 times>, 96, 0, 0, 0, 20, 0, 3, 0, 8, 0, 3, 0, 0, 0, 3, 0 <repeats 18 times>},
  unmap_seq = '\000' <repeats 31 times>,
  bounces = '\000' <repeats 18 times>, "w", '\000' <repeats 12 times>,
  seq = 1 '\001',
  brk = 25837568
}
(gdb) p *__malloc_context->active[0]
$95 = {
  prev = 0x18a3f40,
  next = 0x18a3e80,
  mem = 0xb6f57b30,
  avail_mask = 1073741822,
  freed_mask = 0,
  last_idx = 29,
  freeable = 1,
  sizeclass = 0,
  maplen = 0
}
(gdb) p *__malloc_context->active[0]->mem
$97 = {
  meta = 0x18a3e98,
  active_idx = 29 '\035',
  pad = "\000\000\000\000\000\000\000\000\377\000",
  storage = 0xb6f57b40 ""
}
------


I'm leaving that process stuck in limbo so I can get more infos for a
few days.

Meanwhile we'll also be trying to see if the following patch helps --
given we do not have a good reproducer there won't be any feedback on
this side for a while.
----
>From f46bbe3117674becb3c5e4ffb7c8f4e4776084db Mon Sep 17 00:00:00 2001
From: Dominique Martinet <dominique.martinet@...ark-techno.com>
Date: Mon, 23 Jan 2023 10:12:55 +0900
Subject: [PATCH] mallocng: try_avail: fix infinite loop

sizeclass=0 has span being a multiple of size = 16, which will
never stop having "(span^(span+size-1)) < 4096".
Removing the -1 makes that ending condition reachable for all
class sizes.
---
 src/malloc/mallocng/malloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/malloc/mallocng/malloc.c b/src/malloc/mallocng/malloc.c
index d695ab8ec982..9ef0b176c31e 100644
--- a/src/malloc/mallocng/malloc.c
+++ b/src/malloc/mallocng/malloc.c
@@ -151,7 +151,7 @@ static uint32_t try_avail(struct meta **pm)
 				int size = size_classes[m->sizeclass]*UNIT;
 				int span = UNIT + size*cnt;
 				// activate up to next 4k boundary
-				while ((span^(span+size-1)) < 4096) {
+				while ((span^(span+size)) < 4096) {
 					cnt++;
 					span += size;
 				}
----

Thank you,
-- 
Dominique Martinet

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.