|
Message-ID: <Y83j89CsGRHIfqWC@atmark-techno.com> Date: Mon, 23 Jan 2023 10:33:39 +0900 From: Dominique MARTINET <dominique.martinet@...ark-techno.com> To: musl@...ts.openwall.com Subject: infinite loop in mallocng's try_avail Hi, A coworker of mine has been observing hangs during shutdown on our ARM alpine/openrc board, where the openrc shutdown process is stuck in userspace after having stopped almost everything once in a blue moon (running reboot in a loop, this is reproducible roughly one time every three days) It turns out it's stuck in malloc() as follow: (gdb) bt #0 try_avail (pm=0xb6f692e8 <__malloc_context+48>) at src/malloc/mallocng/malloc.c:156 #1 alloc_slot (sc=sc@...ry=0, req=req@...ry=8) at src/malloc/mallocng/malloc.c:288 #2 0xb6efdc94 in __libc_malloc_impl (n=8) at src/malloc/mallocng/malloc.c:369 #3 0xb6e9d532 in xmalloc (size=size@...ry=8) at ../src/shared/helpers.h:64 #4 0xb6e9d598 in rc_stringlist_new () at ../src/librc/librc-stringlist.c:24 #5 0xb6e9b91e in get_provided (depinfo=depinfo@...ry=0xb6f585d0, runlevel=runlevel@...ry=0xbe940f99 "shutdown", options=options@...ry=3) at ../src/librc/librc-depend.c:269 #6 0xb6e9bce8 in visit_service (options=3, runlevel=0xbe940f99 "shutdown", depinfo=0xb6f5ee30, visited=0xb6f57b00, sorted=0xb6f57af0, types=0xb6f575a0, deptree=0xb6ed8ec0) at ../src/librc/librc-depend.c:419 #7 visit_service (deptree=0xb6ed8ec0, types=0xb6f575a0, sorted=0xb6f57af0, visited=0xb6f57b00, depinfo=0xb6f5ee30, runlevel=0xbe940f99 "shutdown", options=3) at ../src/librc/librc-depend.c:357 #8 0xb6e9bcb2 in visit_service (options=3, runlevel=0xbe940f99 "shutdown", depinfo=0xb6f62f50, visited=0xb6f57b00, sorted=0xb6f57af0, types=0xb6f575a0, deptree=0xb6ed8ec0) at ../src/librc/librc-depend.c:405 #9 visit_service (deptree=0xb6ed8ec0, types=0xb6f575a0, sorted=0xb6f57af0, visited=0xb6f57b00, depinfo=0xb6f62f50, runlevel=0xbe940f99 "shutdown", options=3) at ../src/librc/librc-depend.c:357 #10 0xb6e9bcb2 in visit_service (options=3, runlevel=0xbe940f99 "shutdown", depinfo=0xb6ed85a0, visited=0xb6f57b00, sorted=0xb6f57af0, types=0xb6f575a0, deptree=0xb6ed8ec0) at ../src/librc/librc-depend.c:405 #11 visit_service (deptree=0xb6ed8ec0, types=0xb6f575a0, sorted=0xb6f57af0, visited=0xb6f57b00, depinfo=0xb6ed85a0, runlevel=0xbe940f99 "shutdown", options=3) at ../src/librc/librc-depend.c:357 #12 0xb6e9bffa in rc_deptree_depends (deptree=deptree@...ry=0xb6ed8ec0, types=types@...ry=0xb6f575a0, services=services@...ry=0xb6f57ab0, runlevel=0xbe940f99 "shutdown", options=options@...ry=3) at ../src/librc/librc-depend.c:482 #13 0x004226ea in do_stop_services (going_down=true, parallel=true, newlevel=0xbe940f99 "shutdown", deptree=0xb6ed8ec0, stop_services=<optimized out>, start_services=0xb6f57d70, types_nw=0xb6f575a0) at ../src/openrc/rc.c:592 #14 main (argc=<optimized out>, argv=<optimized out>) at ../src/openrc/rc.c:1022 (this is musl 1.2.4 with a couple of patchs, none around malloc: https://gitlab.alpinelinux.org/alpine/aports/-/tree/3.17-stable/main/musl ) For convenience, I've copied the incriminated loop here: int cnt = m->mem->active_idx + 2; int size = size_classes[m->sizeclass]*UNIT; int span = UNIT + size*cnt; // activate up to next 4k boundary while ((span^(span+size-1)) < 4096) { cnt++; span += size; } Here, we have UNIT=16 (constant), we're considering a malloc(8) so sizeclass is 0 -> size_classes[0] = 1 -> size = 16, so we're looping on this condition: while (span ^ (span + 15) < 4096) with span being a multiple of 16. I'm not good at maths, but I'm fairly confindent this never gets false: 0x0ff0 ^ 0x0fff = f 0x1000 ^ 0x100f = f It could become true without the -1 (span+16), is that was intended ? (It works for other sizes because of the UNIT offset in size, so e.g. for size=32 we have 0x0ff0 ^ 0x100f which is 8191 alright, but that'd also work without the -1 as far as I can tell) Anyway, I'm not ruling out a problem in openrc yet -- rc_parallel is known to be buggy -- but looking at the mallocng ctx I can't see any obvious problem so I'm asking here first: Is that supposed to work? Now I've gotten it down so far it might be possible to build a reproducer; unfortunately I do not know how ctx was when it entered try_avail (the function entered the if (!mask) so the first entry was dequeued), but here's it's current state (please ask if you need anything else) ------ (gdb) p __malloc_context $94 = { secret = 15756413639004407235, init_done = 1, mmap_counter = 135, free_meta_head = 0x0, avail_meta = 0x18a3f70, avail_meta_count = 6, avail_meta_area_count = 0, meta_alloc_shift = 0, meta_area_head = 0x18a3000, meta_area_tail = 0x18a3000, avail_meta_areas = 0x18a4000 <error: Cannot access memory at address 0x18a4000>, active = {0x18a3e98, 0x18a3eb0, 0x18a3208, 0x18a3280, 0x0, 0x0, 0x0, 0x18a31c0, 0x0, 0x0, 0x0, 0x18a3148, 0x0, 0x0, 0x0, 0x18a3dd8, 0x0, 0x0, 0x0, 0x18a3d90, 0x0, 0x18a31f0, 0x0, 0x18a3b68, 0x0, 0x18a3f28, 0x0, 0x0, 0x0, 0x18a3238, 0x0 <repeats 18 times>}, usage_by_class = {2580, 600, 10, 7, 0 <repeats 11 times>, 96, 0, 0, 0, 20, 0, 3, 0, 8, 0, 3, 0, 0, 0, 3, 0 <repeats 18 times>}, unmap_seq = '\000' <repeats 31 times>, bounces = '\000' <repeats 18 times>, "w", '\000' <repeats 12 times>, seq = 1 '\001', brk = 25837568 } (gdb) p *__malloc_context->active[0] $95 = { prev = 0x18a3f40, next = 0x18a3e80, mem = 0xb6f57b30, avail_mask = 1073741822, freed_mask = 0, last_idx = 29, freeable = 1, sizeclass = 0, maplen = 0 } (gdb) p *__malloc_context->active[0]->mem $97 = { meta = 0x18a3e98, active_idx = 29 '\035', pad = "\000\000\000\000\000\000\000\000\377\000", storage = 0xb6f57b40 "" } ------ I'm leaving that process stuck in limbo so I can get more infos for a few days. Meanwhile we'll also be trying to see if the following patch helps -- given we do not have a good reproducer there won't be any feedback on this side for a while. ---- >From f46bbe3117674becb3c5e4ffb7c8f4e4776084db Mon Sep 17 00:00:00 2001 From: Dominique Martinet <dominique.martinet@...ark-techno.com> Date: Mon, 23 Jan 2023 10:12:55 +0900 Subject: [PATCH] mallocng: try_avail: fix infinite loop sizeclass=0 has span being a multiple of size = 16, which will never stop having "(span^(span+size-1)) < 4096". Removing the -1 makes that ending condition reachable for all class sizes. --- src/malloc/mallocng/malloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/malloc/mallocng/malloc.c b/src/malloc/mallocng/malloc.c index d695ab8ec982..9ef0b176c31e 100644 --- a/src/malloc/mallocng/malloc.c +++ b/src/malloc/mallocng/malloc.c @@ -151,7 +151,7 @@ static uint32_t try_avail(struct meta **pm) int size = size_classes[m->sizeclass]*UNIT; int span = UNIT + size*cnt; // activate up to next 4k boundary - while ((span^(span+size-1)) < 4096) { + while ((span^(span+size)) < 4096) { cnt++; span += size; } ---- Thank you, -- Dominique Martinet
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.