musl - Re: infinite loop in mallocng's try

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230125055323.GK4163@brightrain.aerifal.cx>
Date: Wed, 25 Jan 2023 00:53:23 -0500
From: Rich Felker <dalias@...c.org>
To: Dominique MARTINET <dominique.martinet@...ark-techno.com>
Cc: musl@...ts.openwall.com
Subject: Re: infinite loop in mallocng's try_avail

On Wed, Jan 25, 2023 at 09:33:52AM +0900, Dominique MARTINET wrote:
> > If this code is being reached, either the allocator state has been
> > corrupted by some UB in the application, or there's a logic bug in
> > mallocng. The sequence of events that seem to have to happen to get
> > there are:
> > 
> > 1. Previously active group has no more available slots (line 120).
> 
> Right, that one has already likely been dequeued (or at least
> traversed), so I do not see how to look at it but that sounds possible.
> 
> > 2. Freed mask of newly activating group (line 131 or 138) is either
> >    zero (line 145) or the active_idx (read from in-band memory
> >    susceptible to application buffer overflows etc) is wrong and
> >    produces zero when its bits are anded with the freed mask (line
> >    145).
> 
> m->freed_mask looks like it is zero from values below; I cannot tell if
> that comes from a corruption outside of musl or not.
> 
> > > (gdb) p __malloc_context            
> > > $94 = {
> > >   secret = 15756413639004407235,
> > >   init_done = 1,
> > >   mmap_counter = 135,
> > >   free_meta_head = 0x0,
> > >   avail_meta = 0x18a3f70,
> > >   avail_meta_count = 6,
> > >   avail_meta_area_count = 0,
> > >   meta_alloc_shift = 0,
> > >   meta_area_head = 0x18a3000,
> > >   meta_area_tail = 0x18a3000,
> > >   avail_meta_areas = 0x18a4000 <error: Cannot access memory at address 0x18a4000>,
> > >   active = {0x18a3e98, 0x18a3eb0, 0x18a3208, 0x18a3280, 0x0, 0x0, 0x0, 0x18a31c0, 0x0, 0x0, 0x0, 0x18a3148, 0x0, 0x0, 0x0, 0x18a3dd8, 0x0, 0x0, 0x0, 0x18a3d90, 0x0, 
> > >     0x18a31f0, 0x0, 0x18a3b68, 0x0, 0x18a3f28, 0x0, 0x0, 0x0, 0x18a3238, 0x0 <repeats 18 times>},
> > >   usage_by_class = {2580, 600, 10, 7, 0 <repeats 11 times>, 96, 0, 0, 0, 20, 0, 3, 0, 8, 0, 3, 0, 0, 0, 3, 0 <repeats 18 times>},
> > >   unmap_seq = '\000' <repeats 31 times>,
> > >   bounces = '\000' <repeats 18 times>, "w", '\000' <repeats 12 times>,
> > >   seq = 1 '\001',
> > >   brk = 25837568
> > > }
> > > (gdb) p *__malloc_context->active[0]
> > > $95 = {
> > >   prev = 0x18a3f40,
> > >   next = 0x18a3e80,
> > >   mem = 0xb6f57b30,
> > >   avail_mask = 1073741822,
> > >   freed_mask = 0,
> > >   last_idx = 29,
> > >   freeable = 1,
> > >   sizeclass = 0,
> > >   maplen = 0
> > > }
> > > (gdb) p *__malloc_context->active[0]->mem
> > > $97 = {
> > >   meta = 0x18a3e98,
> > >   active_idx = 29 '\035',
> > >   pad = "\000\000\000\000\000\000\000\000\377\000",
> > >   storage = 0xb6f57b40 ""
> > > }
> > 
> > This is really weird, because at the point of the infinite loop, the
> > new group should not yet be activated (line 163), so
> > __malloc_context->active[0] should still point to the old active
> > group. But its avail_mask has all bits set and active_idx is not
> > corrupted, so try_avail should just have obtained an available slot
> > from it without ever entering the block at line 120. So I'm confused
> > how it got to the loop.
> 
> try_avail's pm is `__malloc_context->active[0]`, which is overwritten by
> either dequeue(pm, m) or *pm = m (lines 123,128), so the original
> m->avail_mask could have been zero, with the next element having a zero
> freed mask?

No, avail_mask is only supposed to be able to be nonzero after
activate_group, which is only called on the head of an active list
(free.c:86 or malloc.c:163) and which atomically pulls bits off
freed_mask to move them to avail_mask. If we're observing avail_mask
nonzero at the point you saw it, some invariant seems to have been
violated.

> > One odd thing I noticed is that the backtrace pm=0xb6f692e8 does not
> > match the __malloc_context->active[0] address. Were thse from
> > different runs?
> 
> These were from the same run, I've only observed this single occurence
> first-hand.
> 
> pm is &__malloc_context->active[0], so it's not 0x18a3e98 (first value
> of active) but its address (e.g. __malloc_context+48 as per gdb symbol
> resolution in the backtrace)
> I didn't print __malloc_context but I don't see why gdb would have
> gotten that wrong.

Ah, I forgot I was looking at an additional level of indirection here.
It would be nice to know if m is the same active[0] as at entry; that
would help figure out where things went wrong...

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.