|
Message-ID: <20191106034605.GB16318@brightrain.aerifal.cx> Date: Tue, 5 Nov 2019 22:46:05 -0500 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: Non-candidate allocator designs, things to learn from them On Tue, Oct 22, 2019 at 01:40:51PM -0400, Rich Felker wrote: > Alongside time64 and other enhancements, over the next few months I'll > be working on a major project to replace musl's malloc, which has > known fundamental problems that can't really be fixed without > significant redesign. > > [...] > > I'll follow up soon with some thoughts/findings so far on designs that > might work for us, or that don't work but have interesting properties > that suggest they may lead to workable designs. The following was written incrementally over the past week or so, and as I indicated before it's not so much about a design that works, but about identifying properties that will motivate a good design. The current dlmalloc-style split/merge of free chunks is problematic in a lot of ways (complex interaction with fragmentation, hard to make efficient with threads because access to neighbors being merged has to be synchronized, etc.), so I want to first explore some simple designs that completely lack it. 1. Extending a bump allocator The simplest of all is a bump allocator with free stacks. This is just one step beyond musl's current simple_malloc (used for static linking without free). The only changes are that you round up requests to discrete size classes, and then implement free as pushing onto a free stack, with one stack per size class. Because it's a stack, a singly linked list suffices, and you can push/pop solely with atomics. One big pro is that there is no progressive fragmentation. If you successfully make a complex sequence of allocations, free some subset, then allocate the previously-freed sizes again in different order, it's guaranteed to succeed. This is a really desirable property for long-lived processses in certain kinds of memory-constrained systems. Of course, the permanence of how memory is split up can also be a con. After successfully performing very large numbers of small allocations and freeing them, future large allocations may not be possible, and vice versa. Stack order is also undesirable for catching or mitigating double-free and use-after-free. LRU order will give best results there. Switching to LRU/FIFO order is possible, but no longer admits simple atomics for alloc/free. 2. Zoned size classes The above bump allocators with free essentially partition address space into separate memory usable for each size class in a way that's (one-time-per-run) adaptive, and completely unorganized. Another simple type of allocator partitions (virtual) address space *statically*, reserving (PROT_NONE, later mprotectable to writable storage) for each size a large range of addresses usable as slabs/object pools for objects of that particular size. This type of design appears in GrapheneOS's hardened_malloc: https://github.com/GrapheneOS/hardened_malloc At first this seems pretty magical -- you automatically get a complete lack of fragmentation... of *virtual* memory. It's possible to fragment the zones themselves, requiring far more physical memory than the logical amount allocated, but the physical memory usage by a given size class is bounded by the peak usage in that size class at any point in the program's lifetime. So that's pretty good. In order to be able to do this, however, your virtual address space has to be vastly larger than physical memory available. Ideally, each size's zone should be larger than the total amount of physical memory (ram+swap) you might have, so that you're not imposing an arbitrary limit on what the application can do with it. Figuring around 50 size classes and leaving at least half of virtual address space free for mmap/large allocations, I figure your virtual address space should be at least 100x larger than physical memory availability in order for this to be viable. For 32-bit, that means it's not viable beyond around 30 MB of physical memory. Ouch. Ineed, GrapheneOS's hardened_malloc is explicitly only for 64-bit archs, and this is why. Aside from that, this kind of design also suffers from extreme over-allocation in small programs, and it's not compatible with nommu at all (since it relies on the mmu to cheat fragmentation). For example, if you allocate just one object of each of 50 sizes classes, you'll be using at least 50 pages (200k) already. One advantage of this approach is that it admits nice out-of-band metadata approaches, which let you make strong guarantees about the consistency of the allocator state under heap corruption/overflows. So.. Neither of these approaches is appropriate for the future of musl's malloc. They're not sufficiently general-purpose/generalizable to all environments and archs musl supports. But they do suggest properties worth trying to achieve in an alternative. I've made some good progress on this, which I'll follow up on in a separate post.
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.