|
Message-ID: <20220124144234.GG7074@brightrain.aerifal.cx> Date: Mon, 24 Jan 2022 09:42:36 -0500 From: Rich Felker <dalias@...c.org> To: Yoan Picchi <yoan.picchi@...s.arm.com> Cc: musl@...ts.openwall.com, yoan.picchi@....com Subject: Re: [PATCH 1/1] Add documentation for mallocng On Mon, Jan 24, 2022 at 01:25:01PM +0000, Yoan Picchi wrote: > Hello. > Having received no review nor acknowledgement for the last few month > I'm sending this reminder. > This patch is about adding some documentation for mallocng as the > current lack of thereof makes > it time consuming to work with. Hi. I'm sorry for not replying sooner. Documentation is intentionally not maintained in the musl tree itself. There is a long-term official doc project that's stalled, but also it's expected only to document public contracts. The immediate-benefit place to put documentation of internals, musl-specific behaviors, etc. is on the wiki. If there doesn't seem to be a good place for this document on the wiki now, myself or someone else can add one. Thanks for your interest/work in documenting this. Rich > From: yoan.picchi@...s.arm.com <yoan.picchi@...s.arm.com> > > Sent: 05 November 2021 13:36 > To: musl@...ts.openwall.com > Cc: Yoan Picchi <Yoan.Picchi@....com> > Subject: [PATCH 1/1] Add documentation for mallocng > > From: Yoan Picchi <yoan.picchi@....com> > > This is an attempt at describing the various data structures of > mallocng and how they interract together. > > Signed-off-by: Yoan Picchi <yoan.picchi@....com> > Change-Id: I37659bbbc6fb0c23dbd93c9eba27e51bfeb4715d > --- > src/malloc/mallocng/readme.txt | 284 +++++++++++++++++++++++++++++++++ > 1 file changed, 284 insertions(+) > create mode 100644 src/malloc/mallocng/readme.txt > > diff --git a/src/malloc/mallocng/readme.txt > b/src/malloc/mallocng/readme.txt new file mode 100644 index > 00000000..c0c04847 > --- /dev/null > +++ b/src/malloc/mallocng/readme.txt > @@ -0,0 +1,284 @@ > +This doc is intended to give the reader a high level view of how musl's > +malloc works through explaining its data structures, and is targeted > +for whomever wants to work on malloc or is just curious about how it works. > + > +You can find an alternative unofficial shorter explanation here: > +https://gist.github.com/MaskRay/ac54b26d72452ac77ac578f2e625369f > +And some other information can be found in the mailing list archive: > +https://www.openwall.com/lists/musl/2021/10/04/2 > + > + > + > + GROUP > + > +The group is the structure that will actually hold the user data. Given > +that this structure will be required for any size of allocation, it > +must be small to reduce the memory waste for tiny allocations, yet be > +able to hold a large amount of data. There are two slightly different > +usages depending on the size of the data to hold. > + > + For small allocations (<128KiB): > +A group has several slots, up to 32. The header part of the group is > +16B long on 64 bits platforms. This is defined by the UNIT constant, > +which also represent the minimal alignment inside a group. Then > start the slots. > + > + Small group: > ++---------------------+ > +| meta* meta | \ > +| int number_active:5 | } UNIT (16B) > +| char padding[7] | / > +| | > +| +---------------+ | \ > +| | slot | | } sizeclass x UNIT > +| +---------------+ | / > +| | slot | | > +| +---------------+ | > +| ... x32 | > +| +---------------+ | > +| | slot | | > +| +---------------+ | > ++---------------------+ > + > +Note that a slot uses some memory before the slot's start to store the > +in-band data (IB). This is safe because a slot's last 4B are always > +unused. For the first slot, "padding" should be at least 4 Bytes long > +to fit the IB data. This IB data stores the information required to > find the various fields in the slot. > +IB is not defined by any struct, and is accessed solely through pointer > +arithmetic. If it had a structure, it would be like so: > + > + Basic slot: IB: > + +---------------------+ <- before the slot +----------------------+ > + | IB (4B) | | char need_big_offset | > + +---------------------+ | char index:5 | > ++-------------------------+ <- slot's start | char reserved:3 | > +| +---------------------+ | | uint16_t offset | > +| | | | +----------------------+ > +| | user data | | > +| | | | > +| +---------------------+ | \ > +| | canary (1B) | | \ > +| +---------------------+ | \ > +| | slack | | \ > +| +---------------------+ | } size = reserved > +| | footer size (4B) | | / > +| | (exist only if | | / > +| | reserved == 5) | | / > +| +---------------------+ | / > +| | unused (4B) | | > +| +---------------------+ | > ++-------------------------+ > + > +"need_big_offset" is always null for multislot groups. > +"index" is the slot's index. It is used to find the slot's end once we > +have the meta (explained later on). It might be redundant as one can > +calculate it on the fly from the slot's address, the group's address > +and the sizeclass, but by keeping the number of slots to 32, those 5 > +index bits save us the aforenoted calculation. > +"reserved" specifies where the user data actually ends compared to the > +end of the slot. This way a canary can be placed to detect overflows > +right at the end of the user's data. If "reserved" is 0, then we have > +no space for the canary and so, we don't use any. Being only 3 bits > +long, we deal with larger footers by writing the size of the footer > +directly in the slot when we have enough space. "offset" is the > offset between this slot's start, and the "storage" > +field in group (the start of the first slot). Given that all group > +slots are contiguous, this is used to access the metadata stored at the > +start of the group. Note that this offset is only a 16 bit value. To > +increase the range, it gets multiplied by UNIT, which is currently set > +to 16 and is the reason why 16 appears in a lot of places. This offset > +of 16 bits explains why the largest multislot groups will only have > up to 8 slots: 8*8191 = 2^16-8. > +This is the meaning of the basic IB data, and is enough to run malloc, > +but it can't support aligned_alloc yet. The group metadata being 16 > +Bytes long, what if one needs an alignment of 64B ? > + > +The slots being a fixed size means that most malloc calls actually > +won't fill it entirely and that we have some wiggle room as to where to > +place the user data in that slot: this is the "slack". But if the user > +data starts further back in the slot, how to find the IB ? The solution > +is to move the IB data, as well. This would mean though that once we > +return from the malloc call, we no longer know where the user data > +starts. This is no issue though, because the original IB slot has > been repurposed. In the modified IB, the "offset_size" > +field gives us the offset between the start of the slot and the start > +of the user data. To make sure one doesn't use this IB as a "normal" > +IB, the "reserved" slot is set to 7, which triggers an assert if used > +to get the metadata. > + > +This offset is not only used in aligned_alloc, it is used in nearly > all allocs. > +By cycling through every valid offset in a slot upon free and re-use, > +we increase the time to reuse of the same user address. This makes it > +easier to catch a double free. > + > + Complex slot: Modified IB: > + +---------------------+ +----------------------+ > + | modified IB (4B) | | need_big_offset = 0 | > + +---------------------+ | index:5 = undefined | > ++-------------------------+ <- slot's start | reserved:3 = 7 | > +| +---------------------+ | | uint16_t > +| +---------------------+ | offset_size | > +| | optional offset | | +----------------------+ > +| +---------------------+ | > +| +---------------------+ | > +| | IB (4B) | | > +| +---------------------+ | > +| +---------------------+ | > +| | | | > +| | user data | | > +| | | | > +| +---------------------+ | > +| | canary (1B) | | > +| +---------------------+ | > +| | slack | | > +| +---------------------+ | > +| | footer size (4B) | | > +| +---------------------+ | > +| | unused (4B) | | > +| +---------------------+ | > ++-------------------------+ > + > + > + For large allocations (≥128KiB): > +A group will only hold a single slot. It is not tracked and reused the > +same way as small groups are. It is mmapped upon allocation, and > munmapped when freed. > +Given that this group has no maximum size, it is possible that one > +needs it to be aligned to a few MiB using aligned_alloc. This would be > +above what the 16 bit offset (and offset_size) fields can hold. In > this case, "need_big_offset" > +is set to 1 and an additional 4B is added to the IB to hold a 32 > bits offset. > +Note that it limits alignment to 64GiB (offset*UNIT). > +It may look like we have a problem given that there is only 7 Bytes of > +padding, but given that there is only one slot, we never use > +"number_active", so we can afford overwriting it with the extra IB. > + > +Also note that in most case there is some slack in large groups as well > +because mmap allocates at a page granularity. Internally, musl works > +with 4KiB page and simply considers larger pages as a bunch of 4KiB > +pages. As such, the size of the slot is a multiple of those 4KiB pages > +and not a multiple of the actual page size. > + > + Large group: > ++---------------------+ > +| meta* meta | \ > +| int number_active:5 | } UNIT (16B) > +| char padding[7] | / > +| | > +| +---------------+ | > +| | extra IB (4B) | | > +| +---------------+ | \ > +| | | | \ > +| | slot | | } maplen x 4KiB > +| | | | / > +| +---------------+ | / > ++---------------------+ <- end at a 4KiB boundary > + > +Other than those few changes, large groups behave like small ones. > + > + > + > + META > + > +Groups only have the first part of the metadata. The second part is the > +bound meta object. There's one for each group and they each have a > +pointer to each other. > + > + Meta: > ++--------------------+ > +| meta* prev, next | <- make a circular buffer > +| group* mem | > +| int avail_mask | \ define 3 states for each slot > +| int freed_mask | / > +| size_t last_idx:5 | > +| size_t freeable:1 | > +| size_t sizeclass:6 | <- size if small group > +| size_t maplen:52 | <- size if large group > ++--------------------+ > + > +This is where the size of the slots is defined, using the "sizeclass" > +field. It is linked to a global array with 48 fixed size ranging from 1 > +to 8191. These sizes are then multiplied by UNIT to get the actual size > +in memory, hence the small/large group threshold at 128KiB. The > +"sizeclasses" [48-62] are unused and a "sizeclass" of 63 means it's > a large group. > +Given the limitation of offset to 16 bits, there is a limitation of the > +number of slots for the largest of the small groups. It is set in > "last_idx". > +The "avail_mask" and "freed_mask" are both bitmaps to hold the status > +of every slot in the group. The former holds whether the slot is > +available and never used, the latter whether the slot has been used but > +has been freed, or is currently inactive. When one calls malloc, it'll > +first attempt to allocate into an "avail_mask" slot. Failing this, > +it'll use a freed slot or activate some new slots. This allocation > +procedure is used to keep memory pages clean. A group can easily span > +several pages. As such, it's better to fill in slots in pages that have > +already been dirtied. This is what the "active_idx" field in group is > +used for. When a new slot is alloc-ed in a group, all the slots > that are in the same page get activated. > +"maplen" is where the size of a large group is stored. It is the number > +of 4KiB pages the group spans. > +Then there is "freeable", which can be a bit more confusing. While it > +obviously specifies whether the group is freeable or not, currently a > +group is not freeable only when a new meta is allocated on zeroed memory. > +And last, the "prev" and "next" fields. This is so we can make a > +circular buffer out of those meta objects. This way, when one requires > +a slot, malloc can easily go through the various metas until it finds a > +satisfactory one. In practice it doesn't search for the optimal one but > +just uses some heuristics like "no slot available, need to activate a > +new slot ? take the next meta instead". > + > + > + > + META_AREA > + > +We need some way to hold the memory where the meta will be allocated. > +That's the purpose of the meta_areas. > + > + Meta_area: > ++--------------------+ > +| int64_t check | > +| meta_area* next | <- form a linked list > +| int nslot | > +| meta slots[] | <- slots to store meta objects > ++--------------------+ > + > +These work like a simple linked list where each link is bigger than the > +previous one. Growing the size exponentially ensures there will only be > +a small amount of links and mmap calls. "nslot" gives the number of > +slots where the meta object can stored in this link (≠ slots in > +groups). Though, in practice, "nslot" seems to be currently unused and > +the slots are tracked in the malloc_context object instead. > + > + > + > + MALLOC_CONTEXT > + > +And finally, the context. This is a global singleton mainly used to > +manage the meta objects. Some of the fields have been omitted for brevity. > + > + Malloc_context: > ++------------------------------+ > +| int64_t secret | > +| meta* free_meta_head | <- head of the free meta > +| meta* avail_meta | <- next meta in meta_area > +| size_t avail_meta_count | > +| size_t avail_meta_area_count | > +| meta_area head, tail | > +| meta* active[48] | <- circular buffer for each sizeclass > +| size_t usage_by_class[48] | <- number of slot for each sizeclass > +| uint8_t unmap_seq[32] | \ balance fragmentation/slot reuse > +| uint8_t bounce[32] | / > ++------------------------------+ > + > +"head" and "tail" allow easy access to the meta_area. Given that meta > +slots are not tracked, we only need access to the tail of the list. > +There is some kind of memory protection trick with the meta area. When > +a new meta_area is created, only the first 4KiB page is made available > +for the slots. The number of slots fitting in that page is written down > +in "avail_meta_count". When "avail_meta_count" runs out of slots, > +another 4KiB worth of slots are added and "avail_meta_area_count" is > +decremented once. When it reaches 0, a new meta_area is mmap-ed. The > +point of doing so is that the pages originally mapped are initialized > +with no right to read nor write, and those rights are given back just > +as the malloc needs them. "avail_meta" and "free_meta_head" are the two > +sources of meta objects. The first is a pointer to the first free > slot in the tail meta_area. The second is a pool of all the metas > that have been freed. > +This is important because once again, the meta slots are not tracked in > +the meta area. If we weren't retaining the freed meta in this list, > +they would be lost and it would be a memory leak. This way, we > +prioritize reusing freed meta instead of initializing new ones. For > +easy access, there are a circular buffer for each sizeclass. Along with > +those there are counters of how many slots exist for any given > sizeclass: "usage_by_class". > +The "bounce" and "unmap_seq" seems to be used for balancing > +fragmentation and address reuse. > + > -- > 2.27.0
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.