|
Message-Id: <20211105135323.36524-2-yoan.picchi@foss.arm.com> Date: Fri, 5 Nov 2021 13:53:23 +0000 From: yoan.picchi@...s.arm.com To: musl@...ts.openwall.com Cc: Yoan Picchi <yoan.picchi@....com> Subject: [PATCH 1/1] Add documentation for mallocng From: Yoan Picchi <yoan.picchi@....com> This is an attempt at describing the various data structures of mallocng and how they interract together. Signed-off-by: Yoan Picchi <yoan.picchi@....com> Change-Id: I37659bbbc6fb0c23dbd93c9eba27e51bfeb4715d --- src/malloc/mallocng/readme.txt | 284 +++++++++++++++++++++++++++++++++ 1 file changed, 284 insertions(+) create mode 100644 src/malloc/mallocng/readme.txt diff --git a/src/malloc/mallocng/readme.txt b/src/malloc/mallocng/readme.txt new file mode 100644 index 00000000..c0c04847 --- /dev/null +++ b/src/malloc/mallocng/readme.txt @@ -0,0 +1,284 @@ +This doc is intended to give the reader a high level view of how musl's malloc +works through explaining its data structures, and is targeted for whomever +wants to work on malloc or is just curious about how it works. + +You can find an alternative unofficial shorter explanation here: +https://gist.github.com/MaskRay/ac54b26d72452ac77ac578f2e625369f +And some other information can be found in the mailing list archive: +https://www.openwall.com/lists/musl/2021/10/04/2 + + + + GROUP + +The group is the structure that will actually hold the user data. Given that +this structure will be required for any size of allocation, it must be small +to reduce the memory waste for tiny allocations, yet be able to hold a large +amount of data. There are two slightly different usages depending on the size +of the data to hold. + + For small allocations (<128KiB): +A group has several slots, up to 32. The header part of the group is 16B long +on 64 bits platforms. This is defined by the UNIT constant, which also +represent the minimal alignment inside a group. Then start the slots. + + Small group: ++---------------------+ +| meta* meta | \ +| int number_active:5 | } UNIT (16B) +| char padding[7] | / +| | +| +---------------+ | \ +| | slot | | } sizeclass x UNIT +| +---------------+ | / +| | slot | | +| +---------------+ | +| ... x32 | +| +---------------+ | +| | slot | | +| +---------------+ | ++---------------------+ + +Note that a slot uses some memory before the slot's start to store the in-band +data (IB). This is safe because a slot's last 4B are always unused. For the +first slot, "padding" should be at least 4 Bytes long to fit the IB data. This +IB data stores the information required to find the various fields in the slot. +IB is not defined by any struct, and is accessed solely through pointer +arithmetic. If it had a structure, it would be like so: + + Basic slot: IB: + +---------------------+ <- before the slot +----------------------+ + | IB (4B) | | char need_big_offset | + +---------------------+ | char index:5 | ++-------------------------+ <- slot's start | char reserved:3 | +| +---------------------+ | | uint16_t offset | +| | | | +----------------------+ +| | user data | | +| | | | +| +---------------------+ | \ +| | canary (1B) | | \ +| +---------------------+ | \ +| | slack | | \ +| +---------------------+ | } size = reserved +| | footer size (4B) | | / +| | (exist only if | | / +| | reserved == 5) | | / +| +---------------------+ | / +| | unused (4B) | | +| +---------------------+ | ++-------------------------+ + +"need_big_offset" is always null for multislot groups. +"index" is the slot's index. It is used to find the slot's end once we have the +meta (explained later on). It might be redundant as one can calculate it on the +fly from the slot's address, the group's address and the sizeclass, but by +keeping the number of slots to 32, those 5 index bits save us the aforenoted +calculation. +"reserved" specifies where the user data actually ends compared to the end of +the slot. This way a canary can be placed to detect overflows right at the end +of the user's data. If "reserved" is 0, then we have no space for the canary +and so, we don't use any. Being only 3 bits long, we deal with larger footers +by writing the size of the footer directly in the slot when we have enough +space. "offset" is the offset between this slot's start, and the "storage" +field in group (the start of the first slot). Given that all group slots are +contiguous, this is used to access the metadata stored at the start of the +group. Note that this offset is only a 16 bit value. To increase the range, it +gets multiplied by UNIT, which is currently set to 16 and is the reason why 16 +appears in a lot of places. This offset of 16 bits explains why the largest +multislot groups will only have up to 8 slots: 8*8191 = 2^16-8. +This is the meaning of the basic IB data, and is enough to run malloc, but it +can't support aligned_alloc yet. The group metadata being 16 Bytes long, what +if one needs an alignment of 64B ? + +The slots being a fixed size means that most malloc calls actually won't fill +it entirely and that we have some wiggle room as to where to place the user +data in that slot: this is the "slack". But if the user data starts further +back in the slot, how to find the IB ? The solution is to move the IB data, as +well. This would mean though that once we return from the malloc call, we no +longer know where the user data starts. This is no issue though, because the +original IB slot has been repurposed. In the modified IB, the "offset_size" +field gives us the offset between the start of the slot and the start of the +user data. To make sure one doesn't use this IB as a "normal" IB, the +"reserved" slot is set to 7, which triggers an assert if used to get the +metadata. + +This offset is not only used in aligned_alloc, it is used in nearly all allocs. +By cycling through every valid offset in a slot upon free and re-use, we +increase the time to reuse of the same user address. This makes it easier to +catch a double free. + + Complex slot: Modified IB: + +---------------------+ +----------------------+ + | modified IB (4B) | | need_big_offset = 0 | + +---------------------+ | index:5 = undefined | ++-------------------------+ <- slot's start | reserved:3 = 7 | +| +---------------------+ | | uint16_t offset_size | +| | optional offset | | +----------------------+ +| +---------------------+ | +| +---------------------+ | +| | IB (4B) | | +| +---------------------+ | +| +---------------------+ | +| | | | +| | user data | | +| | | | +| +---------------------+ | +| | canary (1B) | | +| +---------------------+ | +| | slack | | +| +---------------------+ | +| | footer size (4B) | | +| +---------------------+ | +| | unused (4B) | | +| +---------------------+ | ++-------------------------+ + + + For large allocations (≥128KiB): +A group will only hold a single slot. It is not tracked and reused the same way +as small groups are. It is mmapped upon allocation, and munmapped when freed. +Given that this group has no maximum size, it is possible that one needs it to +be aligned to a few MiB using aligned_alloc. This would be above what the 16 +bit offset (and offset_size) fields can hold. In this case, "need_big_offset" +is set to 1 and an additional 4B is added to the IB to hold a 32 bits offset. +Note that it limits alignment to 64GiB (offset*UNIT). +It may look like we have a problem given that there is only 7 Bytes of padding, +but given that there is only one slot, we never use "number_active", so we can +afford overwriting it with the extra IB. + +Also note that in most case there is some slack in large groups as well because +mmap allocates at a page granularity. Internally, musl works with 4KiB page and +simply considers larger pages as a bunch of 4KiB pages. As such, the size of +the slot is a multiple of those 4KiB pages and not a multiple of the actual +page size. + + Large group: ++---------------------+ +| meta* meta | \ +| int number_active:5 | } UNIT (16B) +| char padding[7] | / +| | +| +---------------+ | +| | extra IB (4B) | | +| +---------------+ | \ +| | | | \ +| | slot | | } maplen x 4KiB +| | | | / +| +---------------+ | / ++---------------------+ <- end at a 4KiB boundary + +Other than those few changes, large groups behave like small ones. + + + + META + +Groups only have the first part of the metadata. The second part is the bound +meta object. There's one for each group and they each have a pointer to each +other. + + Meta: ++--------------------+ +| meta* prev, next | <- make a circular buffer +| group* mem | +| int avail_mask | \ define 3 states for each slot +| int freed_mask | / +| size_t last_idx:5 | +| size_t freeable:1 | +| size_t sizeclass:6 | <- size if small group +| size_t maplen:52 | <- size if large group ++--------------------+ + +This is where the size of the slots is defined, using the "sizeclass" field. It +is linked to a global array with 48 fixed size ranging from 1 to 8191. These +sizes are then multiplied by UNIT to get the actual size in memory, hence the +small/large group threshold at 128KiB. The "sizeclasses" [48-62] are unused and +a "sizeclass" of 63 means it's a large group. +Given the limitation of offset to 16 bits, there is a limitation of the number +of slots for the largest of the small groups. It is set in "last_idx". +The "avail_mask" and "freed_mask" are both bitmaps to hold the status of every +slot in the group. The former holds whether the slot is available and never +used, the latter whether the slot has been used but has been freed, or is +currently inactive. When one calls malloc, it'll first attempt to allocate into +an "avail_mask" slot. Failing this, it'll use a freed slot or activate some new +slots. This allocation procedure is used to keep memory pages clean. A group +can easily span several pages. As such, it's better to fill in slots in pages +that have already been dirtied. This is what the "active_idx" field in group is +used for. When a new slot is alloc-ed in a group, all the slots that are in the +same page get activated. +"maplen" is where the size of a large group is stored. It is the number of 4KiB +pages the group spans. +Then there is "freeable", which can be a bit more confusing. While it obviously +specifies whether the group is freeable or not, currently a group is not +freeable only when a new meta is allocated on zeroed memory. +And last, the "prev" and "next" fields. This is so we can make a circular +buffer out of those meta objects. This way, when one requires a slot, malloc +can easily go through the various metas until it finds a satisfactory one. In +practice it doesn't search for the optimal one but just uses some heuristics +like "no slot available, need to activate a new slot ? take the next meta +instead". + + + + META_AREA + +We need some way to hold the memory where the meta will be allocated. That's +the purpose of the meta_areas. + + Meta_area: ++--------------------+ +| int64_t check | +| meta_area* next | <- form a linked list +| int nslot | +| meta slots[] | <- slots to store meta objects ++--------------------+ + +These work like a simple linked list where each link is bigger than the +previous one. Growing the size exponentially ensures there will only be a small +amount of links and mmap calls. "nslot" gives the number of slots where the +meta object can stored in this link (≠ slots in groups). Though, in practice, +"nslot" seems to be currently unused and the slots are tracked in the +malloc_context object instead. + + + + MALLOC_CONTEXT + +And finally, the context. This is a global singleton mainly used to manage the +meta objects. Some of the fields have been omitted for brevity. + + Malloc_context: ++------------------------------+ +| int64_t secret | +| meta* free_meta_head | <- head of the free meta +| meta* avail_meta | <- next meta in meta_area +| size_t avail_meta_count | +| size_t avail_meta_area_count | +| meta_area head, tail | +| meta* active[48] | <- circular buffer for each sizeclass +| size_t usage_by_class[48] | <- number of slot for each sizeclass +| uint8_t unmap_seq[32] | \ balance fragmentation/slot reuse +| uint8_t bounce[32] | / ++------------------------------+ + +"head" and "tail" allow easy access to the meta_area. Given that meta slots are +not tracked, we only need access to the tail of the list. There is some kind of +memory protection trick with the meta area. When a new meta_area is created, +only the first 4KiB page is made available for the slots. The number of slots +fitting in that page is written down in "avail_meta_count". When +"avail_meta_count" runs out of slots, another 4KiB worth of slots are added and +"avail_meta_area_count" is decremented once. When it reaches 0, a new meta_area +is mmap-ed. The point of doing so is that the pages originally mapped are +initialized with no right to read nor write, and those rights are given back +just as the malloc needs them. "avail_meta" and "free_meta_head" are the two +sources of meta objects. The first is a pointer to the first free slot in the +tail meta_area. The second is a pool of all the metas that have been freed. +This is important because once again, the meta slots are not tracked in the +meta area. If we weren't retaining the freed meta in this list, they would be +lost and it would be a memory leak. This way, we prioritize reusing freed meta +instead of initializing new ones. For easy access, there are a circular buffer +for each sizeclass. Along with those there are counters of how many slots exist +for any given sizeclass: "usage_by_class". +The "bounce" and "unmap_seq" seems to be used for balancing fragmentation and +address reuse. + -- 2.27.0
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.