|
Message-ID: <20180731074929.GB22386@voyager> Date: Tue, 31 Jul 2018 09:49:30 +0200 From: Markus Wichmann <nullplan@....net> To: musl@...ts.openwall.com Subject: Re: malloc implementation survey: omalloc On Mon, Jul 30, 2018 at 11:09:05AM -0400, Christopher Friedt wrote: > I haven't looked at omalloc, but I wrote a deadly simple buddy allocator > for Bionic some time ago with support for malloc(3), calloc(3), realloc(3), > free(3), and posix_memalign(3). It would obviously also support > aligned_alloc(3) for C11. > Most allocators can fit these requirements. omalloc is capable of delivering naturally aligned chunks of memory for everything up to a page. After which we'd need to trick Linux's virtual allocator into giving us the alignment we need. Typically via the old "allocate size+align bytes" trick. Or pages in this case. > It ran well on everything from a arm cortex-m0 to an intel core i7. The > code compiled to 504 .text bytes, on cortex-m0, iirc. > > I wrote it originally for using on the kernel-side of an rtos, but it could > easily be extended to make a syscall when a process runs out of ram. > > Obviously, a shortcoming is that the memory blocks must be PoT and there is > the potential for fragmentation. Otherwise though, the meta-information is > intrinsic within the pointer, which obliviates the need for external > storage. > That does sound interesting, but I don't really follow. The meta-info which we need to save is the length of the object returned. How can you encode that in the pointer without any external storage? Using internal storage? Yeah, that is where we are right now, and that means there's only one buffer overflow between us and oblivion at any given time. Or do you essentially have a large block of memory, with one area for the 16-byte objects, one for the 32-byte ones, &c. But then what do you do if your given area runs out and the application still wants more? Just so we don't talk past each other: External storage in this case means the meta-data is saved away from the returned memory; ideally, away from all user-visible memory, so that only a rampaging memory corruption issue can destroy it. Internal storage means, the information is saved close to the returned memory. > Every call takes approximately the same average time which is about log2( > depth ) for binary trees. > > Objectively speaking though, in terms of choosing a new malloc > implementation, the best one should be based on algorithmic complexity, but > also size and speed metrics. > > C That much is clear as well. As I understood it, Rich didn't merely want to improve the existing malloc, though, he wanted to find a new implementation, which had best have some desirable properties. Those being resillience against buffer overflows, as well as fine grained locking for good multi-thread performance. omalloc has one of these. And speed can be optimized. Reading omalloc is complicated by the tons of customizability in the algorithm. For some reason, they thought it would be a good idea to have a file, /etc/malloc.conf, be a symlink that does not point to a file, but rather has the configuration string as its target. Ugly. But it means they can read it in a single syscall. I still have no idea how many calls to the RNG are for security, and how many are necessary to the algorithm. For instance, if a chunk has to be handed out, they don't return the first free chunk, they return a random free chunk. Draw a random number between 0 and the number of free chunks and look for that many 0-bits. Now, is that done to reduce predictability of the returned address or to reduce memory fragmentation or for some other reason? I really can't tell. Now, the applications that tax the allocator more than any others are OO applications, that constantly allocate and free small objects. Like firefox and the like. But those also use multiple threads, so having an arena allocator would be nice, and in fact, including libtcmalloc did speed up firefox a lot. libtcmalloc does nothing but run dlmalloc in a thread-local version. We could modify omalloc like this as well. Not sure how, though. Have multiple hash tables? But how to locate the one that contains the object we're supposed to be freeing? Iterate over all of them? That's going to be a boon to run-time. And how to extend the list of hash tables? And, more importantly, how to shrink it when a thread exits? What I am getting at is this: We can't just benchmark a few algorithms and stick with the fastest one. Nor is a simple algorithmic analysis enough, since often enough, simple lists can be replaced with more complicated containers yielding better performance for some loads. It's complicated... Ciao, Markus
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.