|
Message-ID: <20220919134319.GN9709@brightrain.aerifal.cx> Date: Mon, 19 Sep 2022 09:43:19 -0400 From: Rich Felker <dalias@...c.org> To: baiyang <baiyang@...il.com> Cc: musl <musl@...ts.openwall.com> Subject: Re: The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) On Mon, Sep 19, 2022 at 03:53:30PM +0800, baiyang wrote: > Hi there, > > As we have discussed at > https://github.com/openwrt/openwrt/issues/10752. The > malloc_usable_size() function in musl 1.2 (mallocng) seems to have > some performance issues. > > It caused realloc and free spends too long time for get the chunk size. > > As we mentioned in the discussion, tcmalloc and some other > allocators can also accurately obtain the size class corresponding > to a memory block and its precise size, and it is also very fast at > the same time. > > Can we make some improvements to the existing malloc_usable_size > algorithm in mallocng? This should significantly improve the > performance of existing algorithms. Can you please start from a point of identifying the real-world case in which you're hitting a performance degredation? Made-up tests are generally not helpful and will almost always lead to focusing on the wrong problem. For now I'm going to focus on some things from the linked thread: > > Considering that realloc itself contains a complete > > malloc_usable_size (refer to here and here), So actually most > > (66.7%) of the realloc time is spent doing malloc_usable_size. In your test that increments the realloc size by one each iteration, only one in every PAGESIZE calls has any real work to do. The rest do nothing but set_size after obtaining the metadata on the object they're acting on. It's completely expected that the runtime of these will be dominated by obtaining the metadata; this isn't evidence of anything wrong. And, moreover, it's almost surely a lot more than 66.7%. Most of the 0.8s difference is likely spent on the 2560 mmap syscalls and page faults accessing the new pages they produce. > > In implementations such as: glibc, tcmalloc, msw crt (_msize), mac > > os x (malloc_size), and musl 1.1, even on low-end embedded > > processors, the consumption of malloc_usable_size per 10 million > > calls is mostly not more than a few hundred milliseconds. It looks like mallocng's malloc_usable_size is taking around 150 ns per call on your system, vs maybe 30-50 for others? > > In addition, this very slow slab size acquisition algorithm also > > needs to be called every time free (see here). So we believe it > > should be the main reason for malloc/free and realloc performance > > degradation in version 1.2. Unless you have an application that's explicitly using malloc_usable_size all over the place, it's highly unlikely that this is the cause of your real-world performance problems. The vast majority of reported problems with malloc performance have been in multithreaded applications, where the dominating time cost is fundamental: synchronization cost of having global consistency. There you'll expect to find very similar performance figures from any other allocator with global consistency, such as hardened_malloc. If you're really having single-threaded performance problems that aren't just in made-up benchmarks, please see if you can narrow down the cause empirically rather than speculatively. For example, running the program under perf and looking at where the time is being spent. > > If we can improve its speed and make it close to implementations > > like tcmalloc (tcmalloc can also accurately return the size of the > > size class to which the chunk belongs), it should significantly > > improve the performance of mallocng (at least in single-threaded > > scenarios) . tcmalloc is fast by not having global consistency, not being hardened against memory errors like double-free and use-after-free, and not avoiding fragmentation and excessive memory usage. Likewise for most of the others. The run time costs in mallocng for looking up the out-of-band metadata are largely fundamental to it being out-of-band (not subject to direct falsification via typically exploitable application bugs), size-efficient, 32-bit-compatible, nommu-compatible, etc. Other approaches like in hardened_malloc can be moderately more efficient to access the metadata, at the price of not being at all amenable to small systems, which are a core goal of musl we can't really disregard. I can't say for sure there's not any room for optimization in the metadata fetching though. Looking at the assembly output might be informative, to see if we're doing anything that's making the compiler emit gratuitously inefficient code. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.