|
Message-ID: <20220921175817.GW9709@brightrain.aerifal.cx> Date: Wed, 21 Sep 2022 13:58:17 -0400 From: Rich Felker <dalias@...c.org> To: 王志强 <00107082@....com> Cc: musl@...ts.openwall.com, Quentin Rameau <quinq@...th.space>, Florian Weimer <fweimer@...hat.com> Subject: Re: Re:Re: The heap memory performance (malloc/free/realloc) is significantly degraded in musl 1.2 (compared to 1.1) On Wed, Sep 21, 2022 at 01:15:35PM -0400, Rich Felker wrote: > On Wed, Sep 21, 2022 at 06:15:02PM +0800, 王志强 wrote: > > Hi Rich, > > > > > > > > I am quite interested into the topic, and made a comparation between glibc and musl with following code: > > #define MAXF 4096 > > void* tobefree[MAXF]; > > int main() { > > long long i; > > int v, k; > > size_t s, c=0; > > char *p; > > for (i=0; i<100000000L; i++) { > > v = rand(); > > s = ((v%256)+1)*1024; > > p = (char*) malloc(s); > > p[1023]=0; > > if (c>=MAXF) { > > k = v%c; > > free(tobefree[k]); > > tobefree[k]=tobefree[--c]; > > } > > tobefree[c++]=p; > > } > > return 0; > > } > > ``` > > > > The results show signaficant difference. > > With glibc, (running within a debian docker image) > > # gcc -o m.debian -O0 app_malloc.c > > > > # time ./m.debian > > real 0m37.529s > > user 0m36.677s > > sys 0m0.771s > > > > With musl, (runnign within a alpine3.15 docker image) > > > > # gcc -o m.alpine -O0 app_malloc.c > > > > # time ./m.alpine > > real 6m 30.51s > > user 1m 36.67s > > sys 4m 53.31s > > > > > > > > musl seems spend way too much time within kernel, while glibc hold most work within userspace. > > I used perf_event_open to profile those programs: > > musl profiling(total 302899 samples) shows that those "malloc/free" sequence spend lots of time dealing with pagefault/munmap/madvise/mmap > > > > munmap(30.858% 93469/302899) > > _init?(22.583% 68404/302899) > > aligned_alloc?(89.290% 61078/68404) > > asm_exc_page_fault(45.961% 28072/61078) > > main(9.001% 6157/68404) > > asm_exc_page_fault(29.170% 1796/6157) > > rand(1.266% 866/68404) > > aligned_alloc?(20.437% 61904/302899) > > asm_exc_page_fault(56.038% 34690/61904) > > madvise(13.275% 40209/302899) > > mmap64(11.125% 33698/302899) > > > > > > But glibc profiling (total 29072 samples) is way much lighter, pagefault is the most cost while glibc spend significat time on "free" > > > > > > > > pthread_attr_setschedparam?(82.021% 23845/29072) > > asm_exc_page_fault(1.657% 395/23845) > > _dl_catch_error?(16.714% 4859/29072)__libc_start_main(100.000% 4859/4859) > > cfree(58.839% 2859/4859) > > main(31.138% 1513/4859) > > asm_exc_page_fault(2.115% 32/1513) > > pthread_attr_setschedparam?(3.725% 181/4859) > > random(2.099% 102/4859) > > random_r(1.832% 89/4859) > > __libc_malloc(1.420% 69/4859) > > It seems to be me, glibc make lots of uasage of cache of kernel > > memory and avoid lots of pagefault and syscalls. > > Is this performance difference should concern realworld > > applications? On average, musl actual spend about 3~4ns per > > malloc/free, which is quite acceptable in realworld applications, I > > think. > > > > > > > > (Seems to me, that the performance difference has nothing to do with > > malloc_usable_size, which may be indeed just a speculative guess > > without any base) > > Indeed this has nothing to do with it. What you're seeing is just that > musl/mallocng return freed memory, and glibc, basically, doesn't > (modulo the special case of large contiguous free block at 'top' of > heap). This inherently has a time cost. > > mallocng does make significant efforts to avoid hammering mmap/munmap > under repeated malloc/free, at least in cases where it can reasonably > be deemed to matter. However, this is best-effort, and always a > tradeoff on (potential) large unwanted memory usage vs performance. > More on this later. > > Your test case, with the completely random size distribution across > various large sizes, is likely a worst case. The mean size you're > allocating is 128k, which is the threshold for direct mmap/munmap of > each allocation, so at least half of the allocations you're making can > *never* be reused, and will always be immediately unmapped on free. It > might be interesting to change the scaling factor from 1k to 256 bytes > so that basically all of the allocation sizes are in the > malloc-managed range. One observation if this change is made: it looks like at least 70% of the time is spent performing madvise(MADV_FREE), and that a large portion of the rest (just looking at strace) seems to be repeatedly mapping and freeing a 17-page (68k) block, probably because this size happens to be at the boundary of some threshold where bounce protection isn't happening. I think we should look at both of these in more detail, since they both suggest opportunities for large performance improvements at low cost. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.