|
Message-ID: <10dbd851.a99.1863ee385b5.Coremail.00107082@163.com> Date: Sat, 11 Feb 2023 13:12:23 +0800 (CST) From: "David Wang" <00107082@....com> To: "Rich Felker" <dalias@...c.org> Cc: musl@...ts.openwall.com, "Markus Wichmann" <nullplan@....net> Subject: Re:Re: Re:Re: Re:Re: Re:Re: qsort At 2023-02-10 22:19:55, "Rich Felker" <dalias@...c.org> wrote: >On Fri, Feb 10, 2023 at 09:45:12PM +0800, David Wang wrote: >> >> >> > >> About wrapper_cmp, in my last profiling, there are total 931387 >> samples collected, 257403 samples contain callchain ->wrapper_cmp, >> among those 257403 samples, 167410 samples contain callchain >> ->wrapper_cmp->mycmp, that is why I think there is extra overhead >> about wrapper_cmp. Maybe compiler optimization would change the >> result, and I will make further checks. > >Yes. On i386 here, -O0 takes wrapper_cmp from 1 instruction to 10 >instructions. > >Rich With optimized binary code, it is very hard to collect an intact callchain from kernel via perf_event_open:PERF_SAMPLE_CALLCHAIN. But to profile qsort, a callchain may not be necessary. IP register sampling would be enough to identify which part take most cpu cycles. So I change the strategy, instead of PERF_SAMPLE_CALLCHAIN, now I just use PERF_SAMPLE_IP This is what I got: +-------------------+---------------+ | func | count | +-------------------+---------------+ | Total | 423488 | | memcpy | 48.76% 206496 | | sift | 16.29% 68989 | | mycmp | 14.57% 61714 | | trinkle | 8.90% 37690 | | cycle | 5.45% 23061 | | shr | 2.19% 9293 | | __qsort_r | 1.77% 7505 | | main | 1.04% 4391 | | shl | 0.55% 2325 | | wrapper_cmp | 0.42% 1779 | | rand | 0.05% 229 | | __set_thread_area | 0.00% 16 | +-------------------+---------------+ (Note that, in this profile report, I count only those samples that are directly within the function body, the samples within sub-function do not contribute to any of its parent functions.) And you're right, with optimization, the impact of wrapper_cmp is very very low, only 0.42%. The memcpy stands out above, I use uprobe(perf_event_open:PERF_SAMPLE_REGS_USER) to collect statistics about the size (the 3rd parameter, stored in RDX register) of memcpy, and all of those memcpy function calls are just copying 4 bytes, according to the source code, the size of memcpy is item size to be sorted, which is int32 in my test case. Maybe something could be improved here. I also made same profiling against glibc: +-----------------------------+--------------+ | func | count | +-----------------------------+--------------+ | Total | 640880 | | msort_with_tmp.part.0 | 73.99 474176 | <--- merge sort? | mycmp | 11.76 75392 | | main | 6.45 41306 | | __memcpy_avx_unaligned_erms | 4.58 29339 | | random | 0.86 5525 | | __memcpy_avx_unaligned | 0.83 5293 | | random_r | 0.76 4882 | | rand | 0.45 2897 | | _init | 0.31 1975 | | _fini | 0.01 80 | | __free | 0.00 5 | | _int_malloc | 0.00 5 | | malloc | 0.00 2 | | __qsort_r | 0.00 1 | | _int_free | 0.00 1 | +-----------------------------+--------------+ Test code: ------------------- #include <stdio.h> #include <stdlib.h> int mycmp(const void *a, const void *b) { return *(const int *)a-*(const int*)b; } #define MAXN 1<<20 int vs[MAXN]; int main() { int i, j, k, n, t; for (k=0; k<1024; k++) { for (i=0; i<MAXN; i++) vs[i]=i; for (n=MAXN; n>1; n--) { i=n-1; j=rand()%n; if (i!=j) { t=vs[i]; vs[i]=vs[j]; vs[j]=t; } } qsort(vs, MAXN, sizeof(vs[0]), mycmp); } return 0; } ------------------- gcc test.c -O2 -static With musl-libc: $ time ./a.out real 9m 5.10s user 9m 5.09s sys 0m 0.00s With glic: $ time ./a.out real 1m56.287s user 1m56.270s sys 0m0.004s To sum up, optimize those memcpy calls and reduce comparation to its minimum could have significant performance improvements, but I doubt it could achieve a 4-factor improvement. FYI David
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.