|
Message-ID: <c0682e6.1a.186417c1d3a.Coremail.00107082@163.com> Date: Sun, 12 Feb 2023 01:18:18 +0800 (CST) From: "David Wang" <00107082@....com> To: musl@...ts.openwall.com Subject: Re:Re: qsort At 2023-02-11 21:35:33, "Rich Felker" <dalias@...c.org> wrote: >Based on the profiling data, I would predict an instant 2x speed boost >special-casing small sizes to swap directly with no memcpy call. > I made some experimental changes, use different cycle function for width 4,8 or 16, --- a/src/stdlib/qsort.c +++ b/src/stdlib/qsort.c ... +static void cyclex1(unsigned char* ar[], int n) +{ + unsigned char tmp[32]; + int i; + int *p1, *p2; + if(n < 2) { + return; + } + ar[n] = tmp; + p1 = (int*)ar[n]; + p2 = (int*)ar[0]; + *p1=*p2; + for(i = 0; i < n; i++) { + p1 = (int*)ar[i]; + p2 = (int*)ar[i+1]; + p1[0]=p2[0]; + } +} +static void cyclex2(unsigned char* ar[], int n) +{ + unsigned char tmp[32]; + int i; + long long *p1, *p2; + if(n < 2) { + return; + } + ar[n] = tmp; + p1 = (long long*)ar[n]; + p2 = (long long*)ar[0]; + *p1=*p2; + for(i = 0; i < n; i++) { + p1 = (long long*)ar[i]; + p2 = (long long*)ar[i+1]; + p1[0]=p2[0]; + } +} +static void cyclex4(unsigned char* ar[], int n) +{ + unsigned char tmp[32]; + int i; + long long *p1, *p2; + if(n < 2) { + return; + } + ar[n] = tmp; + p1 = (long long*)ar[n]; + p2 = (long long*)ar[0]; + *p1++=*p2++; + *p1++=*p2++; + for(i = 0; i < n; i++) { + p1 = (long long*)ar[i]; + p2 = (long long*)ar[i+1]; + p1[0]=p2[0]; + p1[1]=p2[1]; + } +} + - cycle(width, ar, i); + if (width==4) cyclex1(ar, i); + else if (width==8) cyclex2(ar, i); + else if (width==16) cyclex4(ar, i); + else cycle(width, ar, i); --- I am not skilled in writing high performance codes, the above is what I can think of for now. a rough timing report is as following: +-------------------------+-----------+----------+-----------+ | item size | glibc | musl | opt musl | +-------------------------+-----------+----------+-----------+ | 4 int | 0m15.794s | 1m 7.52s | 0m 37.27s | | 8 long | 0m16.351s | 1m 2.92s | 0m 45.12s | | 16 struct{ long k, v; } | 0m23.262s | 1m 9.74s | 0m 55.07s | +-------------------------+-----------+----------+-----------+ (128 rounds of qsort random 1<<20 items) The test code for 16bytes qsort: #include <stdio.h> #include <stdlib.h> typedef struct { long long k, v; } VNode; int mycmp(const void *a, const void *b) { long long d = ((const VNode*)a)->v - ((const VNode*)b)->v; if (d>0) return 1; else if (d<0) return -1; return 0; } #define MAXN 1<<20 VNode vs[MAXN]; int main() { int i, j, k, n; long long t; for (k=0; k<128; k++) { for (i=0; i<MAXN; i++) vs[i].v=i; for (n=MAXN; n>1; n--) { i=n-1; j=rand()%n; if (i!=j) { t=vs[i].v; vs[i].v=vs[j].v; vs[j].v=t; } } qsort(vs, MAXN, sizeof(vs[0]), mycmp); for (i=0; i<MAXN; i++) if (vs[i].v!=i) { printf("error\n") ;return 1; } } return 0; } The highest improvement happens with sorting int32, and as date size increases, the impact of the memcpy call-overhead decreases. >Incidentally, our memcpy is almost surely at least as fast as glibc's >for 4-byte copies. It's very large sizes where performance is likely >to diverge. > >Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.