|
Message-ID: <20230421133034.GS3630668@port70.net> Date: Fri, 21 Apr 2023 15:30:34 +0200 From: Szabolcs Nagy <nsz@...t70.net> To: 张飞 <zhangfei@...iscas.ac.cn> Cc: musl@...ts.openwall.com Subject: Re: Re: Re: memset_riscv64 * 张飞 <zhangfei@...iscas.ac.cn> [2023-04-20 16:17:10 +0800]: > Hi! > I listened to your suggestions and referred to string.c in Musl's test set(libc-bench), > and then modified the test cases. Since BUFLEN is a fixed value in strlen.c, I modified > it to a variable as a parameter in my own test case and passed it to the memset function. > I adjusted the LOOP_TIMES has been counted up to 500 times and the running time has been > sorted, only recording the running time of the middle 300 times. > > I took turns executing two programs on the SiFive chip three times each, and the results > are shown below. > First run result > -------------------------------------------------------------------------------- > length(byte) C language implementation(s) Basic instruction implementation(s) > -------------------------------------------------------------------------------- > 100 0.002208102 0.002304056 > 200 0.005053208 0.004629598 > 400 0.008666684 0.007739176 > 800 0.014065196 0.012372702 > 1600 0.023377685 0.020090966 > 3200 0.040221849 0.034059631 > 6400 0.072095377 0.060028906 > 12800 0.134040475 0.110039387 > 25600 0.257426806 0.210710952 > 51200 1.173755160 1.121833227 > 102400 3.693170402 3.637194098 > 204800 8.919975455 8.865504460 > 409600 19.410922418 19.360956493 > -------------------------------------------------------------------------------- > > Second run result > -------------------------------------------------------------------------------- > length(byte) C language implementation(s) Basic instruction implementation(s) > -------------------------------------------------------------------------------- > 100 0.002208109 0.002293857 > 200 0.005057374 0.004640669 > 400 0.008674218 0.007760795 > 800 0.014068582 0.012417084 > 1600 0.023381095 0.020124496 > 3200 0.040225138 0.034093181 > 6400 0.072098744 0.060069574 > 12800 0.134043954 0.110088141 > 25600 0.256453187 0.208578633 > 51200 1.166602505 1.118972796 > 102400 3.684957231 3.635116808 > 204800 8.916302592 8.861590734 > 409600 19.411057216 19.358777670 > -------------------------------------------------------------------------------- > > Third run result > -------------------------------------------------------------------------------- > length(byte) C language implementation(s) Basic instruction implementation(s) > -------------------------------------------------------------------------------- > 100 0.002208111 0.002293227 > 200 0.005056101 0.004628539 > 400 0.008677756 0.007748687 > 800 0.014085242 0.012404443 > 1600 0.023397782 0.020115710 > 3200 0.040242985 0.034084435 > 6400 0.072116665 0.060063767 > 12800 0.134060262 0.110082427 > 25600 0.257865186 0.209101754 > 51200 1.174257177 1.117753408 > 102400 3.696518162 3.635417503 > 204800 8.929357747 8.858765915 > 409600 19.426520562 19.356515671 > -------------------------------------------------------------------------------- > > From the test results, it can be seen that the runtime of memset implemented using the basic > instruction set assembly is basically shorter than that implemented using the C language. > May I ask if the test results are convincing? small sizes are much more common than large sizes, memsets can be distributed such that sizes [0,100), [100,1000), [1000,inf) are used for 1/3 of all memsets each (not the call count, but the amount of bytes memset using such sizes), i.e. if you speed up the size = [100,1000) and [1000,inf) cases by 10% but regress the [0,100) case by 20% then the overall performance roughly stays the same. (of course this is very workload dependent, but across a system this is what i'd expect, probably even more skewed to smaller sizes). so we need to know what happens in the [0,100) range. what i see is a ~4% regression there while there is a ~10% improvement in the [100,1000) case and ~15% improvement in the [1000,inf) case (it would be nice to know why the 25k case is so much faster and why that speed up only applies to that size, we don't want to optimize for some obscure cpu bug that will go away next year) on practical workloads i would expect < 10% speedup overall from the asm code (but we need more data in the [0,100) range to tell). this may not be enough to justify the asm code. rich already said he prefers a different style of implementation (where the body of the function is in c but the inner loop is in asm if that helps e.g. via simd). here is an example of a benchmark that takes input distribution into account from a workload: https://github.com/ARM-software/optimized-routines/blob/master/string/bench/memset.c#L53 > #include <stdio.h> > #include <stdlib.h> > #include <string.h> > #include <time.h> > > #define BUFLEN 500000 > #define LOOP_TIMES 500 > > int cmp(const void *a, const void *b) { > double x = *(double *)a; > double y = *(double *)b; > if (x < y) return -1; > if (x > y) return 1; > return 0; > } > > int main(){ > char *buf = malloc(BUFLEN); > double *arr = malloc(sizeof(double) * LOOP_TIMES); > size_t i,j,k; > struct timespec tv0,tv; > double times; > > for(j=100; j<BUFLEN; j*=2){ > for(k=0; k<LOOP_TIMES; k++){ > for (i=0; i<100; i++) > memset(buf+i, i, j-i); > } > } > > for(j=100; j<BUFLEN; j*=2){ > for(k=0; k<LOOP_TIMES; k++){ > clock_gettime(CLOCK_REALTIME, &tv0); > for (i=0; i<100; i++) > memset(buf+i, i, j-i); alignment only matters up to 64 byte alignment and usually inputs are at least 8byte aligned. value is almost always 0. (we probably don't even need to test non-0 case: a 0 check is correctly predicted in practice.) i think length should have a small variation, just enough to add penalty to small size checks where implementations may use many branches. so something like this may be better (madeup off,al numbers): buf = malloc((1<<16)+32); size_t sz[] = {16, 32, 48, 64, 96, 200, 300, 400, 600, 1<<10, 1<<11, 1<<12, 1<<13, 1<<14, 1<<15, 1<<16, 0}; size_t off[16] = {0, 0, 0, -8, 8, 16, 0, 0, -16, -12, 0, 4, -4, 0, 0, 12}; size_t al[16] = {0, 0, 8, 4, 8, 0, 8, 16, 8, 16, 4, 2, 1, 8, 16, 1}; for (j=0; sz[j]; j++) for (k=0; k<20; k++) { t0 = tic(); // large loop count is important for small sizes for (i=0; i<256; i++) memset(buf + al[i%16], 0, sz[j] + off[i%16]); t1 = tic(); tmin = min(tmin,t1-t0); } large memset (>=1k) can be tested separately (no.need to add off,al variaion then, less inner loop is enough, but it should not hurt to include them here). > clock_gettime(CLOCK_REALTIME, &tv); > tv.tv_sec -= tv0.tv_sec; > if ((tv.tv_nsec -= tv0.tv_nsec) < 0) { > tv.tv_nsec += 1000000000; > tv.tv_sec--; > } > arr[k] = tv.tv_sec + (double)tv.tv_nsec/1000000000; > } > qsort(arr, 500, sizeof(double), cmp); just take the minimum. we want to know the fastest execution. > > for (int m = 100; m < LOOP_TIMES - 100; m++) { > times += arr[m]; > } > printf("len: %ld time: %.9lf\n",j, times); you can also print GB/s which is 256*sz[j]/tmin in my example. > } > free(buf); > return 0; > }
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.