|
Message-Id: <20230607100710.4286-1-zhang_fei_0403@163.com> Date: Wed, 7 Jun 2023 18:07:07 +0800 From: zhangfei <zhang_fei_0403@....com> To: dalias@...c.org, musl@...ts.openwall.com Cc: zhangfei <zhangfei@...iscas.ac.cn> Subject: [PATCH 0/3] RISC-V: Optimize memset, memcpy and memmove From: zhangfei <zhangfei@...iscas.ac.cn> Hi, Currently, the risc-v architecture in the kernel source code uses assembly implemented memset, memcpy, and memmove. As shown in the link below: [1] https://github.com/torvalds/linux/blob/master/arch/riscv/lib/memset.S [2] https://github.com/torvalds/linux/blob/master/arch/riscv/lib/memcpy.S [3] https://github.com/torvalds/linux/blob/master/arch/riscv/lib/memmove.S I have modified it to a form that can be compiled in musl. At the same time, I noticed that aarch64 and x86 in musl have assembly implementations of these functions, so I hope these patches can be integrated into musl. memset.S refers to the handling of data volume less than 8 bytes in musl/src/string/memset.c, and modifies the byte storage to fill head and tail with minimal branching. The original memcpy.S in the kernel uses byte-wise copy if src and dst are not co-aligned.This approach is not efficient enough.Therefore, the patch linked below was used to optimize the memcpy.S of the kernel. [4] https://lore.kernel.org/all/20210216225555.4976-1-gary@garyguo.net/ [5] https://lore.kernel.org/all/20210513084618.2161331-1-bmeng.cn@gmail.com/ memmove.S did not make too many modifications, just made it independent of the kernel's header files and could be compiled separately in musl. The testing platform selected RISC-V SiFive U74.I used the code linked below for performance testing. [6] https://github.com/ARM-software/optimized-routines/blob/master/string/bench/ Compared the performance of C language in musl and assembly implementation, the test results are as follows: memset.c in musl: --------------------- Random memset (bytes/ns): memset_call 32K: 0.36 64K: 0.29 128K: 0.25 256K: 0.23 512K: 0.22 1024K: 0.21 avg 0.25 Medium memset (bytes/ns): memset_call 8B: 0.28 16B: 0.30 32B: 0.48 64B: 0.86 128B: 1.55 256B: 2.60 512B: 3.72 Large memset (bytes/ns): memset_call 1K: 4.83 2K: 5.40 4K: 5.85 8K: 6.09 16K: 6.22 32K: 6.15 64K: 1.39 memset.S: --------------------- Random memset (bytes/ns): memset_call 32K: 0.46 64K: 0.35 128K: 0.30 256K: 0.28 512K: 0.27 1024K: 0.25 avg 0.31 Medium memset (bytes/ns): memset_call 8B: 0.27 16B: 0.48 32B: 0.91 64B: 1.63 128B: 2.71 256B: 4.40 512B: 5.67 Large memset (bytes/ns): memset_call 1K: 6.62 2K: 7.03 4K: 7.46 8K: 7.71 16K: 7.83 32K: 7.57 64K: 1.39 memcpy.c in musl: --------------------- Random memcpy (bytes/ns): memcpy_call 32K: 0.24 64K: 0.20 128K: 0.18 256K: 0.17 512K: 0.16 1024K: 0.15 avg 0.18 Aligned medium memcpy (bytes/ns): memcpy_call 8B: 0.18 16B: 0.31 32B: 0.50 64B: 0.72 128B: 0.94 256B: 1.10 512B: 1.19 Unaligned medium memcpy (bytes/ns): memcpy_call 8B: 0.12 16B: 0.17 32B: 0.23 64B: 0.47 128B: 0.65 256B: 0.79 512B: 0.91 Large memcpy (bytes/ns): memcpy_call 1K: 1.25 2K: 1.29 4K: 1.31 8K: 1.31 16K: 1.28 32K: 0.62 64K: 0.56 memcpy.S: --------------------- Random memcpy (bytes/ns): memcpy_call 32K: 0.29 64K: 0.24 128K: 0.21 256K: 0.20 512K: 0.20 1024K: 0.17 avg 0.21 Aligned medium memcpy (bytes/ns): memcpy_call 8B: 0.15 16B: 0.56 32B: 0.91 64B: 1.17 128B: 2.36 256B: 2.90 512B: 3.27 Unaligned medium memcpy (bytes/ns): memcpy_call 8B: 0.15 16B: 0.27 32B: 0.45 64B: 0.67 128B: 0.90 256B: 1.03 512B: 1.16 Large memcpy (bytes/ns): memcpy_call 1K: 3.49 2K: 3.55 4K: 3.65 8K: 3.69 16K: 3.54 32K: 0.87 64K: 0.75 memmove.c in musl: --------------------- Unaligned forwards memmove (bytes/ns): memmove 1K: 0.22 2K: 0.22 4K: 0.22 8K: 0.23 16K: 0.23 32K: 0.22 64K: 0.20 Unaligned backwards memmove (bytes/ns): memmove 1K: 0.28 2K: 0.28 4K: 0.28 8K: 0.28 16K: 0.28 32K: 0.28 64K: 0.24 memmove.S: --------------------- Unaligned forwards memmove (bytes/ns): memmove 1K: 1.74 2K: 1.85 4K: 1.89 8K: 1.91 16K: 1.92 32K: 1.83 64K: 0.81 Unaligned backwards memmove (bytes/ns): memmove 1K: 1.70 2K: 1.81 4K: 1.87 8K: 1.89 16K: 1.91 32K: 1.84 64K: 0.83 It can be seen that the basic instruction implementations of memset, memcpy, and memmove have better performance improvements compared to the C implementation in musl. Please review the code. Thanks, Zhang Fei zhangfei (3): RISC-V: Optimize memset RISC-V: Optimize memcpy RISC-V: Optimize memmove src/string/riscv64/memset.S | 136 ++++++++++++++++++++++++++++++++++++ src/string/riscv64/memcpy.S | 159 ++++++++++++++++++++++++++++++++++++ src/string/riscv64/memmove.S | 315 +++++++++++++++++++++++++++++++++++ 3 file changed, 610 insertions(+) create mode 100644 src/string/riscv64/memset.S create mode 100644 src/string/riscv64/memcpy.S create mode 100644 src/string/riscv64/memmove.S
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.