|
Message-ID: <484084b3.20c61.18798649e40.Coremail.zhangfei@nj.iscas.ac.cn> Date: Wed, 19 Apr 2023 15:22:23 +0800 (GMT+08:00) From: 张飞 <zhangfei@...iscas.ac.cn> To: musl@...ts.openwall.com Subject: Re: Re: Re: [PATCH]Implementation of strlen function in riscv64 architecture I did replace the C strlen code with a slower one except when musl is built for "#ifdef __riscv_vector" isa extension.So I referred to the C strlen code and implemented it with the basic instruction set, and the performance of both is basically the same. The reason for implementing two versions is to hope that the memset implemented using the basic instruction set can be applicable to all RISCV architecture CPUs, and the vector version can accelerate the hardware supporting vector expansion. When the compiler adds vector extensions through --with-arch=rv64gcv, __riscv_vector will also open by default.Similar macro definitions are common in riscv, such as setjmp/riscv64/setjmp.S in musl, which includes __riscv_float_abi_soft macro definitions. At present, the riscv vector extension instruction set is in a frozen state, and the instruction set is stable. In other open source libraries, such as openssl and openCV, riscv vector optimization is available.We know that the assembly generated by the compiler is often not the most efficient, and the automatic vectorization scenarios are limited, so we need to optimize the function by manual vectorization. For riscv, compiler automatic vectorization is still in its infancy. I conducted tests on different data volumes and compared the performance of memset functions implemented in C language, basic instruction set, and vector instruction set.The test case is test_strlen.c Performance comparison between C language implementation and assembly implementation was tested on Sifive chips(RISC-V SiFive U74 Dual Core 64 Bit RV64GC ISA Chip Platform). The test results are as follows.Due to the consistent algorithm between the two, there is basically no difference in performance. -------------------------------------------------------------------------------- length(byte) C language implementation(s) Basic instruction implementation(s) -------------------------------------------------------------------------------- 2 0.00000528 0.000005441 4 0.00000544 0.000005437 8 0.00000464 0.00000496 16 0.00000544 0.00000512 32 0.0000064 0.00000592 64 0.000007994 0.000007841 128 0.000012 0.000012 256 0.000020321 0.000020481 512 0.000037282 0.000037762 1024 0.000069924 0.000070244 2048 0.000135046 0.000135528 4096 0.000264491 0.000264816 8192 0.000524342 0.000525631 16384 0.001069965 0.001047742 32768 0.002180252 0.002142207 65536 0.005921251 0.005883868 131072 0.012508934 0.012392895 262144 0.02503915 0.024896995 524288 0.049879091 0.049821832 1048576 0.09973658 0.099969603 -------------------------------------------------------------------------------- Due to the lack of a chip that supports vector extension, I conducted a performance comparison test of strlen using C language and vector implementation on the Spike simulator, which has certain reference value. It can be clearly seen that vector implementation is more efficient than C language implementation, with an average performance improvement of over 800%. -------------------------------------------------------------------------------- length(byte) C language implementation(s) Vector instruction implementation(s) -------------------------------------------------------------------------------- 2 0.000003639 0.000003339 4 0.000004239 0.000003339 8 0.000003639 0.000003339 16 0.000004339 0.000003339 32 0.000005739 0.000003339 64 0.000008539 0.000003339 128 0.000014139 0.000004039 256 0.000025339 0.000004739 512 0.000047739 0.000006139 1024 0.000092539 0.000008939 2048 0.000182139 0.000014539 4096 0.000361339 0.000025739 8192 0.000719739 0.000048139 16384 0.001436539 0.000092939 32768 0.002870139 0.000182539 65536 0.005737339 0.000361739 131072 0.011471739 0.000720139 262144 0.022940539 0.001436939 524288 0.045878139 0.002870539 1048576 0.091753339 0.005737739 -------------------------------------------------------------------------------- So I hope to pass __riscv_vector, which enables hardware that does not support vector extension to execute the basic instruction set implementation of strlen, has the same performance as the C language implementation. For support vector extended hardware, strlen implemented by vector instruction set is executed to achieve acceleration effect. Fei Zhang > -----原始邮件----- > 发件人: "Szabolcs Nagy" <nsz@...t70.net> > 发送时间: 2023-04-11 20:48:22 (星期二) > 收件人: "张飞" <zhangfei@...iscas.ac.cn> > 抄送: musl@...ts.openwall.com > 主题: Re: Re: [musl] [PATCH]Implementation of strlen function in riscv64 architecture > > * 张飞 <zhangfei@...iscas.ac.cn> [2023-04-10 13:59:22 +0800]: > > I have made modifications to the assembly implementation of the riscv64 strlen function, mainly > > focusing on address alignment processing to avoid the problem of data crossing > > pages during vector instruction memory access. > > > > I think the assembly implementation of strlen is necessary. In glibc, > > if the c definition is not correct then you have to explain why. > if it's very slow then please tell us so. > > > X86_64, aarch64, alpha, and others all have assembly implementations of this function, > > while for riscv64, it is blank. > > I have also analyzed the test sets of Spec2006 and Spec2017, and the strlen function is also a hot topic. > > an asm implementation has significant maintenance cost so you should > provide some benchmark data or other evidence/reasoning for us to > decide if it's worth the cost. > > it seems you replaced the c strlen code with a slower one except when > musl is built for "#ifdef __riscv_vector" isa extension. what cpus > does this affect? are linux distros expected to use this as baseline? > do different riscv cpus have similar simd performance properties? who > will tweak the asm if not? > > in principle what you did can be done by the compiler auto vectorizer > so maybe contributing to the compiler is more useful. > > note that glibc has cpu specific implementations that it can select > at runtime, but musl uses one generic implementation for all cpus. </zhangfei@...iscas.ac.cn></zhangfei@...iscas.ac.cn></nsz@...t70.net> Download attachment "strlen_riscv64.patch" of type "application/octet-stream" (2007 bytes) View attachment "test_strlen.c" of type "text/plain" (1035 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.