|
Message-ID: <CAJgzZooxj+5AKPg1O=kFW6pNQmY5sHuosG3LonOEyOd9aQBK7A@mail.gmail.com>
Date: Wed, 19 Apr 2023 15:39:06 -0700
From: enh <enh@...gle.com>
To: musl@...ts.openwall.com
Subject: Re: Re: Re: [PATCH]Implementation of strlen function in
riscv64 architecture
On Wed, Apr 19, 2023 at 12:22 AM 张飞 <zhangfei@...iscas.ac.cn> wrote:
> I did replace the C strlen code with a slower one except when
> musl is built for "#ifdef __riscv_vector" isa extension.So I referred
> to the C strlen code and implemented it with the basic instruction
> set, and the performance of both is basically the same.
>
> The reason for implementing two versions is to hope that the memset
> implemented
> using the basic instruction set can be applicable to all RISCV
> architecture CPUs,
> and the vector version can accelerate the hardware supporting vector
> expansion.
> When the compiler adds vector extensions through --with-arch=rv64gcv,
> __riscv_vector will also open by default.Similar macro definitions are
> common in
> riscv, such as setjmp/riscv64/setjmp.S in musl, which includes
> __riscv_float_abi_soft macro definitions.
>
> At present, the riscv vector extension instruction set is in a frozen
> state, and
> the instruction set is stable. In other open source libraries, such as
> openssl
> and openCV, riscv vector optimization is available.
is that actually checked in to openssl? the linux kernel patches to
save/restore vector state still haven't been merged to linux-next afaik,
and there's still no hwcaps support for V either. or are they using
`__riscv_vector` too, and not detecting V at runtime? (the kernel's own use
of V and Zb* seems to be based on an internal-only hwcap mechanism for now.)
> We know that the assembly generated
> by the compiler is often not the most efficient, and the automatic
> vectorization
> scenarios are limited, so we need to optimize the function by manual
> vectorization.
> For riscv, compiler automatic vectorization is still in its infancy.
>
have you tried sifive's autovectorization patches? do they help for this
code?
> I conducted tests on different data volumes and compared the performance
> of memset
> functions implemented in C language, basic instruction set, and vector
> instruction
> set.The test case is test_strlen.c
>
> Performance comparison between C language implementation and assembly
> implementation was
> tested on Sifive chips(RISC-V SiFive U74 Dual Core 64 Bit RV64GC ISA Chip
> Platform).
>
> The test results are as follows.Due to the consistent algorithm between
> the two, there
> is basically no difference in performance.
>
>
> --------------------------------------------------------------------------------
> length(byte) C language implementation(s) Basic instruction
> implementation(s)
>
> --------------------------------------------------------------------------------
> 2 0.00000528 0.000005441
> 4 0.00000544 0.000005437
> 8 0.00000464 0.00000496
> 16 0.00000544 0.00000512
> 32 0.0000064 0.00000592
> 64 0.000007994 0.000007841
> 128 0.000012 0.000012
> 256 0.000020321 0.000020481
> 512 0.000037282 0.000037762
> 1024 0.000069924 0.000070244
> 2048 0.000135046 0.000135528
> 4096 0.000264491 0.000264816
> 8192 0.000524342 0.000525631
> 16384 0.001069965 0.001047742
> 32768 0.002180252 0.002142207
> 65536 0.005921251 0.005883868
> 131072 0.012508934 0.012392895
> 262144 0.02503915 0.024896995
> 524288 0.049879091 0.049821832
> 1048576 0.09973658 0.099969603
>
> --------------------------------------------------------------------------------
>
> Due to the lack of a chip that supports vector extension, I conducted a
> performance
> comparison test of strlen using C language and vector implementation on
> the Spike
> simulator, which has certain reference value. It can be clearly seen that
> vector
> implementation is more efficient than C language implementation, with an
> average
> performance improvement of over 800%.
>
>
> --------------------------------------------------------------------------------
> length(byte) C language implementation(s) Vector instruction
> implementation(s)
>
> --------------------------------------------------------------------------------
> 2 0.000003639 0.000003339
> 4 0.000004239 0.000003339
> 8 0.000003639 0.000003339
> 16 0.000004339 0.000003339
> 32 0.000005739 0.000003339
> 64 0.000008539 0.000003339
> 128 0.000014139 0.000004039
> 256 0.000025339 0.000004739
> 512 0.000047739 0.000006139
> 1024 0.000092539 0.000008939
> 2048 0.000182139 0.000014539
> 4096 0.000361339 0.000025739
> 8192 0.000719739 0.000048139
> 16384 0.001436539 0.000092939
> 32768 0.002870139 0.000182539
> 65536 0.005737339 0.000361739
> 131072 0.011471739 0.000720139
> 262144 0.022940539 0.001436939
> 524288 0.045878139 0.002870539
> 1048576 0.091753339 0.005737739
>
> --------------------------------------------------------------------------------
>
> So I hope to pass __riscv_vector, which enables hardware that does not
> support vector
> extension to execute the basic instruction set implementation of strlen,
> has the same
> performance as the C language implementation. For support vector extended
> hardware,
> strlen implemented by vector instruction set is executed to achieve
> acceleration effect.
>
> Fei Zhang
>
> > -----原始邮件-----
> > 发件人: "Szabolcs Nagy" <nsz@...t70.net>
> > 发送时间: 2023-04-11 20:48:22 (星期二)
> > 收件人: "张飞" <zhangfei@...iscas.ac.cn>
> > 抄送: musl@...ts.openwall.com
> > 主题: Re: Re: [musl] [PATCH]Implementation of strlen function in
> riscv64 architecture
> >
> > * 张飞 <zhangfei@...iscas.ac.cn> [2023-04-10 13:59:22 +0800]:
> > > I have made modifications to the assembly implementation of the
> riscv64 strlen function, mainly
> > > focusing on address alignment processing to avoid the problem of
> data crossing
> > > pages during vector instruction memory access.
> > >
> > > I think the assembly implementation of strlen is necessary. In
> glibc,
> >
> > if the c definition is not correct then you have to explain why.
> > if it's very slow then please tell us so.
> >
> > > X86_64, aarch64, alpha, and others all have assembly
> implementations of this function,
> > > while for riscv64, it is blank.
> > > I have also analyzed the test sets of Spec2006 and Spec2017, and
> the strlen function is also a hot topic.
> >
> > an asm implementation has significant maintenance cost so you should
> > provide some benchmark data or other evidence/reasoning for us to
> > decide if it's worth the cost.
> >
> > it seems you replaced the c strlen code with a slower one except when
> > musl is built for "#ifdef __riscv_vector" isa extension. what cpus
> > does this affect? are linux distros expected to use this as baseline?
> > do different riscv cpus have similar simd performance properties? who
> > will tweak the asm if not?
> >
> > in principle what you did can be done by the compiler auto vectorizer
> > so maybe contributing to the compiler is more useful.
> >
> > note that glibc has cpu specific implementations that it can select
> > at runtime, but musl uses one generic implementation for all cpus.
> </zhangfei@...iscas.ac.cn></zhangfei@...iscas.ac.cn></nsz@...t70.net>
Content of type "text/html" skipped
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.