|
Message-ID: <CAFUsyfKRicqg3zGdReTRgT-4LJv5Asbfn1+YE7tKvXLb4fuJLw@mail.gmail.com> Date: Fri, 21 Jan 2022 15:50:26 -0600 From: Noah Goldstein <goldstein.w.n@...il.com> To: Joerg Sonnenberger <joerg@....de> Cc: libc-coord@...ts.openwall.com, Richard Biener via Gcc <gcc@....gnu.org>, GNU C Library <libc-alpha@...rceware.org> Subject: Re: Add new ABIs '__strcmpeq', '__strncmpeq', '__wcscmpeq' and '__wcsncmpeq' to libc On Fri, Jan 21, 2022 at 12:51 PM Joerg Sonnenberger <joerg@....de> wrote: > > On Thu, Jan 20, 2022 at 04:56:59PM -0600, Noah Goldstein wrote: > > The goal is that the new interfaces will be usable as an optimization > > by compilers if a program uses the return value of the non "eq" > > variant as a boolean. > > So I'm curious, but can you demonstrate that it can be implemented > notacibly faster than regular strcmp? Unlike for memcmp, I don't see an > obvious way to save any operations. Strong point! I had been somewhat assuming we could make the same optimizations with `__memcmpeq` but there still needs to be some logic that tracks which comes first the mismatch or the null terminator. It's not quite as much as `memcmp` vs `__memcmpeq` but we can still save. Using the x86_64 AVX2 optimized implementation as reference: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/strcmp-avx2.S;h=9c73b5899d55a72b292f21b52593284cd513d2a3;hb=HEAD We can convert the general return method of checking equals + strlen from: ``` VMOVU (%rdi), %ymm0 VPCMPEQ (%rsi), %ymm0, %ymm1 VPCMPEQ %ymm0, %ymmZERO, %ymm2 vpandn %ymm1, %ymm2, %ymm1 vpmovmskb %ymm1, %ecx incl %ecx jz L(keep_going) tzcntl %ecx, %ecx movzbl (%rdi, %rcx), %eax movzbl (%rsi, %rcx), %ecx subl %ecx, %eax vzeroupper ret ``` To ``` VMOVU (%rdi), %ymm0 VPCMPEQ (%rsi), %ymm0, %ymm1 VPCMPEQ %ymm0, %ymmZERO, %ymm2 vpandn %ymm1, %ymm2, %ymm2 vpmovmskb %ymm2, %ecx incl %ecx jz L(keep_going) vpmovmskb %ymm1, %eax blsi %ecx, %ecx andn %eax, %ecx, %eax vzeroupper ret ``` Testing this with comparisons where mismatch or strlen in the first 32 bytes (common case) it's about the same throughput but ~20% reduction in latency. Another benefit is we can reuse this exact return logic throughout as memory offset is no longer required. This simplifies the page cross logic a great deal and will net us some serious code size reduction for the common usage of strcmp. I think though I was a bit over optimistic about the performance benefits as I was using `memcmp` vs `__memcmpeq` as a reference. I'll put together a patch for just `__strcmpeq` and post the results here. I think the wide-character versions have more expensive return value checks so if the character versions show a benefit we can expect it to translate. > > Joerg
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.