musl - Re: musl libc, memcpy

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20120801042722.GB544@brightrain.aerifal.cx>
Date: Wed, 1 Aug 2012 00:27:22 -0400
From: Rich Felker <dalias@...ifal.cx>
To: Kim Walisch <kim.walisch@...il.com>
Cc: musl@...ts.openwall.com
Subject: Re: musl libc, memcpy

On Tue, Jul 31, 2012 at 12:19:13AM +0200, Kim Walisch wrote:
> > I'd like to know what block sizes you were looking at, because for
> > memcpy that makes all the difference in the world:
> 
> I copied blocks of 16 kilobytes.

OK, that sounds (off-hand) like a good size for testing.

> > I don't think this is necessary or useful. If we want better
> > performance on these archs, a tiny asm file that does almost nothing
> > but "rep movsd" is known to be the fastest solution on 32-bit x86, and
> > is at least the second-fastest on 64-bit, with the faster solutions
> > not being available on all cpus. On pretty much all other archs,
> > unaligned access is illegal.
> 
> My point is that your code uses byte (char) copying for unaligned data
> but on x86 this is not necessary. Using a simple macro in your memcpy
> implementation that always uses the size_t copying path for x86 speeds
> up your memcpy implementation by about 500% for unaligned data on my
> PC (Intel i5-670 3.46GHz, gcc-4.7, SL Linux 6.2 x86_64). You can also
> use a separate asm file with "rep movsd" for x86, I guess it will run
> at the same speed as my macro solution.

I'm attaching a (possibly buggy; not heavily tested) rep-movsd-based
version. I'd be interested in hearing how it performs.

> Another interesting thing to mention is that gcc-4.5 vectorizes the 3
> copying loops of your memcpy implementation if it is compiled with the
> -ftree-vectorize flag (add -ftree-vectorizer-verbose=1 for
> vectorization report) but not if simply compiled with -O2 or -O3. With

Odd, the gcc manual claims -ftree-vectorize is included in -O3:

http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

> $ gcc -O2 -ftree-vectorize -ftree-vectorizer-verbose=1 memcpy.c main.c -o memcpy
> 
> memcpy.c:25: note: created 1 versioning for alias checks.
> memcpy.c:25: note: LOOP VECTORIZED.
> memcpy.c:21: note: created 1 versioning for alias checks.
> memcpy.c:21: note: LOOP VECTORIZED.
> memcpy.c:9: note: vectorized 2 loops in function.

>From the sound of those notes, I suspect duplicate code (and wasteful
conditional branches) are getting generated to handle the possibility
that the source and destination pointers might alias. I think this
means it would be a good idea to add proper use of "restrict" pointers
(per C99 requirements) in musl sooner rather than later; it might both
reduce code size and improve performance.

Rich

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.