musl - Re: ARM memcpy post-0.9.12-release thread

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130731061337.GC221@brightrain.aerifal.cx>
Date: Wed, 31 Jul 2013 02:13:37 -0400
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Subject: Re: ARM memcpy post-0.9.12-release thread

On Wed, Jul 31, 2013 at 06:18:58AM +0200, Harald Becker wrote:
> Hi Rich !
> 
> 30-07-2013 23:23 Rich Felker <dalias@...ifal.cx>:
> 
> > > misaligned case happens mostly due to working with strings,
> > > and those are usually short. Can't we consider other
> > > misaligned cases violation of the programmer or code
> > > generator? If so, I would prefer the best-attempt inline asm
> > > versions of code or even best attempt C code over arch
> > > specific asm versions ... and add
> > 
> > Part of the problem discussed on #musl was that I was having to
> > be really careful with "best attempt C" since GCC will
> > _generate_ calls to memcpy for some code, even when
> > -ffreestanding is used. The folks on #gcc claim this is not a
> > bug. So, if compilers deem themselves at liberty to make this
> > kind of transformation, any C implementation of memcpy that's
> > not intentionally crippled (e.g. using volatile temps and 20x
> > slower than it should be) is a time-bomb that might blow up on
> > us with the next GCC version...
> 
> I never deal with the details of this type of gcc code
> generation, but doesn't this only happen on small and structure
> copies? Structure copies which shall usually be aligned? So if
> they are aligned the simpler version saves code space.

I'm sorry, I don't think I was clear. The issue is that GCC recognizes
certain patterns and generates calls to memcpy rather than doing the
work inline. If it does this in memcpy.c, you end up with a version of
memcpy that invokes infinite recursion and is thereby unusable.

The issue I hit was that GCC was generating memcpy calls for copying
struct { char block[32]; }, which has no alignment requirement. This
technique was probably the best bet at getting the compiler to
generate an efficient memcpy (in fact, it works quite well on some
other archs), but on ARM it blew away the stack.

When looking for a solution, however, I came across this:

http://gcc.gnu.org/bugzilla//show_bug.cgi?id=56888

It looks to me like the situation is that, as compilers get smarter
and smarter, it's going to become increasingly difficult to ensure
that memcpy doesn't get compiled to a call to memcpy. So, my long term
plan (this is still open to discussion) is to do something like this:

Have one or more C memcpy implementations on-hand that empirically
generate good code. For important archs, have hand-optimized asm; this
is both smaller and better-performing than anything decent we can
achieve with C. For archs where we don't yet have arm, generate asm
from whichever C implementation works best. Then, instead of having
the performance-oriented C in the source tree, have a fail-safe C
version that the compiler can't possibly mess up; this ensures that
future ports can get started without having to worry about whether the
compiler breaks memcpy.

> > This makes asm (either inline or standalone) a lot more
> > appealing for memcpy than it otherwise would be.
> 
> Optimization is always a question of decision, which I consider
> the hard part of the job ... :(
>  
> > > a warning for performance lose on misaligned data in
> > > documentation, with giving a rough percentage of this lose.
> > 
> > You'd prefer video processing being 4 to 5 times slower?
> 
> No, definitely not, but video processing is one of the cases I
> consider candidate for optimized processing. So such projects
> shall include an optimize version of of low level processing
> functions (including memcpy, but not only - candidate for
> library with optimized functions?). 

Are you aware that redefining functions with the standardf names
invokes undefined behavior? Yes you could write your own memcpy by
another name, but then it can't get used by things like stdio (where,
if it's slow, it's likely a large portion of time spent on file io),
TLS image copying (per-thread startup cost), etc.

Of all the functions in libc, memcpy is definitely the most
performance-critical to the most applications. The other things that
matter are malloc/free, math (sometimes), stdio, qsort, and
searching/matching functions like regex, strstr, etc.

> > Video typically consists of single-byte samples (planar YUV) and
> > operations like cropping to a non-multiple-of-4 size, motion
> > compensation, etc. all involve misaligned memcpy. Same goes for
> > image transformations in gimp, image blitting in web browsers
> > (not necessarily aligned to multiple-of-4 boundaries unless
> > you're using 32bpp), etc...
> 
> You are all right, but the programmer shall know of this and
> consider to use appropriate functions. You can write the code for

The programmer should write asm for 20 different archs? Most people
have better things to do with their time..

Back to the point, musl is not dietlibc. If you want the
smallest, lowest-quality imaginable libc, there's dietlibc you can
use. musl's aim is to be a robust general-purpose libc. "Switch from
Bionic to musl and make your apps run five times slower" is not
appealing to anybody.

If the choice were between having fully general, clean C code that
runs 5-10% slower or giant gobs of asm with a heavy maintenance
burden that runs 5-10% faster, I would probably agree and just figure
the people who really need that last 5-10% can drop in some fancy asm.
But that's not the situation we're in. The current code is half the
speed of a decent (still probably not even the fastest) implementation
for aligned copies and nearly five times slower for misaligned copies.
That's well outside the range of "special interest" and into the range
of "our implementation sucks".

Moreover, the choice here is not between clean C and dirty asm. It's
between dirty C and, well, whatever you think of the asm. The only
"clean" C memcpy is:

    while (n--) *d++ = *s++;

Our C memcpy depends on implementation-defined behavior (casting
pointers to integers to inspect their alignment) as well as undefined
behavior (aliasing violations to copy as size_t units). The latter
cannot be detected by a compiler that's not performing LTO/whole
program optimization, so it's "safe" for the most part, but it's still
wrong. So from a standpoint of clean code, getting decent asm on all
the archs and then possibly replacing the C with something more naive
would probably be a step forward.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.