musl - Re: ARM memcpy post-0.9.12-release thread

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20130731061858.07c30257@ralda.gmx.de>
Date: Wed, 31 Jul 2013 06:18:58 +0200
From: Harald Becker <ralda@....de>
To: Rich Felker <dalias@...ifal.cx>
Cc: musl@...ts.openwall.com
Subject: Re: ARM memcpy post-0.9.12-release thread

Hi Rich !

30-07-2013 23:23 Rich Felker <dalias@...ifal.cx>:

> > misaligned case happens mostly due to working with strings,
> > and those are usually short. Can't we consider other
> > misaligned cases violation of the programmer or code
> > generator? If so, I would prefer the best-attempt inline asm
> > versions of code or even best attempt C code over arch
> > specific asm versions ... and add
> 
> Part of the problem discussed on #musl was that I was having to
> be really careful with "best attempt C" since GCC will
> _generate_ calls to memcpy for some code, even when
> -ffreestanding is used. The folks on #gcc claim this is not a
> bug. So, if compilers deem themselves at liberty to make this
> kind of transformation, any C implementation of memcpy that's
> not intentionally crippled (e.g. using volatile temps and 20x
> slower than it should be) is a time-bomb that might blow up on
> us with the next GCC version...

I never deal with the details of this type of gcc code
generation, but doesn't this only happen on small and structure
copies? Structure copies which shall usually be aligned? So if
they are aligned the simpler version saves code space.

> This makes asm (either inline or standalone) a lot more
> appealing for memcpy than it otherwise would be.

Optimization is always a question of decision, which I consider
the hard part of the job ... :(

> > a warning for performance lose on misaligned data in
> > documentation, with giving a rough percentage of this lose.
> 
> You'd prefer video processing being 4 to 5 times slower?

No, definitely not, but video processing is one of the cases I
consider candidate for optimized processing. So such projects
shall include an optimize version of of low level processing
functions (including memcpy, but not only - candidate for
library with optimized functions?). 

> Video typically consists of single-byte samples (planar YUV) and
> operations like cropping to a non-multiple-of-4 size, motion
> compensation, etc. all involve misaligned memcpy. Same goes for
> image transformations in gimp, image blitting in web browsers
> (not necessarily aligned to multiple-of-4 boundaries unless
> you're using 32bpp), etc...

You are all right, but the programmer shall know of this and
consider to use appropriate functions. You can write the code for
those parts which need the speed in a way, which call optimized
functions. A way which usually does not conflict with gcc self
inserted calls. So this self inserted calls usually hit the
aligned scope, or the programmer did not behave well (not the
compiler).

--
Harald

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.