musl - Re: ARM memcpy post-0.9.12-release thread

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20130731032315.GA221@brightrain.aerifal.cx>
Date: Tue, 30 Jul 2013 23:23:15 -0400
From: Rich Felker <dalias@...ifal.cx>
To: Harald Becker <ralda@....de>
Cc: musl@...ts.openwall.com
Subject: Re: ARM memcpy post-0.9.12-release thread

On Wed, Jul 31, 2013 at 05:13:47AM +0200, Harald Becker wrote:
> Hi Rich !
> 
> 30-07-2013 22:26 Rich Felker <dalias@...ifal.cx>:
> 
> > Some rough times (128k copy repeated 10000 times):
> > 
> > Aligned case:
> > Current C code: 1.2s
> > My best-attempt C code: 0.75s
> > My best-attempt inline asm: 0.57s
> > Bionic asm: 0.63s
> > Bionic asm without prefetch: 0.57s
> > 
> > Misaligned case:
> > Current C code: 4.7s
> > My best-attempt inline asm: 2.9s
> > Bionic asm: 1.1s
> 
> I like to throw in a question, as my cent to this topic:
> 
> Does modern C Compiler not try to align all data types? So
> following this path in most cases aligned data structures are
> used and copying them around usually hit the aligned case. The

Yes but these are small anyway and the compiler will be generating
inline code to copy them with ldmia/stmia.

> misaligned case happens mostly due to working with strings, and
> those are usually short. Can't we consider other misaligned cases
> violation of the programmer or code generator? If so, I would
> prefer the best-attempt inline asm versions of code or even
> best attempt C code over arch specific asm versions ... and add

Part of the problem discussed on #musl was that I was having to be
really careful with "best attempt C" since GCC will _generate_ calls
to memcpy for some code, even when -ffreestanding is used. The folks
on #gcc claim this is not a bug. So, if compilers deem themselves at
liberty to make this kind of transformation, any C implementation of
memcpy that's not intentionally crippled (e.g. using volatile temps
and 20x slower than it should be) is a time-bomb that might blow up on
us with the next GCC version...

This makes asm (either inline or standalone) a lot more appealing for
memcpy than it otherwise would be.

> a warning for performance lose on misaligned data in
> documentation, with giving a rough percentage of this lose.

You'd prefer video processing being 4 to 5 times slower? Video
typically consists of single-byte samples (planar YUV) and operations
like cropping to a non-multiple-of-4 size, motion compensation, etc.
all involve misaligned memcpy. Same goes for image transformations in
gimp, image blitting in web browsers (not necessarily aligned to
multiple-of-4 boundaries unless you're using 32bpp), etc...

Rich

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.