musl - Re: ARM memcpy post-0.9.12-release thread

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20130802204146.GO221@brightrain.aerifal.cx>
Date: Fri, 2 Aug 2013 16:41:47 -0400
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Cc: Andre Renaud <andre@...ewatersys.com>
Subject: Re: ARM memcpy post-0.9.12-release thread

Andre, do you have any input on this? (Cc'ing)

Rich


On Tue, Jul 30, 2013 at 10:26:31PM -0400, Rich Felker wrote:
> Hi all (especially Andre),
> 
> I've been doing some experimenting with ARM memcpy, and I have not
> found any way to beat the Bionic asm file for misaligned copies. The
> best I could do with simple inline asm (reading multi-words and
> writing byte-at-a-time or vice versa) improved the performance nearly
> 40% compared to musl's current code, but it was still worse than half
> the speed of the Bionic asm.
> 
> For the aligned case, however, as I've said before, the Bionic code
> runs 10% slower for me than the C-with-inline-asm I posted to the
> list. Commenting out the prefetch code in the Bionic version brings
> the performance up to the same as my version.
> 
> I also found that the Bionic code was mysteriously crashing on the
> real system I test on (it worked on my toolchain with qemu). On
> further investigation, the test system's toolchain had -mthumb (with
> thumb2) as the default; adding -marm made it work. Both ways the asm
> was being interpreted as arm; the problem was that the *calling* code
> being thumb broke it. The solution was adding .type memcpy,%function
> to the asm file. Without that, the linker cannot know that the symbol
> it's resolving is a function name and thus that it has to adjust the
> low bit of the relocated address as a flag for whether the code is arm
> or thumb. I've now got the code working reliably it seems.
> 
> Sizes so far:
> Current C code: 260 bytes
> My best-attempt inline asm: 352 bytes
> Bionic (with prefetch removed): 764 bytes
> 
> Obviously the Bionic code is a bit larger than the others and than I'd
> like it to be, but it looks really hard to trim it down without
> ruining performance for misaligned copies; roughly half of the asm
> covers the misaligned case, which is expensive because you have three
> different code paths for different ways it can be off mod 4.
> 
> One other issue we have to consider if we go with the Bionic code is
> that we'd need to add sub-arch asm dirs to use it. As-is, the code is
> hard-coded for little endian. It will shuffle the byte order badly
> when copying on a big endian machine.
> 
> Some rough times (128k copy repeated 10000 times):
> 
> Aligned case:
> Current C code: 1.2s
> My best-attempt C code: 0.75s
> My best-attempt inline asm: 0.57s
> Bionic asm: 0.63s
> Bionic asm without prefetch: 0.57s
> 
> Misaligned case:
> Current C code: 4.7s
> My best-attempt inline asm: 2.9s
> Bionic asm: 1.1s
> 
> Rich

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.