musl - ARM memcpy post-0.9.12-release thread

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20130731022631.GA6655@brightrain.aerifal.cx>
Date: Tue, 30 Jul 2013 22:26:31 -0400
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Subject: ARM memcpy post-0.9.12-release thread

Hi all (especially Andre),

I've been doing some experimenting with ARM memcpy, and I have not
found any way to beat the Bionic asm file for misaligned copies. The
best I could do with simple inline asm (reading multi-words and
writing byte-at-a-time or vice versa) improved the performance nearly
40% compared to musl's current code, but it was still worse than half
the speed of the Bionic asm.

For the aligned case, however, as I've said before, the Bionic code
runs 10% slower for me than the C-with-inline-asm I posted to the
list. Commenting out the prefetch code in the Bionic version brings
the performance up to the same as my version.

I also found that the Bionic code was mysteriously crashing on the
real system I test on (it worked on my toolchain with qemu). On
further investigation, the test system's toolchain had -mthumb (with
thumb2) as the default; adding -marm made it work. Both ways the asm
was being interpreted as arm; the problem was that the *calling* code
being thumb broke it. The solution was adding .type memcpy,%function
to the asm file. Without that, the linker cannot know that the symbol
it's resolving is a function name and thus that it has to adjust the
low bit of the relocated address as a flag for whether the code is arm
or thumb. I've now got the code working reliably it seems.

Sizes so far:
Current C code: 260 bytes
My best-attempt inline asm: 352 bytes
Bionic (with prefetch removed): 764 bytes

Obviously the Bionic code is a bit larger than the others and than I'd
like it to be, but it looks really hard to trim it down without
ruining performance for misaligned copies; roughly half of the asm
covers the misaligned case, which is expensive because you have three
different code paths for different ways it can be off mod 4.

One other issue we have to consider if we go with the Bionic code is
that we'd need to add sub-arch asm dirs to use it. As-is, the code is
hard-coded for little endian. It will shuffle the byte order badly
when copying on a big endian machine.

Some rough times (128k copy repeated 10000 times):

Aligned case:
Current C code: 1.2s
My best-attempt C code: 0.75s
My best-attempt inline asm: 0.57s
Bionic asm: 0.63s
Bionic asm without prefetch: 0.57s

Misaligned case:
Current C code: 4.7s
My best-attempt inline asm: 2.9s
Bionic asm: 1.1s

Rich

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.