|
Message-ID: <20130731022631.GA6655@brightrain.aerifal.cx> Date: Tue, 30 Jul 2013 22:26:31 -0400 From: Rich Felker <dalias@...ifal.cx> To: musl@...ts.openwall.com Subject: ARM memcpy post-0.9.12-release thread Hi all (especially Andre), I've been doing some experimenting with ARM memcpy, and I have not found any way to beat the Bionic asm file for misaligned copies. The best I could do with simple inline asm (reading multi-words and writing byte-at-a-time or vice versa) improved the performance nearly 40% compared to musl's current code, but it was still worse than half the speed of the Bionic asm. For the aligned case, however, as I've said before, the Bionic code runs 10% slower for me than the C-with-inline-asm I posted to the list. Commenting out the prefetch code in the Bionic version brings the performance up to the same as my version. I also found that the Bionic code was mysteriously crashing on the real system I test on (it worked on my toolchain with qemu). On further investigation, the test system's toolchain had -mthumb (with thumb2) as the default; adding -marm made it work. Both ways the asm was being interpreted as arm; the problem was that the *calling* code being thumb broke it. The solution was adding .type memcpy,%function to the asm file. Without that, the linker cannot know that the symbol it's resolving is a function name and thus that it has to adjust the low bit of the relocated address as a flag for whether the code is arm or thumb. I've now got the code working reliably it seems. Sizes so far: Current C code: 260 bytes My best-attempt inline asm: 352 bytes Bionic (with prefetch removed): 764 bytes Obviously the Bionic code is a bit larger than the others and than I'd like it to be, but it looks really hard to trim it down without ruining performance for misaligned copies; roughly half of the asm covers the misaligned case, which is expensive because you have three different code paths for different ways it can be off mod 4. One other issue we have to consider if we go with the Bionic code is that we'd need to add sub-arch asm dirs to use it. As-is, the code is hard-coded for little endian. It will shuffle the byte order badly when copying on a big endian machine. Some rough times (128k copy repeated 10000 times): Aligned case: Current C code: 1.2s My best-attempt C code: 0.75s My best-attempt inline asm: 0.57s Bionic asm: 0.63s Bionic asm without prefetch: 0.57s Misaligned case: Current C code: 4.7s My best-attempt inline asm: 2.9s Bionic asm: 1.1s Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.