|
Message-ID: <20130709053711.GO29800@brightrain.aerifal.cx> Date: Tue, 9 Jul 2013 01:37:12 -0400 From: Rich Felker <dalias@...ifal.cx> To: musl@...ts.openwall.com Subject: Re: Thinking about release On Tue, Jul 09, 2013 at 05:06:21PM +1200, Andre Renaud wrote: > Hi Rich, > > I think the first step should be benchmarking on real machines. > > Somebody tried the asm that was posted and claimed it was no faster > > than musl's C code; I don't know the specific hardware they were using > > and I don't even recall right off who made the claim or where it was > > reported, but I think before we start writing or importing code we > > need to have a good idea how the current C code compares in > > performance to other "optimized" implementations. > > In the interests of furthering this discussion (and because I'd like > to start using musl as the basis for some of our projects, but the > current speed degradation is noticeable , I've created some patches Then it needs to be fixed. :-) > that enable memcmp, memcpy & memmove ARM optimisations. I've ignored > the str* functions, as these are generally not used on the same bulk > data as the mem* functions, and as such the performance issue is less > noticeable. I think that's a reasonable place to begin. I do mildly question the relevance of memmove to performance, so if we end up having to do a lot of review or changes to get the asm committed, it might make sense to leave memmove for later. > Using a fairly rudimentary test application, I've benchmarked it as > having the following speed improvements (this is all on an actual ARM > board - 400MHz arm926ejs): > memcpy: 160% > memmove: 162% > memcmp: 272% > These numbers bring musl in line with glibc (at least on ARMv5). > memcmp in particular seems to be faster (90MB/s vs 75MB/s on my > platform). > I haven't looked at using the __hwcap feature at this stage to swap > between these implementation and neon optimised versions. I assume > this can come later. > > >From a code size point of view (this is all with -O3), memcpy goes > from 1996 to 1680 bytes, memmove goes from 2592 to 2088 bytes, and > memcmp goes from 1040 to 1452, for a total increase of 224 bytes. > > The code is from NetBSD and Android (essentially unmodified), and it > is all BSD 2-clause licensed. At first glance, this looks like a clear improvement, but have you compared it to much more naive optimizations? My _general_ experience with optimized memcpy asm that's complex like this and that goes out of its way to deal explicitly with cache lines and such is that it's no faster than just naively moving large blocks at a time. Of course this may or may not be the case for ARM, but I'd like to know if you've done any tests. The basic principle in my mind here is that a complex solution is not necessarily wrong if it's a big win in other ways, but that a complex solution which is at most 1-2% faster than a much simpler solution is probably not the best choice. I also have access to a good test system now, by the way, so I could do some tests too. > The git tree is available here: > https://github.com/AndreRenaud/musl/commit/713023e7320cf45b116d1c29b6155ece28904e69 It's an open question whether it's better to sync something like this with an 'upstream' or adapt it to musl coding conventions. Generally musl uses explicit instructions rather than pseudo-instructions/macros for prologue and epilogue, and does not use named labels. > Does anyone have any comments on the suitability of this code, or what If nothing else, it fails to be armv4 compatible. Fixing that should not be hard, but it would require a bit of an audit. The return sequences are the obvious issue, but there may be other instructions in use that are not available on armv4 or maybe not even on armv5...? > kind of more rigorous testing could be applied? See above. What also might be worth testing is whether GCC can compete if you just give it a naive loop (not the fancy pseudo-vectorized stuff currently in musl) and good CFLAGS. I know on x86 I was able to beat the fanciest asm strlen I could come up with simply by writing the naive loop in C and unrolling it a lot. The only reason musl isn't already using that version is that I suspect it hurts branch prediction in the caller.... Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.