|
Message-ID: <CAPfzE3YDFjqHxRaZFeiy0CvbYWYGKzgDGEp-71xSz-03GhNTxw@mail.gmail.com> Date: Thu, 11 Jul 2013 10:44:16 +1200 From: Andre Renaud <andre@...ewatersys.com> To: Andre Renaud <andre@...ewatersys.com> Cc: musl@...ts.openwall.com Subject: Re: Thinking about release > This results in 95MB/s on my platform (up from 65MB/s for the existing > memcpy.c, and down from 105MB/s with the asm optimised version). It is > essentially identically readable to the existing memcpy.c. I'm not > really famiilar with any other cpu architectures, so I'm not sure if > this would improve, or hurt, performance on other platforms. Reviewing the assembler that is produced, it appears that GCC will never generate an ldm/stm instruction (load/store multiple) that reads into more than 4 registers, where as the optimised assembler does them that read 8 (ie: 8 * 32bit reads in a single instruction). I've tried various tricks/optimisations with the C code, and can't convince GCC to do more than 4. I assume that this is probably where the remaining 10MB/s is between these two variants. Rich - do you have any comments on whether either the C or assembler variants of memcpy might be suitable for inclusion in musl? Regards, Andre
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.