|
Message-ID: <CAPfzE3ZsMpC9d4VDZyHabhKOffOQW0dnG7Nwpm8EqVBLUXNZKg@mail.gmail.com> Date: Wed, 10 Jul 2013 10:26:46 +1200 From: Andre Renaud <andre@...ewatersys.com> To: musl@...ts.openwall.com Subject: Re: Thinking about release Replying to myself > Certainly if there was a more straight forward C implementation that > achieved similar results that would be superior. However the existing > musl C memcpy code is already optimised to some degree (doing 32-bit > rather than 8-bit copies), and it is difficult to convince gcc to use > the load-multiple & store-multiple instructions via C code I've found, > without resorting to pretty horrible C code. It may still be > preferable to the assembler though. At this stage I haven't > benchmarked this - I'll see if I can come up with something. As a comparison, the existing memcpy.c implementation tries to copy sizeof(size_t) bytes at a time, which on ARM is 4. This ends up being a standard load/store. However GCC is smart enough to know that it can use ldm/stm instructions for copying structures > 4 bytes. So if we change memcpy.c to use a structure whose size is > 4 (ie: 16), instead of size_t for it's basic copy unit, we do see some improvements: typedef struct multiple_size_t { size_t d[4]; } multiple_size_t; #define SS (sizeof(multiple_size_t)) #define ALIGN (sizeof(multiple_size_t)-1) void *my_memcpy(void * restrict dest, const void * restrict src, size_t n) { unsigned char *d = dest; const unsigned char *s = src; if (((uintptr_t)d & ALIGN) != ((uintptr_t)s & ALIGN)) goto misaligned; for (; ((uintptr_t)d & ALIGN) && n; n--) *d++ = *s++; if (n) { multiple_size_t *wd = (void *)d; const struct multiple_size_t *ws = (const void *)s; for (; n>=SS; n-=SS) *wd++ = *ws++; d = (void *)wd; s = (const void *)ws; misaligned: for (; n; n--) *d++ = *s++; } return dest; } This results in 95MB/s on my platform (up from 65MB/s for the existing memcpy.c, and down from 105MB/s with the asm optimised version). It is essentially identically readable to the existing memcpy.c. I'm not really famiilar with any other cpu architectures, so I'm not sure if this would improve, or hurt, performance on other platforms. Any comments on using something like this for memcpy instead? Obviously this gives you a higher penalty if the size of the area to be copied is between sizeof(size_t) and sizeof(multiple_size_t). Regards, Andre
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.