|
Message-ID: <20120730204100.GY544@brightrain.aerifal.cx> Date: Mon, 30 Jul 2012 16:41:00 -0400 From: Rich Felker <dalias@...ifal.cx> To: Kim Walisch <kim.walisch@...il.com> Cc: musl@...ts.openwall.com Subject: Re: musl libc, memcpy Hi, I'm replying with the list CC'd so others can comment too. Sorry I haven't gotten a chance to try this code or review it in detail yet. What follows is a short initial commentary but I'll give it some more attention soon. On Sun, Jul 29, 2012 at 11:41:47AM +0200, Kim Walisch wrote: > Hi, > > I have been reading through several libc implementations on the > internet for the past days and for fun I have written a fast yet > portable memcpy implementation. It uses more code than your > implementation but I do not think it is bloated. Some quick benchmarks > that I ran on my Intel Core-i5 670 3.46GHz (Red Hat 6.2 x86_64) > indicate that my implemenation runs about 50 percent faster than yours > for aligned data and up to 10 times faster for unaligned data using > gcc-4.7. The Intel C compiler even vectorizes the main copying loop > using SSE instructions (if compiled with icc -O2 -xHost) which gives a > performance better than glibc's memcpy on my system. I would be happy > to hear your opinion about my memcpy implementation. I'd like to know what block sizes you were looking at, because for memcpy that makes all the difference in the world: For very small blocks (down to 1 byte), performance will be dominated by conditional branches picking what to do. For very large blocks (larger than cache), performance will be memory-bound and even byte-at-a-time copying might be competitive. Theoretically, there's only a fairly small range of sizes where the algorithm used matters a lot. > /* CPU architectures that support fast unaligned memory access */ > #if defined(__i386) || defined(__x86_64) > # define UNALIGNED_MEMORY_ACCESS > #endif I don't think this is necessary or useful. If we want better performance on these archs, a tiny asm file that does almost nothing but "rep movsd" is known to be the fastest solution on 32-bit x86, and is at least the second-fastest on 64-bit, with the faster solutions not being available on all cpus. On pretty much all other archs, unaligned access is illegal. > static void *internal_memcpy_uintptr(void *dest, const void *src, size_t n) > { > char *d = (char*) dest; > const char *s = (const char*) src; > size_t bytes_iteration = sizeof(uintptr_t) * 8; > > while (n >= bytes_iteration) > { > ((uintptr_t*)d)[0] = ((const uintptr_t*)s)[0]; > ((uintptr_t*)d)[1] = ((const uintptr_t*)s)[1]; > ((uintptr_t*)d)[2] = ((const uintptr_t*)s)[2]; > ((uintptr_t*)d)[3] = ((const uintptr_t*)s)[3]; > ((uintptr_t*)d)[4] = ((const uintptr_t*)s)[4]; > ((uintptr_t*)d)[5] = ((const uintptr_t*)s)[5]; > ((uintptr_t*)d)[6] = ((const uintptr_t*)s)[6]; > ((uintptr_t*)d)[7] = ((const uintptr_t*)s)[7]; > d += bytes_iteration; > s += bytes_iteration; > n -= bytes_iteration; > } This is just manual loop unrolling, no? GCC should do the equivalent if you ask it to aggressively unroll loops, including the vectorization; if not, that seems like a GCC bug. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.