|
Message-ID: <CAPfzE3aKG7JYE_u3oVDfkF2xDSdhzdrY3ui-H0bUduQXUOQ6Vg@mail.gmail.com> Date: Fri, 12 Jul 2013 10:34:31 +1200 From: Andre Renaud <andre@...ewatersys.com> To: musl@...ts.openwall.com Subject: Re: Thinking about release Hi Rich, > You need both instructions in the same asm block, and proper > constraints. As it is, whether the registers keep their values between > the two separate asm blocks is up to the compiler's whims. > > With the proper constraints ("+r" type), the s+=SS and d+=SS are > unnecessary, as a bonus. Also there's no reason to force alignment to > SS for this loop; that will simply prevent it from being used as much > for smaller copies. I would use SS==sizeof(size_t) and then write 8*SS > in the for loop. > > Last night I was in the process of writing something very similar, but > I put the for loop in asm too and didn't finish it. If it performs > just as well with the loop in C, I like your version better. I've rejiggled it a bit, and it appears to be working. I wasn't entirely sure what you meant about the proper constraints. There is an additional reason why 8*4 was used for the align - to force the whole loop to work in cache-line blocks. I've now done this explicitly on the lead-in by doing the first few copies as 32-bit, then going to the full cache-line asm. This has the same performance as the fully native assembler. However to get that I had to use the same trick that the native assembler uses - doing a load of the next block prior to storing this one. I'm a bit concerned that this would mean we'd be doing a read that was out of bounds, and I can't entirely see why this wouldn't be happening with the existing assembler (but I'm presuming it doesn't). Any comments on this side of it? #define SS sizeof(size_t) #define ALIGN (SS - 1) void * noinline my_asm_memcpy(void * restrict dest, const void * restrict src, size_t n) { unsigned char *d = dest; const unsigned char *s = src; if (((uintptr_t)d & ALIGN) != ((uintptr_t)s & ALIGN)) goto misaligned; /* ARM has 32-byte cache lines, so get us aligned to that */ for (; ((uintptr_t)d & ((8 * SS) - 1)) && n; n-=SS) { *(size_t *)d = *(size_t *)s; d += SS; s+= SS; } /* Do full cache line read/writes */ if (n) { for (; n>=(8 * SS); n-= (8 * SS)) { __asm__ ( "ldmia %0, {r4-r11}\n" "add %0, %0, %4\n" "bic r12, %0, %5\n" "ldrhi r12, [%0]\n" "stmia %1, {r4-r11}\n" "add %1, %1, %4" : "=r"(s), "=r"(d) : "0"(s), "1"(d), "i"(8 * SS), "i"((8 * SS) - 1) : "r4", "r5", "r6", "r7", "r8", "r9", "r10", "r11", "r12"); } misaligned: for (; n; n--) *d++ = *s++; } return dest; } Regards, Andre
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.