|
Message-ID: <20150210204342.GJ23507@brightrain.aerifal.cx> Date: Tue, 10 Feb 2015 15:43:42 -0500 From: Rich Felker <dalias@...ifal.cx> To: musl@...ts.openwall.com Subject: Re: Re: [PATCH] x86_64/memset: simple optimizations On Tue, Feb 10, 2015 at 09:27:17PM +0100, Denys Vlasenko wrote: > On Sat, Feb 7, 2015 at 2:06 PM, Rich Felker <dalias@...ifal.cx> wrote: > > On Sat, Feb 07, 2015 at 01:49:43PM +0100, Denys Vlasenko wrote: > >> On Sat, Feb 7, 2015 at 1:35 AM, Rich Felker <dalias@...ifal.cx> wrote: > >> What speedups? > >> In particular: > >> - perform pre-alignment if dst is unaligned > > > > For the rep stosq path? Does it help? I don't recall the details but I > > seem to remember both docs and measurements showing no reliable > > benefit from alignment for this instruction, and we had people trying > > things on several different cpu models. I'm open to hearing evidence > > to the contrary though. > > size:20k buf:0x7f38656e2100 > stos:25978 ns (times 32), 25.227500 bytes/ns > stos+1:31395 ns (times 32), 20.874662 bytes/ns > stos+4:31396 ns (times 32), 20.873997 bytes/ns > stos+8:24446 ns (times 32), 26.808476 bytes/ns > > size:50k buf:0x7fbca1dc9100 > stos:68149 ns (times 32), 24.041439 bytes/ns > stos+1:85762 ns (times 32), 19.104032 bytes/ns > stos+4:85762 ns (times 32), 19.104032 bytes/ns > stos+8:68204 ns (times 32), 24.022051 bytes/ns > > size:1024k buf:0x7fa3036a5100 > stos:1632285 ns (times 32), 20.556724 bytes/ns > stos+1:1891092 ns (times 32), 17.743416 bytes/ns > stos+4:1891089 ns (times 32), 17.743444 bytes/ns > stos+8:1632181 ns (times 32), 20.558034 bytes/ns > > size:5000k buf:0x7fdf5cd6b100 > stos:15592138 ns (times 32), 10.558298 bytes/ns > stos+1:15501841 ns (times 32), 10.619799 bytes/ns > stos+4:15507773 ns (times 32), 10.615737 bytes/ns > stos+8:15589617 ns (times 32), 10.560005 bytes/ns > > The source is attached. OK. This looks sufficiently significant (despite unaligned memsets being rare) that it would be nice to optimize it. Could we just write an initial possibly-misaligned word then increment the start address and round it up before using rep stos? > #define _GNU_SOURCE > #include <sys/types.h> > #include <sys/time.h> > #include <sys/syscall.h> > #include <time.h> > #include <stdio.h> > #include <stdlib.h> > #include <unistd.h> > #include <string.h> > /* Old glibc (< 2.3.4) does not provide this constant. We use syscall > * directly so this definition is safe. */ > #ifndef CLOCK_MONOTONIC > #define CLOCK_MONOTONIC 1 > #endif > > /* libc has incredibly messy way of doing this, > * typically requiring -lrt. We just skip all this mess */ > static void get_mono(struct timespec *ts) > { > syscall(__NR_clock_gettime, CLOCK_MONOTONIC, ts); > } FWIW, this is a bad idea; you get syscall overhead in your measurements. If you just use clock_gettime (the function) you'll get vdso results (no syscall). Using the syscall directly is also sketchy in that x32 has an incorrect kernel-side definition for struct timespec, but I think it will only matter if aarch64-ILP32 copies this problem from x32 and you're using a big-endian system. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.