|
Message-ID: <20150224010952.GA10683@brightrain.aerifal.cx>
Date: Mon, 23 Feb 2015 20:09:52 -0500
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Cc: Denys Vlasenko <vda.linux@...glemail.com>
Subject: Draft of improved memset.s for i386
Here's a draft of an improved i386 memset.s based on the principles
Denys Vlasenko and I discussed on his and my x86_64 versions. Compared
to the current code, it reduces entry/exit overhead, increases the
length supported in the non-rep-stosl path, and aligns the rep-stosl.
My tests don't measure the misalignment penalty, but even in the
aligned case the rep-stosl path is slightly faster (~5 cycles per run,
out of at least 64 cycles and the non-rep-stosl path is significantly
faster (e.g. 33 vs 51 cycles at size 16 and 40 vs 57 at size 32).
Empirically the byte-register-access/left-shift method of extending
the fill value to a word performs better than imul for me, but the
margin is very small (at most 1 cycle). Since we support much older
cpus (like actual 486) where imul could be really slow, I think this
is the right approach in principle too. I used imul in the rep-stosl
path but haven't tested whether it's faster there.
The non-rep-stosl path only goes up to size 62. I think sizes up to
126 could benefit from it, but the string of stores was getting really
long.
Correctness has not been tested so there may be stupid bugs.
Rich
View attachment "memset-draft.s" of type "text/plain" (1092 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.