|
Message-ID: <20150225203712.GA2302@brightrain.aerifal.cx>
Date: Wed, 25 Feb 2015 15:37:12 -0500
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Updated draft of improved memset.s for i386
Here's a new version of the improved i386 memset.s. The main changes
are:
- Alignment to 16-byte boundary rather than 4-byte for rep stosl.
- Preserving existing over-alignment via rounding up instead of adding
16 then rounding down.
- Special-casing already-aligned case (saves a few cycles when already
aligned, maybe 5-10% total run time at sizes just above the rep
stosl cutoff such as 64).
- Keeping the rep stosl run-length as long as possible rather than
trying to avoid duplicate stores. This helps a lot (>2x improvement)
at size 1024 on Atom and shouldn't hurt in general.
At this point I think it should be a net improvement on nearly any x86
system.
I've checked and it passes the current tests in libc-test. I'm not
entirely sure the tests cover all the cases we need though. For the
32-bit version, tests need to cover:
- All sizes 0-62; alignment doesn't matter.
- Sufficiently many sizes >=63 to get all alignments mod 16 for both
the length and the base pointer.
For the 64-bit versions (either Denys's latest or mine) we also need
coverage for all sizes 63-126 (alignmen doesn't matter) and
sufficiently many past that to test all alignments mod 16 for both
length and base. For the sake of robustness and future-proofing, we
should probably be testing all base and length alignments mod 32 or
more up to size 256 or larger.
Rich
View attachment "memset-draft3.s" of type "text/plain" (1171 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.