Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20151105025433.GW8645@brightrain.aerifal.cx>
Date: Wed, 4 Nov 2015 21:54:33 -0500
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: [PATCH 2/3] i386/memset: do not fetch fill char from
 memory again

On Mon, Oct 12, 2015 at 08:30:33PM +0200, Denys Vlasenko wrote:
>  shl $16,%edx
>  mov 8(%esp),%dl
>  mov 8(%esp),%dh
> 
> The above code has two register merge stalls, and it goes to load unit
> to fetch the data. I don't know what's worse. Both are not pleasant.

Do you have measurements to back this?

> Replace them with IMUL. It has ~3 cycle latency, but no stalls.

While we probably don't need to care about ancient chips like 486 or
original Pentium for performance purposes (altho maybe Quark?), I'd
rather not do anything that would make performance catastrophically
worse on them unless it actually has significant (measurable) benefit
for modern systems. The code as is was written to be non-hostile to
systems where imul has some nontrivial cost.

> Move it a bit up to hide its latency.

The movement puts it before the branch which exits early, which is
probably a huge performance loss on old cpus.

Of course even better than evidence that your code helps a lot on
modern cpus would be evidence that it doesn't hurt at all on old ones.
Anyone have a 486 or 586 lying around to run timings on? I suppose I
could see if my old K6 still boots...

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.