Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55C29025.8070507@kernel.org>
Date: Wed, 5 Aug 2015 15:37:25 -0700
From: Andy Lutomirski <luto@...nel.org>
To: musl@...ts.openwall.com, Rich Felker <dalias@...c.org>
Subject: Re: Further dynamic linker optimizations

On 07/07/2015 10:48 PM, Timo Teras wrote:
> On Tue, 7 Jul 2015 14:55:05 -0400
> Rich Felker <dalias@...c.org> wrote:
>
>> On Tue, Jul 07, 2015 at 09:39:09PM +0300, Alexander Monakov wrote:
>>> On Tue, 30 Jun 2015, Rich Felker wrote:
>>>
>>>> Discussion on #musl with Timo Teräs has produced the following
>>>> results:
>>>>
>>>> - Moving bloom filter size to struct dso gives 5% improvement in
>>>> clang (built as 110 .so's) start time, simply because of a
>>>> reduction of number of instructions in the hot path. So I think
>>>> we should apply that patch.
>>>
>>> I think most of the improvement here actually comes from fewer
>>> cache misses. As a result, I think we should take this idea further
>>> and shuffle struct dso a little bit so that fields accessed in the
>>> hot find_sym loop are packed together, if possible.
>>
>> I'm not entirely convinced; the 5% seems consistent with the number of
>> instructions in the code path. Can you confirm this with cache miss
>> measurements? Or just by obtaining better timings reordering data for
>> cache locality? Note that the head of struct dso has to remain fixed
>> (it's gdb ABI :/) but the rest is free to change.
>
> I used cachegrind and callgrind to benchmark. In my case there was no
> change in cache miss number - the speed up was purely based on running
> less instructions on the hot path.
>
> Though, I ran this on i7 with lot of cache. Cache misses could become
> issue on smaller cpus. But I suspect the bloom filter is doing good
> enough job to keep cache usage on sensible levels.
>
>>>> - The whole outer for loop in find_sym is the hot path for
>>>>    performance. As such, eliminating the lazy calculation of
>>>> gnu_hash and simply doing it before the loop should be a
>>>> measurable win, just by removing the if (!ghm) branch.
>>>
>>> On a related note, it's possible to avoid calculating sysv hash, if
>>> gnu-hash is enabled system-wide, by not setting 'global' flag on
>>> the vdso item (as mentioned on IRC in your conversation with Timo).
>>
>> Yes, and I think this sounds like a worthwhile approach. Seeing
>> timings for it would be great. :-)
>
> I told them earlier in IRC. But on the same i7 box and running "clang
> --version" which has 100+ DT_NEEDED... removing vdso and thus sysv
> hashing had magnitude of tens of milliseconds. (I wonder how it'd
> perform if we calculated both sysv and gnu hashes at same time.)

/me dons vdso maintainer hat.

I can add a GNU hash to the vdso quite easily (for Linux 4.3).  Would 
that be helpful?

--Andy

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.