musl - Re: Re: Further dynamic linker optimizations

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150806043252.GB1900@localhost>
Date: Wed, 5 Aug 2015 21:32:53 -0700
From: Isaac Dunham <ibid.ag@...il.com>
To: musl@...ts.openwall.com
Cc: Rich Felker <dalias@...c.org>
Subject: Re: Re: Further dynamic linker optimizations

On Wed, Aug 05, 2015 at 03:37:25PM -0700, Andy Lutomirski wrote:
> On 07/07/2015 10:48 PM, Timo Teras wrote:
> >On Tue, 7 Jul 2015 14:55:05 -0400
> >Rich Felker <dalias@...c.org> wrote:
> >
> >>On Tue, Jul 07, 2015 at 09:39:09PM +0300, Alexander Monakov wrote:
> >>>On Tue, 30 Jun 2015, Rich Felker wrote:
> >>>
> >>>>Discussion on #musl with Timo Ter??s has produced the following
> >>>>results:
> >>>>
> >>>>- Moving bloom filter size to struct dso gives 5% improvement in
> >>>>clang (built as 110 .so's) start time, simply because of a
> >>>>reduction of number of instructions in the hot path. So I think
> >>>>we should apply that patch.
> >>>
> >>>I think most of the improvement here actually comes from fewer
> >>>cache misses. As a result, I think we should take this idea further
> >>>and shuffle struct dso a little bit so that fields accessed in the
> >>>hot find_sym loop are packed together, if possible.
> >>
> >>I'm not entirely convinced; the 5% seems consistent with the number of
> >>instructions in the code path. Can you confirm this with cache miss
> >>measurements? Or just by obtaining better timings reordering data for
> >>cache locality? Note that the head of struct dso has to remain fixed
> >>(it's gdb ABI :/) but the rest is free to change.
> >
> >I used cachegrind and callgrind to benchmark. In my case there was no
> >change in cache miss number - the speed up was purely based on running
> >less instructions on the hot path.
> >
> >Though, I ran this on i7 with lot of cache. Cache misses could become
> >issue on smaller cpus. But I suspect the bloom filter is doing good
> >enough job to keep cache usage on sensible levels.
> >
> >>>>- The whole outer for loop in find_sym is the hot path for
> >>>>   performance. As such, eliminating the lazy calculation of
> >>>>gnu_hash and simply doing it before the loop should be a
> >>>>measurable win, just by removing the if (!ghm) branch.
> >>>
> >>>On a related note, it's possible to avoid calculating sysv hash, if
> >>>gnu-hash is enabled system-wide, by not setting 'global' flag on
> >>>the vdso item (as mentioned on IRC in your conversation with Timo).
> >>
> >>Yes, and I think this sounds like a worthwhile approach. Seeing
> >>timings for it would be great. :-)
> >
> >I told them earlier in IRC. But on the same i7 box and running "clang
> >--version" which has 100+ DT_NEEDED... removing vdso and thus sysv
> >hashing had magnitude of tens of milliseconds. (I wonder how it'd
> >perform if we calculated both sysv and gnu hashes at same time.)
> 
> /me dons vdso maintainer hat.
> 
> I can add a GNU hash to the vdso quite easily (for Linux 4.3).  Would that
> be helpful?

Would this require a binutils version that supports GNU hashes?
And if so, would it be a hard build-time requirement?

Thanks,
Isaac Dunham

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.