Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150624063923.GR1173@brightrain.aerifal.cx>
Date: Wed, 24 Jun 2015 02:39:23 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: [PATCH 2/5] dynlink.c: compute modulus via magic
 multiplication

On Wed, Jun 24, 2015 at 09:08:15AM +0300, Alexander Monakov wrote:
> > > How can I do the last step, 'x-v*div' without it?
> > 
> > Ah yes. Would it be preferable to have that in struct udiv then,
> > though? Then the caller never has to care about the divisor, and in
> > the case where umod doesn't get inlined, the call will be more
> > efficient (fewer args).
> 
> I doubt that; the caller needs that value ('nbuckets') anyway.

OK.

> [...]
> > Using the post-mul add rather than the saturated increment would make
> > this work for 0xffffffff too.
> 
> post-mul add is problematic on 32-bit

Maybe. It's not clear to me which is cheaper, a saturating multiply
(well-predicted branch) or an adc. IMO the only place the latter is
clearly more costly is on archs with no proper carry mechanism.

> > Another obvious solution is not using the +32 offset so that the right
> > shift can just be 31, but that pessimizes the code on 32-bit archs
> > quite a bit.
> 
> I agree that a special-case for a power-of-two divisor will fire too rarely,
> so figuring a replacement out would be nice.
> 
> If you (or anyone) want to play with ideas, I'm attaching my test driver for
> magic division that tests 2^32-1 inputs for a divisor given on the command
> line (the main loop looks odd, but my goal was to have gcc vectorize it).

Nice.

> > > p->s1 check is for this reason: shift are relatively costly, and s1 is rarely
> > > non-zero, so try to skip the shift if possible; in the rare case the pre-shift
> > > is non-zero, the check allows to skip the saturating increment operation.
> > 
> > Shifts are costly? Are we talking about P4-era junk? ;-)
> 
> I primarily had modern Intel cores in mind where as far as I can see shifts by
> %ecx have latency 2.  But even ignoring that, there's the second part of my
> argument.

With a latency of 2, I can't imagine there being any benefit to
attempting to avoid it with a conditional branch.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.