|
Message-ID: <20150624051312.GP1173@brightrain.aerifal.cx> Date: Wed, 24 Jun 2015 01:13:12 -0400 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: Re: [PATCH 2/5] dynlink.c: compute modulus via magic multiplication On Wed, Jun 24, 2015 at 07:32:56AM +0300, Alexander Monakov wrote: > On Wed, 24 Jun 2015, Rich Felker wrote: > > > On Wed, Jun 24, 2015 at 02:24:52AM +0300, Alexander Monakov wrote: > > > +static uint32_t umod(uint32_t x, uint32_t div, struct udiv *p) > > > +{ > > > + if (!(div&(div-1))) > > > + return x&(div-1); > > > + uint32_t v = x; > > > + if (p->s1) v >>= p->s1; > > > + else if (v != ~0u) v += p->inc; > > > + int s32=32, s2=p->s2; > > > + if (sizeof(long) == 8) > > > + s32+=s2, s2=0; > > > + v = (1ull * v * p->mul) >> s32; > > > + v >>= s2; > > > + return x-v*div; > > > +} > > > > I think having the div argument here is a pessimization. > > How can I do the last step, 'x-v*div' without it? Ah yes. Would it be preferable to have that in struct udiv then, though? Then the caller never has to care about the divisor, and in the case where umod doesn't get inlined, the call will be more efficient (fewer args). > > Power-of-two sizes are unlikely to be used, > > probably so, but... > > > and it's possible for precompute_udiv > > to setup the struct udiv to work perfectly fine for powers of two, > > I gave it a thought, and I didn't find how to do that either. Originally I thought this would work: p->s1 = ctz(div); p->inc = 0; p->mul = 1; p->s2 = 0; But the >>32 of course breaks it. What about: p->s1 = 0 p->inc = 0; p->mul = 0x80000000; p->s2 = ctz(div)-1; This should handle all values except 1, which probably isn't needed, but maybe there's a special case that works for 1 too...? I think the following works for all values except 0xffffffff: p->s1 = 0 p->inc = 1; p->mul = 0xffffffff; p->s2 = 0; Using the post-mul add rather than the saturated increment would make this work for 0xffffffff too. Another obvious solution is not using the +32 offset so that the right shift can just be 31, but that pessimizes the code on 32-bit archs quite a bit. > > so that there is a single code path with no branches. The p->s1 branch > > should not be needed either. > > p->s1 check is for this reason: shift are relatively costly, and s1 is rarely > non-zero, so try to skip the shift if possible; in the rare case the pre-shift > is non-zero, the check allows to skip the saturating increment operation. Shifts are costly? Are we talking about P4-era junk? ;-) Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.