musl - Re: Improvements to qsort

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20210710002015.GT13220@brightrain.aerifal.cx>
Date: Fri, 9 Jul 2021 20:20:15 -0400
From: Rich Felker <dalias@...c.org>
To: jason <jason@...omnia247.nl>
Cc: musl@...ts.openwall.com
Subject: Re: Improvements to qsort

On Fri, Jul 09, 2021 at 01:26:28PM +0200, jason wrote:
> I've been trying to understand ngmalloc, and it's a struggle, so for
> relaxation I had a look at qsort, and have a few suggestions.
> 
> To avoid burying the lead, the following reduces comparisons by
> about 4% with no downside:
> 
> --- a/qsort.c
> +++ b/qsort.c
> @@ -100,18 +100,16 @@ static void sift(unsigned char *head, size_t width, cmpfun cmp, int pshift, size
>  		rt = head - width;
>  		lf = head - width - lp[pshift - 2];
>  
> -		if((*cmp)(ar[0], lf) >= 0 && (*cmp)(ar[0], rt) >= 0) {
> -			break;
> -		}
>  		if((*cmp)(lf, rt) >= 0) {
> -			ar[i++] = lf;
>  			head = lf;
>  			pshift -= 1;
>  		} else {
> -			ar[i++] = rt;
>  			head = rt;
>  			pshift -= 2;
>  		}
> +		if((*cmp)(ar[0], head) >= 0)
> +			break;
> +		ar[i++] = head;
>  	}
>  	cycle(width, ar, i);
>  }
> 
> Basically, instead of doing two comparisons to see if ar[0] is greater
> than both the left and right children and then comparing the children
> with each other to find the greater one, find the greater child and
> compare ar[0] with that one.

I think this (and addition of comments) are the most worthwhile to
look at first. In particular, cmp is a caller-provided function that
can be expensive, so reducing the number of calls is very helpful.

As an aside, it may be nice to get qsort_r in before making changes so
that qsort_r is available for backport without also backporting
bleeding edge improvements.

> I have several other less-significant changes that I wonder if musl is
> interested in:
> 
> * I notice than pntz() is always used in the following context:
> 	trail = pntz(p);
> 	shr(p, trail);
> 	pshift += trail;
>   both pntz() and shr() have branches conditional on whether the shift
>   is more or less than 8*sizeof(size_t) bits.  I have replaced this
>   by a single function, which allows the conditional branch to be
>   shared:
> 
> /*
>  * Re-normalize p[].  Ignore the low-order bit (which must be set!),
>  * find the position of the next-least-significant bit, and shift p[]
>  * down by that much.  Return the number of bits shifted.
>  *
>  * There must be a penultimate trailing bit; i.e. p[] must not be {1, 0}.
>  */
> static inline int normalize(size_t p[2])
> {
> 	int s;
> 	size_t t = p[0] - 1;
> 
> 	if (t == 0) {
> 		t = p[1];
> 		s = ntz(t);
> 		p[0] = t >> s;
> 		p[1] = 0;
> 		s += 8*sizeof(size_t);
> 	} else {
> 		s = ntz(t);
> 		p[0] = t >> s | p[1] << (8*sizeof(size_t) - s);
> 		p[1] >>= s;
> 	}
> 	return s;
> }
> 
>   which is used simply as:
> 	pshift += normalize(p);

I'm unclear whether it's an improvement open-coding shr rather than
just calling it here. Seems like it introduced new duplication of that
logic while eliminating other duplication. If you think a change is
really better here, it might make sense to factor it as first moving
the repeated pntz/shr/pshift+=... to a function without refactoring,
then expanding pntz out into it (since pntz is not called anywhere
else) in a second change, but without expanding out shr.

> * As part of my learning about the code, I have added a lot of comments,
>   particularly large block comments about the smoothsort algorithm
>   in general, the meaning of p[] and pindex, what a "stepson" is,
>   when it's okay to use sift() instead of trinkle(), what the "trusty"
>   flag means, how cycle() is used, etc.  Would these be worth cleaning
>   up and submitting?

This sounds good. If you can submit them, I'll see if myself and the
original author can review that they make sense.

> * In my private copy, I have changed the type of "trusty" to bool,
>   a data type I'm quite fond of as it concisely documents the limited
>   range of values.  I notice, however, that musl is completely devoid of
>   any use of <stdbool.h>, although it uses other C99 features.  Is this
>   a conscious decision?

Yes, the _Bool type isn't used. That's just a style choice (mainly
because it has some mildly non-obvious behaviors and doesn't really
help anything) but one I'd like to remain consistent on.

> * Likewise, I notice that most musl code has a space in "keyword (condition)"
>   like "if (foo != 0)", "while (bar < N)" and "for (i=0; i < N; i++)".
>   But the qsort() code, with one single exception, omits the space.

The qsort code was written with mildly different spacing style here
than what I usually use and ask others to use, but I don't think it
makes any difference. I'd rather be consistent, but I'd also rather
not have "let's gratuitously change spaces" commits.

>   Likewise, most msul code is quite happy to "if (condition) break;"
>   on a single line, with no braces, but the original sift() function
>   above puts the break on a second line and a closing brace on a third.

Really there is no strong preference here, except possibly to put it
on its own line if it's parallel to an else clause that needs to be on
its own line (e.g. for line length), or if there's a special risk of
it being overlooked if it's on the same line. Mostly I'd aim for
*local consistency* here. There is no compelling reason to need global
consistency across the codebase on how this is done.

>   The musl code base is inconsistent enough that it's hard to infer code
>   formatting conventions.  Are these discrepancies in qsort a problem
>   worth fixing?

No. If someone wrote code for musl with mildly different style but
that wasn't deemed offensively bad when it was contributed, there's no
need to change it now/later.

> * The way the heap-building loop maintains its loop invariants is a trifle
>   peculiar.  It begins with the element at *head included in p[] and pindex,
>   but excluded from the heap ordering invariants.  Then it establishes
>   the ordering invariants (with sift() or trinkle()) and immediately adds
>   another (un-ordered) element to p[] and pindex.
> 
>   After the main loop exits, a standalone call to trinkle() places the
>   final element in heap order.
> 
>   I have a more conventional (add, then sift) ordering working which
>   seems to simplify the code:
>  	while(head < high) {
> +		bool notrinkle = false;
> +
> +		/* Append an element to the heap */
> +		head += width;
>  		if((p[0] & 3) == 3) {
> -			sift(head, width, cmp, pshift, lp);
> +			/* Two trees of adjacent sizes: merge with head */
>  			shr(p, 2);
>  			pshift += 2;
> -		} else {
> -			if(lp[pshift - 1] >= high - head) {
> -				trinkle(head, width, cmp, p, pshift, 0, lp);
> -			} else {
> +			if((size_t)(high - head) > lp[pshift - 1]) {
>  				sift(head, width, cmp, pshift, lp);
> +				notrinkle = true;
>  			}
> -			
> -			if(pshift == 1) {
> -				shl(p, 1);
> -				pshift = 0;
> -			} else {
> -				shl(p, pshift - 1);
> -				pshift = 1;
> -			}
> +		} else if(pshift == 1) {
> +			/* order-1 exists; new tree is order-0 */
> +			shl(p, 1);
> +			pshift = 0;
> +			notrinkle = (high != head);
> +		} else {
> +			/* new tree is order-1 */
> +			shl(p, pshift - 1);
> +			pshift = 1;
> +			notrinkle = ((size_t)(high - head) > lp[0]);
>  		}
> -		
>  		p[0] |= 1;
> -		head += width;
> +
> +		/* Re-establish the heap using trinkle */
> +		if(!notrinkle)
> +			trinkle(head, width, cmp, p, pshift, false, lp);
>  	}
>  
> -	trinkle(head, width, cmp, p, pshift, 0, lp);
> 
>   If the original ordering is wanted, then the loop initialization
>   can be changed to start with two elements (p[0] = 3, pindex = 0,
>   head = base + width), since two elements automatically fulfil the
>   required loop invariants.
> 
>   Does anyone have a preference?

I'll need to read this in a lot more detail to comment on it.

> * Since the first test in qsort() excludes width == 0, the loop over
>   width in cycle() can be changed to "do { ... } while (width);".

I really doubt this makes any difference.

> * The ar[] array in sift() only has to be as large as the lp[]
>   array in qsort(), not 1/6 larger.
> 
> * The ar[] array in trinkle() only has to be half as large, plus one.

These observations may be true, but are very conspicuous high-risk
changes from the perspective of someone not familiar with the code.
Unless there is a compelling reason I would not touch them. Someone
reviewing musl commits for new security bugs (aka code suspicious of
being a backdoor) would sink a lot of time into analyzing that sort of
change, and I do not want to waste their time.

> * The pindex > 1 case in the heap-consuming loop contains some
>   redundant left-then-right shifting:
> 			shl(p, 2);
> 			pshift -= 2;
> 			p[0] ^= 7;
> 			shr(p, 1);
>   This can be optimized.

I think this has been mentioned before by others, and at the time we
were not sure why it was written this way either. It can/should
probably be changed.

> * There are several cases where variables can be moved into
>   smaller local scopes, particularly the "stepson", "rt" and "lf"
>   variables in trinkle().  This is another change I've made privately
>   as I think it makes the code more reasonable.  Worth submitting?

Probably not. There is no consistent rule about how this should be
done, and as mentioned above i don't like "style conformance" commits
unless there's something egregiously wrong.

> * GCC emits a couple of signed/unsigned comparison warnings with
>   -W -Wall.  Worth fixing?

No, the set of warnings musl targets is in configure. Intentional use
of signed/unsigned promotion semantics, including comparison, is used
all over musl.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.