|
|
Message-ID: <20250629035154.GT1827@brightrain.aerifal.cx>
Date: Sat, 28 Jun 2025 23:51:56 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Paintcans for reverse iterating strings
On Sat, Jun 28, 2025 at 08:08:43PM -0400, Rich Felker wrote:
> One thing we're going to need for LC_COLLATE in locales where
> second-level weights are applied in reverse order (diacritic marks
> later in the string weigh more than earlier ones) is the ability to
> traverse (& live transform to NFD) the input string in reverse.
>
> In order not to take quadratic time doing this, we need a strategy for
> laying down a logarithmic (thus bounded by a small constant) number of
> waypoints ("paintcans") we can resume forward-processing at to
> simulate reverse-order random access.
>
> I'm posting here a proposal for comments on whether this seems fairly
> optimal or has any obvious improvements:
>
>
> Maintain an array of up to N waypoints, initially populated just by
> the halfway point.
>
> When requesting a position, first remove any waypoints past the
> requested position from the array, then start from the last remaining
> waypoint and work forward. When crossing the halfway point between the
> last waypoint and the desired position, if the waypoint array is not
> already full, add the halfway position as a waypoint.
>
>
> With unlimited N, I'm pretty sure this gives O(N log N) total
> traversal time. In a finite real-machine model, N=8*sizeof(size_t)
> then gives the same. And if we know ahead of time that the size of the
> input is way less than the maximum possible length (note: total length
> is known from a previous pass), we can pre-populate linear-spaced
> early waypoints or space them differently (like 1/4 instead of 1/2 the
> way to the target) so that less time is spent backtracking (I haven't
> yet thought much about which is best).
>
> Note that N=64 is no problem in terms of storage space; it's under 4k.
> And the majority would not even get used/touched except on
> pathologically long inputs to strcoll/strxfrm, and only in the case of
> a locale with reversed second-level weights.
At least empirically this seems to work. I'm getting expected output
and for a 21-character string it performs the forward-iterate
operation 63 times, which is not bad. This is without any logic to
emit extra waypoints when log(len) is small and only a few slots are
needed for the halfway steps.
Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.