Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250629000843.GA19826@brightrain.aerifal.cx>
Date: Sat, 28 Jun 2025 20:08:43 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Paintcans for reverse iterating strings

One thing we're going to need for LC_COLLATE in locales where
second-level weights are applied in reverse order (diacritic marks
later in the string weigh more than earlier ones) is the ability to
traverse (& live transform to NFD) the input string in reverse.

In order not to take quadratic time doing this, we need a strategy for
laying down a logarithmic (thus bounded by a small constant) number of
waypoints ("paintcans") we can resume forward-processing at to
simulate reverse-order random access.

I'm posting here a proposal for comments on whether this seems fairly
optimal or has any obvious improvements:


Maintain an array of up to N waypoints, initially populated just by
the halfway point.

When requesting a position, first remove any waypoints past the
requested position from the array, then start from the last remaining
waypoint and work forward. When crossing the halfway point between the
last waypoint and the desired position, if the waypoint array is not
already full, add the halfway position as a waypoint.


With unlimited N, I'm pretty sure this gives O(N log N) total
traversal time. In a finite real-machine model, N=8*sizeof(size_t)
then gives the same. And if we know ahead of time that the size of the
input is way less than the maximum possible length (note: total length
is known from a previous pass), we can pre-populate linear-spaced
early waypoints or space them differently (like 1/4 instead of 1/2 the
way to the target) so that less time is spent backtracking (I haven't
yet thought much about which is best).

Note that N=64 is no problem in terms of storage space; it's under 4k.
And the majority would not even get used/touched except on
pathologically long inputs to strcoll/strxfrm, and only in the case of
a locale with reversed second-level weights.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.