musl - Re: Collation weight length frequencies

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260523210218.GB27423@brightrain.aerifal.cx>
Date: Sat, 23 May 2026 17:02:18 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Collation weight length frequencies

On Sat, May 23, 2026 at 01:32:31AM -0400, Rich Felker wrote:
> This makes the patterns to represent efficiently, in order of
> importance (* = elided default weight):
> 
> 3 * *
> 3 1 1
> 2 * *
> 2 1 1
> 1 * *
> 1 1 1
> 0 0 0
> 0 2 1
> 0 1 1

The root collation data (FractionalUCA.txt) is currently specified
with 1-4 bytes variable length for primary weight, 1-2 bytes variable
length for secondary and tertiary. In practice it uses 1-3 bytes for
primary, 1-2 for secondary, and 1 for tertiary.

Secondary-ignorables do not occur at all in the root data.

Length-2 secondary weights only occur in primary ignoreables (0 length
primary weight).

So the above list is not just the important length patterns, but is
comprehensive for the root data. Locale-specific tailorings may
however cause more to be used, depending on the algorithm used to
apply them and the complexity of the rules.

This suggests a bit packing that could allow all of the above plus
extending primary length to 4 bytes, secondary and tertiary up to 3
bytes:

7 65 43 210
c tt ss ppp

ppp = length of primary, 0-4, with other values reserved as special
ss = length of secondary, 1-3, or common *
tt = length of tertiary, 1-3, or common *
c = continuation bit

where any reserved values could represent secondary or fully
ignoreables.


However, what it also suggests is maybe having a data-defined
dictionary of header byte values. which would avoid locking in any
assumptions about the FractionalUCA.txt implementation of root data
and instead allow any assignment (e.g. not even necessarily
variable-length/fractional) of weights as non-null byte sequences
compatible with strxfrm.

>From an immediate practical standpoint, this would facilitate eliding
not just one common secondary/tertiary byte value (05), but basically
all of the common secondary/tertiary weight bytes. So that, instead of
most entries in the table being one of (4-6 bytes):

- hh pp pp pp ss tt
- hh pp pp ss tt
- hh pp ss tt

most would be (2-4 bytes):

- hh pp pp pp
- hh pp pp
- hh pp

We could probably take this even further and let header byte represent
a lead primary byte too. This would drop the 87k ideographic collation
elements from 4 bytes each to 3 bytes each.

Going too far probably leads to an over-engineered tailored
compression, but a single dictionary lookup for common properties
seems reasonable.

In such a system, a single value should be reserved for representing
arbitrary lengths. Rather than any fancy encoding for them, I think
I'd have the 3 levels be null-terminated. Where rr is the reserved
header value, and pp ss and tt are bytes of the respective weight
levels:

rr pp pp ... pp 00 ss ... ss 00 tt .. tt 00

This of course is not maximally efficient, but it's roughly only 1
byte worse than storing 3 lengths with sufficient range, and would
only be intended for exceptional cases.


If I go with this, for FractionalUCA-based tooling, the tooling for
now would probably just hard-code a dictionary to emit with all of the
reasonable common weight elisions and combinations of lengths. But it
would give us the option to try different things and improve in the
future without breaking compatibility with existing static binaries.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.