musl - Re: Collation weight length frequencies

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260523053229.GA27423@brightrain.aerifal.cx>
Date: Sat, 23 May 2026 01:32:31 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Collation weight length frequencies

On Wed, Apr 08, 2026 at 01:55:07PM -0400, Rich Felker wrote:
> This is a followup to working out the details of the runtime format
> for collation data designed around the Unicode CLDR FractionalUCA
> (byte-based rather than legacy 16-bit-unit-based) format. This is part
> of the locale support overhaul project, funded by NLnet and the NGI
> Zero Core Fund.
> 
> 
> 
> Attached is an approximate table of frequencies of weight length
> combinations (at levels 1, 2, and 3) for collation mappings from the
> default order in FractionalUCA.txt. It's missing a small number of
> rules defined in terms of other rules which my parser isn't
> processing, and I think 2 prefix-context rules, but otherwise it
> should be complete.
> 
> What it's showing is that the vast majority of rules are 1 byte in the
> nonprimary levels and 3, 2, or 1 byte at the primary level. A
> plurality of those that remain are the full ignoreables (0 bytes at
> all levels). So, from a standpoint of coding efficiency for the
> mmappable runtime locale file, we only really need a compact
> (single-byte) way to represent (1,1,1), (2,1,1), (3,1,1), and (0,0,0).
> Anything else could be represented with explicit lengths with low
> cost relative to the overall data size.
> 
> Rules that are more than one byte in nonprimary levels all arise from
> mappings of one character or a sequence of characters to more than one
> collation element. Patterns like (4,2,2) and (6,2,2) are actually
> pairs of (2,1,1) and (3,1,1) collation elements, We don't need this
> distinction for strcoll/strxfrm usage, but it may be useful to somehow
> delimit them or flag them as multi-collation-element entries if we
> will at some point have regex bracket support for collation elements.
> 
> 
> 
> So, where does this leave us?
> 
> One obvious choice we could make is just reserving 4 special header
> byte values for (0,0,0), (1,1,1), (2,1,1), and (3,1,1), and explicitly
> encoding anything else. This avoids the need for any significant
> "unpacking" and keeps plenty of encoding space available, but it does
> assume the distribution will remain similar after tailoring.
> 
> Another obvious choice is using 2 bits for the length at each level,
> or 3 at the primary level and 2 at each of the two remaining levels.
> With 3 bits for primary, there would only be around 150 rules using
> any longer explicit encoding, and it would be unlikely that tailorings
> add significantly more.
> 
> This second options actually leaves a fair bit of encoding space free,
> too. Aside from the space with the last 1-2 bits set, the space with
> nonzero length at primary level but 0 length at any of the remaining
> levels is not valid for weights. I doubt we need this, though.
> 
> Here is a draft of a possible encoding:
> 
> 0 | ppp | ss | tt   - 3 bit primary, 2 bit secondary, 2 bit tertiary
> 1 | 1..127          - primary length 1..127, then 1 byte each for s/t
> 1 | 0               - vlc for each of p/s/t
> 
> The third case would never be used in practice but is just there for
> completeness/support of gratuitously bad weight byte assignments. It
> probably shouldn't even be implemented, just left as theoretical
> encoding space.
> 
> This doesn't account for storing any knowledge of whether the mapping
> is a multi-collation-element one or any sort of delimiters for
> multiple CEs. If we want to be able to store that, we could require
> that the form with the high bit set (explicit lengths) be used
> whenever it's multi-CE, drop the range to 1..63, and use the
> additional recovered bit to flag that it's multi-CE (and optionally
> store delimiter info in this case).
> 
> 
> With any of these encodings, the size of the default collation table,
> with implicit codepoint-order rules for ideographs, is looking to be
> about 200k. I'd expect a little over double that with modern
> stroke-radical order for ideographs.
> 
> 
> One thing all of this would be throwing away is that there is very low
> information content in the secondary and tertiary levels -- not just
> the lengths but the actual weights. It might make more sense to
> consider an encoding like the first option above that I didn't
> elaborate on as much, but where the header byte encodes not just the
> lengths but an index into a table of common values for
> secondary/tertiary level. This would shave 50k off the default table
> (more with stroke-radical order). Another way to accomplish this might
> be using the (otherwise invalid) length-0 for secondary/tertiary to
> compress common values. However, it might just be the case that the
> relative size here is sufficiently low that we don't want to bother
> with complexity.
> 
> 
> Rich

>    1860  0 0 0
>       1  0 0 2
>     393  0 1 0
>     273  0 1 1
>       5  0 2 0
>     378  0 2 1
>       1  0 2 2
>       1  0 4 0
>       9  0 4 2
>    2125  1 1 1
>      28  1 2 1
>     408  1 2 2
>       2  1 3 2
>     116  1 3 3
>    5866  2 1 1
>     466  2 2 2
>     104  2 3 2
>     109  2 3 3
>      36  2 4 4
>       2  2 5 3
>   26014  3 1 1
>      70  3 2 2
>       6  3 3 2
>      61  3 3 3
>     878  4 2 2
>      45  4 3 2
>      33  4 3 3
>       2  4 4 3
>      10  4 4 4
>      50  5 2 2
>      62  5 3 3
>       3  5 4 4
>     370  6 2 2
>     179  6 3 3
>      25  6 4 4
>       3  6 5 5
>       1  6 6 6
>       1  7 6 6
>       1  7 7 7
>      38  8 4 4
>       8  8 5 5
>       1  8 6 6
>       1  8 8 8
>       5  10 5 5
>      10  10 6 6
>       2  12 6 6
>       1  14 7 7
>       1  15 8 8
>       1  33 18 18

Some further findings:

Only 6900 out of something like 38k explicit collation mappings plus
87k ideographic characters in the default CLDR data, have non-default
secondary or tertiary weights.

So what's really important is not just making it efficient to
represent the above frequently-occurring lengths, but to avoid storing
secondary and tertiary weights at all when they're the default ones.

Some of the numbers above are wrong, like the ones with tertiary
length 0, as a result of the counting program misparsing lines that
reference other characters. The number of such lines are few enough
they don't affect overall conclusions though.

I made the above table to look at overall lengths after concatenating
multiple collation elements, but the mappings to multiple collation
elements are not common enough to really matter to the encoding
efficiency/density.

The only length patterns that really appear in significant numbers for
single collation elements are:

0 0 0
0 1 1
0 2 1
1 1 1
2 1 1
3 1 1

Out of the 26k "3 1 1" instances counted above, about 19k are of the
form "3 * *" where * is a default secondary/tertiary weight. There are
also 87k ideographic characters with the "3 * *" default weight
pattern.

This makes the patterns to represent efficiently, in order of
importance (* = elided default weight):

3 * *
3 1 1
2 * *
2 1 1
1 * *
1 1 1
0 0 0
0 2 1
0 1 1

The "0 0 0" pattern is not actually all that important, since multiple
mappings can all collapse to a reference to the same definition (since
there are no data bits that can vary), but I think it's obviously
nicest to just make it efficient anyway.

If we stick with the header byte representing the length of all the
concatenated collation elements for the mapping, we probably want
something like:

3 bits for primary length
2 bits for secondary length, with a special value for elided default
2 bits for tertiary length, with a special value for elided default
1 bit to represent larger values with multiple explicit fields

This is the same as the quoted draft above except for the elided
default cases.

However, since this encoding does not really cover the needs of
mappings that expand to multiple CEs except by using a longer 3+ byte
form (see quoted text above), I wonder if it makes more sense not to
concatenate multiple CEs, but instead let each have its own header
byte.

If we do that, any reasonably-assigned CE only needs a 1-byte header,
taking values pretty much entirely from the set of 9 possibilities
above. This gives a smaller encoding for 2-CE mappings, especially
since it lets us encode the elision of default secondary/tertiary
weights in multi-CE mappings, and avoids trying to cram too much in 8
bits based on expectations of likely patterns.

As an example, the mapping of U+2033 with the concatenated encoding
would be at least 1 header byte followed by:

    0A BA 0A BA 05 05 20 20

With per-CE headers (h1 and h2 each 1 byte):

    h1 0A BA 20 h2 0A BA 20

(the 05 is the elided common default weight).

For another, U+320E (long enough to need 3 header bytes):

    hh hh hh 09 8E 79 06 79 67 09 90 05 05 05 05 17 10 10 17

vs

    h1 09 8E 17 h2 79 06 10 h3 79 67 10 h4 09 90 17

It looks like the only time the concatenated form wins is for very
very long sequences of CEs, like U+FDFA expanding to 18 CEs, or
U+1F1A9 expanding to 8. And these are so few that the size doesn't
matter.

If each CE does have its own header byte, we then need a way to know
how many there are. Since we need so few bits to encode lengths this
way, having a continuation bit seems like the obvious choice.

I'll try to work out a specific proposed header layout like this in
the next day or so and see how it works.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.