|
|
Message-ID: <20260523210218.GB27423@brightrain.aerifal.cx> Date: Sat, 23 May 2026 17:02:18 -0400 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: Re: Collation weight length frequencies On Sat, May 23, 2026 at 01:32:31AM -0400, Rich Felker wrote: > This makes the patterns to represent efficiently, in order of > importance (* = elided default weight): > > 3 * * > 3 1 1 > 2 * * > 2 1 1 > 1 * * > 1 1 1 > 0 0 0 > 0 2 1 > 0 1 1 The root collation data (FractionalUCA.txt) is currently specified with 1-4 bytes variable length for primary weight, 1-2 bytes variable length for secondary and tertiary. In practice it uses 1-3 bytes for primary, 1-2 for secondary, and 1 for tertiary. Secondary-ignorables do not occur at all in the root data. Length-2 secondary weights only occur in primary ignoreables (0 length primary weight). So the above list is not just the important length patterns, but is comprehensive for the root data. Locale-specific tailorings may however cause more to be used, depending on the algorithm used to apply them and the complexity of the rules. This suggests a bit packing that could allow all of the above plus extending primary length to 4 bytes, secondary and tertiary up to 3 bytes: 7 65 43 210 c tt ss ppp ppp = length of primary, 0-4, with other values reserved as special ss = length of secondary, 1-3, or common * tt = length of tertiary, 1-3, or common * c = continuation bit where any reserved values could represent secondary or fully ignoreables. However, what it also suggests is maybe having a data-defined dictionary of header byte values. which would avoid locking in any assumptions about the FractionalUCA.txt implementation of root data and instead allow any assignment (e.g. not even necessarily variable-length/fractional) of weights as non-null byte sequences compatible with strxfrm. >From an immediate practical standpoint, this would facilitate eliding not just one common secondary/tertiary byte value (05), but basically all of the common secondary/tertiary weight bytes. So that, instead of most entries in the table being one of (4-6 bytes): - hh pp pp pp ss tt - hh pp pp ss tt - hh pp ss tt most would be (2-4 bytes): - hh pp pp pp - hh pp pp - hh pp We could probably take this even further and let header byte represent a lead primary byte too. This would drop the 87k ideographic collation elements from 4 bytes each to 3 bytes each. Going too far probably leads to an over-engineered tailored compression, but a single dictionary lookup for common properties seems reasonable. In such a system, a single value should be reserved for representing arbitrary lengths. Rather than any fancy encoding for them, I think I'd have the 3 levels be null-terminated. Where rr is the reserved header value, and pp ss and tt are bytes of the respective weight levels: rr pp pp ... pp 00 ss ... ss 00 tt .. tt 00 This of course is not maximally efficient, but it's roughly only 1 byte worse than storing 3 lengths with sufficient range, and would only be intended for exceptional cases. If I go with this, for FractionalUCA-based tooling, the tooling for now would probably just hard-code a dictionary to emit with all of the reasonable common weight elisions and combinations of lengths. But it would give us the option to try different things and improve in the future without breaking compatibility with existing static binaries. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.