musl - Re: musl multi-level table format for binary locale images

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20260515013533.GQ1827@brightrain.aerifal.cx>
Date: Thu, 14 May 2026 21:35:33 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: musl multi-level table format for binary locale images

On Wed, May 13, 2026 at 12:43:29PM -0400, Rich Felker wrote:
> On Tue, May 12, 2026 at 07:09:32PM -0400, Rich Felker wrote:
> > On Sat, May 09, 2026 at 11:04:13PM -0400, Rich Felker wrote:
> > > On Fri, May 08, 2026 at 11:22:28PM -0400, Rich Felker wrote:
> > > > The concepts here have been presented before; what follows is an
> > > > informal spec of the actual mappable image format that has emerged
> > > > from earlier design proposals and discussion and from implementation
> > > > of draft tooling.
> > > > 
> > > > The multi-level tables used here are in some sense a data-driven
> > > > generalization of the multi-level tables used elsewhere in musl for
> > > > character data, with headers at each level defining the range covered
> > > > and bits examined at the level covered.
> > > > 
> > > > Some of the details here are not yet matched by the draft tooling, and
> > > > all may still be subject to further change until integration and
> > > > release.
> > > > 
> > > > This is part of the locale support overhaul project, funded by NLnet
> > > > and the NGI Zero Core Fund.
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > FILE FORMAT
> > > > [...]
> > > 
> > > Give me another day or so to clean this up and post it, but I've now
> > > implemented both the generation and lookup for the binary locale
> > > images. Generation does not make use of shift!=0 (multi-level)
> > > functionality, which is presently unneeded outside of collation data;
> > > this will be added later, and should be easy to do as a
> > > transformationon the in-memory representation prior to serialization.
> > > 
> > > Right now I have the code bolted haphazardly to the existing
> > > parselocale.c to build the image then query back the items in
> > > parselocale.c's table to check that they round-trip. I'll probably
> > > clean this up a little bit to be reasonable to commit into the
> > > development history, then go on to write a simple localedef(1) entry
> > > point.
> > > 
> > > This leaves details of collation and musl integration as the main
> > > remaining parts of the locale project.
> > 
> > An integration of the parser, binary table generation, and
> > localedef(1) frontend is up at:
> > 
> > https://codeberg.org/dalias/musl-locale-tools-draft
> > 
> > Current version at the time of this email:
> > 
> > https://codeberg.org/dalias/musl-locale-tools-draft/src/commit/19bf9dc524353232b03735f410490895248ee5b1
> > 
> > The code to perform lookups is not yet merged much less hooked up to
> > any test framework, but I'm attaching a draft to this email. It needs
> > to be pointed at the start of the actual table (after the 16-byte file
> > header).
> 
> Lookup code is now included in the above draft repo, with bugs fixed
> and adjustment to length/count field meaning applied, and hooked up to
> an extractlocale utility that can pull the text-based source locale
> format out of binary files.
> 
> Current revision as of this mail is:
> 
> https://codeberg.org/dalias/musl-locale-tools-draft/src/commit/653135e636c1e0a3d8a34278079c424afa7d6639
> 
> As the data structures are not self-describing but require the
> consuming process to be aware of the layout, the same table used by
> parselocale to populate the binary table is used to pull data back out
> of it.

Next step on this code is probably going to be getting it ready for
representing collation elements. This requires the ability to actually
generate a table broken down into multiple levels. There's a dumb way
to do that just making an interface to hard-code the desired shift
breakdowns, but I think I'd like to actually take advantage of the
flexibility to use different shifts and generate good ones
computationally.

Here's a rough sketch of the idea I have in mind:

When emitting a table whose entry count would be greater than some
threshold (this is necessary if it will be >64k, since that's not
representable in a single level, but probably should be done at a much
smaller threshold), start off with a candidate shift keeping the upper
16 bits of the range or half of the remaining bits, whichever is
smaller. But examine whether there are large gaps with no entries when
using this shift, and if so, increase the shift and reevalute,
stopping based on a cost function.

At this point, make a new table with the selected shift, add subtables
for each slot with entries, move those entries to their new subtable,
and replace the original table with the new one.


One thing this approach does not yet do is take advantage of
redundancy. In theory if you have 10 subtables that all say "every
index under here has an implicit ideographic collation weight" then
the same copy of that binary subtable could be used all 10 times.

However, I'm not sure we really should care about this kind of
deduplication. It's much easier to deduplicate within a leaf subtable.
So for example if you have a subtable covering 1k codepoints that all
have implicit ideographic weights or that are all complete ignorables,
that subtable itself can have 1-byte (scale=0) offsets for each slot,
and all of them can be 1, pointing at a single copy of the collation
weight encoding. There's still the cost of these offsets -- should we
perhaps have a scale=-1 that means "all keys have the same value and
no offsets at all are stored?

Doing this kind of deduplication shouldn't be difficult. Each subtable
just needs to keep a searchable list of the data items already stored
in it (tsearch can do this too :) so adding duplicates can mark them
as references to the already-present copy. This functionality would
actually be testable now, with all of the duplicated "No error
information" strings we have in the dumplocale output.

If I switched back to flattening the tables destructively in-memory,
rather than serializing them directly to an output buffer, we could
actually get deduplication of whole subtables this way too. But I
suspect it has limited usefulness and without more complexity would
only deduplicate direct sibling subtables, not all copies of the same
subtable. So I'll probably leave consideration of whether to try to
optimize the tables futher like this as a consideration for after the
rest of the locale project is done.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.