|
|
Message-ID: <20260429200826.GA18431@brightrain.aerifal.cx>
Date: Wed, 29 Apr 2026 16:08:27 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Start of localedef tool for locales project
Based on previous proposal & discussion of the new locale source
format (subset of POSIX localedef, to be documented, with extensions
for error strings) and the concepts for the proposed binary runtime
format, I've put together a simple parser that reads localedef-format
input and emits what amount to a sequence of insertions into the
multi-level binary table format. It's very minimal, mostly
table-driven. At the moment it's producing output as text as a
test/proof-of-concept, but it's ready to be switched over to any
variant of the draft binary format.
This is part of the locale support overhaul project, funded by NLnet
and the NGI Zero Core Fund.
A small todo list of things I'd like to change:
- Right now, I'm defining all of the musl-ABI (mostly matching
definitions from Linux kernel and glibc) item keys inline in the
source in order to avoid a dependency on the host matching. This
depends on not including the corresponding host headers that would
clash. These macros shoul probably be moved to a prefixed namespace
where they don't clash, and a header that can also be used in musl
itself where needed (like for remapping mips and ppc error codes).
- Some of the integer key path assignments depend on sparse
multi-level table support not to be gigantic. Particularly,
LC_MESSAGES stuff. I think these should be rearranged so that we
don't actually have to generate sparse multi-level stuff
gratuitously. It's more code complexity and has a slightly higher
runtime cost. Outside of collation tables where it's necessary, I'd
like to reserve the sparseness support as a last resort for
expansion, not a necessary feature to use from day one.
- Error handling is minimal. For a polished tool for translators to
use, it needs to report and error out on malformed input, and
possibly warn about missing data too.
Some features to note:
- The parser does not stop on the first match for a line, but applies
all matching rules. This is actually used for the "grouping" and
"mon_grouping" localeconv items, where CHAR_MAX has special meaning.
Since CHAR_MAX can be 127 or 255 depending on arch, an arch-agnostic
locale file needs two versions of the strings, to be used
conditionally based on the arch's value of CHAR_MAX. In theory it
could also be used for RADIXCHAR and THOUSEP, which have both
nl_langinfo and localeconv versions of the same field, but I think
it's safer not to allow the binary format to represent them
inconsistently and instead to derive one from the other at runtime.
- The same is not done for the inline localeconv bytes, only for the
strings pointed to by the first 10 members of struct localeconv.
Unlike the strings, the inline bytes are fixed size and necessarily
in contiguous memory with pointer members which have to be relocated
to the process's address space, so we can (and must) memcpy them
from the mmapped locale into locale-local runtime memory in libc.
This means we can just replace '\xff' with '\x7f' during the
copying. This conditionally takes place in exactly the same code
that will select which versions of "grouping" and "mon_grouping" to
point to.
At present, neither monetary nor non-monetary grouping is supported,
so none of the above actually matters. Adding missing strfmon
functionality is possibly in-scope for this project, time permitting,
but non-monetary grouping (' modifier to printf) has traditionally
been something we intentionally did not support, and I don't know if
that will change. In any case, though, I wanted to make sure the data
format can represent the full range of possible groupings for both, so
that we're not locked out of future support (future-proofing being a
primary goal of this project).
Attached are the current source and output for the out-edit.txt file
(previous output of dumplocale.c).
Rich
View attachment "parselocale.c" of type "text/plain" (14303 bytes)
View attachment "compiled.txt" of type "text/plain" (7783 bytes)
View attachment "out-edit.txt" of type "text/plain" (6396 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.