Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260429200826.GA18431@brightrain.aerifal.cx>
Date: Wed, 29 Apr 2026 16:08:27 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Start of localedef tool for locales project

Based on previous proposal & discussion of the new locale source
format (subset of POSIX localedef, to be documented, with extensions
for error strings) and the concepts for the proposed binary runtime
format, I've put together a simple parser that reads localedef-format
input and emits what amount to a sequence of insertions into the
multi-level binary table format. It's very minimal, mostly
table-driven. At the moment it's producing output as text as a
test/proof-of-concept, but it's ready to be switched over to any
variant of the draft binary format.

This is part of the locale support overhaul project, funded by NLnet
and the NGI Zero Core Fund.

A small todo list of things I'd like to change:

- Right now, I'm defining all of the musl-ABI (mostly matching
  definitions from Linux kernel and glibc) item keys inline in the
  source in order to avoid a dependency on the host matching. This
  depends on not including the corresponding host headers that would
  clash. These macros shoul probably be moved to a prefixed namespace
  where they don't clash, and a header that can also be used in musl
  itself where needed (like for remapping mips and ppc error codes).

- Some of the integer key path assignments depend on sparse
  multi-level table support not to be gigantic. Particularly,
  LC_MESSAGES stuff. I think these should be rearranged so that we
  don't actually have to generate sparse multi-level stuff
  gratuitously. It's more code complexity and has a slightly higher
  runtime cost. Outside of collation tables where it's necessary, I'd
  like to reserve the sparseness support as a last resort for
  expansion, not a necessary feature to use from day one.

- Error handling is minimal. For a polished tool for translators to
  use, it needs to report and error out on malformed input, and
  possibly warn about missing data too.

Some features to note:

- The parser does not stop on the first match for a line, but applies
  all matching rules. This is actually used for the "grouping" and
  "mon_grouping" localeconv items, where CHAR_MAX has special meaning.
  Since CHAR_MAX can be 127 or 255 depending on arch, an arch-agnostic
  locale file needs two versions of the strings, to be used
  conditionally based on the arch's value of CHAR_MAX. In theory it
  could also be used for RADIXCHAR and THOUSEP, which have both
  nl_langinfo and localeconv versions of the same field, but I think
  it's safer not to allow the binary format to represent them
  inconsistently and instead to derive one from the other at runtime.

- The same is not done for the inline localeconv bytes, only for the
  strings pointed to by the first 10 members of struct localeconv.
  Unlike the strings, the inline bytes are fixed size and necessarily
  in contiguous memory with pointer members which have to be relocated
  to the process's address space, so we can (and must) memcpy them
  from the mmapped locale into locale-local runtime memory in libc.
  This means we can just replace '\xff' with '\x7f' during the
  copying. This conditionally takes place in exactly the same code
  that will select which versions of "grouping" and "mon_grouping" to
  point to.

At present, neither monetary nor non-monetary grouping is supported,
so none of the above actually matters. Adding missing strfmon
functionality is possibly in-scope for this project, time permitting,
but non-monetary grouping (' modifier to printf) has traditionally
been something we intentionally did not support, and I don't know if
that will change. In any case, though, I wanted to make sure the data
format can represent the full range of possible groupings for both, so
that we're not locked out of future support (future-proofing being a
primary goal of this project).

Attached are the current source and output for the out-edit.txt file
(previous output of dumplocale.c).

Rich

View attachment "parselocale.c" of type "text/plain" (14303 bytes)

View attachment "compiled.txt" of type "text/plain" (7783 bytes)

View attachment "out-edit.txt" of type "text/plain" (6396 bytes)

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.