Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250618192847.GA1827@brightrain.aerifal.cx>
Date: Wed, 18 Jun 2025 15:28:47 -0400
From: Rich Felker <dalias@...c.org>
To: Pablo Correa Gomez <pabloyoyoista@...tmarketos.org>
Cc: musl@...ts.openwall.com
Subject: Re: Planned locale work and community thoughts

On Mon, Jun 02, 2025 at 07:37:51PM +0200, Pablo Correa Gomez wrote:
> Hi everybody,
> 
> I am Pablo Correa Gomez, a member of postmarketOS Core Contributors,
> working on the collation and locale overhaul project
> (https://www.openwall.com/lists/musl/2025/05/05/5)together with Rich.
> 
> We have now more details on the planned locale work that was earlier
> announced. The current musl locales experience is sub-par compared to
> other platforms, and we plan to use this project to fix that. 
> 
> The main and biggest issue that we aim to solve is the representation
> format of the locale strings. The initial implementation used English
> strings as keys to lookup for translations. This had a major issue
> where May would represent both the abbreviated and non-abbreviated
> forms of the month, making it untranslatable in languages where May has
> more than 3 letters. However, there are other different issues that are
> also aiming to solve in this project:

Main decision to be made here is how we key items that need
localization, whether by fixing the string-based keying (e.g. using
the macro names like "ABMON5" as the keys) with the gettext-type
lookup we have now, or switching to assigned integer indices as the
keying for a more catgets-like system (likely using the values from
the macros in langinfo.h as the indices), or something else.

> * Implement RADIXCHAR so that "." is not the only possible separator.
> THOUSEP will in principle not be implemented due to it breaking quite
> some assumptions, and it being less critical for users.

To give some background on this: from the start I was largely opposed
to having the radix char be localizable at all, as this has been a
source of perpetual problems for parsing and generating text-based
data formats intended for interchange, and I didn't really think there
was any modern demand for it.

However, in past discussions of the topic, it's come up that some
people do want it, and I don't want us to be the bad guys who are
being stubborn dismissing someone else's cultural expectations, so the
tentative plan has been to offer this with 1-bit degree of freedom
between '.' and ',' as the only choices.

I've been made aware that, at least historically prior to use in
computer systems, there have been other notations for radix point, but
it's not clear if there's any modern expectation to be able to do
that. What I think would be a useful next step is to grep the Unicode
CLDR for whether there are non-'.' non-',' radix chars in any locale
definitions. If there are none, I think that already settles it. If
there are any, we should attempt to figure out whether there are
real-world systems that support them and precedent for users to expect
they work.

Note that supporting basically anything plausble other than '.' and
',' as radix characters has major technical issues that may introduce
vulns into programs not expecting it, so in the absence of both strong
evidence of necessity and research into what would break and whether
unsafe breakage is unlikely, I want to just say no to this.

It may however make sense for the on-disk data format to allow for the
possibility, and for musl to just treat anything but "," as if it were
"." for the forseeable future.

> * Implement LC_MONETARY so that we can get properly localized currency
> representation.

This is fairly straightforward, but does need a reasonable data format
that translates well into "struct localeconv" form. The localeconv
fields that are strings need to be directly usable from the
memory-mapped locale file, so that we don't need to allocate
variable-sized storage for them, and one complication of this is that
"grouping" and "mon_grouping" are *arch-specific* because the encoding
uses CHAR_MAX with a special meaning, and CHAR_MAX could be 127 or
255. This means both versions of the string (one for singed-plain-char
archs and one for unsigned-plain-char archs) need to be stored in the
on-disk format.

> * Make sure that every function that accepts a locale actually uses it
> for the translation.
> 
> To be able to prepare for the technical work, there are some things for
> which we would like community input:
> 
> 1. We need to figure out an alternative representation for the
> translatable strings derived from[1] to avoid the "May issue". A simple
> solution would be to use those constants (or an abbreviation of them)
> as keys for the lookup. Hopefully that would be both unambiguous and
> self-explanatory and as a bonus, it's already documented. Does somebody
> have other/better ideas?

See above for the options I'm aware of.

> 2. Regardless of the representation we choose, we need to decide on a
> workflow for translators. Currently, people can just copy the .pot 
> file[2] with a hard-coded representation that might include other
> things to translate. That seems good enough if we chose the
> representation directly from [1], but might not be possible if we
> decide on something different.

Long-term, the workflow should probably be deriving the data from the
Unicode CLDR with possibility for overrides, with tooling to do that.
I'm not sure if we want to prepare such tooling now.

At least for collation, I think we need some level of tooling now in
order to be able to test/evaluate it. I'm presently trying to find the
relevant tooling other systems use. ICU has something that converts
the base weights table to a possibly-reasonable binary form but I
haven't located the tooling to apply locale-specific modifications
from CLDR data to the table.

> 3. Right now, other translatable strings coming from different sources
> (another email with a detailed analysis will follow up) are also part
> of the musl locales project. Those are also just translated directly as
> strings. However, some also appear in different contexts. Like "out of
> memory" on regex, on getting network address info. Should these be
> split, receive a different representation, and thus provide additional
> context information to translators? I personally believe that most
> high-level applications should hide these messages coming directly from
> libc, and thus they should only be rarely available to users, like in
> CLI applications, where users are generally expected to have a basic
> knowledge of English. I would be fine with leaving these strings
> represented just by their own English string names, even if that means
> a bit of context is lost in some languages.

I think it's expected that they're translated (don't set LC_MESSAGES
if you don't want that) but again the mechanism is open to change
while we're making a major overhaul here.

Do we want the messages keyed by the English strings, as now, or do we
want them keyed by identity of the error, whether that's the names of
the errno E*/REG_*/EAI_*/etc. macros or some assigned integer codes as
in the option for LC_TIME stuff above.

> 4. Chose a default locale placement, so that we can get translations
> without needing to parse an envvar in [3]. In Alpine/pmOS the location
> is currently in /usr/share/i18n/locales/musl/ I do not think that's a
> great place, but the FHS does not seem to provide an obvious place for
> it to live, since AFAIU locales for the libc should not be mixed with
> LC_MESSAGES from other applications. Are there other suggestions?

My main concern, especially if we want them to be usable by suid
binaries, is that they should be in a location we can rely on to
belong to root. While I don't think they should be *stored* in /etc,
reaching them via a path component (intended to be a symlink) in /etc
is probably the best way to both ensure that and allow the actual
files to be placed wherever distro policy wants them to be placed.

> 5. So far, although the musl-locales project exist, it has been kept
> apart from the main musl project, and not really sanctioned as
> "official". It would be great, if we could have discussions related to
> musl-locales project directly in this mailing list. And if there could
> be a synchronized copy of it in https://git.musl-libc.org/cgit next to
> the musl repository. Is there somebody against this?

I think having the discussion on the main mailing list should be fine.

> 6. Given that at postmarketOS good localization is something critical,
> I would be very happy if we could fork the current project, host it  in
> our gitlab, and use it as the place to synchronize with
> https://git.musl-libc.org/cgit. If somebody would have other ideas, or
> moving it is considered disruptive, then it would be great if somebody
> from our team could also get access, so we can increase the maintenance
> it has seen lately.

I don't have a strong opinion on this yet, but I do agree that we
should have it sync'd to the main musl git server, regardless of where
actual devel takes place, so that it presents as "official".

> 7. If a locale is missing in musl, setlocale currently "fakes" that
> support exist by copying the C data to the said locale. This has the
> benefit that apps which are translated in a locale missing in musl
> still show up as translated for the application-related messages. The
> problem with this is that the UX is then inconsistent, since users get
> things mixed and matched in different languages. This is also generally
> a step against musl philosophy of being strinctly correct. A previous
> discussion[4] had a pretty good proposal[5] that I fully support. As I
> said in the thread, as long as we have some time to adapt, the behavior
> change should be acceptable.

I need to review this but I recall the proposal being acceptable.

> 8. Finally, if you want to be involved in testing in a language for
> which we don't yet have a volunteer signed-in in[6], feel free to
> report yourself, we might have some small funding available, for which
> please send me a private email with the details specified in there.
> 
> We hope that at the end of this work, we have a setup for musl locales
> that is able to fit the needs of most users. If you believe there is
> something missing, please let us know.
> 
> This work is possible thanks to a grant from NLnet and the NGI Zero
> Core Fund. Thank you for supporting us!
> 
> [1]
> https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/langinfo.h.html
> [2]
> https://git.adelielinux.org/adelie/musl-locales/-/blob/main/musl.pot?ref_type=heads
> [3]
> https://git.musl-libc.org/cgit/musl/tree/src/locale/locale_map.c#n66
> [4] https://www.openwall.com/lists/musl/2023/08/10/3
> [5]
> https://gist.github.com/al45tair/15c3ade52b09d0cad67074176ad43e4a#proposed-behaviour
> [6]
> https://gitlab.postmarketos.org/postmarketOS/postmarketos/-/issues/65

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.