musl - Re: gettext LC_MESSAGES differences from other libc

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250112045105.GI10433@brightrain.aerifal.cx>
Date: Sat, 11 Jan 2025 23:51:05 -0500
From: Rich Felker <dalias@...c.org>
To: Gavin Smith <gavinsmith0123@...il.com>
Cc: musl@...ts.openwall.com
Subject: Re: gettext LC_MESSAGES differences from other libc

On Sat, Jan 11, 2025 at 06:13:25PM +0000, Gavin Smith wrote:
> (Please CC me in any replies as I am not subscribed to the list.)
> 
> As you know, the gettext function in musl does not behave exactly like
> the function in glibc and some other libc implementations.  Specifically,
> it does not obey the LANGUAGE variable which can be used to specify that
> translated strings should be in a certain language.
> 
> In 2014, you discussed the rationale for not supporting LANGUAGE.  There
> were issues with threads and caching:
> 
> Rich Felker, Thu, 31 Jul 2014, "How should $LANGUAGE work in our gettext?"
> https://www.openwall.com/lists/musl/2014/07/31/2
> 
> Recently in the Texinfo project, we found this incompatibility with musl
> for translations of strings to be placed in output files.  The gettext
> API (neither musl or glibc/other) is not a perfect match for Texinfo
> needs as much assumes that the target language is that of the user, of
> the person sitting in front of the computer, whereas the appropriate
> translation language is that of the input document.  For example, somebody
> could be generating documentation in Italian to be posted to a website,
> while they don't speak Italian themselves and do not have an Italian
> locale installed.

This sounds like locale is not the right tool for processing it.

> The only way we can support this with glibc is to set LC_MESSAGES and/or
> LC_ALL to a locale that is not "C" or "POSIX", and then to set the LANGUAGE
> variable for the actual target language.  This is a nuisance, as sometimes
> it is a struggle to actually find such a locale.  The assumption when this
> API was designed was that a user with only a "C" locale does not need
> translations, but this is false when they are generating them for somebody
> else.  libc appears to offer no way just to open an arbitrary .mo file (the
> file with the translated strings in it) to get the translations, forcing
> you to go through the locale system.

If you just want to process .mo files without going thru the locale
system, the necessary code is about 42 source lines/329 machine code
bytes that's MIT-licensed in musl that you're free to copy. This
probably makes the most sense.

> musl supports setting LC_MESSAGES to an arbitrary value that is not
> a locale, so can access arbitrary translation files in a different way.
> However, we didn't think it was worth having a special case in the code
> just for musl:
> https://lists.gnu.org/archive/html/bug-texinfo/2024-12/msg00035.html
> 
> You also discussed this changing how LC_MESSAGES worked in a post in
> 2017, but as far as I am aware nothing came of it:
> 
> Rich Felker, Wed, 8 Nov 2017, "Re: setlocale behavior with 'missing' locales"
> 
>   One notable issue is that, right now, we rely on being able to set
>   LC_MESSAGES to an arbitrary name even if there's no libc locale
>   definition for it; this is because gettext() relies on the name of the
>   current LC_MESSAGES locale to find (application-specific) translation
>   files that might exist even without a libc translation. I'm not sure
>   how we would best keep this working under changes similar to the
>   above.
> https://www.openwall.com/lists/musl/2017/11/08/2

There's currently a proposal to partly remove this behaior, because it
prevents applictions from being able to detect if there's actually a
meaningful locale installed for a specific locale name. The specifics
have not been worked out, and this is an area I'd really like input
from affected parties on.

The hard constraint from my perspective is that setlocale("",x) can't
be allowed to fail (user stuck with no Unicode because of unsupported
locale name in environment), but both the current behavior of making a
virtual locale by the requested name, and replacing the name by
C.UTF-8 in this case, are options. It's plausible that only
LC_MESSAGES could keep the current behavior if this turns out to be
the most helpful.

Depending on how LC_MESSAGES is to be handled, it's plausible that we
could integrate support for LANGUAGE at the same time, maybe having
the synthesized locale for "" also storing/encoding the value of
LANGUAGE, or some other mechanism to achieve the same thing. But I'm
not sure it's a good idea. There are many reasons already discussed
why the LANGUAGE model is broken, and I'm not sure we can fix it in a
way that's consistent with user expectations.

I'll probably open a new thread on this specific topic soon.

But I suspect your problem is best solved by not using locale for
non-user-language data processing.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.