Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z4K0xWcQ6tP30CZc@beigestar>
Date: Sat, 11 Jan 2025 18:13:25 +0000
From: Gavin Smith <gavinsmith0123@...il.com>
To: musl@...ts.openwall.com
Subject: gettext LC_MESSAGES differences from other libc

(Please CC me in any replies as I am not subscribed to the list.)

As you know, the gettext function in musl does not behave exactly like
the function in glibc and some other libc implementations.  Specifically,
it does not obey the LANGUAGE variable which can be used to specify that
translated strings should be in a certain language.

In 2014, you discussed the rationale for not supporting LANGUAGE.  There
were issues with threads and caching:

Rich Felker, Thu, 31 Jul 2014, "How should $LANGUAGE work in our gettext?"
https://www.openwall.com/lists/musl/2014/07/31/2

Recently in the Texinfo project, we found this incompatibility with musl
for translations of strings to be placed in output files.  The gettext
API (neither musl or glibc/other) is not a perfect match for Texinfo
needs as much assumes that the target language is that of the user, of
the person sitting in front of the computer, whereas the appropriate
translation language is that of the input document.  For example, somebody
could be generating documentation in Italian to be posted to a website,
while they don't speak Italian themselves and do not have an Italian
locale installed.

The only way we can support this with glibc is to set LC_MESSAGES and/or
LC_ALL to a locale that is not "C" or "POSIX", and then to set the LANGUAGE
variable for the actual target language.  This is a nuisance, as sometimes
it is a struggle to actually find such a locale.  The assumption when this
API was designed was that a user with only a "C" locale does not need
translations, but this is false when they are generating them for somebody
else.  libc appears to offer no way just to open an arbitrary .mo file (the
file with the translated strings in it) to get the translations, forcing
you to go through the locale system.

musl supports setting LC_MESSAGES to an arbitrary value that is not
a locale, so can access arbitrary translation files in a different way.
However, we didn't think it was worth having a special case in the code
just for musl:
https://lists.gnu.org/archive/html/bug-texinfo/2024-12/msg00035.html

You also discussed this changing how LC_MESSAGES worked in a post in
2017, but as far as I am aware nothing came of it:

Rich Felker, Wed, 8 Nov 2017, "Re: setlocale behavior with 'missing' locales"

  One notable issue is that, right now, we rely on being able to set
  LC_MESSAGES to an arbitrary name even if there's no libc locale
  definition for it; this is because gettext() relies on the name of the
  current LC_MESSAGES locale to find (application-specific) translation
  files that might exist even without a libc translation. I'm not sure
  how we would best keep this working under changes similar to the
  above.
https://www.openwall.com/lists/musl/2017/11/08/2

Could there be a possiblity of a new extension to the getttext API that
works with musl, glibc and other libc implementations, that could be used
for arbitrary languages, not just those with installed locales?

I mention the possibility, as I found an old proposal (from 2016) to
add to the glibc API for translation languages that could be of interest:

Bruno Haible, 2016-05-10
"Re: [bug-gettext] RFC: move LANGUAGE check out of gettext()"
https://lists.gnu.org/archive/html/bug-gettext/2016-05/msg00009.html

> Why is this being reported for the LANGUAGE environment variable but not
> for the LANG and LC_ALL environment variables? Because for LANG and LC_*
> we have an architecture composed of three functionalities:
> 
>   (A) environment variables: getenv(), setenv()
> 
>   (B) locales: setlocale(), newlocale(), uselocale().
> 
>   (C) gettext() and friends.
> 
> (A) is the bottom-most layer. But it has the limitation that multi-threaded
> programs must not call setenv().
> 
> (B) is a layer that fetches the initial values from (A), and that allows
> mutators (setlocale(), uselocale()) in multi-threaded programs.
> So that multi-threaded applications can modify the program's locale after
> startup, there is the setlocale() function.
> So that multi-threaded programs can have a locale per thread, there is a
> uselocale() function.
> 
> (C) is an application layer that happens to be in Glibc for convenience
> reasons. It is based on the layer (B).
> 
> 
> Back to the LANGUAGE environment variable. The problem is that here we
> have the layers (A) and (C), but (B) is missing. The solution ought to
> be to introduce a layer (B) for LANGUAGE. LANGUAGE is not specified by
> POSIX and does not perfectly fit into the locale system, therefore I
> believe it is best treated separately.

This was also raised in the glibc bugtracker system:

Daiki Ueno, 2016-05-31
"API for language priority list"
https://sourceware.org/bugzilla/show_bug.cgi?id=20184

It was proposed that a language preference list could be set on a thread
specific basis, that would not involve setting environment variables.
This accords with point 2 in Rich Felker's 2014 commentary:

  2. The $LANGUAGE variable conflicts with uselocale and thread-local
     locales. For instance if the caller has called uselocale to request
     language Y despite the process-wide locale being language X, where
     language X is based on the user's preferences in the environment
     and language Y is based on data, it's wrong to present messages
     based on the environment ($LANGUAGE) rather than the requested
     language Y.


I hope this possibility is interesting to you although I don't fully
understand all the issues involved.


Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.