musl - Re: gettext LC_MESSAGES differences from other libc

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z5AG5IM8lCLNwILn@beigestar>
Date: Tue, 21 Jan 2025 20:43:16 +0000
From: Gavin Smith <gavinsmith0123@...il.com>
To: Rich Felker <dalias@...c.org>
Cc: musl@...ts.openwall.com, Patrice Dumas <pertusus@...e.fr>
Subject: Re: gettext LC_MESSAGES differences from other libc

On Sat, Jan 11, 2025 at 11:51:05PM -0500, Rich Felker wrote:
> > Recently in the Texinfo project, we found this incompatibility with musl
> > for translations of strings to be placed in output files.  The gettext
> > API (neither musl or glibc/other) is not a perfect match for Texinfo
> > needs as much assumes that the target language is that of the user, of
> > the person sitting in front of the computer, whereas the appropriate
> > translation language is that of the input document.  For example, somebody
> > could be generating documentation in Italian to be posted to a website,
> > while they don't speak Italian themselves and do not have an Italian
> > locale installed.
> 
> This sounds like locale is not the right tool for processing it.
> 
> > The only way we can support this with glibc is to set LC_MESSAGES and/or
> > LC_ALL to a locale that is not "C" or "POSIX", and then to set the LANGUAGE
> > variable for the actual target language.  This is a nuisance, as sometimes
> > it is a struggle to actually find such a locale.  The assumption when this
> > API was designed was that a user with only a "C" locale does not need
> > translations, but this is false when they are generating them for somebody
> > else.  libc appears to offer no way just to open an arbitrary .mo file (the
> > file with the translated strings in it) to get the translations, forcing
> > you to go through the locale system.
> 
> If you just want to process .mo files without going thru the locale
> system, the necessary code is about 42 source lines/329 machine code
> bytes that's MIT-licensed in musl that you're free to copy. This
> probably makes the most sense.

Thanks for the suggestion.  It is possible that we will end up doing
this, if the current approach has more problems.

I noticed that your implementation at:

https://git.musl-libc.org/cgit/musl/tree/src/locale/__mo_lookup.c

does not refer to a hashing table section of the .mo file.  This
could make it slower.

I'm not sure if there is a relevant standard for the format for .mo
files.

At https://pubs.opengroup.org/onlinepubs/9799919799/utilities/msgfmt.html, it
says:

"The format of the created messages object files is unspecified."

The GNU gettext manual gives some documentation on the file format,
but does not document the format of the hashing table:

"The precise hashing algorithm used is fairly dependent on GNU gettext
code, and is not documented here."
https://www.gnu.org/software/gettext/manual/html_node/MO-Files.html

Apart from the hashing table issue, using libc gettext handles some
other things that we would have to recreate ourselves, on top of the .mo
file format.  Ones I can think of are translation contexts, character
encodings, and regional language variants (e.g. pt and pt_BR).
Another issue could be plural variants of translations.  We could probably
reimplement all of this without a huge amount of difficulty, if we
really wanted to, although as the code is in the libc already it would
seem simpler if we could access it.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.