Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160511232614.GJ21636@brightrain.aerifal.cx>
Date: Wed, 11 May 2016 19:26:14 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: gettext and locale names

On Mon, May 09, 2016 at 09:46:50PM +0900, Masanori Ogino wrote:
> 2016-05-05 6:39 GMT+09:00 Rich Felker <dalias@...c.org>:
> > On Wed, May 04, 2016 at 10:05:28PM +0900, Masanori Ogino wrote:
> >> Hello,
> >>
> >> When I played with gettext API, I found that musl searches .mo files
> >> with a directory named as current *full* locale names, e.g.
> >> en_US.UTF-8. However, we often use shortened names too. Here is a list
> >> of those names from those of my machine in /usr/share/locale: de,
> >> en_GB, ru_UA.koi8u, sr@...in, etc.
> >>
> >> Due to this mismatch, we can't get translations with musl's gettext
> >> API for applications in wild. Thus, I'm considering to implement
> >> locale searching with shortening. Does it make sense?
> >
> > Yes, I think this makes sense. Before spending time on the code though
> > it makes sense to discuss the proposed logic here. What level would
> > the search/shortening happen at? __get_locale in locale_map.c? In
> > dcngettext.c?
> 
> Sure. I doubt that shortening in __get_locale might be insufficient
> since some code may want the full locale name even if there is no
> locale data for it. I will dig into the code.
> 
> Another problem is the preference of shortened locales. Obviously, the
> full locale itself has the highest priority and language-only locales
> (e.g. en, de, etc.) do the lowest one. However, which is the preferred
> locale, en_GB@...o or en_GB.UTF-8, when the code receives
> en_GB.UTF-8@...o?
> 
> I am unsure whether someone actually uses such locale, but I think it
> is necessary to discuss such corner cases.

Conceptually there are two sets of names the locale names need to lead
us to: libc locales in MUSL_LOCPATH, and gettext translation files in
directories provided to bindtextdomain. If/when we add non-stub
catgets support, the locale name is also relevant to NLSPATH
processing where %L expands to the whole locale name, and %l, %t, and
%c expand to the language, territory, and codeset parts of it,
respectively.

>From musl's standpoint all locales are UTF-8-encoded, so the codeset
portion of the locale name is at best redundant. The official musl
locale files, once we have such a thing, should not have ".UTF-8" in
their names, but a spurious ".UTF-8" component in the locale name
string should be accepted (and ignored) for compatibility with
glibc-based systems where the specifier may be necessary for
glibc-linked programs to distinguish from legacy versions of the
locales.

In principle we could implement this by stripping the ".UTF-8" at
setlocale time (in __get_locale from locale_map.c) but I don't see a
major advantage in doing that versus keeping the full string and just
stripping it when constructing filenames to try opening. On the other
hand there are advantages to keeping it: some users/distros may want
to put a spurious ".UTF-8" in the locale name to trick broken programs
that use strstr on the locale name, rather than nl_langinfo(CODESET),
to determine that they're in a UTF-8 environment.

For gettext translations, I haven't seen ".UTF-8" used either. My
$prefix/share/locale directories have under them directories only of
the forms "ll", "ll_TT", and "ll_TT@mod". If I'm not mistaken, modern
gettext-based programs always store UTF-8 in their message catalogs
and legacy locales are expected to convert the contents when loading.

Based on all this, the search order we perform should probably be
something like this: First, take the input locale name and strip any
codeset identifier. Then, iterate over 4 steps:

1. Try full name.
2. If a modifier (@mod) is present, try with modifier removed.
3. If a territory (_TT) is present, try with territory removed.
4. If both modifer and territory are present, try with both removed.

At worst this yields 4 file-open attempts, and only in the case where
a user has requested a ll_TT@mod type locale but either the @mod or
_TT does not exist. For ll_TT type locales, it yields at most 2
attempts. For ll type locales, or locale names that don't fit the
standard pattern, there should be at most one attempt.

>From an implementation side, note that, presently, dcngettext uses the
full pathname of the message catalog file as the key to look up the
memory-mapped image. This lookup needs to happen without touching the
filesystem, and the only reason a pathname is used is that it
encompasses all the necessary key components (bound directory, locale
name, category name, domain name). So if we go with my above proposal,
the "pathname" used as a key should still contain the full locale
name, not the particular fallback it resolved to, and thus might not
actually be a valid pathname anymore. Because of this it's plausible
that the same catalog file could end up getting mapped more than once
(e.g. as both /usr/share/locale/en_US/LC_MESSAGES/foo and
/usr/share/locale/en/LC_MESSAGES/foo) but this doesn't incur any major
cost and I don't think it's worth trying to detect and avoid.

Does this all make sense? Does it sound reasonable?

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.