Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20171108052715.GM1627@brightrain.aerifal.cx>
Date: Wed, 8 Nov 2017 00:27:15 -0500
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: setlocale behavior with 'missing' locales

On Wed, Nov 08, 2017 at 12:03:38AM -0500, Rich Felker wrote:
> Unfortunately this turns out to have been something of a tradeoff,
> since there's no way for applications (and, as it turns out,
> especially tests/test suites) to query whether a particular locale is
> "really" available. I've been asked to change the behavior to fail on
> unknown locale names, but of course that's not a working option in
> light of the above.
> 
> I think there may be a solution that makes everyone happy, but I'm not
> sure yet. I'm going to follow up with a description and analysis of
> whether it's valid/conforming.

So here's the possible solution. ISO C leaves the default locale when
setlocale(cat,"") is called implementation-defined. POSIX however
defines it in terms of the LANG and LC_* environment variables. See
the CX text in:

http://pubs.opengroup.org/onlinepubs/9699919799/functions/setlocale.html

  "Setting all of the categories of the global locale is similar to
  successively setting each individual category of the global locale,
  except that all error checking is done before any actions are
  performed. To set all the categories of the global locale,
  setlocale() can be invoked as:

  setlocale(LC_ALL, "");

  In this case, setlocale() shall first verify that the values of all
  the environment variables it needs according to the precedence rules
  (described in XBD Environment Variables) indicate supported locales.
  If the value of any of these environment variable searches yields a
  locale that is not supported (and non-null), setlocale() shall
  return a null pointer and the global locale shall not be changed. If
  all environment variables name supported locales, setlocale() shall
  proceed as if it had been called for each category, using the
  appropriate value from the associated environment variable or from
  the implementation-defined default if there is no such value."

and the Environment Variables text in XBD 8.2:

http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08_02

The former seems to tie our hands: unless the locales determined by
the environment variables all exist, setlocale is required to fail and
leave us in the (unacceptable) "C" locale where UTF-8 doesn't work.
However the latter seems to offer us a way out. After describing how
the precedence of the variables work, how locale pathnames work if
localedef is supported (musl doesn't support it), and how
implementation-provided/defined locale names work, it specifies:

  "If the locale value is not recognized by the implementation, the
  behavior is unspecified."

My optimistic reading of this is that, in the event the locale name
provided does not correspond to something we recognize, we're free to
define how it's interpreted, and always interpret it as C.UTF-8.

What this would achieve is the following:

1. setlocale(cat, explicit_locale_name) - succeeds if the locale
   actually has a definition file, fails and returns a null pointer
   otherwise.

2. setlocale(cat, "") - always succeeds, honoring the environment
   variable for the category if a locale definition file by that name
   exists, but otherwise (the unspecified behavior) treating it as if
   it were C.UTF-8.

This way, applications that probe for specific locale names can do so
and determine if they exist, but applications that just want to use
the default locale the user configured will still avoid catastrophic
breakage (failure to support UTF-8) even if they encounter "bad" LC_*
variables.

Does this approach sound acceptable? I'm fairly content with
interpreting it as conforming to the standard; I'm mainly concerned
about whether there might be unforseen breakage.

One notable issue is that, right now, we rely on being able to set
LC_MESSAGES to an arbitrary name even if there's no libc locale
definition for it; this is because gettext() relies on the name of the
current LC_MESSAGES locale to find (application-specific) translation
files that might exist even without a libc translation. I'm not sure
how we would best keep this working under changes similar to the
above.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.