musl - Re: Possible bug in setlocale upon invalid LC

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160402040914.GD21636@brightrain.aerifal.cx>
Date: Sat, 2 Apr 2016 00:09:14 -0400
From: Rich Felker <dalias@...c.org>
To: Assaf Gordon <assafgordon@...il.com>
Cc: musl@...ts.openwall.com
Subject: Re: Possible bug in setlocale upon invalid LC_ALL value

On Fri, Apr 01, 2016 at 10:46:25PM -0400, Assaf Gordon wrote:
> Hello Rich,
> 
> thank you for the prompt and detailed response.
> 
> > On Apr 1, 2016, at 20:58, Rich Felker <dalias@...c.org> wrote:
> > 
> > On Fri, Apr 01, 2016 at 08:47:01PM -0400, Assaf Gordon wrote:
> >> I think I've encountered a problem in musl, where using setlocale with invalid locale name returns the invalid locale instead of a known locale.
> > 
> > This is intentional. All locale names are valid under musl, and those
> > which don't have any particular definition are just aliases for
> > C.UTF-8.
> 
> I will suggest a minor fix to GNU coreutils to accommodate for this
> current implementation.

I think any 'fix' would be inconsistent with both the specified
behavior and the intended behavior. See below:

> > The alternative would be that UTF-8 support breaks whenever
> > LC_* vars are set but locales are not installed/configured, which
> > would pretty much _always_ be the case when running a static-linked
> > standalone binary on a non-musl-based system (where LC_* are probably
> > set to something the main host libc recognizes).
> > 
> > One possibility if this behavior is problematic would be to only
> > consider names without their own definitions as aliases for C.UTF-8
> > when MUSL_LOCPATH is not set. However I think we'd need to see a
> > strong motivation for doing that, since it seems like it would be
> > worse behavior in some ways, especially when using LC_MESSAGES set to
> > a language for which you don't have a locale installed.
> 
> I'm not an expert about locales to argue one way or the other.
> 
> Naively, I would think that this is somewhat problematic, because a
> best-behaving program (one that checks set locale's return code for
> errors) has no way to warn the user that he/she used an invalid
> locale.

Well the intent is that it _is_ valid.

> Perhaps a work-around would be to handle it this way:
> if an invalid (non-existing) locale is given in LC_* env vars,
> setlocale(LC_ALL,"") should return NULL (indicating an error), then
> all other invocations of setlocale(LC_*,NULL) would return the
> "C.UTF-8" indicator. This would allow detecting the error, but not
> affect further processing (if invalid locales are already an alias
> to C.UTF-8). This seems to match other OSes/libcs which return fixed
> "C" in such cases.

This is non-conforming. If setlocale returns NULL it is required not
to have modified the locale. This, combined with the fact that prior
to calling setlocale successfully, the locale is in an unusable
(single-byte, non-UTF-8-handling state), is the whole motivation for
musl's treatment of locale names that don't have definitions.

> The reason for such check is that it is common user mistake to
> specify non-existing locales, then be confused by the seemingly
> incorrect results. Allowing a program to detect incorrect locales is
> a good mitigation.
> 
> I'll side-step the non-UTF-8 locales (which would be a problem in
> the current musl auto-aliasing to UTF-8), and show one possible case
> where silent aliasing leads to incorrect results.

musl does not support non-UTF-8 encodings at all, so that's not a very
interesting case anyway.

> consider the following UTF-8 string:
>    M N Ñ O P Y Z Æ Ø Å
> (which includes Spanish eñe and the last three letters in the Swedish alphabet).
> When sorting with locale-aware programs, different locales should
> give different collation orders (e.g. es_ES.UTF-8 vs sv_FI.UTF-8).
> 
> To reproduce:
>   A='\116\n\303\221\n\117\n\120\n\131\n\132\n\303\205\n\303\204\n\303\226\n'
>   printf "$A" | LC_ALL=sv_FI.UTF-8 sort
>   printf "$A" | LC_ALL=es_ES.UTF-8 sort
> 
> If a user has a typo in the locale name (e.g. sv_SV.UTF-8), there's
> no way for a program to detect it, and he will get unexpected
> ordered results.

But how is this any different from having a typo that results in
another defined locale being selected?

> GNU coreutils' 'sort' program added a --debug option to help user diagnose such issues.
> On Linux with glibc, this will be the output:
> 
>   $ printf "$A" | LC_ALL=es_ES.UTF-8 sort --debug > /dev/null
>   sort: using ‘es_ES.UTF-8’ sorting rules
> 
>   $ printf "$A" | LC_ALL=sv_FI.UTF-8 sort --debug > /dev/null                             
>   sort: using ‘sv_FI.UTF-8’ sorting rules
> 
>   $ printf "$A" | LC_ALL=sv_SV.UTF-8 sort --debug > /dev/null                             
>   sort: using simple byte comparison 
> 
>   $ printf "$A" | LC_ALL=foobar sort --debug > /dev/null                                   
>   sort: using simple byte comparison
> 
> The last two messages ("simple byte") is the hint that the locale is
> invalid, and sort will does not use it.
> 
> On Alpine (linux + musl), there's no way to detect such case:
> 
>   $ printf "$A" | LC_ALL=sv_FI.UTF-8 gsort --debug > /dev/null
>   gsort: using ‘sv_FI.UTF-8’ sorting rules
> 
>   $ printf "$A" | LC_ALL=sv_SV.UTF-8 gsort --debug > /dev/null
>   gsort: using ‘sv_SV.UTF-8’ sorting rules
> 
>   $ printf "$A" | LC_ALL=foobar gsort --debug > /dev/null
>   gsort: using ‘foobar’ sorting rules

It might help if this resulted in:

	gsort: using ‘C.UTF-8’ sorting rules

This is what used to happen ("hard resolving" the alias to a different
name, rather than "soft resolving" it), but now we save the actual
requsted name so that it can be used for loading messages if dcgettext
is used with a category other than LC_MESSAGES. This is actually a
very rarely used feature, which could probably be sacrificed for
categories other than LC_MESSAGES if there's a strong benefit to doing
so.

Note that musl does not have any collation support at all right now,
nor any official locale files. That gives us some flexibility to
change things without impacting users, but the changes still can't
impact standards conformance/API contracts. I do hope to add collation
in the near future, as part of the goals for "1.2".

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.