musl - Draft proposed locale changes

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180305183950.GA17616@brightrain.aerifal.cx>
Date: Mon, 5 Mar 2018 13:39:50 -0500
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Draft proposed locale changes


localeconv/LC_NUMERIC/LC_MONETARY

Each loaded locale needs an immutable lconv structure to represent
this data. It needs to be allocated with the locale (at locale loading
time) since localeconv() has no provision for failure, but we can wait
to populate it lazily, and we can put the code to populate it in
localeconv.c so that static-linked programs that don't use this
rarely-used interface don't have to pay for it. We could also omit
even allocating it (56/96 bytes) if localeconv.o is not linked, but
it's probably not worth the special-casing code to do that.

The localeconv structure should be part of struct __locale_map, not
struct __locale_struct, since it's a pure function of the data in the
memory-mapped locale file and not a function of how that data is
linked to a specific locale category. Putting it in __locale_struct
would just complicate setlocale and newlocale.

The obvious (but not terribly efficient) form for the data in the
locale file is to have each lconv field as a mo-level key, as in:

	msgid "int_frac_digits"
	msgstr "2"

A more compact form could pack them all into one, but then the order
becomes a hidden locale-file interface boundary/ABI.

For the string fields it's necessary that they each be in-place
strings in the mo file. grouping and mon_grouping also have the
special constraint that they need to vary by whether the arch uses
signed or unsigned plain-char (since CHAR_MAX has special meaning) so
the mo file needs to store both versions. That's ugly but I don't see
any good way around it. We can probably punt on this for now just by
not supporting grouping (i.e. only supporting locale definitions that
don't do grouping), since it's not implemented anyway.

If we support decimal_point, it should not go through the localeconv
mechanism since it would always be needed by printf and strtod.
Instead __get_locale should probe it right away and set a 1-bit flag
in the __locale_map structure for these functions to consume (1-bit
based on previous research that [.,] are the only values).



nl_langinfo/LC_TIME/etc.

Eliminate the currently-present wrong values for ERA* and related
LC_TIME stuff; that gets rid of all ambiguous translation keys except
"May". Bikeshed up some alternate key for May.



strerror/LC_MESSAGES

Not sure yet. One radical idea I kinda like is removing all the
English-phrase messages from libc core and just having strerror
produce strings like "ENOENT", "EPERM", etc. in the C locale. This
seems to be the only option that wouldn't either moderately increase
libc size or require translation files to match the exact current text
in the builtin English libc messages. Users who want the current
messages would then need an "en" locale with contents like:

	msgid "ENOENT"
	msgstr "No such file or directory"

If we don't want this, the possible solutions look like one of:

1. Prepending the error code and a null byte (e.g. "ENOENT\0") to all
the existing error strings, then skipping past it if the translation
was not found.

2. Putting a second version of strerror in locale_map.c with the E*
names in it, so it's only linked if you use locale. I strongly dislike
this approach because it greatly increases the marginal size cost of
doing the right thing (calling setlocale) and imposes the cost even if
you don't use strerror at all (only setlocale).

3. Accepting that translations need to match (and perpetually be
updated to match) error strings in musl __strerror.h. I don't like
this much either.

So I think it should be between options 1 and "zero" above. Option
zero decreases the size of libc by nearly 1k (removing messages) but
changes the behavior. Option 1 increases the size of libc by about 1k.



LC_COLLATE

No specific proposal yet. We need a data structure to map characters
and sequences of characters to collating elements. Obviously the mo
file's lookups could be used directly (O(log n), improved avg case if
we ever add hash table support) but they might be heavier than we
want. The alternative would be having a gigantic string in the mo file
that's just "compiled" collation table data, but unless it's
well-designed that seems like an undesirable permanent interface
boundary.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.