Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140628012521.GU179@brightrain.aerifal.cx>
Date: Fri, 27 Jun 2014 21:25:21 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Locale framework RFC - more on proposed implementation

I realize I left out some details about how setlocale/uselocale/etc.
will actually work, which should be included:


General behavior, implementation-defined behaviors:

Per C, when a locale of "" is requested, the "default locale" is used.
Per POSIX, default locale is determined by applying the LC_* and LANG
env vars in the usual order (LC_ALL overrides all, LC_* are for
individual categories, LANG is the fallback if LC_* are not set) with
an implementation-defined default in the case where none of the vars
are defined. musl's implementation-defined default will be the current
"C.UTF-8".

Here are the values for different locale categories that will be
meaningful to musl:

All categories: C.UTF-8, suppresses any possible filesystem access
looking for locale files.

LC_CTYPE - C or POSIX have a special meaning, byte-based, non-UTF-8

LC_MESSAGES - Language name to be used in pathname components for
translated message files. Kept and used regardless of whether such
directories exist since applications may have their own translated
messages in languages libc is not aware of.

LC_TIME - Language/culture name to be used for loading a file that
will control time formatting. This _might_ simply be a .mo file with
message translations for the English names, or a catgets-format file
using the nl_langinfo item tags as keys. At first it won't even be
implemented so it doesn't matter.

LC_COLLATE - Language/culture name to be used for loading a collation
file, probably in some compiled version of Unicode collation algorithm
format rather than the POSIX localedef format. But for now this will
be also remain unimplemented.

LC_MONETARY - Language/culture name to be used for loading a file with
custom strfmon parameters. Since strfmon does not even work properly
now, this is not a priority, and it will remain unimplemented for the
time being.

LC_NUMERIC - None.

Unrecognized locale names will also be accepted as aliases for
C.UTF-8; this both faciliates easy use of message translation files
for which libc is not aware (i.e. setting LANG to such a language
won't cause setlocale(LC_ALL, "") to fail) and avoids the unfortunate
possibility of an accidental bogus environment setting from causing
programs to regress from UTF-8 to byte-based mode, which would happen
if setlocale were to fail on unknown arguments.

When setlocale(LC_ALL, 0) is called, it needs to return a string that
encodes all of the locale categories and which can be used to set the
locale back to the same state. I don't think the format for this
string needs to be documented/specified, but it will probably just be
a delimited list of the settings for each configurable category, in
numeric order.


Implementation:

Atomicity:

The standards are less than ideal with regard to what happens when the
locale is changed out from under interfaces whose behavior depends on
locale. Rather than worry about this, I'd rather everything just be
safe. So aside from thread-local locale structures used by uselocale,
pretty much all of the data the locale system works with should be
immutable -- once allocated/mapped, it should never be freed.
Otherwise expensive locking is required on every use. To avoid
excessive costs/memory leaks, each locale resource loaded should only
be loaded once, and reused if requested again, much like the way
dlopen/dlclose work in musl.

Locale structure:

The locale structure, which represents either a global locale setting
or a locale allocated by newlocale for use with uselocale, needs to
reflect each of the individual locale categories. With setlocale, it's
possible to change any of the categories for the global locale
independently, and the confusingly-named newlocale actually changes
just one category at a time for the input locale it modifies.

Here is a possible structure:

struct __locale_struct {
	int ctype_utf8;
	char *name[4];
	void *cat[4];
};

Since the only property of the LC_CTYPE category which varies is
encoding (binary vs UTF-8), a single atomically-written int suffices
to implement LC_CTYPE.

For the remaining four categories which can vary, a name and a pointer
are stored.

The names are only used internally by the locale system (e.g. for
constructing the return string setlocale produces) so they do not need
any atomicity. They could could be stored in dynamic allocated storage
(makes sense for newlocale/uselocale) or static storage (makes sense
for setlocale where we don't want to link free).

The category data pointers need to be replaced atomically such that
another thread accessing the category's data sees either the old value
or the new value; both point to valid data since all data is
immutable. The specifics of the data pointed to will be defined later.

Storage:

I had previously suggested adding a pointer to the global locale to
the libc structure, but I think it may make more sense to actually
embed the global locale structure there. Either approach would work,
but if the locale structure isn't embedded, a few places where it's
used would need to check whether the pointer is null and have suitable
fallbacks.

For uselocale, __pthread_self()->locale should just point to whatever
locale is in use, possibly the global locale or one created by
newlocale. With this design there's no requirement (as in the previous
document) for checking whether __pthread_self()->locale is NULL and
falling back to the global locale. Instead, when libc.uselocale_cnt is
nonzero, __pthread_self()->locale can just be used directly.

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.