|
Message-ID: <20140628012521.GU179@brightrain.aerifal.cx> Date: Fri, 27 Jun 2014 21:25:21 -0400 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: Locale framework RFC - more on proposed implementation I realize I left out some details about how setlocale/uselocale/etc. will actually work, which should be included: General behavior, implementation-defined behaviors: Per C, when a locale of "" is requested, the "default locale" is used. Per POSIX, default locale is determined by applying the LC_* and LANG env vars in the usual order (LC_ALL overrides all, LC_* are for individual categories, LANG is the fallback if LC_* are not set) with an implementation-defined default in the case where none of the vars are defined. musl's implementation-defined default will be the current "C.UTF-8". Here are the values for different locale categories that will be meaningful to musl: All categories: C.UTF-8, suppresses any possible filesystem access looking for locale files. LC_CTYPE - C or POSIX have a special meaning, byte-based, non-UTF-8 LC_MESSAGES - Language name to be used in pathname components for translated message files. Kept and used regardless of whether such directories exist since applications may have their own translated messages in languages libc is not aware of. LC_TIME - Language/culture name to be used for loading a file that will control time formatting. This _might_ simply be a .mo file with message translations for the English names, or a catgets-format file using the nl_langinfo item tags as keys. At first it won't even be implemented so it doesn't matter. LC_COLLATE - Language/culture name to be used for loading a collation file, probably in some compiled version of Unicode collation algorithm format rather than the POSIX localedef format. But for now this will be also remain unimplemented. LC_MONETARY - Language/culture name to be used for loading a file with custom strfmon parameters. Since strfmon does not even work properly now, this is not a priority, and it will remain unimplemented for the time being. LC_NUMERIC - None. Unrecognized locale names will also be accepted as aliases for C.UTF-8; this both faciliates easy use of message translation files for which libc is not aware (i.e. setting LANG to such a language won't cause setlocale(LC_ALL, "") to fail) and avoids the unfortunate possibility of an accidental bogus environment setting from causing programs to regress from UTF-8 to byte-based mode, which would happen if setlocale were to fail on unknown arguments. When setlocale(LC_ALL, 0) is called, it needs to return a string that encodes all of the locale categories and which can be used to set the locale back to the same state. I don't think the format for this string needs to be documented/specified, but it will probably just be a delimited list of the settings for each configurable category, in numeric order. Implementation: Atomicity: The standards are less than ideal with regard to what happens when the locale is changed out from under interfaces whose behavior depends on locale. Rather than worry about this, I'd rather everything just be safe. So aside from thread-local locale structures used by uselocale, pretty much all of the data the locale system works with should be immutable -- once allocated/mapped, it should never be freed. Otherwise expensive locking is required on every use. To avoid excessive costs/memory leaks, each locale resource loaded should only be loaded once, and reused if requested again, much like the way dlopen/dlclose work in musl. Locale structure: The locale structure, which represents either a global locale setting or a locale allocated by newlocale for use with uselocale, needs to reflect each of the individual locale categories. With setlocale, it's possible to change any of the categories for the global locale independently, and the confusingly-named newlocale actually changes just one category at a time for the input locale it modifies. Here is a possible structure: struct __locale_struct { int ctype_utf8; char *name[4]; void *cat[4]; }; Since the only property of the LC_CTYPE category which varies is encoding (binary vs UTF-8), a single atomically-written int suffices to implement LC_CTYPE. For the remaining four categories which can vary, a name and a pointer are stored. The names are only used internally by the locale system (e.g. for constructing the return string setlocale produces) so they do not need any atomicity. They could could be stored in dynamic allocated storage (makes sense for newlocale/uselocale) or static storage (makes sense for setlocale where we don't want to link free). The category data pointers need to be replaced atomically such that another thread accessing the category's data sees either the old value or the new value; both point to valid data since all data is immutable. The specifics of the data pointed to will be defined later. Storage: I had previously suggested adding a pointer to the global locale to the libc structure, but I think it may make more sense to actually embed the global locale structure there. Either approach would work, but if the locale structure isn't embedded, a few places where it's used would need to check whether the pointer is null and have suitable fallbacks. For uselocale, __pthread_self()->locale should just point to whatever locale is in use, possibly the global locale or one created by newlocale. With this design there's no requirement (as in the previous document) for checking whether __pthread_self()->locale is NULL and falling back to the global locale. Instead, when libc.uselocale_cnt is nonzero, __pthread_self()->locale can just be used directly.
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.