|
Message-ID: <20140627190412.GA13087@brightrain.aerifal.cx> Date: Fri, 27 Jun 2014 15:04:12 -0400 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: Locale framework RFC Background: One of the agenda items for this release cycle is locale framework. This is needed both to support a byte-based POSIX locale (the unfortunate fallout of Austin Group issue #663) and for legitimate locale purposes like collation, localized time formats, etc. Note that, at present, musl's "everything is UTF-8" C locale is already non-conforming to the requirements of ISO C, because it places extra characters in the C locale's character classes like alpha, etc. beyond what the standard allows. This could be fixed by making the definitions of the character classes locale-dependent, but if we just accept the (bad, wrong, backwards, etc.) new POSIX requirements for the C/POSIX locale, we get a fix for free: it doesn't matter if iswalpha(0xc0) returns true in the C locale if the wchar_t value 0xc0 can never be generated in the C locale. My proposed solution is to provide a backwards C locale where bytes 0x80 through 0xff are interpreted as abstract bytes that decode to wchar_t values which are either invalid Unicode or PUA codepoints. The latter is probably preferable since generating invalid codepoints may, strictly speaking, make it wrong to define __STDC_ISO_10646__. How does this affect real programs? Not much at all. A program that hasn't called setlocale() can't expect to be able to use the multibyte interfaces reasonably anyway, so it doesn't matter that they default to byte mode when the program starts up. And if the program does call setlocale correctly (with an argument of "", which means to use the 'default locale', defined by POSIX with the LC_* env vars), it will get a proper UTF-8-based locale anyway unless the user has explicitly overridden that by setting LC_CTYPE=C or LC_ALL=C. So really all that's seriously affected are scripts using LC_CTYPE=C or LC_ALL=C to do byte-based processing using the standard utilities, and the behavior of these is "improved". Implementation: Three new fields in the libc structure: 1. locale_t global_locale; This is the locale presently selected by setlocale() and which affects all threads which have not called uselocale() or which called uselocale with LC_GLOBAL_LOCALE as the argument. 2. int uselocale_cnt; uselocale_cnt is the current number of threads with a thread-local locale. It's incremented/decremented (atomically) by the uselocale function when transitioning from LC_GLOBAL_LOCALE to a thread-local locale or vice versa, respectively, and also decremented (atomically) in pthread_exit if the exiting thread has a thread-local locale. The purpose of having uselocale_cnt is that, whenever uselocale_cnt is zero, libc.global_locale can be used directly with no TLS access to determine if the current thread has a thread-local locale. 3. int bytelocale_cnt_minus_1 This is a second atomic counter which behaves similarly to uselocale_cnt, except that it is only incremented/decremented when the thread-local locale being activated/deactivated is non-UTF-8 (byte-based). The global locale set by setlocale is also tracked in the count, and the result is offset by -1. Initially at program startup (when setlocale has not been called), the value of bytelocale_cnt_minus_1 is zero. Setting any locale but "C" or "POSIX" for LC_CTYLE with setlocale will enable UTF-8 and thus decrement the value to -1. Setting any thread-local locale to "C" or "POSIX" for LC_CTYPE will increment the value to something non-negative. All functions which are optimized for the sane case of all data being UTF-8 therefore have a trivial fast-path: if libc.bytelocale_cnt_minus_1 is negative, they can immediately assume UTF-8 with no further tests. Otherwise checking libc.uselocale_cnt is necessary to determine whether to inspect libc.global_locale or __pthread_self()->locale to determine whether to decode UTF-8 or treat bytes as abstract bytes. Per earlier testing I did when Austin Group issue #663 was being discussed, a single access and conditional jump based on data in the libc structure does not yield measurable performance cost in UTF-8 decoding. For encoding (wc[r]tomb) there may be a small performance cost added on archs that need a GOT pointer for GOT-relative accesses (vs direct PC-relative), since the current code has no GOT pointer. Fortunately decoding, not encoding, is the performance-critical operation. Code which uses locale: The basic idiom for getting the locale will be: locale_t loc = libc.uselocale_cnt && __pthread_self()->locale ? __pthread_self()->locale : libc.global_locale; And if all that's needed is a UTF-8 flag: int is_utf8 = libc.bytelocale_cnt_minus_1<0 || loc->utf8; where "loc" is the result of the previous expression above. This test looks fairly expensive, but the only cases with any cost are when there's at least one thread with a non-UTF-8 locale. Even in the case where uselocale is in heavy use, as long as it's not being used to turn off UTF-8, there's no performance penalty. Components affected: 1. Multibyte functions: must use the above tests to see whether to process UTF-8 or behave as dumb byte functions. Note that the restartable multibyte functions (those which use mbstate_t) can skip the check when the state is not the initial state, since use of the state after changing locale is UB. 2. Character class functions: should not be affected at all, but we need to make sure they're all returning false for the characters decoded from high bytes in bytelocale mode. 3. Stdio wide mode: It's required to bind to the character encoding in effect at the time the FILE goes into wide mode, rather than at the time of the IO operation. So rather than using mbrtowc or wcrtomb, it needs to store the state at the time of enterring wide mode and use a conversion that's conditional on this saved flag rather than on the locale. 4. Code which uses mbtowc and/or wctomb assuming they always process UTF-8: Aside from the above-mentioned use in stdio, this is probably just iconv. To fix this, I propose adding new functions which don't check the locale but always process UTF-8. These could also be used for stdio wide mode, and they could use a different API than the standard functions in order to be more efficient (e.g. returning the decoded character, or negative for errors, rather than storing the result via a pointer argument). 5. MB_CUR_MAX macro: It needs to expand to a function call rather than an integer constant expression, since it has to be 1 for the new POSIX locale. The function can in turn use the is_utf8 pattern above to determine the right return value. 6. setlocale, uselocale, and related functions: These need to implement the locale switching and above atomic counters logic. 7. pthread_exit: Needs to decrement revelant atomic counters. 8. nl_langinfo and nl_langinfo_l: At present, the only item they need to support on a per-thread basis is CODESET. For the byte-based C locale, this could be "8BIT", "BINARY", "ASCII+8BIT" or similar. Here it needs to be decided whether nl_langinfo should be responsible for determining the locale_t to pass to nl_langinfo_l, or whether nl_langinfo_l should accept (locale_t)0 and do its own determination. This issue will also affect other */*_l pairs that need non-trivial implementations later. 9. iconv: In addition to the above issue in item 4, iconv should support whatever value nl_langinfo(CODESET) returns for the C locale as a from/to argument, even if it's largely useless. Overall Impact: Should be near-zero on programs that don't use locale-related features: a few bytes in the global libc struct and a couple extra lines in pthread_exit.
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.