|
Message-ID: <20171108050338.GL1627@brightrain.aerifal.cx> Date: Wed, 8 Nov 2017 00:03:38 -0500 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: setlocale behavior with 'missing' locales One of the primary concerns when the byte-based C locale was added(*) was not to introduce regressions in the property that musl is "always UTF-8" except when the user or application has explicitly requested a byte-based ("C"/"POSIX") locale. First, some background: In order for the standard libc interfaces to honor character encoding, a portable program has always needed to call setlocale(LC_CTYPE, "") or setlocale(LC_ALL, ""). Addition of the byte-based C locale "disabled UTF-8" in any application which wasn't calling setlocale, but that was deemed acceptable since such applications were not portable and would not work on other systems anyway. The other important cases to consider were failure of setlocale. Prior to the addition of the byte-based C locale, setlocale was essentially a no-op, and from a practical standpoint it didn't matter if it succeeded or failed because the preexisting "C" locale at program entry already provided UTF-8. But afterwards, if setlocale failed for some reason, applications that were trying to do the right thing would suffer regression. We ruled out spurious failure for resource exhaustion reasons by making a statically allocated C.UTF-8 locale object. But the other possible source of failure would have been having LC_* variables in the environment (perhaps as a result of ssh'ing from another system or running a musl-linked binary on a glibc-based system) with no corresponding locale files for musl. If we treated that as an error, UTF-8 would have suddenly broken in all sorts of real-world situtations, and one of the core original design goals/values of musl would have been broken. The choice I made at the time to avoid this was to declare that all locale names are valid locales, and if there's no actual file defining the locale, it's simply a clone of C.UTF-8. So for example if you run with LC_ALL=fr_FR but no fr_FR translation file, you get a locale named fr_FR (that's what setlocale reports as the active locale) but with no translated messages/dates/etc., just UTF-8 character encoding (so you're still able to access all characters properly and use localized or multilingual data). Unfortunately this turns out to have been something of a tradeoff, since there's no way for applications (and, as it turns out, especially tests/test suites) to query whether a particular locale is "really" available. I've been asked to change the behavior to fail on unknown locale names, but of course that's not a working option in light of the above. I think there may be a solution that makes everyone happy, but I'm not sure yet. I'm going to follow up with a description and analysis of whether it's valid/conforming. Rich (*) References on byte-based C locale: Subject: [musl] Possible bytelocale patch Message-ID: <20140703071318.GA10117@...ghtrain.aerifal.cx> Subject: [musl] Revisiting byte-based C locale Message-ID: <20150522022203.GA26651@...ghtrain.aerifal.cx> Subject: [musl] [PATCH] Byte-based C locale, draft 1 Message-ID: <20150606214007.GA17398@...ghtrain.aerifal.cx> commit 1507ebf837334e9e07cfab1ca1c2e88449069a80 byte-based C locale, phase 1: multibyte character handling functions commit 16f18d036d9a7bf590ee6eb86785c0a9658220b6 byte-based C locale, phase 2: stdio and iconv (multibyte callers)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.