|
Message-ID: <20140703061107.GA2716@brightrain.aerifal.cx> Date: Thu, 3 Jul 2014 02:11:07 -0400 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: State of LC_CTYPE conformance Aside from actually providing useful features later on, one of the motivations for adding the locale framework is to address conformance issues in musl's current LC_CTYPE behavior for the C locale. However, I'm not so sure those exist. My analysis of the current situation follows: ISO C places requirements not on the character class sets for the locale (it doesn't really have such a concept) but rather simply on the behaviors of if the ctype.h and wctype.h functions in the C locale: - islower: exactly the 26 lowercase ASCII letters - isupper: exactly the 26 uppercase ASCII letters - isalpha: exactly the union of the above two sets - isblank: exactly space and tab - ispunct: exactly printable characters for which neither isspace nor isalnum is true. - isspace: exactly the standard 6 space characters However the wide functions are much less restricted; aside from iswblank, which returns true exactly for space and tab, none of these functions have C-locale-specific restrictions. Thus, as far as I can tell, musl's current behavior is 100% conforming to the requirements of ISO C. The ctype.h functions behave identically to ASCII (since the high bytes are invalid) and the wctype.h functions are free to do full Unicode support. POSIX, on the other hand, has more restrictive locale requirements. There is a well-defined set of characters for each class, and the ctype.h and wctype.h functions for each class are specified in terms of membership in that set. Thus, for example, POSIX forbids a Latin-1 C locale where the ctype.h functions only reflect ASCII (as required by ISO C) but the wctype.h functions return true as appropriate for characters in the range U+0080 to U+00FF. As far as I can tell, however, POSIX does not forbid musl's C locale, since the ctype.h functions cannot reflect the presence of multibyte characters in the set being tested against, and thus remain consistent with the wctype.h functions and with the ISO C requirement not to return true for "extra" members. Does this analysis sound correct? If so, I may hold off on actually adding byte-based LC_CTYPE for the time being and focus on more constructive use of the new locale framework. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.