musl - Re: Revisiting byte-based C locale

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMAJcuCK=nh9=Kvsh5h5Et3tw5XDQv5XGy4OozyZOLyft9LgVA@mail.gmail.com>
Date: Thu, 21 May 2015 23:04:47 -0500
From: Josiah Worcester <josiahw@...il.com>
To: musl@...ts.openwall.com
Subject: Re: Revisiting byte-based C locale

On Thu, May 21, 2015 at 9:22 PM, Rich Felker <dalias@...c.org> wrote:
> The last time the the byte-based C locale topic was visited ("Possible
> bytelocale patch", http://www.openwall.com/lists/musl/2014/07/03/2),
> it was a rather ugly patch introducing lots of code duplication. Now,
> I believe the callers of multibyte/wide char functions which need to
> always work in UTF-8 mode (iconv) or need to match a previously-saved
> mode (stdio wide functions, which save the encoding in the FILE when
> it becomes wide-oriented) can simply swap __pthread_self()->locale
> back and forth. There is no longer a possibility that the thread
> pointer may be uninitialized, nor a heavy synchronization cost of
> switching thread-local locales from the atomics in uselocale -- commit
> 68630b55c0c7219fe9df70dc28ffbf9efc8021d8 removed all that.
>
> Thus, I think we're at a point where we can evaluate the choice to
> support or not to support a byte-based C locale on the basis of things
> like standards conformance and impact on users and on software
> compatibility without having to weigh implementation costs (which
> would have contributed to "impact on users").
>
> Since last year, the issue of byte-based C locale has come up a few
> more times as a stumbling point for users on the IRC channel and/or
> mailing list (I forget which and haven't gone back to look it up yet).
> In particular, broken configure tests passing binary data to grep
> failed, and I believe one or more language interpreters loading source
> files in the C locale errored out due to a Latin-1 encoded "©"
> character in source comments. Personally I'm in favor of getting the
> broken stuff fixed, but I can see both sides.
>
> There are also minor conformance reasons to consider the byte-based C
> locale even without accepting the resolution to Austin Group issue 663
> (which is supposedly imposing the requirement, someday). In
> particular, the C standard seems to allow the current behavior of
> musl, where the C locale has extra characters for which isw*() return
> true, as long as the non-wide is*() functions don't have such extra
> characters. C doesn't even define abstract character classes that
> these functions report, just loose requirements on their behavior. But
> POSIX specifies LC_CTYPE in terms of character classes which have
> members, and does not leave room for extra characters in the C locale
> as far as I can tell. This could affect real-world usage cases where
> an application intentionally running in the C locale expects the
> regex/fnmatch bracket [[:alpha:]] not to match anything but ASCII
> letters. As mentioned several times in the past, this non-conformance
> could be addressed by changes in the isw*() functions (making them
> locale-aware) rather than by adding the byte-based C locale, but if
> there are other motivations to support the byte-based C locale, it
> may make sense to solve both issues with one change.
>
> Any new opinions on the topic? Or interest in re-emphasizing a
> previously stated opinion? :)
>
> Rich

Given the POSIX rules on LC_CTYPE character classes effecting
[[:alpha:]], it seems to me now that the clear intent (if not
statement) is in fact for a byte-based C locale. Though maybe
unfortunate, it does seem like as though that is in fact the most
conformant way of doing it, and conforming looks to have little cost
now.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.