Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20140629021423.GV179@brightrain.aerifal.cx>
Date: Sat, 28 Jun 2014 22:14:23 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Locale framework RFC

On Fri, Jun 27, 2014 at 03:04:12PM -0400, Rich Felker wrote:
> Components affected:
> [...]
> 
> 3. Stdio wide mode: It's required to bind to the character encoding in
> effect at the time the FILE goes into wide mode, rather than at the
> time of the IO operation. So rather than using mbrtowc or wcrtomb, it
> needs to store the state at the time of enterring wide mode and use a
> conversion that's conditional on this saved flag rather than on the
> locale.
> 
> 4. Code which uses mbtowc and/or wctomb assuming they always process
> UTF-8: Aside from the above-mentioned use in stdio, this is probably
> just iconv. To fix this, I propose adding new functions which don't
> check the locale but always process UTF-8. These could also be used
> for stdio wide mode, and they could use a different API than the
> standard functions in order to be more efficient (e.g. returning the
> decoded character, or negative for errors, rather than storing the
> result via a pointer argument).

These two items are turning out to be something of a pain: in
particular, the need for non-locale-sensitive UTF-8 encoding and
decoding functions. They can be solved by duplicating mbrtowc.c with
an identical file except for omitting the locale check that's being
added (and likewise wcrtomb.c), but that's rather ugly. Another
solution would be to somehow process the first byte in the caller so
that the mbstate_t would be non-initial by the time mbrtowc is called.
That would force mbrtowc to handle the sequence as UTF-8. But it also
spreads out the logic into places I'd rather it not be.

Eventually when I do the iconv overhaul, I'd probably like to inline
UTF-8 processing anyway and make it a good deal faster, operating on a
larger intermediate buffer when possible rather than working
character-by-character. However I don't want the current locale work
to be dependent on future iconv work for correct behavior, so a decent
short-term solution is needed too. And of course the stdio wide
functions need a solution.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.