musl - Re: [ Guidance ] Potential New Routines; Requesting Help

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CANHA4OgZDm64Tx7aSyTLvku_+7Myu7AuLrP_XEk00YyV12EyiA@mail.gmail.com>
Date: Mon, 30 Dec 2019 22:58:27 -0500
From: JeanHeyd Meneide <phdofthehouse@...il.com>
To: Rich Felker <dalias@...c.org>
Cc: musl@...ts.openwall.com
Subject: Re: [ Guidance ] Potential New Routines; Requesting Help

On Mon, Dec 30, 2019 at 2:57 PM Rich Felker <dalias@...c.org> wrote:
> I don't think these interfaces gives you an "out" in a way that's
> fully conforming. The C model is that there's a set of characters
> supported in the current locale, and each of them has one or more
> multibyte representations (possibly involving shift states) and a
> single wide character representation. Converting between UTF-16 or
> UTF-32 and wchar_t outside the scope of characters that exist in the
> current locale isn't presently a meaningful concept, and wouldn't
> enable you to get meaningful results from wctype.h functions, etc.
> (Would you propose having a second set of such functions for char32_t
> to handle that? Really it sounds like what you want is an out to
> deprecate wchar_t and use char32_t in its place, which wouldn't be a
> bad idea...)

     This is actually something I am extremely interested in tackling.
But I need to make sure everyone can get their data in current
applications from mb/wide characters to the char32_t. Then a potential
<uctype.h> can be worked on that takes case mapping, case folding, and
all of the other useful things Unicode has brought to the table and
work with Unicode Code Points. One of the things I saw before is that
there was a previous proposal to extend wctype.h with other functions
that was very large, and despite being well motivated it did not
succeed in WG14.

     Also on my list of things is the fact that char16_t and char32_t
do not necessarily have to be Unicode (__STD_C_UTF32__ and friends).
This means that if we settle on char32_t for these interfaces, we may
set a potential trap for users who migrate and then try to port to
platforms where c16 does not mean UTF-16, and c32 does not mean
UTF-32. In coordinating with a few static analysis vendors who cover a
very large range of compiler implementations both C and C++, they have
reportedly not yet found a compiler which makes char16/32_t not be
UTF-16/32 (some platforms forget to define the macros but still use
those encodings). I hope that in the future a paper can be brought to
WG14 to make those encodings required for char16/32_t, rather than
checking the macro and leaving users out to dry. Right now everything
de-facto works, but I worry...

     Still. I want to introduce each logical piece of functionality in
its own paper, with its own scope and motivation. This, in my opinion,
seems to work much better. Work on transition and replacement, then
deprecate the things which are know from experience are bad. I don't
know if my plan is going to work, but having nobody vote against my
first ever WG14 proposal is a good start and I want to be careful to
not get stuck in Committee on mega-proposals that scare people.

> Solving these problems for implementations burdened by a legacy *wrong
> choice* of definition of wchar_t is not possible by adding more
> interfaces alone; it requires a lot of changes to the underlying
> abstract model of what a character is in C. I'm not really in favor of
> such changes. They complicate and burden existing working
> implementations for the sake of ones that made bad choices. Windows in
> particular *can* and *should* fix wchar_t to be 32-bit. The Windows
> API uses WCHAR, not wchar_t, anyway, so that a change in wchar_t is
> really not a big deal for interface compatibility, and has conformance
> problems like wprintf treating %s/%ls incorrectly that require
> breaking changes to fix. Good stdlib implementations on Windows
> already fix these things.

     They should, absolutely. Still, I think that preventing lossy
conversions for wchar_t usage on platforms where the wide character is
used to interface with the system is a worthwhile endeavor. I don't
think it is feasible (or would ever fly in WG14) to change what
wchar_t is and how it behaves: but I would rather invest time in
implementing interfaces that can offer better and more complete
functionality. I'm trying to keep my changes well-scoped, motivated,
and small.

> The __STDC_ISO_10646__ macro is the way to determine that the encoding
> of wchar_t is Unicode (or some subset if WCHAR_MAX doesn't admit the
> full range). Otherwise it's not something you can meaningfully work
> with except as an abstract number, but in that case you just want to
> avoid it as much as possible and convert directly between multibyte
> characters and char16_t/char32_t. I don't see how converting directly
> between wchar_t and char16_t/char32_t is more useful, even if it is a
> prettier factorization of the code.

     It is an abstract number with no meaning to the developer, but
the platform (e.g., IBM using various GB encodings for wchar_t on
certain platforms where __STDC_ISO_10646__ is not defined) knows that
meaning. My intention is that by letting the Standard Library and
platform handle it, you can get from a blob of abstract numbers to
meaningful text in a Standard way. Not only for wchar_t, but for mb
strings too.

> A far more useful thing to know than wchar_t encoding is the multibyte
> encoding. POSIX gives you this in nl_langinfo(CODESET) but plain C has
> no equivalent. I'd actually like to see WG14 adopt this into plain C.

     This is actually something I am considering! There are a few
sister papers related to this percolating through another Standards
Committee right now; I want to see how that goes before bringing it to
WG14. But, I think that functionality should come in addition to - not
instead of - additional conversion functions. Platforms own wchar_t
and multibyte char encodings: if the user has to write conversion
routines themselves after checking the equivalent of nl_langinfo, we
may end up with incomplete or half-done support for encodings in many
programs!

> On musl (where I'm familiar with performance properties),
> byte-at-a-time conversion is roughly half the speed of bulk, which
> looks big but is diminishingly so if you're actually doing something
> with the result (just converting to wchar_t for its own sake is not
> very useful). Character-at-a-time is probably somewhat less slow than
> byte-at-a-time. When I wrote this I put in heavy effort to make
> byte/character-at-a-time not horribly slow, because it's normally the
> natural programming model. Wide character strings are not an idiomatic
> type to work with in C.

      If it is still okay, I will put my best effort into making sure
the character-at-a-time and similar functions are something you and
other musl contributors can be happy with!

Sincerely,
JeanHeyd
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.