|
Message-ID: <CANHA4OhdNZ7wEn7Ntnbd8VY=b0mM-NzYsrwZpcQML8239BJYmA@mail.gmail.com> Date: Mon, 30 Dec 2019 13:53:45 -0500 From: JeanHeyd Meneide <phdofthehouse@...il.com> To: Rich Felker <dalias@...c.org> Cc: Florian Weimer <fw@...eb.enyo.de>, musl@...ts.openwall.com Subject: Re: [ Guidance ] Potential New Routines; Requesting Help On Mon, Dec 30, 2019 at 12:28 PM Rich Felker <dalias@...c.org> wrote: > I think you misunderstood my remarks here. I was not talking about > invention of new charsets (which we seem to agree should not happen), > but making it possible to use existing legacy charsets which were > previously not usable as a locale's encoding due to limitations of the > C APIs. I see making that possible as counter-productive. It does not > serve to let users keep doing something they were already doing > (compatibility), only do to something newly backwards. My goal is to allow developers to go from an encoding they do not control fully (the multibyte encoding) to an encoding they know and can reason about in their program (c8, for example). This is why I am providing the mb -> cNN and wc -> cNN functions in both single-character and string forms. The hope is to make it easy to go from a statically known encoding (modulo difficulties from __STD_C_UTF16/32__ not being defined) to the platform encoding, and vice-versa, using the same style of functions like mb(s)(r)towc(s) and wc(s)(r)tomb(s). > > ... I will, however, note that the paper > > specifically wants to add the Restartable versions of "single unit" wc > > and mb to/from functions. > > I don't follow. mbrtowc and wcrtomb already exist and have since at > least C99. Apologies, I meant doing wc <-> cNN and mb <-> cNN! > > ... > > > > This means that while wcto* and *towc functions are broken, the > > I don't see them as broken. They support every encoding that has ever > worked in the past as the encoding for a locale (tautologically). The > only way they're "broken" is if you want to add new locale encodings > that weren't previously supportable. Apologies; this was in reference to wide characters given a not UTF-32 interpretation on certain platforms like Windows and certain flavors of IBM. They chose 16 bits, which can't accommodate Unicode without needing multiple wchar_t. Unfortunately, this means that they were really out of luck before DR488 was accepted: they had no means to return multiple wchar_t for characters outside the 16-bit maximum. With DR488, restartable functions have the potential to convert out properly (albeit, the DR was only applied to char16_t functions, so while I have a hope and a wish we can fix it for their platforms it might not work out for the wcto* and *towc functions anyways). char16_t functions, though, should offer those platforms a better way out (though not a perfect one: they'll need to rely on platform knowledge and perform some casts). > ... > > Conversion of arbitrary encodings other than the one in use by the > locale requires a different API that takes encodings by name or some > other identifier. The standard (POSIX) API for this is iconv, which > has plenty of limitations of its own, some the same as what you've > identified. Absolutely agreed! I just want the ones that the platform controls (wide character and multibyte character encodings) to have correct, simple paths to static encodings that can be used for more rigorous text processing. Sincerely, JeanHeyd Meneide
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.