|
Message-ID: <CANHA4OgQGzxkg9X8z8m5iKHdCuStEdbnbk0JrNwHV+m8Qf=XoQ@mail.gmail.com> Date: Mon, 30 Dec 2019 13:39:10 -0500 From: JeanHeyd Meneide <phdofthehouse@...il.com> To: Rich Felker <dalias@...c.org> Cc: musl@...ts.openwall.com Subject: Re: [ Guidance ] Potential New Routines; Requesting Help On Mon, Dec 30, 2019 at 12:31 PM Rich Felker <dalias@...c.org> wrote: > > ... > This is interesting, but I'm trying to understand the motivation. > > If __STDC_ISO_10646__ is defined, wchar_t is UTF-32/UCS-4, and the > proposed functions are just the identity (for the c32 ones) and > UTF-16/32 conversion. > > If it's not defined, you have the same problem as the current mb/cNN > functions: there's no reason to believe arbitrary Unicode characters > can round-trip through wchar_t any better than they can through > multibyte characters. In fact on such implementations it's likely that > wchar_t meanings are locale-dependent and just a remapping of the > byte/multibyte characters. I'm sorry, I'll try to phrase it as best as I can. The issue I and others have with the lack of cNNtowc is that, if we are to write standards-compliant C, the only way to do that transformation from, for example, char16_t data to wchar_t portably is: c16rtomb -> multibyte data -> mbrtowc The problem with such a conversion sequence is that there are many legacy encodings and this causes bugs on many user's machines. Text representable in both char16_t and wchar_t is lost in the middle: due to the middle not handling it, putting us in a place where we lose of data going to and from wchar_t to char16_t. This has been frustrating for a number of users who try to rely on the standard, only to have to write the above conversions sequence and fail. Thus, providing a direct function with no intermediates results in a better Standard C experience. A minor but still helpful secondary motivation is in giving people on certain long-standing platforms a way out. By definition, UTF16 does not work with wchar_t, so I am explicitly told that wchar_t for a platform like .e.g Windows is UCS-2 (the non-multi-unit version of UTF-16 that was deprecated a while ago) is wrong when using the Standard Library if I want real Unicode Support. Library developers tell me to rely on platform-specific APIs. The "use MultiByteToWideChar" or "use ICU" or "use this AIX-specific function", makes it much less of a Standard way to handle text: hence, the paper to the WG14 C Committee. The restartable versions of the single-character functions and the bulk conversion functions give ways for implementations locked to behaving like the deprecated UCS-2, 16-bit-single-unit-encoding a way out, and also allow us to have lossless data conversion. This reasoning might be a little bit "overdone" for libraries like musl and glibc who got wchar_t right (thank you!), but part of standardizing these things means I have to account for implementations that have been around longer than I have been alive. :) Does that make sense? > What situation do you envision where the proposed functions let you > reliably do something that's not already possible? My understanding is that libraries such as musl are "blessed" as distributions of the Standard Library, and that they can access system information that makes it possible for them to utilize what the current "wchar_t encoding" is in a way normal, regular developers cannot. Specifically, in the generic external implementation I have been working on, I have a number of #ifdef to check for, say, IBM machines, then check if they are specifically under zh/tw or even jp locales, because they deploy a wchar_t in these scenarios that is neither UTF16 or UTF32 (but instead a flavor of one of the GB encodings and Japanese encodings); otherwise, IBM uses UTF16/UCS-2 for wchar_t in i686 and UTF-32 for wchar_t in x86_64 for certain machines. I also check for what happens on Windows under various settings as well. Doing this as an external library is hard, because there is no way I can control the knobs for such reliably, but that a Standard Library distribution would have access to that information (since they are providing such functions already). So, for example, musl -- being the C library -- controls how the wchar_t should behave (modulo compiler intervention) for its wide character functions. Similarly, glibc would know what to do for its platforms, and IBM would know what to do for its platforms, and so on and so forth. Each distribution would provide behavior in coordination with their platform. Is this incorrect? Am I assuming a level of standard library <-> vendor relation/cooperation that does not exist? > > While I have a basic implementation, I would like to use some > > processor and compiler intrinsics to make it faster and make sure my > > first contribution meets both quality and speed standards for a C > > library. > > > > Is there a place in the codebase I can look to for guidance on > > how to handle intrinsics properly within musl libc? If there is > > already infrastructure and common idioms in place, I would rather use > > that then starting to spin up my own. > > I'm not sure what you mean by intrinsics or why you're looking for > them but I guess you're thinking of something as a performance > optimization? musl favors having code in straight simple C except when > there's a strong reason (known bottleneck in existing real-world > software -- things like memcpy, strlen, etc.) to do otherwise. The > existing mb/wc code is slightly "vectorized" (see mbsrtowcs) but doing > so was probably a mistake. The motivation came along with one of the > early motivations for musl: not making UTF-8 a major performance > regression like it was in glibc. But it turned out the bigger issue > was the performance of character-at-a-time and byte-at-a-time > conversions, not bulk conversion. My experience so far is that the character-at-a-time functions can cause severe performance penalties for external users, especially if the library is dynamically linked. If the C standard provides the bulk-conversion functions, performance would increase drastically for users desiring bulk conversion (because they do not have to write a loop around a dynamically-loaded function call to do conversions one-at-a-time). I am glad that musl has had similar experience, and would like to make the bulk functions available in musl too! My asking about intrinsics and such was that I have some optimizations using hand-vectorized instructions for some bulk cases. I will be more than happy to just contribute regular and readable plain C, though, and then revisit such functions if it turns out that vectorization with SIMD and other instructions for various platforms turns out to be worth it. My initial hunch is that it is, but I'm more than happy to focus on correctness first, extreme performance (maybe) later. > If we do adopt these functions, the right way to do it would be using > them to refactor the existing c16/c32 functions. Basically, for > example, the bulk of c16rtomb would become c16rtowc, and c16rtomb > would be replaced with a call to c16rtowc followed by wctomb. And the > string ones can all be simple loop wrappers. I would be more than happy to write the implementation as such! Most of the wchar_t functions will be very easy since musl and glibc chose the right wchar_t. (Talking to other vendors is going to be a much, much more difficult conversation...) Best Wishes, JeanHeyd Meneide
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.