|
Message-ID: <20191230173106.GI30412@brightrain.aerifal.cx> Date: Mon, 30 Dec 2019 12:31:06 -0500 From: Rich Felker <dalias@...c.org> To: JeanHeyd Meneide <phdofthehouse@...il.com> Cc: musl@...ts.openwall.com Subject: Re: [ Guidance ] Potential New Routines; Requesting Help On Tue, Dec 24, 2019 at 06:06:50PM -0500, JeanHeyd Meneide wrote: > Dear musl Maintainers and Contributors, > > I hope this e-mail finds you doing well this Holiday Season! I am > interested in developing a few fast routines for text encoding for > musl after the positive reception of a paper for the C Standard > related to fast conversion routines: > > https://thephd.github.io/vendor/future_cxx/papers/source/C%20-%20Efficient%20Character%20Conversions.html This is interesting, but I'm trying to understand the motivation. If __STDC_ISO_10646__ is defined, wchar_t is UTF-32/UCS-4, and the proposed functions are just the identity (for the c32 ones) and UTF-16/32 conversion. If it's not defined, you have the same problem as the current mb/cNN functions: there's no reason to believe arbitrary Unicode characters can round-trip through wchar_t any better than they can through multibyte characters. In fact on such implementations it's likely that wchar_t meanings are locale-dependent and just a remapping of the byte/multibyte characters. What situation do you envision where the proposed functions let you reliably do something that's not already possible? > While I have a basic implementation, I would like to use some > processor and compiler intrinsics to make it faster and make sure my > first contribution meets both quality and speed standards for a C > library. > > Is there a place in the codebase I can look to for guidance on > how to handle intrinsics properly within musl libc? If there is > already infrastructure and common idioms in place, I would rather use > that then starting to spin up my own. I'm not sure what you mean by intrinsics or why you're looking for them but I guess you're thinking of something as a performance optimization? musl favors having code in straight simple C except when there's a strong reason (known bottleneck in existing real-world software -- things like memcpy, strlen, etc.) to do otherwise. The existing mb/wc code is slightly "vectorized" (see mbsrtowcs) but doing so was probably a mistake. The motivation came along with one of the early motivations for musl: not making UTF-8 a major performance regression like it was in glibc. But it turned out the bigger issue was the performance of character-at-a-time and byte-at-a-time conversions, not bulk conversion. If we do adopt these functions, the right way to do it would be using them to refactor the existing c16/c32 functions. Basically, for example, the bulk of c16rtomb would become c16rtowc, and c16rtomb would be replaced with a call to c16rtowc followed by wctomb. And the string ones can all be simple loop wrappers. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.