|
Message-ID: <20180124214539.GY1627@brightrain.aerifal.cx> Date: Wed, 24 Jan 2018 16:45:39 -0500 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: Re: Updating Unicode support On Tue, Jan 23, 2018 at 10:26:02PM -0800, Eric Pruitt wrote: > On Tue, Jan 23, 2018 at 04:51:33PM -0800, Eric Pruitt wrote: > > On Tue, Jan 23, 2018 at 06:38:57PM -0500, Rich Felker wrote: > > > OK. With this in mind, I hope you're also aware that musl's Unicode > > > tables are all highly optimized for size and (aside from case mapping) > > > very good speed relative to their size, and are generated mechanically > > > from the UCD files via some ugly code here: > > > > > > https://github.com/richfelker/musl-chartable-tools > > I updated my copy of musl to 1.1.18 then recompiled it with and without > my utf8proc changes using GCC 6.3.0 "-O3" targeting Linux 4.9.0 / > x86_64: > > - Original implementation: 2,762,774B (musl-1.1.18/lib/libc.a) > - utf8proc implementation: 3,055,954B (musl-1.1.18/lib/libc.a) > - The utf8proc implementation is ~11% larger. I didn't do any > performance comparisons. > > > > Do you have an example of characters that caused the problem? I'd like > > > to better understand how it came up. Maybe glibc is already doing > > > something different than what I think they're doing. > > > > I'll follow-up on this later. I need to recompile a few things before I > > can give you some concrete examples. I wrote a program for an unrelated > > project that I can use to compare the width data of glibc, musl libc and > > my utf8proc-based wcwidth(3), and I'll include that, too. > > > > [...] > > > > Either I overlooked musl-chartable-tools when I was trying to figure out > > how to update musl's Unicode tables or they hadn't been posted to the > > wiki when I last checked. As mentioned above, I'll do some comparisons > > and get back to you. > > I'm using Debian 9, and the version of glibc it ships with (2.24) uses > Unicode 9. Since musl-1.1.18 uses Unicode 10 data, I'll have to rebuild > the character tables to do proper comparisons. The text files in > musl-chartable-tools appear to be out of date: > > data$ head -n5 *.txt > ==> DerivedCoreProperties.txt <== > # DerivedCoreProperties-6.1.0.txt > # Date: 2011-12-11, 18:26:55 GMT [MD] > # > # Unicode Character Database > # Copyright (c) 1991-2011 Unicode, Inc. > > ==> EastAsianWidth.txt <== > # EastAsianWidth-6.1.0.txt > # Date: 2011-09-19, 18:46:00 GMT [KW] > # > # East Asian Width Properties > # > > I know the updated versions of the text files can be downloaded from > <https://www.unicode.org/Public/10.0.0/ucd/>. Could you please verify > whether the version of the code that was used to create > <https://git.musl-libc.org/cgit/musl/commit/?id=c72c1c5> and > <https://git.musl-libc.org/cgit/musl/commit/?id=54941ed> has been pushed > to <https://github.com/richfelker/musl-chartable-tools>? Indeed, it wasn't pushed -- sorry. Done now. > > I finished reviewing the Unicode Collation Algorithm, and it looks like > > utf8proc doesn't include the necessary collation information. This is > > understandable since different locales have different collation rules, > > but I'm going to propose adding DUCET, the Default Unicode Collation > > Element Table, on their issue tracker since it doesn't look like it's > > been discussed yet. > > I opened https://github.com/JuliaLang/utf8proc earlier today. You mentioned it earlier, and yes, collation is also an open problem for musl. I want to do it based on UCA, not the POSIX localedef form of collation tables. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.