|
Message-ID: <20180124005133.pdcypbus23yrikgg@sinister.lan.codevat.com> Date: Tue, 23 Jan 2018 16:51:33 -0800 From: Eric Pruitt <eric.pruitt@...il.com> To: musl@...ts.openwall.com Subject: Re: Updating Unicode support On Tue, Jan 23, 2018 at 06:38:57PM -0500, Rich Felker wrote: > OK. With this in mind, I hope you're also aware that musl's Unicode > tables are all highly optimized for size and (aside from case mapping) > very good speed relative to their size, and are generated mechanically > from the UCD files via some ugly code here: > > https://github.com/richfelker/musl-chartable-tools The utf8proc library also uses optimized tables for property lookups. For example, retrieving properties for an individual character is done using a 2-stage lookup: // utf8proc.c:223 at commit 3a10df6 static const utf8proc_property_t *unsafe_get_property(utf8proc_int32_t uc) { /* ASSERT: uc >= 0 && uc < 0x110000 */ return utf8proc_properties + ( utf8proc_stage2table[ utf8proc_stage1table[uc >> 8] + (uc & 0xFF) ] ); } See <https://github.com/JuliaLang/utf8proc/tree/95fc75b/data> for the gory details. It's on my TODO list to compare the size of the object files generated using utf8proc compared to musl's built-in tables. I'll post the results once I get around to it. It's not an issue for me personally because I don't use musl on any resource constrained systems, but I do appreciate and understand that this is a priority for you which is why I suggested making utf8proc an optional feature. > If you mean that emoji should be considered double-width, I agree with > that in principle, but everything has to *agree* upon widths in order > for them to work. If not, terminal contents just get corrupted when > programs or systems that disagree try to communicate. It would take a > coordinated effort with glibc, third-party libraries, and programs > like screen that ship their own wcwidth-equivalent tables to redefine > them as double-width, and ideally there should probably be some > Unicode recommendation to document the change. Hence the ability to compile the utf8proc-wcwidth.c as a shared library that can be used with LD_PRELOAD. Initially I thought everything would work out once all my applications used the same Unicode release, but I still noticed inconsistencies and rendering glitches. The final solution was using LD_PRELOAD to override wcwidth(3) and wcswidth(3) in applications that either I don't build myself (notably Mutt and M.O.C) or that I dynamically link -- currently just my graphical terminal emulator simply because I have no interest in trying to statically link against X11. My other frequently used CLI applications like Bash, GNU Awk, and tmux are compiled statically using musl libc with my utf8proc changes. Long story short, I control the entire rendering stack by building applications I care about myself or using LD_PRELOAD to bend the ones I don't to my will. I don't think I've had any rendering problems since I started doing things this way. > Do you have an example of characters that caused the problem? I'd like > to better understand how it came up. Maybe glibc is already doing > something different than what I think they're doing. I'll follow-up on this later. I need to recompile a few things before I can give you some concrete examples. I wrote a program for an unrelated project that I can use to compare the width data of glibc, musl libc and my utf8proc-based wcwidth(3), and I'll include that, too. > Thanks for pointing out this library -- it looks like something we > might should add to the wiki as a recommended lib, and seems to > implement a lot of Unicode functionality that's otherwise only > available in gigantic bloated libraries like ICU. I'd like to take a > closer look at it when I get time. I've been pretty happy with utf8proc so far. My only qualms with it are the lack of a pre-existing implementations of common POSIX functions and the relatively heavy toolchain used to generate its property tables; updating the property tables requires Julia, Ruby and FontForge. These programs are readily available for popular Linux distributions, but those applications aren't something I normally have installed on my hosts. I finished reviewing the Unicode Collation Algorithm, and it looks like utf8proc doesn't include the necessary collation information. This is understandable since different locales have different collation rules, but I'm going to propose adding DUCET, the Default Unicode Collation Element Table, on their issue tracker since it doesn't look like it's been discussed yet. > If someone wants to make local changes or upgrade to newer Unicode > before it's upstream in musl, these tools generally provide the best > way to do it. > > [...] > > Of course it's possible to drop it in to musl's tree locally like you > did as a hack, but this isn't something musl can really do due to both > namespace considerations (wcwidth depending on symbols not in reserved > namespace) and policy about not introducing config switches. But if > the table contents in utf8proc do differ from musl, you can always use > the chartable tools package to generate matching tables to drop into > musl. Either I overlooked musl-chartable-tools when I was trying to figure out how to update musl's Unicode tables or they hadn't been posted to the wiki when I last checked. As mentioned above, I'll do some comparisons and get back to you. Eric
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.