musl - Re: Updating Unicode support

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180124005133.pdcypbus23yrikgg@sinister.lan.codevat.com>
Date: Tue, 23 Jan 2018 16:51:33 -0800
From: Eric Pruitt <eric.pruitt@...il.com>
To: musl@...ts.openwall.com
Subject: Re: Updating Unicode support

On Tue, Jan 23, 2018 at 06:38:57PM -0500, Rich Felker wrote:
> OK. With this in mind, I hope you're also aware that musl's Unicode
> tables are all highly optimized for size and (aside from case mapping)
> very good speed relative to their size, and are generated mechanically
> from the UCD files via some ugly code here:
>
> https://github.com/richfelker/musl-chartable-tools

The utf8proc library also uses optimized tables for property lookups.
For example, retrieving properties for an individual character is done
using a 2-stage lookup:

    // utf8proc.c:223 at commit 3a10df6
    static const utf8proc_property_t *unsafe_get_property(utf8proc_int32_t
    uc) {
      /* ASSERT: uc >= 0 && uc < 0x110000 */
      return utf8proc_properties + (
        utf8proc_stage2table[
          utf8proc_stage1table[uc >> 8] + (uc & 0xFF)
        ]
      );
    }

See <https://github.com/JuliaLang/utf8proc/tree/95fc75b/data> for the
gory details. It's on my TODO list to compare the size of the object
files generated using utf8proc compared to musl's built-in tables. I'll
post the results once I get around to it. It's not an issue for me
personally because I don't use musl on any resource constrained systems,
but I do appreciate and understand that this is a priority for you which
is why I suggested making utf8proc an optional feature.

> If you mean that emoji should be considered double-width, I agree with
> that in principle, but everything has to *agree* upon widths in order
> for them to work. If not, terminal contents just get corrupted when
> programs or systems that disagree try to communicate. It would take a
> coordinated effort with glibc, third-party libraries, and programs
> like screen that ship their own wcwidth-equivalent tables to redefine
> them as double-width, and ideally there should probably be some
> Unicode recommendation to document the change.

Hence the ability to compile the utf8proc-wcwidth.c as a shared library
that can be used with LD_PRELOAD. Initially I thought everything would
work out once all my applications used the same Unicode release, but I
still noticed inconsistencies and rendering glitches. The final solution
was using LD_PRELOAD to override wcwidth(3) and wcswidth(3) in
applications that either I don't build myself (notably Mutt and M.O.C)
or that I dynamically link -- currently just my graphical terminal
emulator simply because I have no interest in trying to statically link
against X11.

My other frequently used CLI applications like Bash, GNU Awk, and tmux
are compiled statically using musl libc with my utf8proc changes. Long
story short, I control the entire rendering stack by building
applications I care about myself or using LD_PRELOAD to bend the ones I
don't to my will. I don't think I've had any rendering problems since I
started doing things this way.

> Do you have an example of characters that caused the problem? I'd like
> to better understand how it came up. Maybe glibc is already doing
> something different than what I think they're doing.

I'll follow-up on this later. I need to recompile a few things before I
can give you some concrete examples. I wrote a program for an unrelated
project that I can use to compare the width data of glibc, musl libc and
my utf8proc-based wcwidth(3), and I'll include that, too.

> Thanks for pointing out this library -- it looks like something we
> might should add to the wiki as a recommended lib, and seems to
> implement a lot of Unicode functionality that's otherwise only
> available in gigantic bloated libraries like ICU. I'd like to take a
> closer look at it when I get time.

I've been pretty happy with utf8proc so far. My only qualms with it are
the lack of a pre-existing implementations of common POSIX functions and
the relatively heavy toolchain used to generate its property tables;
updating the property tables requires Julia, Ruby and FontForge. These
programs are readily available for popular Linux distributions, but
those applications aren't something I normally have installed on my
hosts.

I finished reviewing the Unicode Collation Algorithm, and it looks like
utf8proc doesn't include the necessary collation information. This is
understandable since different locales have different collation rules,
but I'm going to propose adding DUCET, the Default Unicode Collation
Element Table, on their issue tracker since it doesn't look like it's
been discussed yet.

> If someone wants to make local changes or upgrade to newer Unicode
> before it's upstream in musl, these tools generally provide the best
> way to do it.
>
> [...]
>
> Of course it's possible to drop it in to musl's tree locally like you
> did as a hack, but this isn't something musl can really do due to both
> namespace considerations (wcwidth depending on symbols not in reserved
> namespace) and policy about not introducing config switches. But if
> the table contents in utf8proc do differ from musl, you can always use
> the chartable tools package to generate matching tables to drop into
> musl.

Either I overlooked musl-chartable-tools when I was trying to figure out
how to update musl's Unicode tables or they hadn't been posted to the
wiki when I last checked. As mentioned above, I'll do some comparisons
and get back to you.

Eric
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.