Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180124062602.3nn7xiwo4mgor57y@sinister.lan.codevat.com>
Date: Tue, 23 Jan 2018 22:26:02 -0800
From: Eric Pruitt <eric.pruitt@...il.com>
To: musl@...ts.openwall.com
Subject: Re: Updating Unicode support

On Tue, Jan 23, 2018 at 04:51:33PM -0800, Eric Pruitt wrote:
> On Tue, Jan 23, 2018 at 06:38:57PM -0500, Rich Felker wrote:
> > OK. With this in mind, I hope you're also aware that musl's Unicode
> > tables are all highly optimized for size and (aside from case mapping)
> > very good speed relative to their size, and are generated mechanically
> > from the UCD files via some ugly code here:
> >
> > https://github.com/richfelker/musl-chartable-tools

I updated my copy of musl to 1.1.18 then recompiled it with and without
my utf8proc changes using GCC 6.3.0 "-O3" targeting Linux 4.9.0 /
x86_64:

- Original implementation: 2,762,774B (musl-1.1.18/lib/libc.a)
- utf8proc implementation: 3,055,954B (musl-1.1.18/lib/libc.a)
- The utf8proc implementation is ~11% larger. I didn't do any
  performance comparisons.

> > Do you have an example of characters that caused the problem? I'd like
> > to better understand how it came up. Maybe glibc is already doing
> > something different than what I think they're doing.
>
> I'll follow-up on this later. I need to recompile a few things before I
> can give you some concrete examples. I wrote a program for an unrelated
> project that I can use to compare the width data of glibc, musl libc and
> my utf8proc-based wcwidth(3), and I'll include that, too.
>
> [...]
>
> Either I overlooked musl-chartable-tools when I was trying to figure out
> how to update musl's Unicode tables or they hadn't been posted to the
> wiki when I last checked. As mentioned above, I'll do some comparisons
> and get back to you.

I'm using Debian 9, and the version of glibc it ships with (2.24) uses
Unicode 9. Since musl-1.1.18 uses Unicode 10 data, I'll have to rebuild
the character tables to do proper comparisons. The text files in
musl-chartable-tools appear to be out of date:

    data$ head -n5 *.txt
    ==> DerivedCoreProperties.txt <==
    # DerivedCoreProperties-6.1.0.txt
    # Date: 2011-12-11, 18:26:55 GMT [MD]
    #
    # Unicode Character Database
    # Copyright (c) 1991-2011 Unicode, Inc.

    ==> EastAsianWidth.txt <==
    # EastAsianWidth-6.1.0.txt
    # Date: 2011-09-19, 18:46:00 GMT [KW]
    #
    # East Asian Width Properties
    #

I know the updated versions of the text files can be downloaded from
<https://www.unicode.org/Public/10.0.0/ucd/>. Could you please verify
whether the version of the code that was used to create
<https://git.musl-libc.org/cgit/musl/commit/?id=c72c1c5> and
<https://git.musl-libc.org/cgit/musl/commit/?id=54941ed> has been pushed
to <https://github.com/richfelker/musl-chartable-tools>?

> I finished reviewing the Unicode Collation Algorithm, and it looks like
> utf8proc doesn't include the necessary collation information. This is
> understandable since different locales have different collation rules,
> but I'm going to propose adding DUCET, the Default Unicode Collation
> Element Table, on their issue tracker since it doesn't look like it's
> been discussed yet.

I opened https://github.com/JuliaLang/utf8proc earlier today.

Eric

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.