musl - Re: Updating Unicode support

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180124225318.edwuzdu53c7f2sts@sinister.lan.codevat.com>
Date: Wed, 24 Jan 2018 14:53:18 -0800
From: Eric Pruitt <eric.pruitt@...il.com>
To: musl@...ts.openwall.com
Subject: Re: Updating Unicode support

On Wed, Jan 24, 2018 at 02:25:06PM -0800, Eric Pruitt wrote:
> On Wed, Jan 24, 2018 at 04:48:53PM -0500, Rich Felker wrote:
> > > I updated my copy of musl to 1.1.18 then recompiled it with and without
> > > my utf8proc changes using GCC 6.3.0 "-O3" targeting Linux 4.9.0 /
> > > x86_64:
> > >
> > > - Original implementation: 2,762,774B (musl-1.1.18/lib/libc.a)
> > > - utf8proc implementation: 3,055,954B (musl-1.1.18/lib/libc.a)
> > > - The utf8proc implementation is ~11% larger. I didn't do any
> > >   performance comparisons.
> >
> > You're comparing the whole library, not character tables. If you
> > compare against all of ctype, it's a 15x size increase. If you compare
> > against just wcwidth, it's a 69x increase.
>
> That was intentional. I have no clue what the common case is for other
> people that use musl, but most applications **I** use make use of
> various parts of musl, so I did the comparison on the library as a
> whole.

If the size of utf8proc tables is a problem, I'm not sure how you'd go
about implementing UCA without them in an efficient manner. Part of the
UCA requires normalizing the Unicode strings and also needs character
property data to determine what sequence of characters in one string is
compared to a sequence of characters in another string. Perhaps you
could compromise by simply ignoring certain characters and not doing
normalization at all.

Since the utf8proc maintainer seems receptive to my proposed change, I'm
going to implement the collation feature in utf8proc, and if you decide
that utf8proc is worth the bloat, you'll get collation logic for "free."

Eric

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.