musl - Re: Help establishing wctype character classes

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120423014414.GK14673@brightrain.aerifal.cx>
Date: Sun, 22 Apr 2012 21:44:14 -0400
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Subject: Re: Help establishing wctype character classes

It seems glibc defines these via localedata/gen-unicode-ctype.c in the
following ways:

- Alphabetic: It has complex special-casing for some particular
  characters based on reported errors in Unicode, but basically it
  amounts to all of categories L*, Nl, Nd, and members of category So
  which have "LETTER" in their name.

- Blank: Tab and all of category Zs without <noBreak>.

- Space: The ASCII space class, plus all of Zs, Zl, and Zp without
  <noBreak>.

- Control: Anything with <control> as its name or category Zl/Zp.

- Graphic: Any non-control, non-space. 

- Printable: Any non-control.

- Punctuation: Any non-alphanumeric graphic. They cite this as "the
  traditional POSIX definition of punctuation", so I'm inclined to
  think they have a good idea here.

Source:
http://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/gen-unicode-ctype.c

Note that wherever I've said "any", it's actually quantified only over
defined characters; thus any characters added to Unicode later than
the version glibc is sync'd with are reported as non-printable but
also non-control. This seems highly undesirable to me; it's the reason
"less" refuses to show new characters and instead prints <U+xxxx>.

I would rather have every valid _codepoint_ be either control or
printable, and all the non-space printable codepoints be graphic, but
then among the graphic codepoints, only define alphanumeric or
punctuation class for those codepoints assigned to characters in
Unicode. That is, the hierarchy would break down as:

1. All valid codepoints are either control or printable.
2. All printable codepoints are either ASCII space or graphic.
3. All graphic codepoints are either assigned or unassigned.
4. All assigned graphic codepoints (graphic _characters_) are either
alphanumeric or punctuation.

It seems the only arbitrary decision left for us to make is how to
divide the graphic characters between alphanumeric and punctuation.
And this can be done by an explicit definition for either one, in
terms of which the other will be implicitly defined.

Here's a possible definition for alphanumeric:

- All characters with Unicode Alphabetic property (includes L*, Nl,
  and special cases (Other_Alphabetic) defined in PropList.txt).
- All characters with category Nd (digit).

And possibly also:

- Some or all characters with category No (other numeric - this
  includes things like superscripts, vulgar fractions, and
  script-specific numerical notations that are anything other than a
  direct copy of the ten decimal digits). Note that most of these in
  the Latin blocks are traditionally considered punctuation on Unixy
  systems.
- Some or all characters with category So (other symbol) with LETTER
  in their names (just U+2129 turned Greek small letter iota and a
  bunch of useless circled/parenthesized letters).
- Excluding a few special cases like glibc does (2 Thai characters
  that are actually not letters but punctuation, according to
  Theppitak Karoonboonyanan).

Does this sound like a reasonable plan? Any tweaks needed?

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.