|
Message-ID: <20120423014414.GK14673@brightrain.aerifal.cx> Date: Sun, 22 Apr 2012 21:44:14 -0400 From: Rich Felker <dalias@...ifal.cx> To: musl@...ts.openwall.com Subject: Re: Help establishing wctype character classes It seems glibc defines these via localedata/gen-unicode-ctype.c in the following ways: - Alphabetic: It has complex special-casing for some particular characters based on reported errors in Unicode, but basically it amounts to all of categories L*, Nl, Nd, and members of category So which have "LETTER" in their name. - Blank: Tab and all of category Zs without <noBreak>. - Space: The ASCII space class, plus all of Zs, Zl, and Zp without <noBreak>. - Control: Anything with <control> as its name or category Zl/Zp. - Graphic: Any non-control, non-space. - Printable: Any non-control. - Punctuation: Any non-alphanumeric graphic. They cite this as "the traditional POSIX definition of punctuation", so I'm inclined to think they have a good idea here. Source: http://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/gen-unicode-ctype.c Note that wherever I've said "any", it's actually quantified only over defined characters; thus any characters added to Unicode later than the version glibc is sync'd with are reported as non-printable but also non-control. This seems highly undesirable to me; it's the reason "less" refuses to show new characters and instead prints <U+xxxx>. I would rather have every valid _codepoint_ be either control or printable, and all the non-space printable codepoints be graphic, but then among the graphic codepoints, only define alphanumeric or punctuation class for those codepoints assigned to characters in Unicode. That is, the hierarchy would break down as: 1. All valid codepoints are either control or printable. 2. All printable codepoints are either ASCII space or graphic. 3. All graphic codepoints are either assigned or unassigned. 4. All assigned graphic codepoints (graphic _characters_) are either alphanumeric or punctuation. It seems the only arbitrary decision left for us to make is how to divide the graphic characters between alphanumeric and punctuation. And this can be done by an explicit definition for either one, in terms of which the other will be implicitly defined. Here's a possible definition for alphanumeric: - All characters with Unicode Alphabetic property (includes L*, Nl, and special cases (Other_Alphabetic) defined in PropList.txt). - All characters with category Nd (digit). And possibly also: - Some or all characters with category No (other numeric - this includes things like superscripts, vulgar fractions, and script-specific numerical notations that are anything other than a direct copy of the ten decimal digits). Note that most of these in the Latin blocks are traditionally considered punctuation on Unixy systems. - Some or all characters with category So (other symbol) with LETTER in their names (just U+2129 turned Greek small letter iota and a bunch of useless circled/parenthesized letters). - Excluding a few special cases like glibc does (2 Thai characters that are actually not letters but punctuation, according to Theppitak Karoonboonyanan). Does this sound like a reasonable plan? Any tweaks needed? Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.