Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130808043035.GK221@brightrain.aerifal.cx>
Date: Thu, 8 Aug 2013 00:30:35 -0400
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Subject: Re: Re: Status of Big5 and extensions

On Thu, Aug 08, 2013 at 05:53:21AM +0200, Szabolcs Nagy wrote:
> * Rich Felker <dalias@...ifal.cx> [2013-08-07 22:11:19 -0400]:
> > Since you mentioned Big5-2003, I've been looking into it, and it seems
> > like it should be part of our base Big5 mapping. Diffing moztw's
> > version of it against CP950.TXT (after cleaning up both), I get:
> 
> i checked an other source for big5-2003 and it is bug compatible
> with the moztw one (so it might not be mozilla's fault)
> http://www.csie.ntu.edu.tw/~r92030/project/big5/
> 
> this source maps C255 to 5F5E instead of 5F5D
> (also observed in the icu version of cp950)

Unfortunately this mismatches the normative Unihan.txt which says
U+5F5D corresponds to the historical Big5 character C255, so we need
at least some justification for the change if Unihan.txt is buggy.

> > These are all part of ETEN omitted from CP950, and should definitely
> > be in Big5 base.
> > 
> > +0xC6A1 0x2460
> > +0xC6A2 0x2461
> > +0xC6A3 0x2462
> > +0xC6A4 0x2463
> > +...
> > +0xC7F1 0x30F5
> > +0xC7F2 0x30F6
> > 
> > These are also from ETEN. Notably, the Cyrillic block that immediately
> > follows these is still omitted in Big5-2003, for reasons that appear
> > political. Since ETEN, UAO, and HKSCS all have it, I see no reason not
> > to add the Cyrillic block back in here.
> > 
> 
> the C6BF-C6D9 part is incompatible in hkscs and big5-2003
> hkscs == uao != big5-2003 for these codes
> icu agrees with the old hkscs pua codes so this might be
> just a bug in the big5-2003 source

I believe I've dug up the story on this here:

http://lists.whatwg.org/htdig.cgi/whatwg-whatwg.org/2012-April/035389.html

In short, the Big5-2003 mappings from moztw.org are wrong. The "KANGXI
RADICAL" characters in Unicode are compatibility characters. This
means they have compatibility-equivalents which should be used in
Unicode documents in place of them, much like the Greek letter μ
should be used in place of Latin-1 MICRO SIGN (µ), and only exist for
round-trip compatibility with legacy character sets which encode the
character twice. Since Big5 (unlike CNS 11643) does not encode the
Kangxi radicals twice, using them in a mapping to Unicode is wrong use
of Unicode, regardless of what the mapping table from the standards
body says. Thus, I have no problem with going with the UAO/HKSCS way.

According to the above link, however, HKSCS has introduced a problem.
They've double mapped U+5E7A and are thus mapping the one in the C6CD
slot to U+2F33 instead, since the FBF4 slot is mapping to U+5E7A. I'm
not sure what the right solution to this is; since we're not
interested in round-trip, it might make the most sense to just ignore
it and map them both to the (same) proper character.

> > -0xF9FA 0x256D
> > -0xF9FB 0x256E
> > -0xF9FC 0x2570
> > -0xF9FD 0x256F
> > +0xF9FA 0x2554
> > +0xF9FB 0x2557
> > +0xF9FC 0x255A
> > +0xF9FD 0x255D
> > 
> > This looks like pure Mozilla cruft. Is there any justification for
> > these sorts of changes?
> > 
> 
> these are box drawing chars (like A2A4-A2A7 above),
> the diff is double vs light lines
> 
> cp950 == hkscs == uao != big-2003 (and missing from icu)
> 
> hkscs maps F9FE to FFED instead of 2593 (cp950,uao,icu)

I don't really care so much about these anyway since they do not
affect linguistic content, just warez .nfo files. ;-)

> > Does the above analysis look correct? If so I will go ahead and merge
> > the above changes to Big5 support into musl.
> > 
> > BTW, the only non-PUA part of UAO within the standard Big5 range
> > (89x157 grid) that won't be mapped with these changes is the stuff
> > right after the Cyrillic block. This part does not conflict with
> > current HKSCS, so if I had good sources from both the Taiwan and HK
> > sides supporting the position that these mappings will not conflict
> > with other extensions in current use or with future expansion of
> > HKSCS, we could consider including that part of UAO in the base Big5
> > mapping. At this point this is only an idea for consideration, but we
> > can keep it in mind.
> 
> note that
> C87A, C87C, C8A4 are mapped to 2xxxx in hkscs
> (old hkscs pua codes agree with uao)

OK, so is this non-conflicting?

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.