musl - iconv Korean and Traditional Chinese research so far

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130804165152.GA32076@brightrain.aerifal.cx>
Date: Sun, 4 Aug 2013 12:51:52 -0400
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Subject: iconv Korean and Traditional Chinese research so far

OK, so here's what I've found so far. Both legacy Korean and legacy
Traditional Chinese encodings have essentially a single base character
set:

Korean:
KS X 1001 (previously known as KS C 5601)
93 x 94 DBCS grid (A1-FD A1-FE)
All characters in BMP
17484 bytes table space

Traditional Chinese:
Big5 (CP950)
89 x (63+94) DBCS grid (A1-F9 40-7E,A1-FE)
All characters in BMP
27946 bytes table space

Both of these have various minor extensions, but the main extensions
of any relevance seem to be:

Korean:
CP949
Lead byte range is extended to 81-FD (125)
Tail byte range is extended to 41-5A,61-7A,81-FE (26+26+126)
44500 bytes table space

Traditional Chinese:
HKSCS (CP951)
Lead byte range is extended to 88-FE (119)
1651 characters outside BMP
37366 bytes table space for 16-bit mapping table, plus extra mapping
needed for characters outside BMP

The big remaining questions are:

1. How important are these extensions? I would guess the answer is
"fairly important", espectially for HKSCS where I believe the
additional characters are needed for encoding Cantonese words, but
it's less clear to me whether the Korean extensions are useful (they
seem to mainly be for the sake of completeness representing most/all
possible theoretical syllables that don't actually occur in words, but
this may be a naive misunderstanding on my part).

2. Are there patterns to exploit? For Korean, ALL of the Hangul
characters are actually combinations of several base letters. Unicode
encodes them all sequentially in a pattern where the conversion to
their constitutent letters is purely algorithmic, but there seems to
be no clean pattern in the legacy encodings, as the encodings started
out just incoding the "important" ones then adding less important
combinations in separate ranges.

Worst-case, adding Korean and Traditional Chinese tables will roughly
double the size of iconv.o to around 150k. This will noticably enlarge
libc.so, but will make no difference to static-linked programs except
those using iconv. I'm hoping we can make these additions less
expensive, but I don't see a good way yet.

At some point, especially if the cost is not reduced, I will probably
add build-time options to exclude a configurable subset of the
supported character encodings. This would not be extremely
fine-grained, and the choices to exclude would probably be just:
Japanese, Simplified Chinese, Traditional Chinese, and Korean. Legacy
8-bit might also be an option but these are so small I can't think of
cases where it would be beneficial to omit them (5k for the tables on
top of the 2k of actual code in iconv). Perhaps if there are cases
where iconv is needed purely for conversion between different Unicode
forms, but no legacy charsets, on tiny embedded devices, dropping the
8-bit tables and all of the support code could be useful; the
resulting iconv would be around 1k, I think.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.