Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130805154344.GJ221@brightrain.aerifal.cx>
Date: Mon, 5 Aug 2013 11:43:45 -0400
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Subject: Re: Re: iconv Korean and Traditional Chinese research so far

On Mon, Aug 05, 2013 at 04:28:32PM +0800, Roy wrote:
> Since I'm a Traditional Chinese and Japanese legacy encoding user, I
> think I can say something here.

Great, thanks for joining in with some constructive input! :)

> >Traditional Chinese:
> >HKSCS (CP951)
> >Lead byte range is extended to 88-FE (119)
> >1651 characters outside BMP
> >37366 bytes table space for 16-bit mapping table, plus extra mapping
> >needed for characters outside BMP
> 
> There is another Big5 extension called Big5-UAO, which is being used
> in world's largest telnet-based BBS called "ptt.cc".
> 
> It has two tables, one for Big5-UAO to Unicode, another one is
> Unicode to Big5-UAO.
> http://moztw.org/docs/big5/table/uao250-b2u.txt
> http://moztw.org/docs/big5/table/uao250-u2b.txt
> 
> Which extends DBCS lead byte to 0x81.

Is it a superset of HKSCS or does it assign different characters to
the range covered by HKSCS?

> In EUC-KR (MS-CP949), there is Hanja characters (i.e. Kanji
> characters in Japanese) and Japanese Katakana/Hiragana besides of
> Hangul characters.

Yes, I'm aware of these. However, it looks to me like the only
characters outside the standard 94x94 grid zone are Hangul syllables,
and they appear in codepoint order. If so, even if there's not a good
pattern to where they're located, merely knowing that the ones that
are missing from the 94x94 grid are placed in order in the expanded
space is sufficient to perform algorithmic (albeit inefficient)
conversion. Does this sound correct?

> >Worst-case, adding Korean and Traditional Chinese tables will roughly
> >double the size of iconv.o to around 150k. This will noticably enlarge
> >libc.so, but will make no difference to static-linked programs except
> >those using iconv. I'm hoping we can make these additions less
> >expensive, but I don't see a good way yet.
> 
> For static linking, can we have conditional linking like QT does?

My feeling is that it's a tradeoff, and probably has more pros than
cons. Unlike QT, musl's iconv is extremely small. Even with all the
above, the size of iconv.o will be under 130k, maybe closer to 110k.
If you actually use iconv in your program, this is a small price to
pay for having it fully functional. On the other hand, if linking it
is conditional, you have to consider who makes the decision, and when.
If it's at link time for each application, that's probably too much of
a musl-specific version. If it's at build time for musl, then is it
your device vendor deciding for you what languages you need? One of
the biggest headaches of uClibc-based systems is finding that the
system libc was built with important options you need turned off and
that you need to hack in a replacement to get something working...

I think the cost of getting stuck with broken binaries where charsets
were omitted is sufficiently greater than the cost of adding a few
tens of kb to static binaries using iconv, that we should only
consider a build time option if embedded users are actively reporting
size problems.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.