Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20130802203157.GA14682@brightrain.aerifal.cx>
Date: Fri, 2 Aug 2013 16:31:58 -0400
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Subject: Request for help/info for Korean iconv task

Hi all,

One of the big goals for the next release cycle is getting Korean
legacy encoding support into iconv. On quick inspection, I see
KS-C-5601 and KS-X-1001, the latter of which seems to be a subset of
the former, as the base character sets involved. These appear to be
much like JIS0208 for Japanese: not directly usable, since they
overlap with ASCII, and requiring an encoding that maps them to usable
byte values. Further, it seems all the encodings in real-world use
(covering Unix, Windows, Mac) are based on the EUC scheme, which is
the least insane of the legacy DBCS designs. (An ISO-2022 based form
is also possibly used in email and on IRC; supporting this would be
dependent on stateful iconv, which is a separate agenda item.)

There are several issues I could use some help getting the full story
on:

1. Despite the big ones all being EUC-based, there seem to be several
variants: EUC-KR, Windows-949, and maybe others. It's not immediately
clear to me whether they differ in significant ways we would need to
support.

2. Long runs of the character table are Hangul syllables (obviously),
but it's unclear to me whether they are sufficiently organized in
patterns that knowing the pattern would enable us to elide large
enough parts of the table to be worthwhile.

3. There's some "Johab" encoding which may or may not be important,
and I'm not sure how it's related to the others.

4. Are there perhaps any stats on overall usage of different charsets
on the internet, on which we could base some judgements on relevance?

Here are the Unicode tables:

http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP949.TXT
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT
http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSC5601.TXT
etc.

This also seems like a useful document but I have not had time to make
sense of it all yet:

http://stason.org/TULARC/languages/korean/8-What-are-KS-X-1001-KS-C-5601-and-other-Hangul-codes.html#.UfwXr6omNmk

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.