|
Message-ID: <20130802203157.GA14682@brightrain.aerifal.cx> Date: Fri, 2 Aug 2013 16:31:58 -0400 From: Rich Felker <dalias@...ifal.cx> To: musl@...ts.openwall.com Subject: Request for help/info for Korean iconv task Hi all, One of the big goals for the next release cycle is getting Korean legacy encoding support into iconv. On quick inspection, I see KS-C-5601 and KS-X-1001, the latter of which seems to be a subset of the former, as the base character sets involved. These appear to be much like JIS0208 for Japanese: not directly usable, since they overlap with ASCII, and requiring an encoding that maps them to usable byte values. Further, it seems all the encodings in real-world use (covering Unix, Windows, Mac) are based on the EUC scheme, which is the least insane of the legacy DBCS designs. (An ISO-2022 based form is also possibly used in email and on IRC; supporting this would be dependent on stateful iconv, which is a separate agenda item.) There are several issues I could use some help getting the full story on: 1. Despite the big ones all being EUC-based, there seem to be several variants: EUC-KR, Windows-949, and maybe others. It's not immediately clear to me whether they differ in significant ways we would need to support. 2. Long runs of the character table are Hangul syllables (obviously), but it's unclear to me whether they are sufficiently organized in patterns that knowing the pattern would enable us to elide large enough parts of the table to be worthwhile. 3. There's some "Johab" encoding which may or may not be important, and I'm not sure how it's related to the others. 4. Are there perhaps any stats on overall usage of different charsets on the internet, on which we could base some judgements on relevance? Here are the Unicode tables: http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP949.TXT http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSX1001.TXT http://www.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/KSC/KSC5601.TXT etc. This also seems like a useful document but I have not had time to make sense of it all yet: http://stason.org/TULARC/languages/korean/8-What-are-KS-X-1001-KS-C-5601-and-other-Hangul-codes.html#.UfwXr6omNmk Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.