musl - Re: Re: iconv Korean and Traditional Chinese research so far

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20130805173144.GL221@brightrain.aerifal.cx>
Date: Mon, 5 Aug 2013 13:31:45 -0400
From: Rich Felker <dalias@...ifal.cx>
To: musl@...ts.openwall.com
Subject: Re: Re: iconv Korean and Traditional Chinese research so far

On Mon, Aug 05, 2013 at 11:43:45AM -0400, Rich Felker wrote:
> > In EUC-KR (MS-CP949), there is Hanja characters (i.e. Kanji
> > characters in Japanese) and Japanese Katakana/Hiragana besides of
> > Hangul characters.
> 
> Yes, I'm aware of these. However, it looks to me like the only
> characters outside the standard 94x94 grid zone are Hangul syllables,
> and they appear in codepoint order. If so, even if there's not a good
> pattern to where they're located, merely knowing that the ones that
> are missing from the 94x94 grid are placed in order in the expanded
> space is sufficient to perform algorithmic (albeit inefficient)
> conversion. Does this sound correct?

I've verified that this is correct and committed an implementation of
Korean based on this principle, which I basically copied from my
current implementation of GB18030's support for arbitrary Unicode
codepoints. It has not been heavily tested but I did test it casually
with all the important boundary values and it seems correct. Tests
should probably be added to the test suite.

Rich

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.