john-dev - Re: Markov UTF-8 magic

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <8ea32632c1c8be83853b95088bf67112@smtp.hushmail.com>
Date: Sun, 6 Jan 2013 13:10:02 +0100
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: Markov UTF-8 magic

On 6 Jan, 2013, at 11:32 , Frank Dittrich <frank_dittrich@...mail.com> wrote:
> Creating a really good UTF-8 validity checker is even somewhat more
> complicated, since you have to exclude illegal overlong sequences as
> well as invalid Unicode code points.
> 
> See the discussion here (just one example):
> http://stackoverflow.com/questions/1031645/how-to-detect-utf-8-in-plain-c
> 
> BTW: Here's a perl expression which checks for valid UTF-8, just in case
> we'll need one:
> http://www.w3.org/International/questions/qa-forms-utf-8
> 
> May be we should google for a well-tested free C implementation which we
> can use.

I'm pretty sure the original lib I got our Unicode support from had a validity checker, I'll have a look at that. It's pretty trivial but if we try to invent the wheel we'll probably end up overlooking something.

magnum

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.