john-dev - Re: episerver UTF-8

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <55CD84F6.3060501@mailbox.org>
Date: Fri, 14 Aug 2015 08:04:38 +0200
From: Frank Dittrich <frank.dittrich@...lbox.org>
To: john-dev@...ts.openwall.com
Subject: Re: episerver UTF-8

On 08/14/2015 03:07 AM, jfoug@....net wrote:
> On Thu, 13 Aug 2015 19:35:57 -0500, Lei Zhang <zhanglei.april@...il.com>
> wrote:
>> BTW, I think 3*PLAINTEXT_LENGTH means that we assume
> 
> Yes, this is an 'assumption'
> 
>> each UTF8 char to be no larger than 3 bytes. Is that assumption true?
>> Or 4-byte UTF8 chars are too rare to be considered?
> 
> In real world, they are somewhat rare.  But your point is valid.  There
> could certainly be a string of X 4 byte utf8 (there are even 5 byte utf8
> characters) which cause something that should handle 25 characters to
> not be able to handle a string of 25 4 (or 5) byte utf8. But we simply
> have drawn a line in the sand where reality vs theoretical limits come
> into play.

For applications that use UTF-16 with surrogates internally, the above
assumption is OK. If you enter characters that require more than tree
bytes when converted to utf-8, the max. number of characters will be
reduced accordingly.

Frank

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.