|
Message-ID: <d53ef8e70b5b09e23824e604a2f92e7f@smtp.hushmail.com> Date: Fri, 14 Aug 2015 11:18:40 +0200 From: magnum <john.magnum@...hmail.com> To: john-dev@...ts.openwall.com Subject: Re: episerver UTF-8 On 2015-08-14 08:04, Frank Dittrich wrote: > On 08/14/2015 03:07 AM, jfoug@....net wrote: >> On Thu, 13 Aug 2015 19:35:57 -0500, Lei Zhang <zhanglei.april@...il.com> >> wrote: >>> BTW, I think 3*PLAINTEXT_LENGTH means that we assume >> >> Yes, this is an 'assumption' No, it is not. It's always correct. >>> each UTF8 char to be no larger than 3 bytes. Is that assumption true? >>> Or 4-byte UTF8 chars are too rare to be considered? >> >> In real world, they are somewhat rare. But your point is valid. There >> could certainly be a string of X 4 byte utf8 (there are even 5 byte utf8 >> characters) which cause something that should handle 25 characters to >> not be able to handle a string of 25 4 (or 5) byte utf8. But we simply >> have drawn a line in the sand where reality vs theoretical limits come >> into play. > > For applications that use UTF-16 with surrogates internally, the above > assumption is OK. If you enter characters that require more than tree > bytes when converted to utf-8, the max. number of characters will be > reduced accordingly. Exactly. 1 byte of UTF-8 = 2 octets of UTF-16 2 bytes of UTF-8 = 2 octets of UTF-16 3 bytes of UTF-8 = 2 octets of UTF-16 4 bytes of UTF-8 = 4 octets of UTF-16 For all of the above, the UTF-16 is *one* character. magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.