Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <DB6P192MB0101984E844CB3983214220CFBEA0@DB6P192MB0101.EURP192.PROD.OUTLOOK.COM>
Date: Thu, 4 May 2017 14:49:15 +0000
From: "maksis ." <maksis@...enaline-network.com>
To: "musl@...ts.openwall.com" <musl@...ts.openwall.com>
Subject: RE: Issues with iconv conversions (UTF-8 -> cp*)

Thanks! Based on quick testing with a few different charsets, the results are now correct in both apps.

From: Rich Felker<mailto:dalias@...c.org>
Sent: torstai 4. toukokuuta 2017 17.18
To: musl@...ts.openwall.com<mailto:musl@...ts.openwall.com>
Subject: Re: [musl] Issues with iconv conversions (UTF-8 -> cp*)

On Thu, May 04, 2017 at 09:23:18AM +0000, maksis . wrote:
> Seems like it’s better, but certain characters are converted
> incorrectly by the linked test app (such as “€” -> “¬”). I also have
> different characters failing in my actual C++ app.

I can't reproduce this. With the patch applied I have both “€” and “¬”
successfully round-tripping from UTF-8 to CP1250 and back...

..however that makes sense because the bug was out-of-bounds memory
reads, whose behavior would vary by build/system/etc. (UB). Try this
v2 patch.

Rich


> From: Rich Felker<mailto:dalias@...c.org>
> Sent: torstai 4. toukokuuta 2017 0.00
> To: musl@...ts.openwall.com<mailto:musl@...ts.openwall.com>
> Subject: Re: [musl] Issues with iconv conversions (UTF-8 -> cp*)
>
> On Wed, May 03, 2017 at 01:53:19PM -0400, Rich Felker wrote:
> > On Wed, May 03, 2017 at 05:20:47PM +0000, maksis . wrote:
> > > Hi,
> > >
> > > I’m experiencing issues with iconv conversions from UTF-8 to Windows codepages (cp* -> UTF-8 seems to be working fine).
> > >
> > >
> > > Test program: https://gist.github.com/maksis/ef6562b43c94a6a29dc21b987ec4c0cf
> > >
> > > gcc output:  UTF-8 -> cp1250 -> UTF-8: ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚„…†‡‰Š‹ŚŤŽŹ‘’“”•–—™š›śťžźˇ˘Ł¤Ą¦§¨©Ş«¬®Ż°±˛ł´µ¶·¸ąş»Ľ˝ľżŔÁÂĂÄĹĆÇČÉĘËĚÍÎĎĐŃŇÓÔŐÖ×ŘŮÚŰÜÝŢßŕáâăäĺćçčéęëěíîďđńňóôőö÷řůúűüýţ
> > >
> > > musl-gcc output:  UTF-8 -> cp1250 -> UTF-8: ABCDEFGHIJKLMNOPQRSTUVWXYZ[]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚„…†‡‰Š‹ŚŤŽŹ‘’“”•–—™š›śťžźˇ˘Ł*Ą****Ş***Ż**˛ł*****ąş*Ľ˝ľżŔ**Ă*ĹĆ*Č*Ę*Ě**ĎĐŃŇ**Ő**ŘŮ*Ű**Ţ*ŕ**ă*ĺć*č*ę*ě**ďđńň**ő**řů*ű**ţ
> > >
> > > Tested with GCC 6.3.1 and musl 1.1.16
> >
> > Thanks. I've confirmed that this is a bug in conversion to (not from,
> > only to) legacy 8bit codepages; there's missing logic for the case
> > (which is handled specially in the tables) where a unicode codepoint
> > is represented by its own value (that fits in 8 bits). I'll work on a
> > patch to fix it and follow up when it's either committed or ready for
> > testing.
>
> Does the attached patch work for you? Anyone see problems with it? The
> functional change is at:
>
> -                       if (c < 128+totype) {
> +                       if (c < 128+totype || c==legacy_map(tomap, c)) {
>
> The rest is just refactoring to put the shared logic in a new static
> function legacy_map rather than repeating it two (and now three, after
> the fix) times. Further simplification via refactoring is actually
> possible but I tried to keep this patch short and easy to review.
>
> Rich
>


Content of type "text/html" skipped

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.