Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Sat, 2 Jun 2018 22:26:35 -0400
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: iconv UTF-8 <--> CP1255 roundtrip possible bug?

On Wed, May 16, 2018 at 08:48:08PM -0500, Will Dietz wrote:
> On Wed, May 16, 2018 at 6:04 PM, Rich Felker <dalias@...c.org> wrote:
> > On Wed, May 16, 2018 at 12:22:36PM -0500, Will Dietz wrote:
> >> I admit to being a bit unsure, but the behavior shown below doesn't
> >> seem obviously right --LMK if I'm missing something :).
> >>
> >> Input file attached for inspection without relying on it getting
> >> through byte-identical to what I have--
> >> indeed I'm not sure copy+paste into this is working correctly (the
> >> characters look different in my terminal :)).  Anyway:
> >>
> >> $ cat cp1255-snippet.xxd
> >> 00000000: efac b3d6 b8d7 9d0a                      ........
> >> $ xxd -r cp1255-snippet.xxd
> >> דָּם
> >>
> >> Attempt to round-trip this from UTF-8 to CP1255 and back,
> >> first with glibc's iconv (2.26):
> >>
> >> $ xxd -r cp1255-snippet.xxd|iconv -f UTF-8 -t CP1255|iconv -f CP1255
> >> -t UTF-8 | xxd
> >> 00000000: efac b3d6 b8d7 9d0a
> >>
> >> Looks good, same as what was sent in.
> >>
> >> Using musl-based iconv utility (1.1.19):
> >> $ xxd -r cp1255-snippet.xxd|$ICONV -f UTF-8 -t CP1255|$ICONV -f CP1255
> >> -t UTF-8 | xxd
> >> 00000000: 2ad6 b8d7 9d0a                           *.....
> >>
> >> Indeed, the result looks different than what was started with:
> >>
> >> *ָם
> >>
> >> (again apologies if that doesn't survive mailing and such)
> >>
> >> This input was taken from gnu libiconv's test suite, in particular the
> >> first line of tests/CP1255-snippet.UTF-8.  Since it's 2 characters,
> >> and test data, I hope there's no problem re:licensing O:).
> >>
> >> I've reproduced the same behavior using iconv() directly, I can share
> >> that if that would be preferable. It's the same code from earlier
> >> iconv threads on the ML.
> >
> > No need; it's easy to reproduce, and I'm leaning towards saying the
> > test is invalid. U+FB33 is a precomposed ligature form (from the
> > Alphabetic Presentation Forms block), roughly equivalent in status to
> > stuff like "fi" (U+FB01). An iconv implementation could perform an
> > approximate conversion for such characters, returning a positive value
> > indicating the number of such substitutions made, but silently
> > converting it in a lossy way is not conforming, and of there's
> > apparently no lossless way to convert it since CP1255 has no dedicated
> > character slot for it (at least based on the definition of the
> > codepage I'm using).
> 
> Thanks for looking into this and for the great information!
> I'll investigate more tomorrow, but wanted to respond to your inquiry
> since it's easy to produce and might help explain things :).

Any further findings?

> > Do you know how/why they expect it to round-trip? What does glibc do
> > when converting it -- can you show the intermediate (CP1255) form as a
> > hexdump?
> 
> Sure!
> 
> Here's the intermediates for libiconv first, then w/musl:
> 
> $ cat libiconv-cp1255.xxd
> 00000000: e3cc c8ed 0a                             .....
> $ cat musl-iconv-cp1255.xxd
> 00000000: 2ac8 ed0a                                *...

This is a plausible/reasonable conversion GNU iconv is doing...

> Here's what happens when each of these are feed through both:
> 
> ---- using libiconv's intermediate:
> $ xxd -r ./libiconv-cp1255.xxd|iconv -f CP1255 -t UTF-8
> דָּם
> $ xxd -r ./libiconv-cp1255.xxd|iconv -f CP1255 -t UTF-8|xxd
> 00000000: efac b3d6 b8d7 9d0a                      ........
> $ xxd -r ./libiconv-cp1255.xxd|$ICONV -f CP1255 -t UTF-8
> דָּם
> $ xxd -r ./libiconv-cp1255.xxd|$ICONV -f CP1255 -t UTF-8|xxd
> 00000000: d793 d6bc d6b8 d79d 0a                   .........

...but the GNU iconv behavior here is completely unreasonable/wrong.
The first character it outputs is a presentation form for a ligature.
There is no reason iconv should be doing this kind of renormalization
when the original representation as two separate characters is
available in the dest charset.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.