|
Message-ID: <CAKGWAO8EUJokpuQmX_mzc=VVfcpkEW1_=m4JFnjoy3xD+EHwmQ@mail.gmail.com> Date: Thu, 14 Jun 2018 14:37:48 -0500 From: Will Dietz <w@...z.org> To: musl@...ts.openwall.com Subject: Re: iconv UTF-8 <--> CP1255 roundtrip possible bug? Nothing yet, I've not been able to spend more time on this lately sorry :). I'll let you know if I do find anything, and at least I'll be trying the latest changes you've pushed. (yay all issues I've run into appear fixed :D) Thanks! ~Will On Sat, Jun 2, 2018 at 9:26 PM, Rich Felker <dalias@...c.org> wrote: > On Wed, May 16, 2018 at 08:48:08PM -0500, Will Dietz wrote: >> On Wed, May 16, 2018 at 6:04 PM, Rich Felker <dalias@...c.org> wrote: >> > On Wed, May 16, 2018 at 12:22:36PM -0500, Will Dietz wrote: >> >> I admit to being a bit unsure, but the behavior shown below doesn't >> >> seem obviously right --LMK if I'm missing something :). >> >> >> >> Input file attached for inspection without relying on it getting >> >> through byte-identical to what I have-- >> >> indeed I'm not sure copy+paste into this is working correctly (the >> >> characters look different in my terminal :)). Anyway: >> >> >> >> $ cat cp1255-snippet.xxd >> >> 00000000: efac b3d6 b8d7 9d0a ........ >> >> $ xxd -r cp1255-snippet.xxd >> >> דָּם >> >> >> >> Attempt to round-trip this from UTF-8 to CP1255 and back, >> >> first with glibc's iconv (2.26): >> >> >> >> $ xxd -r cp1255-snippet.xxd|iconv -f UTF-8 -t CP1255|iconv -f CP1255 >> >> -t UTF-8 | xxd >> >> 00000000: efac b3d6 b8d7 9d0a >> >> >> >> Looks good, same as what was sent in. >> >> >> >> Using musl-based iconv utility (1.1.19): >> >> $ xxd -r cp1255-snippet.xxd|$ICONV -f UTF-8 -t CP1255|$ICONV -f CP1255 >> >> -t UTF-8 | xxd >> >> 00000000: 2ad6 b8d7 9d0a *..... >> >> >> >> Indeed, the result looks different than what was started with: >> >> >> >> *ָם >> >> >> >> (again apologies if that doesn't survive mailing and such) >> >> >> >> This input was taken from gnu libiconv's test suite, in particular the >> >> first line of tests/CP1255-snippet.UTF-8. Since it's 2 characters, >> >> and test data, I hope there's no problem re:licensing O:). >> >> >> >> I've reproduced the same behavior using iconv() directly, I can share >> >> that if that would be preferable. It's the same code from earlier >> >> iconv threads on the ML. >> > >> > No need; it's easy to reproduce, and I'm leaning towards saying the >> > test is invalid. U+FB33 is a precomposed ligature form (from the >> > Alphabetic Presentation Forms block), roughly equivalent in status to >> > stuff like "fi" (U+FB01). An iconv implementation could perform an >> > approximate conversion for such characters, returning a positive value >> > indicating the number of such substitutions made, but silently >> > converting it in a lossy way is not conforming, and of there's >> > apparently no lossless way to convert it since CP1255 has no dedicated >> > character slot for it (at least based on the definition of the >> > codepage I'm using). >> >> Thanks for looking into this and for the great information! >> I'll investigate more tomorrow, but wanted to respond to your inquiry >> since it's easy to produce and might help explain things :). > > Any further findings? > >> > Do you know how/why they expect it to round-trip? What does glibc do >> > when converting it -- can you show the intermediate (CP1255) form as a >> > hexdump? >> >> Sure! >> >> Here's the intermediates for libiconv first, then w/musl: >> >> $ cat libiconv-cp1255.xxd >> 00000000: e3cc c8ed 0a ..... >> $ cat musl-iconv-cp1255.xxd >> 00000000: 2ac8 ed0a *... > > This is a plausible/reasonable conversion GNU iconv is doing... > >> Here's what happens when each of these are feed through both: >> >> ---- using libiconv's intermediate: >> $ xxd -r ./libiconv-cp1255.xxd|iconv -f CP1255 -t UTF-8 >> דָּם >> $ xxd -r ./libiconv-cp1255.xxd|iconv -f CP1255 -t UTF-8|xxd >> 00000000: efac b3d6 b8d7 9d0a ........ >> $ xxd -r ./libiconv-cp1255.xxd|$ICONV -f CP1255 -t UTF-8 >> דָּם >> $ xxd -r ./libiconv-cp1255.xxd|$ICONV -f CP1255 -t UTF-8|xxd >> 00000000: d793 d6bc d6b8 d79d 0a ......... > > ...but the GNU iconv behavior here is completely unreasonable/wrong. > The first character it outputs is a presentation form for a ligature. > There is no reason iconv should be doing this kind of renormalization > when the original representation as two separate characters is > available in the dest charset. > > Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.