Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180227173426.GQ1436@brightrain.aerifal.cx>
Date: Tue, 27 Feb 2018 12:34:26 -0500
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: iconv failure (ISO-2022-JP) since musl update on
 AlpineLinux

On Tue, Feb 27, 2018 at 05:57:04PM +0100, Steffen Nurpmeso wrote:
> Hello.
> 
> After updating to musl-1.1.19-r0 there i saw test failures for the
> MUA i maintain, namely regarding the mentioned charset.  I will
> attach a file to reproduce.  (Am not subscribed.)
> Ciao!
> 
>   #?0[steffen@...on steffen]$ cksum in.utf 
>   1259742080 686 in.utf
>   #?0[steffen@...on steffen]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum
>   2184132317 536
>   #?0[steffen@...on steffen]$ iconv --version
>   iconv (GNU libiconv 1.11)
> ...
>   #?0[steffen@...ex tmp]$ cksum in.utf 
>   1259742080 686 in.utf
>   #?0[steffen@...ex tmp]$ iconv -f utf8 -t iso-2022-jp <in.utf|cksum 
>   209789743 1736
>   #?0[steffen@...ex tmp]$ apk info --who-owns /usr/bin/iconv 
>   /usr/bin/iconv is owned by musl-utils-1.1.19-r0

Does the data round-trip correctly? I don't think you can expect
bitwise match between outputs of different ISO-2022-JP converters,
unless perhaps they both guarantee minimality, because the ISO-2022-JP
representation of a string is highly nonunique.

In particular musl's to-ISO-2022-JP converter is stateless and always
generates shifts in/out around every non-ASCII character. Of course
this is highly suboptimal, but in the worst case (where the caller
calls iconv one character at a time) the iconv API can't do any better
because strings are required to end in the unshifted state, and the
iconv API doesn't have any method to "finalize" a conversion. This
implies that every time iconv returns with non-ASCII as the most
recent output character, it must be followed by a shift back to the
initial (ASCII) state.

We could improve this in the case of batch conversions by overwriting
the previous shift-back-to-initial and skipping the next shift if the
character set of the next character to output matches the previous
one, but that only works within a single batch call, since iconv can't
write outside the buffer passed to it for the current call. This is an
improvement I think I want to make, since it would improve typical
output size a lot, but the cost is output determinism under different
chunking by the caller.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.