|
Message-ID: <79808844.bqqDOferBU@omega> Date: Thu, 30 Jul 2020 11:39:43 +0200 From: Bruno Haible <bruno@...sp.org> To: bug-gnulib@....org Cc: Florian Weimer <fweimer@...hat.com>, Arjun Shankar <ashankar@...hat.com>, Rich Felker <dalias@...c.org>, "A. Wilcox" <awilfox@...lielinux.org>, musl@...ts.openwall.com Subject: Re: iconv replacements [Dropping bug-bison from CC] > > Yes and no. The code is not making assumptions about a particular iconv() > > implementation. But it needs to distinguish two categories of replacements > > done by iconv(): > > - those that are harmless (for example when replacing a Unicode TAG > > character U+E00xx with an empty output), > > - those that are better not presented to the user, if the programmer has > > specified a fallback (for example, replacing all non-ASCII characters > > with NUL, '?', or '*'). > > > > The standards don't help in making the distinction. > > > > Therefore whether you consider said glibc and libiconv behaviour as > > "non-conforming" or not is irrelevant. > > Could you sketch briefly what you need? We have identified some issues > with the existing iconv interface. If we add an enhancement, it would > make sense to cover these requirements. POSIX [1] says: "If iconv() encounters a character in the input buffer that is valid, but for which an identical character does not exist in the target codeset, iconv() shall perform an implementation-defined conversion on this character." "The iconv() function shall ... return the number of non-identical conversions performed." This is sufficient for detecting that iconv() did something that the application might or might not like. For decent application behaviour in UTF-8, legacy 8-bit, and ASCII locales I wrote a module 'unicodeio' that accepts an ASCII fallback given by the programmer. For example, for the string "François Pinard" a fallback "Francois Pinard" can be given, and for the string "•" a fallback "." can be given. In this code, it needs to analyze what iconv() actually did and distinguish replacements that are OK (no need to activate the ASCII fallback) and those that are worse than the ASCII fallback. For example: - Replacing 'ç' with '?' (NetBSD, Solaris 11) or '*' (musl) or NUL (IRIX) is worse than the ASCII fallback. - Replacing a Unicode tag character with an empty string is OK. - Replacing GREEK SMALL LETTER MU with MICRO SIGN is OK. - Replacing FULLWIDTH COLON with ':' is OK (most likely equivalent to the ASCII fallback). That's my requirement from the application side. I don't know whether an iconv() implementation can help here, given the limited interface of iconv. Maybe there could be an alternative to //TRANSLIT in the iconv_open() argument, that would specify e.g. that tag characters and <compat> and <wide> replacements in UnicodeData.txt are OK but other replacements are not OK? Where either - OK means a conversion that does not increment the return value, - "not OK" means a conversion that increments the return value, or - OK means a conversion that increments the return value, - "not OK" means an error return (-1 / EILSEQ). Bruno [1] https://pubs.opengroup.org/onlinepubs/9699919799/functions/iconv.html
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.