|
Message-ID: <20130805004915.GA221@brightrain.aerifal.cx> Date: Sun, 4 Aug 2013 20:49:15 -0400 From: Rich Felker <dalias@...ifal.cx> To: musl@...ts.openwall.com Subject: Re: iconv Korean and Traditional Chinese research so far On Mon, Aug 05, 2013 at 12:39:43AM +0200, Harald Becker wrote: > Hi Rich ! > > > Worst-case, adding Korean and Traditional Chinese tables will > > roughly double the size of iconv.o to around 150k. This will > > noticably enlarge libc.so, but will make no difference to > > static-linked programs except those using iconv. I'm hoping we > > can make these additions less expensive, but I don't see a good > > way yet. > > Oh nooo, do you really want to add this statically to the iconv > version? Do I want to add that size? No, of course not, and that's why I'm hoping (but not optimistic) that there may be a way to elide a good part of the table based on patterns in the Hangul syllables or the possibility that the giant extensions are unimportant. Do I want to give users who have large volumes of legacy text in their languages stored in these encodings the same respect and dignity as users of other legacy encodings we already support? Yes. > Why cant we have all this character conversions on a state driven > machine which loads its information from a external configuration > file? This way we can have any kind of conversion someone likes, > by just adding the configuration file for the required Unicode to > X and X to Unicode conversions. This issue was discussed a long time ago and the consensus among users of static linking was that static linking is most valuable when it makes the binary completely "portable" to arbitrary Linux systems for the same cpu arch, without any dependency on having files in particular locations on the system aside from the minimum required by POSIX (things like /dev/null), the standard Linux /proc mountpoint, and universal config files like /etc/resolv.conf (even that is not necessary, BTW, if you have a DNS on localhost). Having iconv not work without external character tables is essentially a form of dynamic linking, and carries with it issues like where the files are to be found (you can override that with an environment variable, but that can't be permitted for setuid binaries), what happens if the format needs to change and the format on the target machine is not compatible with the libc version your binary was built with, etc. This is also the main reason musl does not support something like nss. Another side benefit of the current implementation is that it's fully self-contained and independent of any system facilities. It's pure C and can be taken out of musl and dropped in to any program on any C implementation, including freestanding (non-hosted) implementations. If it depended on the filesystem, adapting it for such usage would be a lot more work. > State driven fsm interpreters are really small and fast and may > read it's complete configuration from a file ... architecture > independent file, so we may have same character conversion files > for all architectures. A fsm implementation would be several times larger than the implementations in iconv.c. It's possible that we could, at some time in the future, support loading of user-defined character conversion files as an added feature, but this should only be for really special-purpose things like custom encodings used for games or obsolete systems (old Mac, console games, IBM mainframes, etc.). In terms of the criteria for what to include in musl itself, my idea is that if you have a mail client or web browser based on iconv for its character set handling, you should be able to read the bulk of content in any language. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.