|
Message-ID: <20130805143540.GH221@brightrain.aerifal.cx> Date: Mon, 5 Aug 2013 10:35:40 -0400 From: Rich Felker <dalias@...ifal.cx> To: musl@...ts.openwall.com Subject: Re: iconv Korean and Traditional Chinese research so far On Mon, Aug 05, 2013 at 09:53:32AM +0200, Harald Becker wrote: > Hi Rich ! > > > iconv is not something that needs to be extensible. There is a > > finite set of legacy encodings that's relevant to the world, > > and their relevance is going to go down and down with time, not > > up. > > Oh! So you consider Japanese, Chinese, Korean, etc. languages > relevant for programs sitting on my machines? How can you decide I don't decide what's relevant for you. Rather, I don't have the authority to declare it irrelevant-by-default. This is true even for things like crypt algorithms (does anybody really want to use md5??) but especially for anything that would preclude somebody from being able to receive data in their native language. Simple multilingual support via UTF-8 with conversion from legacy data has been near top priority, if not top, since the conception of musl. If history has shown us anything, it's that universal support for all languages must be default and turning off some support to save space (which is rarely if ever actually needed) needs to be a conscious decision. I'm no Apple fan by any means, but just look at the situation on iOS: you can turn on a new iPhone or iPad and read data in any language (including having the relevant fonts!) and even add a keyboard and type in almost any language, without having to buy a special localized version or install add-ons. This is very different from the situation on Android right now. musl's intended applicability is broad. From industrial control to settop boxes, in-car entertainment, initramfs images for desktop machines, phones, tablets, plug computers that run your private home or office webmail server, full desktops, VE LAMP stacks, hosts for VEs, etc. Some of these usages have a real need for human-language text; others don't. But if we have the power to make it such that, if someone uses a musl to implement a plug computer for webmail, it naturally supports all languages unless the maker of the device goes and actively rips that support out, then we have a responsibility to do so. Or, said differently, it's OUR FAULT for making broken-by-default software if language support is missing unless you go to the effort of learning musl-specific ways to enable it. > this? Why being so ignorant and trying to write an standard > conform library and then pick out a list of char sets of your > choice which may be possible on iconv, neglecting wishes and > need of any musl user. If I were to just accept your demands, it would essentially mean: (1) discarding the opinions of everybody else who discussed this issue in the past and decided that static linking should mean real static binaries that work the same without needing extra files in the filesystem.. (2) discarding the informed decisions I made based on said discussions. > .... or in other words, if you really be this ignorant and > insist on including those charsets fixed in musl, musl is never > more for me :( ... I don't need to bring in any part of mine into > musl, but I don't consider a lib usable for my needs, which > include several char set files in statical build and neglects to > load seldom used charset definitions from extern in any way. Name the extra "seldom used charset definitions" you're interested in. They're probably already supported. We are not discussing adding some new giant subsystem to musl. We are discussing adding the last two missing major legacy charsets to an existing framework that's existed for a long time. > > > > Do I want to give users who have large volumes of legacy > > > > text in their languages stored in these encodings the same > > > > respect and dignity as users of other legacy encodings we > > > > already support? Yes. > > > > > > Of course. I won't dictate others which conversions they want > > > to use. I only hat to have plenty of conversion tables on my > > > system when I really know I never use such kind of > > > conversions. > > > > And your table for just Chinese is as large as all our tables > > combined... > > How can you tell this. I don't think so. You're welcome to implement it and see. Thanks to the way static linking works, if you add -lyouriconv when static linking, the iconv in musl will be completely omitted from the binary and yours will be used instead. Of course the iconv in musl will be completely omitted anyway except in the small number of programs that actually use iconv. This is not glibc where stdio and locale depend on iconv. iconv is purely iconv. > Such conversion codes > may be very compact. Size is mainly required for translation > tables, that is when code points of the char sets does not match > Unicode character order, but you always need the space for those > translations. The rest won't be much. That's all the size. The VAST majority of the table size is for 4 major character encoding families, those based on: - JIS 0208 - GB 18030 - KS X 1001 - Big5 As for legacy 8-bit encodings, musl's approach to them is also more efficient than you could easily be with a state machine. The fact that the number of codepoints that ever appear in an 8-bit encoding is less than 1024 is used to store the mappings as 10-bit-per-entry packed arrays of indices into the legacy_chars table. This reduces the marginal cost of individual 8bit encodings by 25% (versus 16-bit entries). The ASCII range and any span upward into the high range that maps directly to Unicode codepoints is also elided from the table (which reduces ISO-8859-* by another 62.5%). In short, what we have is about the smallest possible representation you can get without applying LZMA or something (and thereby needing all the code to decompress and dirty pages to store the decompressed version). It's hard to beat. By the way, if you really want to save the space they take, you could just delete this email thread from your mail folder. It's larger than musl's iconv already. :-) > > I agree you can make iconv smaller than musl's in the case > > where _no_ legacy DBCS are installed. But if you have just one, > > you'll be just as large or larger than musl with them all. > > .... musl with them all? I don't consider them smaller than an > optimized byte code interpreter ... not when you are going to > include DBCS char sets fixed into musl. At least if you do all > the required translations. I may have been exaggerating a little bit, but I doubt you can get your bytecode GB18030 support smaller than about 110k once you count the bytecode and the interpreter binary. I'm even more doubtful that you can get it smaller than the current 71k in musl. > > compare the size of musl's tables to glibc's converters. I've > > worked hard to make them as small as reasonably possible > > without doing hideous hacks like decompression into an > > in-memory buffer, which would actually increase bloat. > > Are you now going to build a lib for startup purpose and embedded > systems only or are you trying to write a general purpose > library? General-purpose. Have you not read the website? Originally in the 1990s, Linux-based systems used a fork of the GNU C library (glibc) version 1, which existed in various versions (libc4, libc5). Later, distributions adopted the more mature version 2 of glibc, and denoted it libc6. Since then, other specialized C library implementations such as uClibc and dietlibc have emerged as well. musl is a new general-purpose implementation of the C library. It is lightweight, fast, simple, free, and aims to be correct in the sense of standards-conformance and safety. If you're using it for startup purposes or embedded systems that don't communicate with humans in human language, you won't be running applications that call iconv() and thus it's irrelevant. > On one hand you say "use dietlibc" if you need small statical > programs and on the other hand you want to include many charset > definitions into a statical build to avoid dynamic loading of > tables, required only on embedded systems. Where did I say "use dietlibc"? If I did (I don't really remember) it was not a serious recommendation but a sarcastic remark to make a point that musl is not about being "smallest-at-all-costs" (and thereby broken) like dietlibc is. > > have been over in 1992, when Pike and Thompson made them > > obsolete, but it's really over now. > > So why are you adding Japanese, Chinese and Korean charsets to an > iconv conversion in musl? Why not just using UTF-8? Whenever you > use iconv you want the flexibility to do all required charset > conversions. Which means you need to statically link in many > charset definitions or you need to dynamically load what is > required. The time of creating charsets is over. That does not magically make the data created in those charsets in the past go away or convert itself to UTF-8. It doesn't even magically stop people from making new data in those charsets. All it means is that governments, vendors, etc. have stopped the madness of making new charsets. > > Then dynamic link it. If you want an extensible binary, you use > > dynamic linking. > > Dynamic linking of mail client, ok and where go the charset > definition files? Are they all packed into your libc.so? That is > a very big file? Why do I need to have Asian language definition > on my disk, when I do not want? Because any other solution would be larger, would defeat the purpose of static linking, and would contribute to the problem of poor multilingual support. Why are you upset about these tables and not other tables like crypto sboxes, wcwidth, character classes, bits of 2/pi and pi/2, etc.? By the way, math/*.o are also fairly large, on the same order of magnitude as iconv; would you also suggest we move it all out to bytecode loaded at runtime even in static binaries? > It is your decision, but please state clear what purpose you are > building musl. Here it looks you are mixing things and steping in > a direction I will never like. This has all been documented all along. I'm sorry you don't understand the goals of the project. Perhaps your misunderstanding is what "general purpose" means. It does not mean we omit anything that could offend anyone by wasting a few bytes on their hard drive. It means we don't cut corners that break important usage cases. Having a complete iconv linked whenever you link a program using iconv() does not break your usage case unless you have less than 100k of disk/ssd/rom storage to spare, and in that case, you probably shouldn't be using iconv. If anyone ever does have a practical difficulty because of this, rather than theoretical complaints based on anglocentricism, eurocentricism, and/or xenophobia, I am not entirely opposed to making a build option to omit iconv tables, but it has to be well-motivated. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.