|
Message-ID: <20130805024651.6d7e1b7e@ralda.gmx.de> Date: Mon, 5 Aug 2013 02:46:51 +0200 From: Harald Becker <ralda@....de> Cc: musl@...ts.openwall.com, dalias@...ifal.cx Subject: Re: iconv Korean and Traditional Chinese research so far Hi Rich, in addition to my previous message to clarify some things: 04-08-2013 12:51 Rich Felker <dalias@...ifal.cx>: > Worst-case, adding Korean and Traditional Chinese tables will > roughly double the size of iconv.o to around 150k. This will > noticably enlarge libc.so, but will make no difference to > static-linked programs except those using iconv. I'm hoping we > can make these additions less expensive, but I don't see a good > way yet. I would write iconv as a virtual machine interpreter for a very simple byte code machine. The byte code (program) of the virtual machine is just an array of unsigned bytes and the virtual machine only contains the instructions to read next byte and assemble a Unicode value or to receive a Unicode value and to produce multi byte character output. The virtual machine code itself works like a finite state machine to handle multi byte character sets. That way iconv consist of a small byte code interpreter to build the virtual machine. Then it maps the byte code from an external file for any required character set. This byte code from external file consist of virtual machine instructions and conversion tables. As this virtual machine shall be optimized for the conversion purposes, conversion operations require only interpretation of a view virtual instructions per converted character (for simple character sets, big ones may need a few more instructions). This operation is usually very fast, as not much data is involved and instructions are highly optimized for conversion operation. The virtual machine works with a data space of only a few bytes (less than 256), where some bytes need to preserve from one conversion call to next. That is conversion needs a conversion context of a few bytes (8..16). Independently from any character set conversion you want to add, you only need a single byte code interpreter for iconv, which will not increase in size. Only the external byte code / conversion table for the charsets may vary in size. Simple char sets, like Latins, consist of only a few bytes of byte code, big charsets like Japanese, Chinese and Korean, need some more byte code and may be some bigger translation tables ... but those tables are only loaded if iconv needs to access such a charset. iconv itself doesn't need to handle table of available charsets, it only converts the charset name into a filename and opens the corresponding charset translation file. On the charset file some header and version check shall handle possible installation conflicts. For any conversion request the virtual machine interpreter runs through the byte code of the requested charset and returns the conversion result. As the virtual machine shall not contain operations to violate the remainder of the system, this shall not break system security. At most the byte code is so misbehaved that it runs forever, without producing an error or any output. So the machine hangs just in an infinite loop during conversion, until the process is terminated (a simple counter may limit number of executed instructions and bail out in case of such looping). > At some point, especially if the cost is not reduced, I will > probably add build-time options to exclude a configurable > subset of the supported character encodings. This would not be > extremely fine-grained, and the choices to exclude would > probably be just: Japanese, Simplified Chinese, Traditional > Chinese, and Korean. Legacy 8-bit might also be an option but > these are so small I can't think of cases where it would be > beneficial to omit them (5k for the tables on top of the 2k of > actual code in iconv). Perhaps if there are cases where iconv > is needed purely for conversion between different Unicode > forms, but no legacy charsets, on tiny embedded devices, > dropping the 8-bit tables and all of the support code could be > useful; the resulting iconv would be around 1k, I think. You may skip all this, if iconv is constructed as a virtual machine interpreter and all character conversions are loaded from an external file. As a fallback the library may compile in the byte code for some small charset conversions, like ASCII, Latin-1, UTF-8. All other charset conversions are loaded from external resources, which may be installed or not depending on admins decision. And just added if required later. -- Harald
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.