musl - Re: iconv Korean and Traditional Chinese research so far

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130805024651.6d7e1b7e@ralda.gmx.de>
Date: Mon, 5 Aug 2013 02:46:51 +0200
From: Harald Becker <ralda@....de>
Cc: musl@...ts.openwall.com, dalias@...ifal.cx
Subject: Re: iconv Korean and Traditional Chinese research so far

Hi Rich,

in addition to my previous message to clarify some things:

04-08-2013 12:51 Rich Felker <dalias@...ifal.cx>:

> Worst-case, adding Korean and Traditional Chinese tables will
> roughly double the size of iconv.o to around 150k. This will
> noticably enlarge libc.so, but will make no difference to
> static-linked programs except those using iconv. I'm hoping we
> can make these additions less expensive, but I don't see a good
> way yet.

I would write iconv as a virtual machine interpreter for a very
simple byte code machine. The byte code (program) of the virtual
machine is just an array of unsigned bytes and the virtual
machine only contains the instructions to read next byte and
assemble a Unicode value or to receive a Unicode value and to
produce multi byte character output. The virtual machine code
itself works like a finite state machine to handle multi byte
character sets.

That way iconv consist of a small byte code interpreter to build
the virtual machine. Then it maps the byte code from an external
file for any required character set. This byte code from external
file consist of virtual machine instructions and conversion
tables. As this virtual machine shall be optimized for the
conversion purposes, conversion operations require only
interpretation of a view virtual instructions per converted
character (for simple character sets, big ones may need a few
more instructions). This operation is usually very fast, as not
much data is involved and instructions are highly optimized for
conversion operation.

The virtual machine works with a data space of only a few bytes
(less than 256), where some bytes need to preserve from one
conversion call to next. That is conversion needs a conversion
context of a few bytes (8..16). 

Independently from any character set conversion you want to add,
you only need a single byte code interpreter for iconv, which
will not increase in size. Only the external byte code /
conversion table for the charsets may vary in size. Simple char
sets, like Latins, consist of only a few bytes of byte code, big
charsets like Japanese, Chinese and Korean, need some more byte
code and may be some bigger translation tables ... but those
tables are only loaded if iconv needs to access such a charset.

iconv itself doesn't need to handle table of available charsets,
it only converts the charset name into a filename and opens the
corresponding charset translation file. On the charset file some
header and version check shall handle possible installation
conflicts. For any conversion request the virtual machine
interpreter runs through the byte code of the requested charset
and returns the conversion result. As the virtual machine shall
not contain operations to violate the remainder of the system,
this shall not break system security. At most the byte code is so
misbehaved that it runs forever, without producing an error or
any output. So the machine hangs just in an infinite loop during
conversion, until the process is terminated (a simple counter may
limit number of executed instructions and bail out in case of such
looping).

> At some point, especially if the cost is not reduced, I will
> probably add build-time options to exclude a configurable
> subset of the supported character encodings. This would not be
> extremely fine-grained, and the choices to exclude would
> probably be just: Japanese, Simplified Chinese, Traditional
> Chinese, and Korean. Legacy 8-bit might also be an option but
> these are so small I can't think of cases where it would be
> beneficial to omit them (5k for the tables on top of the 2k of
> actual code in iconv). Perhaps if there are cases where iconv
> is needed purely for conversion between different Unicode
> forms, but no legacy charsets, on tiny embedded devices,
> dropping the 8-bit tables and all of the support code could be
> useful; the resulting iconv would be around 1k, I think.

You may skip all this, if iconv is constructed as a virtual
machine interpreter and all character conversions are loaded from
an external file.

As a fallback the library may compile in the byte code for some
small charset conversions, like ASCII, Latin-1, UTF-8. All other
charset conversions are loaded from external resources, which may
be installed or not depending on admins decision. And just
added if required later.  

--
Harald
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.