|
Message-ID: <20150608003315.GD17573@brightrain.aerifal.cx> Date: Sun, 7 Jun 2015 20:33:15 -0400 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: Re: Build option to disable locale [was: Byte-based C locale, draft 1] On Mon, Jun 08, 2015 at 01:59:35AM +0200, Harald Becker wrote: > On 07.06.2015 02:24, Rich Felker wrote: > >It's somewhat more clear what you're talking about, but I'm still not > >sure what specific pieces of code you would want to omit from libc.so. > >Which of the following would you want to remove or keep? > > I did not look into all the details ... > > In general: Keep the API, but add stubs with minimal operation or > fail for none C locale (etc.). > > >- UTF-8 encoding and decoding > > May be of use to keep, if on bare minimum. This is roughly 3k of code, and is mandatory if you want to say you "support UTF-8" at all. I'll note the other parts that fundamentally depend on it. > >- Character properties > > - Case mappings > > Keep ASCII, map all none ASCII to a single value. I assume by "map to a single value" you mean uniform properties for all non-ASCII Unicode characters, e.g. just printable but nothing else. Case-mapping everything down to one character would not be a good idea. :-) Character properties are roughly 11k of code. Case mappings are 1k of code. Note that while some of the properties are arguably not very useful (the wctype system does not give you enough information to do serious text processing with them), without the wcwidth property, you cannot properly display non-ASCII text on a terminal. So at least this one, which takes 3k, is pretty critical to "UTF-8 support". > >- Internal message translation (nl_langinfo strings, errors, etc.) > > - Message translation API (gettext) > > No translation at all, keep the English messages (as short as possible). The internal translation support is about 2k. The gettext system is roughly another 2k on top of that (and depends on the former). I agree this is completely non-mandatory for "UTF-8 support" and that's why musl originally didn't have it. > >- Charset conversion (iconv) > > Copy ASCII / UTF-8, but fail for all other. iconv is big. About 128k. The ability to selectively omit some or all legacy charsets from iconv is a long-term goal. Of course if you have an actual need for character set conversion, e.g. reading email in mutt, then your alternative to musl's 128k iconv is GNU libiconv weighing in at several MB... > >- Non-ASCII characters in regex and fnmatch patterns/brackers > > May be the question to allow for UTF-8, but only those, no other > charsets (should allow to do some optimization and avoid all the > extended overhead). That's how it is now. > fnmatch: Match None ASCII just 1:1, no other special operation. > > regex: Don't have the experience on the internals of this topic. In > general allow for 1:1 matching of none ASCII characters, but > otherwise behave as C locale (e.g. equivalence classes). For both fnmatch and regex, the single-character-match (? or . respectively) matches characters, not bytes. Likewise bracket expressions match characters. In order for this to work at all, you need UTF-8 decoding (see above). There's no directly measurable code size cost for these items; the savings from not doing UTF-8 would come from completely different code that doesn't now exist in musl for bypassing mbtowc and just working directly on input bytes. So aside from iconv, the above seem to total around 19k, and at least 6k of that is mandatory if you want to be able to claim to support UTF-8. So the topic at hand seems to be whether you can save <13k of libc.so size by hacking out character handling/locale related features that are non-essential to basic UTF-8 support... Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.