|
Message-ID: <20180307154420.GM1436@brightrain.aerifal.cx> Date: Wed, 7 Mar 2018 10:44:20 -0500 From: Rich Felker <dalias@...c.org> To: musl@...ts.openwall.com Subject: Re: Efficient case mapping On Tue, Mar 06, 2018 at 03:27:41PM -0500, Rich Felker wrote: > Replacing the compact but slow linear-search-based case mapping tables > with something comparably compact but fast has been on the hidden > roadmap for a while now, and I think I have a good solution: > > Analysis of Unicode blocks (256-char) shows that most blocks have > either no case mappings at all, or just a single common delta between > lower and upper case for the whole block. There are about 11 blocks > that break this rule and have more. The most unique deltas with > nontrivial frequency (same delta used by more than a single-digit > number of characters) in any block is 3, appearing only in block 04. > > As such, a single-direction case mapping can be achieved as follows. > Use two multi-level bit tables (same format as ctype bits) to assign a > 2-bit number to each codepoint, and 2 integer deltas (the two most > common) to each block. The meanings are: > > 00 - no case mapping > 01 - map via block's first delta > 10 - map via block's second delta > 11 - exception; binary search tiny table of exceptions > > Obviously this can be extended to two directions via a second pair of > tables, but I think we can do better, just 3 bits total: > > 000 - no case mappings > 001 - is lowercase; map to uppercase via block's first toupper delta > 010 - is uppercase; map to lowercase via block's first tolower delta > 011 - exception; binary search tiny table of exceptions > 100 - is lowercase; map to uppercase via block's second toupper delta > 101 - is lowercase; map to uppercase via block's third toupper delta > 110 - is uppercase; map to lowercase via block's second tolower delta > 111 - is uppercase; map to lowercase via block's third tolower delta > > Most blocks will use no bits at all (no case mappings). A few blocks > that only contain lowercase with a single delta will use just one bit. > Many blocks with case mappings will use 2 bits (only one frequent > delta for each direction). The remaining few will use 3 bits, and now > they can code not just 2 but 3 frequent deltas in each direction, > getting rid of a lot of exceptions. > > Some quick size estimates (hope I didn't mess this up): > > 11 complex blocks -- 11 blocks * 256/8 bytes * 3 bits > 7 simple blocks -- 7 blocks * 256/8 bytes * 1 bit > Top-level tables -- 512 blocks * 1 byte * 3 tables > Exceptions -- ~96 pairs * 4 bytes * 2 sortings > > Looks like it comes to about 3k total. Note that this was a significant underestimate. Of course it omitted any accounting for storage of per-block deltas (up to 6 of them for each block) like my first total for the improved scheme, but there seem to be other significant errors too. The counts for "complex" and "simple" blocks were based on just one direction of case mapping, upper to lower. Once you bring in both directions, very little is simple and the number of blocks increases quite a bit. > Further optimizing: We're not using all the data we already have. Only > alphabetic characters can have case mappings, and we already have an > iswalpha table; the above tables are highly correlated with it. > > If we short-circuit out the whole operation with iswalpha, then the > "000 - no case mappings" value is no longer common. Instead, "zero is > one of the common deltas" can be included or not on a per-block basis. An interesting find after implementing this (and reporting the results, which had a minor bug, in the previous reply to the quoted email): ahort-circuiting with iswalpha barely helps at all. There are only two blocks 04 and 24, that have fewer rules after folding the nonalphabetic characters into another rule as "dontcare". Of those, 04 has loads of exceptional cases and saving a rule has minor benefit. Only 24 actually goes from needing 2 bits down to needing only 1 bit. So the "further optimizing" really has only marginal benefit, saving about 72 bytes of table at the (size and time) expense of a call to iswalpha. Arguably a better optimization might be to optimize iswalpha (3.8k) as iswlower||iswupper||small_table_of_caseless_alpha. What this really comes down to is that having a multi-level bit table per-property for character properties is always going to be locally cheaper (for static linking when you just want one property) than a table mapping characters to a bit vector representing all their properties and just picking out the property you want, but if you have a variable-width table like what I'm doing with case mappings, where the width (and thus number/set of possible output combinations) varies per block, it may be more efficient to combine all properties in to a single variable-width-bit-vector table (mainly/especially if the properties are correlated). If so, it may make sense to make such a change eventually -- to have a single musl-internal character properties lookup function that returns a bit vector representing: - wcwidth: -1, 0, 1, or 2 - type: lower[delta], upper[delta], casing-exception, nocase-alpha, punct, control, or space Then iswalpha would be represented as reading the table and masking for lower||upper||casing-exception||nocase-alpha, etc. The key to how this is different, and much more efficient, than glibc and other implementations is the multilevel aspect of the table and the variable-width of the encoding, whereby blocks with only a few possible outputs only need 1 or 2 bits. At 128 blocks needing 3 bits on average (a number I made up, assuming over half of the 512 blocks are CJK or unassigned and some others are simpler) this would be 13k of tables, which is comparable to the current size of the character type data (iswalpha+iswpunct+wcwidth) without any case mappings. So it looks like a minor win at best, but maybe something to keep in mind for the future. Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.