musl - Re: Efficient case mapping

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180307154420.GM1436@brightrain.aerifal.cx>
Date: Wed, 7 Mar 2018 10:44:20 -0500
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Efficient case mapping

On Tue, Mar 06, 2018 at 03:27:41PM -0500, Rich Felker wrote:
> Replacing the compact but slow linear-search-based case mapping tables
> with something comparably compact but fast has been on the hidden
> roadmap for a while now, and I think I have a good solution:
> 
> Analysis of Unicode blocks (256-char) shows that most blocks have
> either no case mappings at all, or just a single common delta between
> lower and upper case for the whole block. There are about 11 blocks
> that break this rule and have more. The most unique deltas with
> nontrivial frequency (same delta used by more than a single-digit
> number of characters) in any block is 3, appearing only in block 04.
> 
> As such, a single-direction case mapping can be achieved as follows.
> Use two multi-level bit tables (same format as ctype bits) to assign a
> 2-bit number to each codepoint, and 2 integer deltas (the two most
> common) to each block. The meanings are:
> 
> 00 - no case mapping
> 01 - map via block's first delta
> 10 - map via block's second delta
> 11 - exception; binary search tiny table of exceptions
> 
> Obviously this can be extended to two directions via a second pair of
> tables, but I think we can do better, just 3 bits total:
> 
> 000 - no case mappings
> 001 - is lowercase; map to uppercase via block's first toupper delta
> 010 - is uppercase; map to lowercase via block's first tolower delta
> 011 - exception; binary search tiny table of exceptions
> 100 - is lowercase; map to uppercase via block's second toupper delta
> 101 - is lowercase; map to uppercase via block's third toupper delta
> 110 - is uppercase; map to lowercase via block's second tolower delta
> 111 - is uppercase; map to lowercase via block's third tolower delta
> 
> Most blocks will use no bits at all (no case mappings). A few blocks
> that only contain lowercase with a single delta will use just one bit.
> Many blocks with case mappings will use 2 bits (only one frequent
> delta for each direction). The remaining few will use 3 bits, and now
> they can code not just 2 but 3 frequent deltas in each direction,
> getting rid of a lot of exceptions.
> 
> Some quick size estimates (hope I didn't mess this up):
> 
> 11 complex blocks -- 11 blocks * 256/8 bytes * 3 bits
> 7 simple blocks -- 7 blocks * 256/8 bytes * 1 bit
> Top-level tables -- 512 blocks * 1 byte * 3 tables
> Exceptions -- ~96 pairs * 4 bytes * 2 sortings
> 
> Looks like it comes to about 3k total.

Note that this was a significant underestimate. Of course it omitted
any accounting for storage of per-block deltas (up to 6 of them for
each block) like my first total for the improved scheme, but there
seem to be other significant errors too. The counts for "complex" and
"simple" blocks were based on just one direction of case mapping,
upper to lower. Once you bring in both directions, very little is
simple and the number of blocks increases quite a bit.

> Further optimizing: We're not using all the data we already have. Only
> alphabetic characters can have case mappings, and we already have an
> iswalpha table; the above tables are highly correlated with it.
> 
> If we short-circuit out the whole operation with iswalpha, then the
> "000 - no case mappings" value is no longer common. Instead, "zero is
> one of the common deltas" can be included or not on a per-block basis.

An interesting find after implementing this (and reporting the
results, which had a minor bug, in the previous reply to the quoted
email): ahort-circuiting with iswalpha barely helps at all. There are
only two blocks 04 and 24, that have fewer rules after folding the
nonalphabetic characters into another rule as "dontcare". Of those, 04
has loads of exceptional cases and saving a rule has minor benefit.
Only 24 actually goes from needing 2 bits down to needing only 1 bit.
So the "further optimizing" really has only marginal benefit, saving
about 72 bytes of table at the (size and time) expense of a call to
iswalpha.

Arguably a better optimization might be to optimize iswalpha (3.8k) as
iswlower||iswupper||small_table_of_caseless_alpha.

What this really comes down to is that having a multi-level bit table
per-property for character properties is always going to be locally
cheaper (for static linking when you just want one property) than a
table mapping characters to a bit vector representing all their
properties and just picking out the property you want, but if you have
a variable-width table like what I'm doing with case mappings, where
the width (and thus number/set of possible output combinations) varies
per block, it may be more efficient to combine all properties in to a
single variable-width-bit-vector table (mainly/especially if the
properties are correlated). If so, it may make sense to make such a
change eventually -- to have a single musl-internal character
properties lookup function that returns a bit vector representing:

- wcwidth: -1, 0, 1, or 2
- type: lower[delta], upper[delta], casing-exception, nocase-alpha,
  punct, control, or space

Then iswalpha would be represented as reading the table and masking
for lower||upper||casing-exception||nocase-alpha, etc.

The key to how this is different, and much more efficient, than glibc
and other implementations is the multilevel aspect of the table and
the variable-width of the encoding, whereby blocks with only a few
possible outputs only need 1 or 2 bits.

At 128 blocks needing 3 bits on average (a number I made up, assuming
over half of the 512 blocks are CJK or unassigned and some others are
simpler) this would be 13k of tables, which is comparable to the
current size of the character type data (iswalpha+iswpunct+wcwidth)
without any case mappings. So it looks like a minor win at best, but
maybe something to keep in mind for the future.

Rich
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.