musl - Re: Efficient case mapping

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180307182536.GO1436@brightrain.aerifal.cx>
Date: Wed, 7 Mar 2018 13:25:36 -0500
From: Rich Felker <dalias@...c.org>
To: musl@...ts.openwall.com
Subject: Re: Efficient case mapping

On Tue, Mar 06, 2018 at 10:48:52PM -0500, Rich Felker wrote:
> On Tue, Mar 06, 2018 at 03:27:41PM -0500, Rich Felker wrote:
> > Further optimizing: We're not using all the data we already have. Only
> > alphabetic characters can have case mappings, and we already have an
> > iswalpha table; the above tables are highly correlated with it.
> > 
> > If we short-circuit out the whole operation with iswalpha, then the
> > "000 - no case mappings" value is no longer common. Instead, "zero is
> > one of the common deltas" can be included or not on a per-block basis.
> > 
> > We can also add, along with the per-block common deltas, a table of
> > which cases each delta goes with, and assign new bit meanings:
> > 
> > 00 - is case[0] for block, map to ~case[0] via delta[0]
> > 01 - is case[1] for block, map to ~case[1] via delta[1]
> > 10 - is case[2] for block, map to ~case[2] via delta[2]
> > 11 - is exception, map via binary search of exceptions table
> > 
> > If needed we could add back a third bit here, but I think we can get
> > by without it.
> > 
> > Note that, because iswalpha is now short-circuiting non-alphabetic
> > characters, we can just assign an arbitrary table value (best: 00) to
> > all non-alphabetic (including unassigned) characters. This drops most
> > of the "simple blocks" (blocks with just 0 or a single delta) down to
> > trivial (elided) blocks.
> > 
> > These last ideas ("further optimizing:" and beyond) were worked out
> > while writing this email, and I don't yet have any analysis/test code
> > to go with them. Will follow up later with results. With some luck,
> > final version will be barely-larger than current code, extensible, and
> > fast. :)
> 
> I now have results for this approach which I'm attaching. The output
> is from my block analysis tool. For each block, it displays case
> mapping rules sorted by decreasing frequency, with characters that
> don't matter (because they're non-alphabetic) merged into the most
> frequent rule. The syntax is:
> 
>   rule d(t) [f]
> 
> where t is the character type (n = no case, u = uppercase, l =
> lowercase, x = exceptional/titlecase), d is the delta to the opposite
> case character, and f is the frequency of the rule (again, counting
> merging of dontcare into the most frequent).
> 
> While the vast majority of blocks are fine with 0-2 bits of rule data,
> a few do have more than 3 fairly-frequent rules, the worst being
> blocks 04, 05, and 2C. Having an extra bit would save at least 169
> exceptions in those, and thus ~676 bytes. The fixed cost of an extra
> table bit is 512 bytes (for the first level), plus 32 bytes per block
> that contains data, so it's pretty close to break-even, but keeping
> the exception table small does help performance too. My leaning would
> be to add the extra bit, but write a cost function so that it's only
> used where the 4th-7th frequencies for the block total more than 8
> (the number of exceptions to break even at 32 bytes).
> 
> With that approach it looks like we have about 220 entries in the
> exception tables, 4-5 blocks with a third bit, 21 blocks with a second
> bit, 24 blocks with a first bit, and 3 table headers. That comes to
> about 880+160+672+768+1536 = 4k.
> 
> One thing I haven't accounted for is storing the 3 (or 7) deltas per
> block. The raw deltas are signed 17-bit, but if we assume mappings
> don't cross planes they can be thought of as unsigned 16-bit on the
> low 16 bits of the codepoint. But that's still pretty large -- 3
> 16-bit numbers by 512 blocks is 3k. Fortunately I think we can do a
> lot better. First, have a table of all the deltas that ever appear
> (about 20). Then, for each block, have 0-7 single-byte indices into
> this table, and a single-byte index to the sequence of indices. Now we
> just have about 2*20+512+5*7+16*3+3*1 = 638 bytes.
> 
> 4.6k is a good bit more than I was planning on spending here, so maybe
> there are still improvements to be made...

And here is a draft of the code that would use the mapping. The
#ifdef'd external tables are to allow compiling the code without
actually having table data yet; otherwise gcc optimizes out the whole
translation unit. :-)

Comments?

Rich

View attachment "casemap.c" of type "text/plain" (1804 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.