|
|
Message-ID: <20260321174658.GX1827@brightrain.aerifal.cx>
Date: Sat, 21 Mar 2026 13:46:58 -0400
From: Rich Felker <dalias@...c.org>
To: Xan Phung <xan.phung@...il.com>
Cc: musl@...ts.openwall.com
Subject: Re: [PATCH v2] wctype: reduce size of iswalpha & iswpunct by
47%
On Sat, Jan 24, 2026 at 12:20:45AM +1100, Xan Phung wrote:
> Currently iswalpha and iswpunct have total text data size of over 8kb.
> A more efficient encoding has reduced the total size to 4.2kb, a 47%
> reduction.
>
> This encoding also features a speed improvement, with lookups for
> approx 220 of the 256 BMP code blocks done with just a single memory
> load.
>
> The remaining Unicode blocks are niche uses, and are accessed with
> three memory loads. Compared to the old musl encoding (which used a
> two level table) the topmost level remains much the same, providing
> offsets into code block units (256 codepoint granularity). The second
> level data uses fixed sizes, of one 32 bit word per codepage (where
> each 2 bit pair in word identifies a block of 16 codepoints as all 0,
> all 1, or mixed). The third level is a variable length series of
> extension bytes, indexed by the popcount of set high bits within the
> second level's 32 bit word. Popcount is used as not all 16 codepoint
> blocks need an extension byte. This popcount is calculated with
> nearly same latency as a 32 bit multiply (so it is comparable with
> the indexing speed of accessing a 2D array of non power of 2 size),
> but has the advantage of greater compactness for sparse data.
>
> Results have been tested against first 0x20000 codepoints, and match
> that returned by the pre-existing musl implementation.
>
> A new char table tool to replace the existing gen_ctype is provided
> separately. The new tool has similar command line use:
>
> gen_ctype [a|p] -- generates iswalpha & iswpunct table.h files
> gen_ctype [A|P] -- generates iswalpha & iswpunct dict.h files
>
> Signed-off-by: Xan Phung <xan.phung@...il.com>
>
> [...]
>
> -int iswalpha(wint_t wc)
> -{
> - if (wc<0x20000U)
> - return (table[table[wc>>8]*32+((wc&255)>>3)]>>(wc&7))&1;
> - if (wc<0x2fffeU)
> - return 1;
> - return 0;
> +int iswalpha(wint_t wc) {
> + unsigned direct, page, bmap, shfr, lane, base, rev;
> + unsigned target, huff, type, popc, fast;
> + signed char ext;
> + if ((unsigned)wc >= 0x20000)
> + return (unsigned)wc < 0x2fffe;
> +
> + /* Direct path used in 220 of the 256 BMP code pages */
> + direct = wc < (PAGEH-PAGES)*8;
> + page = (wc >> PAGE_SH);
> + bmap = (wc >> 3) + PAGES;
> + base = table.b[direct ? bmap:page];
> + if (base <= direct*256 + 1)
> + return base >> (wc & -direct & 7) & 1;
> +
> + /* 2nd & 3rd level arrays: final level idx=popc^rev_direction */
> + target = wc & (PAGE_MAX-1);
> + shfr = (target & 15);
> + lane = (target >> 4);
> + huff = table.w[base += PAGEH/4 - 2]; /* base 0/1 reserved */
> + type = (huff >> (2 * lane)) & 3;
> + popc = (huff << (31 - 2 * lane));
> + popc = (popc & 0x11111111) + ((popc & 0x44444444) >> 2);
> + popc = (popc * 0x11111111) >> 28;
> + base+= (rev = -(page & 1)) + 1;
> + ext = (table.b + base*4)[(int)(popc^rev)];
> +
> + /* Dictionary lookup is only 1% of codepoints */
> + fast = (type != 2 | ext & 1);
> + if (fast) {
> + shfr = (shfr + 5) & -(type >= 2);
> + shfr = (shfr - (6 & -(type == 2)) + type);
> + return (ext << 8 | 0xfe) >> shfr & 1;
> + } else {
> + return dict[ext >> 1 & 0x7f] >> shfr & 1;
> + }
> }
Do you have any figures on how performance compares? My big concern
looking at this is that it appears to be A LOT more operations, and
branches, for any codepoint that doesn't fall in the "direct path",
which only covers codepoints below U+1000 and pages that are uniform
across the whole page (effectively only CJK pages). Indeed, the
non-covered 36 pages of the BMP make up basically all of the
alphabetic scripts in the BMP above U+1000. So the question is going
to be: how much slower do they get, and is this worth it?
Mongolian, Tagalog, Burmese, or some other alphabetic language above
U+1000 would be the right choice of input to measure the cost.
Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.