Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260321174658.GX1827@brightrain.aerifal.cx>
Date: Sat, 21 Mar 2026 13:46:58 -0400
From: Rich Felker <dalias@...c.org>
To: Xan Phung <xan.phung@...il.com>
Cc: musl@...ts.openwall.com
Subject: Re: [PATCH v2] wctype: reduce size of iswalpha & iswpunct by
 47%

On Sat, Jan 24, 2026 at 12:20:45AM +1100, Xan Phung wrote:
> Currently iswalpha and iswpunct have total text data size of over 8kb.
> A more efficient encoding has reduced the total size to 4.2kb, a 47%
> reduction.
> 
> This encoding also features a speed improvement, with lookups for
> approx 220 of the 256 BMP code blocks done with just a single memory
> load.
> 
> The remaining Unicode blocks are niche uses, and are accessed with
> three memory loads.  Compared to the old musl encoding (which used a
> two level table) the topmost level remains much the same, providing
> offsets into code block units (256 codepoint granularity). The second
> level data uses fixed sizes, of one 32 bit word per codepage (where
> each 2 bit pair in word identifies a block of 16 codepoints as all 0,
> all 1, or mixed). The third level is a variable length series of
> extension bytes, indexed by the popcount of set high bits within the
> second level's 32 bit word. Popcount is used as not all 16 codepoint
> blocks need an extension byte. This popcount is calculated with
> nearly same latency as a 32 bit multiply (so it is comparable with
> the indexing speed of accessing a 2D array of non power of 2 size),
> but has the advantage of greater compactness for sparse data.
> 
> Results have been tested against first 0x20000 codepoints, and match
> that returned by the pre-existing musl implementation.
> 
> A new char table tool to replace the existing gen_ctype is provided
> separately. The new tool has similar command line use:
> 
>   gen_ctype [a|p]  -- generates iswalpha & iswpunct table.h files
>   gen_ctype [A|P]  -- generates iswalpha & iswpunct dict.h files
> 
> Signed-off-by: Xan Phung <xan.phung@...il.com>
> 
> [...]
> 
> -int iswalpha(wint_t wc)
> -{
> -	if (wc<0x20000U)
> -		return (table[table[wc>>8]*32+((wc&255)>>3)]>>(wc&7))&1;
> -	if (wc<0x2fffeU)
> -		return 1;
> -	return 0;
> +int iswalpha(wint_t wc) {
> +	unsigned direct, page, bmap, shfr, lane, base, rev;
> +	unsigned target, huff, type, popc, fast;
> +	signed char ext;
> +	if ((unsigned)wc >= 0x20000)
> +		return (unsigned)wc < 0x2fffe;
> +
> +	/* Direct path used in 220 of the 256 BMP code pages */
> +	direct = wc < (PAGEH-PAGES)*8;
> +	page = (wc >> PAGE_SH);
> +	bmap = (wc >> 3) + PAGES;
> +	base = table.b[direct ? bmap:page];
> +	if (base <= direct*256 + 1)
> +		return base >> (wc & -direct & 7) & 1;
> +
> +	/* 2nd & 3rd level arrays: final level idx=popc^rev_direction */
> +	target = wc & (PAGE_MAX-1);
> +	shfr = (target & 15);
> +	lane = (target >> 4);
> +	huff = table.w[base += PAGEH/4 - 2];     /* base 0/1 reserved */
> +	type = (huff >> (2 * lane)) & 3;
> +	popc = (huff << (31 - 2 * lane));
> +	popc = (popc & 0x11111111) + ((popc & 0x44444444) >> 2);
> +	popc = (popc * 0x11111111) >> 28;
> +	base+= (rev = -(page & 1)) + 1;
> +	ext  = (table.b + base*4)[(int)(popc^rev)];
> +
> +	/* Dictionary lookup is only 1% of codepoints */
> +	fast = (type != 2 | ext & 1);
> +	if (fast) {
> +		shfr = (shfr + 5) & -(type >= 2);
> +		shfr = (shfr - (6 & -(type == 2)) + type);
> +		return (ext << 8 | 0xfe) >> shfr & 1;
> +	} else {
> +		return dict[ext >> 1 & 0x7f] >> shfr & 1;
> +	}
>  }

Do you have any figures on how performance compares? My big concern
looking at this is that it appears to be A LOT more operations, and
branches, for any codepoint that doesn't fall in the "direct path",
which only covers codepoints below U+1000 and pages that are uniform
across the whole page (effectively only CJK pages). Indeed, the
non-covered 36 pages of the BMP make up basically all of the
alphabetic scripts in the BMP above U+1000. So the question is going
to be: how much slower do they get, and is this worth it?

Mongolian, Tagalog, Burmese, or some other alphabetic language above
U+1000 would be the right choice of input to measure the cost.

Rich

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.