musl - Re: [PATCH v2] wctype: reduce size of iswalpha & iswpunct by 47%

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAO6moYt21AKTPR9NjorPRDZrv8QsnSft-JsfG6Cx5neSD9WM1w@mail.gmail.com>
Date: Mon, 23 Mar 2026 22:54:53 +1100
From: Xan Phung <xan.phung@...il.com>
To: Rich Felker <dalias@...c.org>
Cc: Openwall musl <musl@...ts.openwall.com>
Subject: Re: [PATCH v2] wctype: reduce size of iswalpha & iswpunct by 47%

>
>
> Do you have any figures on how performance compares?


Hi Rich,
Thanks for your question about potential worst cases for the 36 code pages.

On my Ryzen 7 8845HS, for the non-direct path code pages (which shows my
performance at its worst), I am *7 CPU cycles slower* than current musl
code. However, for <U+1000 and CJK codepoints, I am *4-5 CPU cycles faster*.
As I outline further below, the 'slow' pages appear in <0.1% of general web
text, making it logical to optimise for the 99.9% global use case.

iswalpha_base  (cp sample range 0x1200-0x21FF): chksum = -13426
  *12.16 cycles per lookup (current musl)*
iswalpha_small (cp sample range 0x1200-0x21FF): chksum = -13426
  *19.76 cycles per lookup (my code)*

iswalpha_base  (cp sample range 0-0xFFF): chksum = -13430
  *12.16 cycles per lookup (current musl)*
iswalpha_small (cp sample range 0-0xFFF): chksum = -13430
  *7.60 cycles per lookup (my code)*

(All the above was done with -O2 optimisation, GCC15.2.0 compiler and ran
on Linux 6.16.7)

I expect similar results on ARM64 (I have done a cycle by cycle trace of
the ARM64 asm which I can provide to you if you want this low level detail)
There are 15~16 CPU cycles of computation, but nearly 50% (6-8 cycles) of
this is hidden by the memory latency - during which the CPU is otherwise
idle.


My big concern
> looking at this is that it appears to be A LOT more operations, and
> branches, for any codepoint that doesn't fall in the "direct path",
> which only covers codepoints below U+1000 and pages that are uniform
> across the whole page (effectively only CJK pages). Indeed, the
> non-covered 36 pages of the BMP make up basically all of the
> alphabetic scripts in the BMP above U+1000. So the question is going
> to be: how much slower do they get, and is this worth it?
>
>
To answer your question "is it worth it", my short answer is:

*(1) Raw CPU Performance:* My code provides a *37% speed increase* for
99.9% of global use cases. For the remaining <0.1% (see General Text
Statistics below) of specialised scripts, e.g., Mongolian, Polytonic Greek,
the impact is a modest 7-cycle delta. Even for a Mongolian user, the high
mix of ASCII in daily use (often >50%) makes this change
performance-neutral or better.

*(2) Efficiency & Footprint:* Energy use/power envelopes and system-wide
cache pressure are often bigger bottlenecks than raw cycles. By halving the
memory access count for 90% of codepoints (covering >99.9% of web text),
and significantly reducing the data footprint, this approach improves
overall system performance—including on low-end or embedded CPUs where L1
cache is a scarce resource, up to high performance clients & servers (where
power throttles speed).

*A. General (web) text language statistics*

To put the "holistic" mix into perspective, the 36 code pages sit outside
the top 35-40 most-used languages. While some Greek or Cyrillic blocks
appear in these pages, they are for Polytonic Greek and Medieval Cyrillic,
neither of which is in modern use. The modern scripts are already handled
in the one memory access "direct" path.

According to Wikipedia's 'Languages used on the Internet', these 36 pages
account for *less than 0.1% of all web pages*. I made the intentional
decision to optimize for the >99.9% use case. Despite prioritizing size
over performance for these rare blocks, the "worst-case" impact is approx 7
CPU cycles.
*B. System-wide Impact: Cache Pressure and Energy*

While your metric is "number of ops," I’d propose that the number of memory
accesses is a more significant metric for modern systems.

The "big picture" is that the primary throttle for modern CPUs (from
embedded to server) is the available power envelope:
* L1 cache access has *100x* the energy footprint of a CPU op.
* This means even though memory access is only 50% of CPU cycles,* memory
is >90% of energy use*.
* Cache Latency: Any L1 miss costs dozens to hundreds of cycles and *main
memory is 1000x* energy use of a CPU op.
* Halving the data size directly mitigates this.

My "direct path" requires one memory access (vs. two in current musl). Even
if the direct path fails, 99% of codepoints are covered in a total of three.
For >99% of users, this is half the memory traffic (and hence half the
energy use) of current musl.

P.S. iswalpha.o, iswpunct.o, wcwidth.o and towctrans.o are among the
bulkiest "data-heavy" units in musl. In low end environments with small
(16KB–32KB) L1 caches, these functions can easily evict other "hot"
application data. My approach significantly reduces this cache pressure—for
instance, I can now reduce towctrans.o size by 60%.

*C. Branches*:

The branches are easily predicted by even the most primitive branch
predictors.

This is because I carefully structured them to be highly biased (often 99:1
or better), so they come close to executing like branchless (100:0) code

** *Final Conditional*:* The 'if' statement at the bottom of the function
follows the "fast" path (bit shift & mask) 99% of the time. The dict[] array
lookup is rare enough that the branch is effectively zero cost almost all
the time.

** *Range Guards*:* The guard against wc >= 0x20000 is standard and
predictable 99.9% of the time, mirroring existing musl behavior with
virtually no performance penalty.

** *The "Direct" Path*:* This branch is taken for ~90% of all codepoints
(which covers 99.9% of all general web text). For users of specific scripts
(like Mongolian), the branch predictor will benefit from the high temporal
and spatial locality of text processing—the predictor quickly adapts to the
script currently being parsed, rendering the overhead very low.

Mongolian, Tagalog, Burmese, or some other alphabetic language above
> U+1000 would be the right choice of input to measure the cost.
>
> Rich
>

Content of type "text/html" skipped
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.