|
|
Message-ID: <6db7f991cc8d345dadc472f66c58bfd8e0445b2e.camel@postmarketos.org> Date: Wed, 15 Apr 2026 17:31:40 +0200 From: Pablo Correa Gomez <pabloyoyoista@...tmarketos.org> To: Rich Felker <dalias@...c.org>, musl@...ts.openwall.com Subject: Re: Collation weight length frequencies El Wed, 08-04-2026 a las 13:55 -0400, Rich Felker escribió: > > With any of these encodings, the size of the default collation table, > with implicit codepoint-order rules for ideographs, is looking to be > about 200k. I'd expect a little over double that with modern > stroke-radical order for ideographs. That is really cool! Does anybody have an idea of what is the approximate size of this table in other implementations? Or are they solving this problem in a completely different way? Best, Pablo > > > One thing all of this would be throwing away is that there is very low > information content in the secondary and tertiary levels -- not just > the lengths but the actual weights. It might make more sense to > consider an encoding like the first option above that I didn't > elaborate on as much, but where the header byte encodes not just the > lengths but an index into a table of common values for > secondary/tertiary level. This would shave 50k off the default table > (more with stroke-radical order). Another way to accomplish this might > be using the (otherwise invalid) length-0 for secondary/tertiary to > compress common values. However, it might just be the case that the > relative size here is sufficiently low that we don't want to bother > with complexity. > > > Rich
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.