|
Message-ID: <241e5237abb4048f28c63f03749d88ff@smtp.hushmail.com> Date: Fri, 4 Jan 2013 18:37:18 +0100 From: magnum <john.magnum@...hmail.com> To: john-dev@...ts.openwall.com Subject: Re: What is a digit? [these mails will look terrible in Openwall's list archive due to not supporting UTF-8 - use Gmane instead] On 4 Jan, 2013, at 18:19 , Frank Dittrich <frank_dittrich@...mail.com> wrote: > I would have assumed that digits are the 10 characters from '0' to '9'. > > This is also what doc/RULES suggests: > RULES:?d matches digits [0-9] > > While doc/RULES mentions that, depending on --encoding=NAME, non-ASCII > characters will be added to the appropriate character classes, I would > never have guessed that the following statement means '²' and '¼' could > be digits: ... > But if I look at encoding_data.h, it seems to depend on the encoding used: > > $ grep -B 1 CHARS_DIGITS encoding_data.h > // ²³¹¼½¾ > #define CHARS_DIGITS_ISO_8859_1 "\xB2\xB3\xB9\xBC\xBD\xBE" Yes. This is not some random decision of mine - all these definitions are taken as-is from the Unicode database, using scripts. I agree the '¾' is hardly a digit but I consider it far to unimportant to care about. That database was a true goldmine for the codepage support stuff. Besides, the "false" hits are probably rare enough they might actually do good. > I must admit that is not what I would have expected. > I would have expected character class d to match [0-9], and not some > other special characters: Then just don't use any --encoding, and everything is normal. Do you have a use case where this is an actual problem? > Even if ², ³, ¼ are digits, why aren't these characters digits if utf-8 > is used? Solely because the rules engine has almost no support for UTF-8. We'd need to make a separate alternative UTF-16 rules engine. It would probably be easier than one might think at first, but at this time it's low prio to me. And I doubt anyone else cares. > BTW, for other character classes like > ?v matches vowels: "aeiouAEIOU" > ?c matches consonants: "bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ" > ?l matches lowercase letters [a-z] > ?u matches uppercase letters [A-Z] > ?a matches letters [a-zA-Z] > ?x matches letters and digits [a-zA-Z0-9] > I am very happy that the non-ASCII upper case and lower case letters are > considered. There are more things that are not 100% optimal. Most notably, vowels and consonants are very fuzzy logic (and you often can't tell from a single character if it's a vowel or a consonant anyway) that was not even present in the Unicode database. But the support is way way better than nothing. magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.