john-dev - Re: What is a digit?

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <241e5237abb4048f28c63f03749d88ff@smtp.hushmail.com>
Date: Fri, 4 Jan 2013 18:37:18 +0100
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: What is a digit?

[these mails will look terrible in Openwall's list archive due to not supporting UTF-8 - use Gmane instead]

On 4 Jan, 2013, at 18:19 , Frank Dittrich <frank_dittrich@...mail.com> wrote:
> I would have assumed that digits are the 10 characters from '0' to '9'.
> 
> This is also what doc/RULES suggests:
> RULES:?d	matches digits [0-9]
> 
> While doc/RULES mentions that, depending on --encoding=NAME, non-ASCII
> characters will be added to the appropriate character classes, I would
> never have guessed that the following statement means '²' and '¼' could
> be digits:
...
> But if I look at encoding_data.h, it seems to depend on the encoding used:
> 
> $ grep -B 1 CHARS_DIGITS encoding_data.h
> // ²³¹¼½¾
> #define CHARS_DIGITS_ISO_8859_1 "\xB2\xB3\xB9\xBC\xBD\xBE"

Yes. This is not some random decision of mine - all these definitions are taken as-is from the Unicode database, using scripts. I agree the '¾' is hardly a digit but I consider it far to unimportant to care about. That database was a true goldmine for the codepage support stuff. Besides, the "false" hits are probably rare enough they might actually do good.

> I must admit that is not what I would have expected.
> I would have expected character class d to match [0-9], and not some
> other special characters:

Then just don't use any --encoding, and everything is normal. Do you have a use case where this is an actual problem?

> Even if ², ³, ¼ are digits, why aren't these characters digits if utf-8
> is used?

Solely because the rules engine has almost no support for UTF-8. We'd need to make a separate alternative UTF-16 rules engine. It would probably be easier than one might think at first, but at this time it's low prio to me. And I doubt anyone else cares.

> BTW, for other character classes like
> ?v      matches vowels: "aeiouAEIOU"
> ?c      matches consonants: "bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ"
> ?l      matches lowercase letters [a-z]
> ?u      matches uppercase letters [A-Z]
> ?a      matches letters [a-zA-Z]
> ?x      matches letters and digits [a-zA-Z0-9]
> I am very happy that the non-ASCII upper case and lower case letters are
> considered.

There are more things that are not 100% optimal. Most notably, vowels and consonants are very fuzzy logic (and you often can't tell from a single character if it's a vowel or a consonant anyway) that was not even present in the Unicode database. But the support is way way better than nothing.

magnum

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.