Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Fri, 4 Jan 2013 18:37:18 +0100
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: What is a digit?

[these mails will look terrible in Openwall's list archive due to not supporting UTF-8 - use Gmane instead]

On 4 Jan, 2013, at 18:19 , Frank Dittrich <frank_dittrich@...mail.com> wrote:
> I would have assumed that digits are the 10 characters from '0' to '9'.
> 
> This is also what doc/RULES suggests:
> RULES:?d	matches digits [0-9]
> 
> While doc/RULES mentions that, depending on --encoding=NAME, non-ASCII
> characters will be added to the appropriate character classes, I would
> never have guessed that the following statement means '²' and '¼' could
> be digits:
...
> But if I look at encoding_data.h, it seems to depend on the encoding used:
> 
> $ grep -B 1 CHARS_DIGITS encoding_data.h
> // ²³¹¼½¾
> #define CHARS_DIGITS_ISO_8859_1 "\xB2\xB3\xB9\xBC\xBD\xBE"

Yes. This is not some random decision of mine - all these definitions are taken as-is from the Unicode database, using scripts. I agree the '¾' is hardly a digit but I consider it far to unimportant to care about. That database was a true goldmine for the codepage support stuff. Besides, the "false" hits are probably rare enough they might actually do good.

> I must admit that is not what I would have expected.
> I would have expected character class d to match [0-9], and not some
> other special characters:

Then just don't use any --encoding, and everything is normal. Do you have a use case where this is an actual problem?

> Even if ², ³, ¼ are digits, why aren't these characters digits if utf-8
> is used?

Solely because the rules engine has almost no support for UTF-8. We'd need to make a separate alternative UTF-16 rules engine. It would probably be easier than one might think at first, but at this time it's low prio to me. And I doubt anyone else cares.

> BTW, for other character classes like
> ?v      matches vowels: "aeiouAEIOU"
> ?c      matches consonants: "bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ"
> ?l      matches lowercase letters [a-z]
> ?u      matches uppercase letters [A-Z]
> ?a      matches letters [a-zA-Z]
> ?x      matches letters and digits [a-zA-Z0-9]
> I am very happy that the non-ASCII upper case and lower case letters are
> considered.

There are more things that are not 100% optimal. Most notably, vowels and consonants are very fuzzy logic (and you often can't tell from a single character if it's a vowel or a consonant anyway) that was not even present in the Unicode database. But the support is way way better than nothing.

magnum

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.