Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <BLU0-SMTP215171FD992CC83D292AE0FFD200@phx.gbl>
Date: Fri, 4 Jan 2013 18:19:04 +0100
From: Frank Dittrich <frank_dittrich@...mail.com>
To: john-dev@...ts.openwall.com
Subject: What is a digit?

Hi all,

may be this is a question which should be posted on john-users instead
of john-dev, I am not sure.

I would have assumed that digits are the 10 characters from '0' to '9'.

This is also what doc/RULES suggests:
RULES:?d	matches digits [0-9]

While doc/RULES mentions that, depending on --encoding=NAME, non-ASCII
characters will be added to the appropriate character classes, I would
never have guessed that the following statement means '²' and '¼' could
be digits:

"NOTE, if running in --encoding=iso-8859-1 (or koi8-r/cp125/cp866,etc),
then the high bit characters are added to the respective classes. So in
iso-8859-1 mode, lower case ?l would include
àáâãäåæçèéêëìíîïðñòóôõöøùúûüýþßÿ while in 'normal' runs, it is only a-z."

But if I look at encoding_data.h, it seems to depend on the encoding used:

$ grep -B 1 CHARS_DIGITS encoding_data.h
// ²³¹¼½¾
#define CHARS_DIGITS_ISO_8859_1 "\xB2\xB3\xB9\xBC\xBD\xBE"
--
//
#define CHARS_DIGITS_ISO_8859_2
--
// ²³½
#define CHARS_DIGITS_ISO_8859_7 "\xB2\xB3\xBD"
...

I must admit that is not what I would have expected.
I would have expected character class d to match [0-9], and not some
other special characters:
$ grep "rules_init_class('d'," rules.c |cut -f 3-
rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_ISO_8859_1);
rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_ISO_8859_2);
rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_ISO_8859_7);
rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_ISO_8859_15);
rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_KOI8_R);
rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_CP437);
rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_CP737);
rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_CP850);
rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_CP852);
rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_CP858);
rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_CP866);
rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_CP1250);
rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_CP1251);
rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_CP1252);
rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_CP1253);
rules_init_class('d', CHARS_DIGITS);

Even if ², ³, ¼ are digits, why aren't these characters digits if utf-8
is used?

BTW, for other character classes like
?v      matches vowels: "aeiouAEIOU"
?c      matches consonants: "bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ"
?l      matches lowercase letters [a-z]
?u      matches uppercase letters [A-Z]
?a      matches letters [a-zA-Z]
?x      matches letters and digits [a-zA-Z0-9]
I am very happy that the non-ASCII upper case and lower case letters are
considered.


Confused,

Frank

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.