|
Message-ID: <BLU0-SMTP215171FD992CC83D292AE0FFD200@phx.gbl> Date: Fri, 4 Jan 2013 18:19:04 +0100 From: Frank Dittrich <frank_dittrich@...mail.com> To: john-dev@...ts.openwall.com Subject: What is a digit? Hi all, may be this is a question which should be posted on john-users instead of john-dev, I am not sure. I would have assumed that digits are the 10 characters from '0' to '9'. This is also what doc/RULES suggests: RULES:?d matches digits [0-9] While doc/RULES mentions that, depending on --encoding=NAME, non-ASCII characters will be added to the appropriate character classes, I would never have guessed that the following statement means '²' and '¼' could be digits: "NOTE, if running in --encoding=iso-8859-1 (or koi8-r/cp125/cp866,etc), then the high bit characters are added to the respective classes. So in iso-8859-1 mode, lower case ?l would include àáâãäåæçèéêëìíîïðñòóôõöøùúûüýþßÿ while in 'normal' runs, it is only a-z." But if I look at encoding_data.h, it seems to depend on the encoding used: $ grep -B 1 CHARS_DIGITS encoding_data.h // ²³¹¼½¾ #define CHARS_DIGITS_ISO_8859_1 "\xB2\xB3\xB9\xBC\xBD\xBE" -- // #define CHARS_DIGITS_ISO_8859_2 -- // ²³½ #define CHARS_DIGITS_ISO_8859_7 "\xB2\xB3\xBD" ... I must admit that is not what I would have expected. I would have expected character class d to match [0-9], and not some other special characters: $ grep "rules_init_class('d'," rules.c |cut -f 3- rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_ISO_8859_1); rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_ISO_8859_2); rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_ISO_8859_7); rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_ISO_8859_15); rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_KOI8_R); rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_CP437); rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_CP737); rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_CP850); rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_CP852); rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_CP858); rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_CP866); rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_CP1250); rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_CP1251); rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_CP1252); rules_init_class('d', CHARS_DIGITS CHARS_DIGITS_CP1253); rules_init_class('d', CHARS_DIGITS); Even if ², ³, ¼ are digits, why aren't these characters digits if utf-8 is used? BTW, for other character classes like ?v matches vowels: "aeiouAEIOU" ?c matches consonants: "bcdfghjklmnpqrstvwxyzBCDFGHJKLMNPQRSTVWXYZ" ?l matches lowercase letters [a-z] ?u matches uppercase letters [A-Z] ?a matches letters [a-zA-Z] ?x matches letters and digits [a-zA-Z0-9] I am very happy that the non-ASCII upper case and lower case letters are considered. Confused, Frank
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.