|
Message-ID: <4E4A28DA.3030301@bredband.net> Date: Tue, 16 Aug 2011 10:22:50 +0200 From: magnum <rawsmooth@...dband.net> To: john-dev@...ts.openwall.com Subject: vowels/consonants In patch 0015, I added encoding data for vowels/consonant classes. I came up with a best-effort short-cut that does the job fairly well with little effort: cmpt_cp.pl decomposes the Unicode version of the character, and regexp matches it with (currently) this: if ($nfd =~ m/[aoueiœæøɪʏɛɔαεηιοωυаэыуояеюиєіı]/i) { This means we should catch most variants of each. I think I've got most of the currently suppoerted encodings covered but this can be improved over time (hopefully by list users in the know of specific languages). Current regexp list is made from google/wikipedia and cover most of Latin, Nordic, Greek, Russian, Ukrainian and Turkish. Anyone interested can try this with an encoding you want to examine: $ ../run/cmpt_cp.pl -v iso-8859-1 This is how encoding_data.h is built. The -v flag makes for verbose comments. Near end of output we currently see this for cp737 (Greek): // ÀÁÂÃÄÅÆÈÉÊËÌÍÎÏÒÓÔÕÖØÙÚÛÜàáâãäåæèéêëìíîïòóôõöøùúûü #define CHARS_VOWELS_ISO_8859_1 \ "\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD2\xD3\xD4\xD5\xD6\xD8\xD9\xDA\xDB\xDC\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF2\xF3\xF4\xF5\xF6\xF8\xF9\xFA\xFB\xFC" // ªµºÇÐÑÝÞßçðñýþÿ #define CHARS_CONSONANTS_ISO_8859_1 "\xAA\xB5\xBA\xC7\xD0\xD1\xDD\xDE\xDF\xE7\xF0\xF1\xFD\xFE\xFF" Like you see, the regexp caught many variants. Anyone knowing a letter that should be added to the regexp, please chime in. Note though, that this is not an exact science. The definition of vowels vary with language and even situation. In English, Y and W can be vowels sometimes, and consonants sometimes. magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.