john-dev - vowels/consonants

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <4E4A28DA.3030301@bredband.net>
Date: Tue, 16 Aug 2011 10:22:50 +0200
From: magnum <rawsmooth@...dband.net>
To: john-dev@...ts.openwall.com
Subject: vowels/consonants

In patch 0015, I added encoding data for vowels/consonant classes. I 
came up with a best-effort short-cut that does the job fairly well with 
little effort:

cmpt_cp.pl decomposes the Unicode version of the character, and regexp 
matches it with (currently) this:

if ($nfd =~ m/[aoueiœæøɪʏɛɔαεηιοωυаэыуояеюиєіı]/i) {

This means we should catch most variants of each. I think I've got most 
of the currently suppoerted encodings covered but this can be improved 
over time (hopefully by list users in the know of specific languages).

Current regexp list is made from google/wikipedia and cover most of 
Latin, Nordic, Greek, Russian, Ukrainian and Turkish.

Anyone interested can try this with an encoding you want to examine:

$ ../run/cmpt_cp.pl -v iso-8859-1

This is how encoding_data.h is built. The -v flag makes for verbose 
comments. Near end of output we currently see this for cp737 (Greek):

// ÀÁÂÃÄÅÆÈÉÊËÌÍÎÏÒÓÔÕÖØÙÚÛÜàáâãäåæèéêëìíîïòóôõöøùúûü
#define CHARS_VOWELS_ISO_8859_1 \
	"\xC0\xC1\xC2\xC3\xC4\xC5\xC6\xC8\xC9\xCA\xCB\xCC\xCD\xCE\xCF\xD2\xD3\xD4\xD5\xD6\xD8\xD9\xDA\xDB\xDC\xE0\xE1\xE2\xE3\xE4\xE5\xE6\xE8\xE9\xEA\xEB\xEC\xED\xEE\xEF\xF2\xF3\xF4\xF5\xF6\xF8\xF9\xFA\xFB\xFC"

// ªµºÇÐÑÝÞßçðñýþÿ
#define CHARS_CONSONANTS_ISO_8859_1 
"\xAA\xB5\xBA\xC7\xD0\xD1\xDD\xDE\xDF\xE7\xF0\xF1\xFD\xFE\xFF"

Like you see, the regexp caught many variants. Anyone knowing a letter 
that should be added to the regexp, please chime in.

Note though, that this is not an exact science. The definition of vowels 
vary with language and even situation. In English, Y and W can be vowels 
sometimes, and consonants sometimes.

magnum

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.