Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110809120231.GA27064@openwall.com>
Date: Tue, 9 Aug 2011 16:02:31 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: Re: "valid character" class

On Tue, Aug 09, 2011 at 01:00:31PM +0200, magnum wrote:
> OK, I think we'll go for ?y for 'valid' then.

Sounds good.

> Question to *all*: There are some characters that are truly invalid for 
> a codepage, like 0x98 in cp1251. There are also characters that are not 
> really invalid per the Unicode spec, but control characters. For 
> example, in most (all?) ISO-8859-xx codepages, the characters 
> 0x80..0x9F. Should we treat the latter as invalid? There are pros and 
> cons. My personal vote is that we should treat them as invalid, i.e. the 
> rule !?Y would drop any candidate that contains 0x80..0x9F if we're 
> using --enc=iso-8859-1 but only 0x98 if using -enc=cp1251.

I concur.

We could also want to introduce a class for control chars, though.
By default, it'd cover whatever chars are usually the control ones on
terminals - see the DumbForce sample.  However, for example,
--encoding=cp1251 will turn most chars in the 0x80 to 0x9f range to
non-control, even though they will remain risky to the terminal...

In practice, I'd expect the complement of this class (non-control) to be
more useful.  We'll get that one automatically.

So we'll have ?y for valid and ?O for non-control - similar, but
different (as you explained above).

Oh, and we could want to allocate a consecutive range of character class
letters (maybe a very small range) for user-defined classes.  Maybe we
could use digits rather than letters, but then there won't be automatic
complements.

> One effect of doing so is ability to reject/accept any UTF-8 encoded 
> words (from a mixed wordlist like RockYou.txt) using such rules because 
> *all* non-ascii characters in UTF-8 contains octets in that range.

In what range?  Sorry, I don't understand what you mean here.  There are
UTF-8 characters that are not ASCII yet that do not contain octets in
the 0x80 to 0x9f range.  So perhaps you meant something else.

Thanks,

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.