john-dev - Markov UTF-8 validation (was: [john-users] Incremental attack properties questions)

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <88cf56076357f6b7b886576cdd1ec946@smtp.hushmail.com>
Date: Tue, 8 Jan 2013 20:59:55 +0100
From: magnum <john.magnum@...hmail.com>
To: "john-dev@...ts.openwall.com" <john-dev@...ts.openwall.com>
Subject: Markov UTF-8 validation (was: [john-users] Incremental attack properties questions)

On 6 Jan, 2013, at 11:10 , Frank Dittrich <frank_dittrich@...mail.com> wrote:
> On 01/06/2013 04:06 AM, magnum wrote:
>> On 5 Jan, 2013, at 13:00 , Frank Dittrich <frank_dittrich@...mail.com> wrote:
>>> Even if you would get incremental mode working with non-ascii
>>> characters, the incremental mode would sooner or later generate byte
>>> sequences which are not valid utf-8 characters.
>>> (This shouldn't happen with Markov mode, provided you generate your
>>> custom stats file with valid input. There's just one exception if a byte
>>> sequence for a non-ascii character at the end of the word gets cut off
>>> due to maximum length or maximum Markov level limits.)
>> 
>> I really had no idea Markov is this good with UTF-8. This is cool stuff.
> 
> As long as you don't have any characters which require more than 2 bytes
> for UTF-8 encoding, Markov works really good, except for cutting off
> byte sequences composing a single character at the end of the word.
> If you add 3-byte characters into the mix, things get worse, because
> then you have sequences of continuation bytes in the range 0x80-0xbf.
> As long as there are not too many 3-byte or 4-byte characters in your
> input, the number of invalid UFT-8 words generated will not be too bad.
> 
> (Once you finish the UTF-8 validity check for --markov mode used
> together with --encoding=utf-8, --markov mode will be an almost perfect
> fit for UTF-8 passwords.)

The current git code has full UTF-8 validation, but it made a significant performance drop. We have several alternatives:

1. Keep it (only happens in Markov and only with --enc=utf8)
2. Revert to the simpler end-sequence check (catches most problems and ought to be much faster, although I need to establish just how much faster it is IRL)
3. Drop these tests altogether (for a fast format it is better to just let the invalid sequences through, but for very slow formats this is bad). You can always add an external UTF-8 validation filter for slow formats (and I should of course commit one).
4. Re-code the UTF validation so it happens inline with the Markov candidate generation. This will be tricky (if at all possible) but should end up being fast. It may be a challenge to not affect other encodings though.

Maybe #3 is best, but I'll do some more tests with the fast end-sequence validation.

magnum

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.