john-dev - Re: Markov UTF-8 validation (was: [john-users] Incremental attack properties questions)

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <a00195b7326ea8be1814689deb15f17f@smtp.hushmail.com>
Date: Fri, 5 Apr 2013 18:20:03 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: Markov UTF-8 validation (was: [john-users] Incremental attack properties questions)

On 8 Jan, 2013, at 20:59 , magnum <john.magnum@...hmail.com> wrote:
> On 6 Jan, 2013, at 11:10 , Frank Dittrich <frank_dittrich@...mail.com> wrote:
>> On 01/06/2013 04:06 AM, magnum wrote:
>>> On 5 Jan, 2013, at 13:00 , Frank Dittrich <frank_dittrich@...mail.com> wrote:
>>>> Even if you would get incremental mode working with non-ascii
>>>> characters, the incremental mode would sooner or later generate byte
>>>> sequences which are not valid utf-8 characters.
>>>> (This shouldn't happen with Markov mode, provided you generate your
>>>> custom stats file with valid input. There's just one exception if a byte
>>>> sequence for a non-ascii character at the end of the word gets cut off
>>>> due to maximum length or maximum Markov level limits.)
>>> 
>>> I really had no idea Markov is this good with UTF-8. This is cool stuff.
>> 
>> As long as you don't have any characters which require more than 2 bytes
>> for UTF-8 encoding, Markov works really good, except for cutting off
>> byte sequences composing a single character at the end of the word.
>> If you add 3-byte characters into the mix, things get worse, because
>> then you have sequences of continuation bytes in the range 0x80-0xbf.
>> As long as there are not too many 3-byte or 4-byte characters in your
>> input, the number of invalid UFT-8 words generated will not be too bad.
>> 
>> (Once you finish the UTF-8 validity check for --markov mode used
>> together with --encoding=utf-8, --markov mode will be an almost perfect
>> fit for UTF-8 passwords.)
> 
> The current git code has full UTF-8 validation, but it made a significant performance drop. We have several alternatives:
> 
> 1. Keep it (only happens in Markov and only with --enc=utf8)
> 2. Revert to the simpler end-sequence check (catches most problems and ought to be much faster, although I need to establish just how much faster it is IRL)
> 3. Drop these tests altogether (for a fast format it is better to just let the invalid sequences through, but for very slow formats this is bad). You can always add an external UTF-8 validation filter for slow formats (and I should of course commit one).
> 4. Re-code the UTF validation so it happens inline with the Markov candidate generation. This will be tricky (if at all possible) but should end up being fast. It may be a challenge to not affect other encodings though.

Since the above, the validation was dropped from Markov mode. I have now added an external filter --ext:filter_utf8 that rejects anything not legal in UTF-8. It is very slow though, it's not worth using for fast formats.

I'm considering also adding a rule with the same purpose, but that would be to the bleeding branch.

magnum

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.