Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <a00195b7326ea8be1814689deb15f17f@smtp.hushmail.com>
Date: Fri, 5 Apr 2013 18:20:03 +0200
From: magnum <john.magnum@...hmail.com>
To: john-dev@...ts.openwall.com
Subject: Re: Markov UTF-8 validation (was: [john-users] Incremental attack properties questions)

On 8 Jan, 2013, at 20:59 , magnum <john.magnum@...hmail.com> wrote:
> On 6 Jan, 2013, at 11:10 , Frank Dittrich <frank_dittrich@...mail.com> wrote:
>> On 01/06/2013 04:06 AM, magnum wrote:
>>> On 5 Jan, 2013, at 13:00 , Frank Dittrich <frank_dittrich@...mail.com> wrote:
>>>> Even if you would get incremental mode working with non-ascii
>>>> characters, the incremental mode would sooner or later generate byte
>>>> sequences which are not valid utf-8 characters.
>>>> (This shouldn't happen with Markov mode, provided you generate your
>>>> custom stats file with valid input. There's just one exception if a byte
>>>> sequence for a non-ascii character at the end of the word gets cut off
>>>> due to maximum length or maximum Markov level limits.)
>>> 
>>> I really had no idea Markov is this good with UTF-8. This is cool stuff.
>> 
>> As long as you don't have any characters which require more than 2 bytes
>> for UTF-8 encoding, Markov works really good, except for cutting off
>> byte sequences composing a single character at the end of the word.
>> If you add 3-byte characters into the mix, things get worse, because
>> then you have sequences of continuation bytes in the range 0x80-0xbf.
>> As long as there are not too many 3-byte or 4-byte characters in your
>> input, the number of invalid UFT-8 words generated will not be too bad.
>> 
>> (Once you finish the UTF-8 validity check for --markov mode used
>> together with --encoding=utf-8, --markov mode will be an almost perfect
>> fit for UTF-8 passwords.)
> 
> The current git code has full UTF-8 validation, but it made a significant performance drop. We have several alternatives:
> 
> 1. Keep it (only happens in Markov and only with --enc=utf8)
> 2. Revert to the simpler end-sequence check (catches most problems and ought to be much faster, although I need to establish just how much faster it is IRL)
> 3. Drop these tests altogether (for a fast format it is better to just let the invalid sequences through, but for very slow formats this is bad). You can always add an external UTF-8 validation filter for slow formats (and I should of course commit one).
> 4. Re-code the UTF validation so it happens inline with the Markov candidate generation. This will be tricky (if at all possible) but should end up being fast. It may be a challenge to not affect other encodings though.

Since the above, the validation was dropped from Markov mode. I have now added an external filter --ext:filter_utf8 that rejects anything not legal in UTF-8. It is very slow though, it's not worth using for fast formats.

I'm considering also adding a rule with the same purpose, but that would be to the bleeding branch.

magnum

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.