|
Message-ID: <a00195b7326ea8be1814689deb15f17f@smtp.hushmail.com> Date: Fri, 5 Apr 2013 18:20:03 +0200 From: magnum <john.magnum@...hmail.com> To: john-dev@...ts.openwall.com Subject: Re: Markov UTF-8 validation (was: [john-users] Incremental attack properties questions) On 8 Jan, 2013, at 20:59 , magnum <john.magnum@...hmail.com> wrote: > On 6 Jan, 2013, at 11:10 , Frank Dittrich <frank_dittrich@...mail.com> wrote: >> On 01/06/2013 04:06 AM, magnum wrote: >>> On 5 Jan, 2013, at 13:00 , Frank Dittrich <frank_dittrich@...mail.com> wrote: >>>> Even if you would get incremental mode working with non-ascii >>>> characters, the incremental mode would sooner or later generate byte >>>> sequences which are not valid utf-8 characters. >>>> (This shouldn't happen with Markov mode, provided you generate your >>>> custom stats file with valid input. There's just one exception if a byte >>>> sequence for a non-ascii character at the end of the word gets cut off >>>> due to maximum length or maximum Markov level limits.) >>> >>> I really had no idea Markov is this good with UTF-8. This is cool stuff. >> >> As long as you don't have any characters which require more than 2 bytes >> for UTF-8 encoding, Markov works really good, except for cutting off >> byte sequences composing a single character at the end of the word. >> If you add 3-byte characters into the mix, things get worse, because >> then you have sequences of continuation bytes in the range 0x80-0xbf. >> As long as there are not too many 3-byte or 4-byte characters in your >> input, the number of invalid UFT-8 words generated will not be too bad. >> >> (Once you finish the UTF-8 validity check for --markov mode used >> together with --encoding=utf-8, --markov mode will be an almost perfect >> fit for UTF-8 passwords.) > > The current git code has full UTF-8 validation, but it made a significant performance drop. We have several alternatives: > > 1. Keep it (only happens in Markov and only with --enc=utf8) > 2. Revert to the simpler end-sequence check (catches most problems and ought to be much faster, although I need to establish just how much faster it is IRL) > 3. Drop these tests altogether (for a fast format it is better to just let the invalid sequences through, but for very slow formats this is bad). You can always add an external UTF-8 validation filter for slow formats (and I should of course commit one). > 4. Re-code the UTF validation so it happens inline with the Markov candidate generation. This will be tricky (if at all possible) but should end up being fast. It may be a challenge to not affect other encodings though. Since the above, the validation was dropped from Markov mode. I have now added an external filter --ext:filter_utf8 that rejects anything not legal in UTF-8. It is very slow though, it's not worth using for fast formats. I'm considering also adding a rule with the same purpose, but that would be to the bleeding branch. magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.