|
Message-ID: <88cf56076357f6b7b886576cdd1ec946@smtp.hushmail.com> Date: Tue, 8 Jan 2013 20:59:55 +0100 From: magnum <john.magnum@...hmail.com> To: "john-dev@...ts.openwall.com" <john-dev@...ts.openwall.com> Subject: Markov UTF-8 validation (was: [john-users] Incremental attack properties questions) On 6 Jan, 2013, at 11:10 , Frank Dittrich <frank_dittrich@...mail.com> wrote: > On 01/06/2013 04:06 AM, magnum wrote: >> On 5 Jan, 2013, at 13:00 , Frank Dittrich <frank_dittrich@...mail.com> wrote: >>> Even if you would get incremental mode working with non-ascii >>> characters, the incremental mode would sooner or later generate byte >>> sequences which are not valid utf-8 characters. >>> (This shouldn't happen with Markov mode, provided you generate your >>> custom stats file with valid input. There's just one exception if a byte >>> sequence for a non-ascii character at the end of the word gets cut off >>> due to maximum length or maximum Markov level limits.) >> >> I really had no idea Markov is this good with UTF-8. This is cool stuff. > > As long as you don't have any characters which require more than 2 bytes > for UTF-8 encoding, Markov works really good, except for cutting off > byte sequences composing a single character at the end of the word. > If you add 3-byte characters into the mix, things get worse, because > then you have sequences of continuation bytes in the range 0x80-0xbf. > As long as there are not too many 3-byte or 4-byte characters in your > input, the number of invalid UFT-8 words generated will not be too bad. > > (Once you finish the UTF-8 validity check for --markov mode used > together with --encoding=utf-8, --markov mode will be an almost perfect > fit for UTF-8 passwords.) The current git code has full UTF-8 validation, but it made a significant performance drop. We have several alternatives: 1. Keep it (only happens in Markov and only with --enc=utf8) 2. Revert to the simpler end-sequence check (catches most problems and ought to be much faster, although I need to establish just how much faster it is IRL) 3. Drop these tests altogether (for a fast format it is better to just let the invalid sequences through, but for very slow formats this is bad). You can always add an external UTF-8 validation filter for slow formats (and I should of course commit one). 4. Re-code the UTF validation so it happens inline with the Markov candidate generation. This will be tricky (if at all possible) but should end up being fast. It may be a challenge to not affect other encodings though. Maybe #3 is best, but I'll do some more tests with the fast end-sequence validation. magnum
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.