|
Message-ID: <BLU0-SMTP442EC109D92E9EAA5E4A3F6FD260@phx.gbl> Date: Sun, 6 Jan 2013 10:22:02 +0100 From: Frank Dittrich <frank_dittrich@...mail.com> To: john-dev@...ts.openwall.com Subject: Re: Markov UTF-8 magic On 01/06/2013 04:52 AM, magnum wrote: > On 6 Jan, 2013, at 3:23 , magnum <john.magnum@...hmail.com> wrote: > >> On 5 Jan, 2013, at 14:29 , Frank Dittrich <frank_dittrich@...mail.com> wrote: >>> On 01/05/2013 01:11 PM, Frank Dittrich wrote: >>>> Since Markov mode generates words based on 2-byte-frequencies, and since >>>> it generates passwords shorter than maximum length, there will be a >>>> non-neglectable number of words with invalid utf-8 characters, >>>> especially at the end of the word. So you might need to combine --markov >>>> with an --external filter. >>> >>> If you don't want to write a general-purpose utf-8 validity check, but >>> just one which checks --markov output based on stats files which have >>> been generated using a word list encoded in (valid) UTF-8, then this >>> task is quite simple: >>> >>> If the last byte is < 0x80, the word is valid. >>> Else if the last byte is > 0xbf, the word is invalid. >>> Else if the second to last byte is >= 0xc0 and <= 0xdf, the word is valid. >>> Else if the third to last byte is >= 0xe0 and <= 0xef, the word is valid. >>> Else if the forth to last byte is >= 0xf0 and <= 0xf7, the word is valid. >>> Else the word is invalid. >> >> I'm thinking I could include this in the Markov mode itself, provided we run with --enc=utf8. Would that be sane? > > I tried doing so. Please test. Unfortunately, what you input contains both characters which are represented by two bytes and characters which are represented by three bytes, then you can get wrong sequences if you have characters with the same second byte. I.e., you could have the third non-ascii byte for a 3-byte-character appended to a 2-byte character, or the third byte of the 3-byte-character skipped, because 2-byte characters and 3-byte (or 4-byte) characters use the same continuation bytes. It is just the first byte that determines how long the UFT-8 representation of a character is. I tried to implement a better UTF-8 check as an external mode, but so far I had no time to test: # Check for valid UTF-8 encoding [List.External:UTF-8] void filter() { int i, j, c; i = err = 0; j = -1; while (c = word[i]) { if (c >= 0x80) { if (c < 0xc0) { if (i > j) { word[0] = 0; return; } } else if (i <= j) { word[0] = 0; return; } else { j = i+1; if (c >= 0xe0) { j++; if (c > 0xf7) { word[0] = 0; return; } else if (c >= 0xf0) j++; } } } else if (i <= j) { word[0] = 0; return; } i++; } if (i <= j) word[0] = 0; } If there are bugs, you should still be able to get the idea of what I tried to do. Some bytes (0xc0, 0xc1, and 0xf5 - 0xff) are invalid. Those shouldn't be generated by Markov mode if the input for generating the stats file was valig UTF-8, but for a general-purpose UTF-8 checker, they should be excluded. See http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences Frank
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.