|
Message-ID: <20241031223607.GA8221@openwall.com> Date: Thu, 31 Oct 2024 23:36:07 +0100 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: Markov phrases in john On Thu, Oct 31, 2024 at 07:58:18PM +0100, Solar Designer wrote: > Now getting rid of the password.lst mistake, I have at 1 billion > candidates trained entirely on: > > RockYou with dupes - 1830570 or +731171 > without dupes - 1870645 or +771246 > > Comparing the best result so far without tokenizer vs. best with the > tokenizer, it's improvement from +630978 to +771246, or by 22%. > > Average length of the extra 771246 passwords is 6.83, so this time > they're only very slightly longer than we had without tokenizer. > > It's possible to tune for longer passwords, such as by excluding length > 2 tokens, but with otherwise the same input I guess this will result in > incremental mode training to use fewer-token strings first and in fewer > passwords cracked. I've now tested this as well (excluding length 2 or even also length 3 tokens in favor of length 4, or biasing towards longer tokens while including lengths 2 to 4), and it matches my expectations above (fewer passwords cracked, and average cracked password length increased only to about 7.0). What's more interesting, though, is that it's a way to get different passwords cracked. For example, with token length forced to 4 (for all 158 tokens, many of which are full words or years), training on RockYou without dupes, at 1 billion candidates I got 1770275 or +670876. Combining this with the above result of "1870645 or +771246" (which was for token lengths 2 to 4), I get 2123847 or +1024448. That's for 1+1=2 billion candidates total. Simply continuing the first (token length 2 to 4) run to 2 billion instead gives merely 2016222 or +916823. So we get 12% more combined incremental mode cracks by splitting the 2 billion candidate budget into two differently tokenized 1 billion runs. While the average length increases only slightly, the number of long passwords cracked increases significantly, for length 13+ from 68 in 1 billion for token lengths 2 to 4, to 124 for token length 4 (or 187 for these two combined, at 2 billion total). The maximum length of a successfully cracked password in 1 billion increases from 14 to 17 (the password is "bellababygirl2007", which is finally a "Markov phrase"). This will probably be more important in much longer runs (1 billion is quite little for approaching such passphrase lengths in this way). Another interesting observation is that forcing token length 2 results in fewer cracks (1857522 or +758123 at 1 billion) than with token lengths 3 and 4 also lightly used (as in the previous tests). Further, combining this token length 2 pot with the above token length 4 pot results in slightly fewer cracks (2110578 or +1011179) than the result above. So tokenize.pl's current default selection of token lengths actually appears optimal for both kinds of uses (on its own and along with a differently-tokenized run). > This may be more reasonable to do with/via a > pre-filtered training set (for use after more extensive other attacks > than just the wordlist) and once we re-focus this approach on phrases. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.