john-users - Re: Markov phrases in john

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20241031231900.GA8623@openwall.com>
Date: Fri, 1 Nov 2024 00:19:00 +0100
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: Markov phrases in john

On Thu, Oct 31, 2024 at 11:36:07PM +0100, Solar Designer wrote:
> On Thu, Oct 31, 2024 at 07:58:18PM +0100, Solar Designer wrote:
> >      without dupes - 1870645 or +771246
> > 
> > Comparing the best result so far without tokenizer vs. best with the
> > tokenizer, it's improvement from +630978 to +771246, or by 22%.
> > 
> > Average length of the extra 771246 passwords is 6.83, so this time
> > they're only very slightly longer than we had without tokenizer.
> > 
> > It's possible to tune for longer passwords, such as by excluding length
> > 2 tokens, but with otherwise the same input I guess this will result in
> > incremental mode training to use fewer-token strings first and in fewer
> > passwords cracked.
> 
> I've now tested this as well (excluding length 2 or even also length 3
> tokens in favor of length 4, or biasing towards longer tokens while
> including lengths 2 to 4), and it matches my expectations above (fewer
> passwords cracked, and average cracked password length increased only to
> about 7.0).
> 
> What's more interesting, though, is that it's a way to get different
> passwords cracked.  For example, with token length forced to 4 (for all
> 158 tokens, many of which are full words or years), training on RockYou
> without dupes, at 1 billion candidates I got 1770275 or +670876.
> Combining this with the above result of "1870645 or +771246" (which was
> for token lengths 2 to 4), I get 2123847 or +1024448.  That's for 1+1=2
> billion candidates total.  Simply continuing the first (token length 2
> to 4) run to 2 billion instead gives merely 2016222 or +916823.
> 
> So we get 12% more combined incremental mode cracks by splitting the 2
> billion candidate budget into two differently tokenized 1 billion runs.

I was also interested in how wasteful or not such split is in terms of
duplicate candidates.

For the token length 2 to 4 run, we have 997250925 unique (99.7%).
For the token length 4 run, we have 998700856 unique (99.9%).
For these two combined, we have 1885325771 unique (94.3%).

So it's only moderately wasteful (and for such counts it's practical to
deduplicate when hashes are slow), but could get worse for longer runs.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.