Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241031231900.GA8623@openwall.com>
Date: Fri, 1 Nov 2024 00:19:00 +0100
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: Markov phrases in john

On Thu, Oct 31, 2024 at 11:36:07PM +0100, Solar Designer wrote:
> On Thu, Oct 31, 2024 at 07:58:18PM +0100, Solar Designer wrote:
> >      without dupes - 1870645 or +771246
> > 
> > Comparing the best result so far without tokenizer vs. best with the
> > tokenizer, it's improvement from +630978 to +771246, or by 22%.
> > 
> > Average length of the extra 771246 passwords is 6.83, so this time
> > they're only very slightly longer than we had without tokenizer.
> > 
> > It's possible to tune for longer passwords, such as by excluding length
> > 2 tokens, but with otherwise the same input I guess this will result in
> > incremental mode training to use fewer-token strings first and in fewer
> > passwords cracked.
> 
> I've now tested this as well (excluding length 2 or even also length 3
> tokens in favor of length 4, or biasing towards longer tokens while
> including lengths 2 to 4), and it matches my expectations above (fewer
> passwords cracked, and average cracked password length increased only to
> about 7.0).
> 
> What's more interesting, though, is that it's a way to get different
> passwords cracked.  For example, with token length forced to 4 (for all
> 158 tokens, many of which are full words or years), training on RockYou
> without dupes, at 1 billion candidates I got 1770275 or +670876.
> Combining this with the above result of "1870645 or +771246" (which was
> for token lengths 2 to 4), I get 2123847 or +1024448.  That's for 1+1=2
> billion candidates total.  Simply continuing the first (token length 2
> to 4) run to 2 billion instead gives merely 2016222 or +916823.
> 
> So we get 12% more combined incremental mode cracks by splitting the 2
> billion candidate budget into two differently tokenized 1 billion runs.

I was also interested in how wasteful or not such split is in terms of
duplicate candidates.

For the token length 2 to 4 run, we have 997250925 unique (99.7%).
For the token length 4 run, we have 998700856 unique (99.9%).
For these two combined, we have 1885325771 unique (94.3%).

So it's only moderately wasteful (and for such counts it's practical to
deduplicate when hashes are slow), but could get worse for longer runs.

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.