john-users - Re: Markov phrases in john

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241101021006.GA9021@openwall.com>
Date: Fri, 1 Nov 2024 03:10:06 +0100
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: Markov phrases in john

On Fri, Nov 01, 2024 at 12:19:00AM +0100, Solar Designer wrote:
> On Thu, Oct 31, 2024 at 11:36:07PM +0100, Solar Designer wrote:
> > What's more interesting, though, is that it's a way to get different
> > passwords cracked.  For example, with token length forced to 4 (for all
> > 158 tokens, many of which are full words or years), training on RockYou
> > without dupes, at 1 billion candidates I got 1770275 or +670876.
> > Combining this with the above result of "1870645 or +771246" (which was
> > for token lengths 2 to 4), I get 2123847 or +1024448.  That's for 1+1=2
> > billion candidates total.  Simply continuing the first (token length 2
> > to 4) run to 2 billion instead gives merely 2016222 or +916823.
> > 
> > So we get 12% more combined incremental mode cracks by splitting the 2
> > billion candidate budget into two differently tokenized 1 billion runs.
> 
> I was also interested in how wasteful or not such split is in terms of
> duplicate candidates.
> 
> For the token length 2 to 4 run, we have 997250925 unique (99.7%).
> For the token length 4 run, we have 998700856 unique (99.9%).
> For these two combined, we have 1885325771 unique (94.3%).
> 
> So it's only moderately wasteful (and for such counts it's practical to
> deduplicate when hashes are slow), but could get worse for longer runs.

Upon a closer look, I realize that the token length 4 run is actually a
mix of lots of token-less passwords and also many with tokens.  So it's
an interesting and useful result, but it's not what it seemed at first -
not so much of a focus on longer passwords in the second billion.

To actually focus on longer passwords, I just processed the length 4
token fake pot file through:

sed -n '/[^ -~]/p'

This leaves only lines with non-ASCII characters, which is what we use
for tokens.  Then the corresponding 1 billion run cracks only +378031,
but the ratio of longer passwords increases (359 are length 13+, up from
124 before the above sed).  Combined with the token length 2 to 4 run,
it's 2018976 or +919577, which is still slightly higher than a 2 billion
run for token length 2 to 4.

To fully exclude token-less passwords from this second run, I modified
the external mode:

-       word[k] = 0;
+
+       if (i == k)
+               word = 0;
+       else
+               word[k] = 0;

(This filters out candidate passwords for which the length was left
unchanged by token substitution, which means they had no tokens.)

Then it cracks only +156803, which obviously leaves it behind a simple 2
billion run for token length 2 to 4.  The number of cracked length 13+
passwords increases only a bit further (387, up from 359 above).  First
25 candidates from this run are:

master1
malove
minnie1
melove
jameslove
jolove
samanda
sweetygirl
sweets1
ming1234
ma1234
me1234
james1234
jo1234
masters
mara123
minnie2
sweety1
sweets3
may1234
miamor1
miamore
sara123
sweetgirl1
sweetgirl9

Length 16+ cracked are:

mariannamarianna
ilovemyfamily123
lovelovelove1994
angelinaangelina
alexalexalex2007
bellababygirl2007
sexygurl4eva1992
cherryberry2cute
1989198919891989
danceamandadance
bearbearbearbear
moneyoverbitches1
2005200520052005
0000000000002008
ilovestephanie11

(Those mostly with repetitions should of course also be crackable with
wordlist+rules.)

Modifying the external mode to insist on at least 2 tokens (length
increase greater than 4) results in the below first 25 candidates:

jameslove
sweetygirl
james1234
sweetgirl1
sweetgirl9
amberlove
amber1234
jamesbaby
amberbaby
moneylove
money1234
moneybaby
jerry1234
jerrybaby
jerrygirl
sweetgirl2006
sweetlove1
sweetlove4
sweetlover
moneygirl
jamesgirl
jerryange
ambergirl
sweetlove
sweet1234

This gets closer to "Markov phrases", although words longer than 4 are
formed from the tokens plus individual letters.  Unfortunately, this
cracks only +14391 in 1 billion, out of which 427 are length 13+ (still
an increase compared to previous runs).  It may be worth retesting this
kind of filtering with shorter tokens, as I guess at length 4 the low
number of available tokens becomes too much of a limiting factor for
which passwords may be formed.

Besides longer passwords still being relatively rare and these attacks
not being as effective as wordlist+rules at cracking them, yet another
factor is that longer passwords - and especially non-wordlist-crackable
ones - may be under-represented in HIBP compared to real-world usage.
That's because HIBP is compiled largely from previously-cracked
passwords (many of them from a long ago), not only from plaintext leaks.
So whatever passwords others couldn't crack before are simply not in
there, unless the specific leak was plaintext.  In this context, Matt's
suggested testing "against a site specific password dump" makes even
more sense.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.