|
Message-ID: <20241101021006.GA9021@openwall.com> Date: Fri, 1 Nov 2024 03:10:06 +0100 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: Markov phrases in john On Fri, Nov 01, 2024 at 12:19:00AM +0100, Solar Designer wrote: > On Thu, Oct 31, 2024 at 11:36:07PM +0100, Solar Designer wrote: > > What's more interesting, though, is that it's a way to get different > > passwords cracked. For example, with token length forced to 4 (for all > > 158 tokens, many of which are full words or years), training on RockYou > > without dupes, at 1 billion candidates I got 1770275 or +670876. > > Combining this with the above result of "1870645 or +771246" (which was > > for token lengths 2 to 4), I get 2123847 or +1024448. That's for 1+1=2 > > billion candidates total. Simply continuing the first (token length 2 > > to 4) run to 2 billion instead gives merely 2016222 or +916823. > > > > So we get 12% more combined incremental mode cracks by splitting the 2 > > billion candidate budget into two differently tokenized 1 billion runs. > > I was also interested in how wasteful or not such split is in terms of > duplicate candidates. > > For the token length 2 to 4 run, we have 997250925 unique (99.7%). > For the token length 4 run, we have 998700856 unique (99.9%). > For these two combined, we have 1885325771 unique (94.3%). > > So it's only moderately wasteful (and for such counts it's practical to > deduplicate when hashes are slow), but could get worse for longer runs. Upon a closer look, I realize that the token length 4 run is actually a mix of lots of token-less passwords and also many with tokens. So it's an interesting and useful result, but it's not what it seemed at first - not so much of a focus on longer passwords in the second billion. To actually focus on longer passwords, I just processed the length 4 token fake pot file through: sed -n '/[^ -~]/p' This leaves only lines with non-ASCII characters, which is what we use for tokens. Then the corresponding 1 billion run cracks only +378031, but the ratio of longer passwords increases (359 are length 13+, up from 124 before the above sed). Combined with the token length 2 to 4 run, it's 2018976 or +919577, which is still slightly higher than a 2 billion run for token length 2 to 4. To fully exclude token-less passwords from this second run, I modified the external mode: - word[k] = 0; + + if (i == k) + word = 0; + else + word[k] = 0; (This filters out candidate passwords for which the length was left unchanged by token substitution, which means they had no tokens.) Then it cracks only +156803, which obviously leaves it behind a simple 2 billion run for token length 2 to 4. The number of cracked length 13+ passwords increases only a bit further (387, up from 359 above). First 25 candidates from this run are: master1 malove minnie1 melove jameslove jolove samanda sweetygirl sweets1 ming1234 ma1234 me1234 james1234 jo1234 masters mara123 minnie2 sweety1 sweets3 may1234 miamor1 miamore sara123 sweetgirl1 sweetgirl9 Length 16+ cracked are: mariannamarianna ilovemyfamily123 lovelovelove1994 angelinaangelina alexalexalex2007 bellababygirl2007 sexygurl4eva1992 cherryberry2cute 1989198919891989 danceamandadance bearbearbearbear moneyoverbitches1 2005200520052005 0000000000002008 ilovestephanie11 (Those mostly with repetitions should of course also be crackable with wordlist+rules.) Modifying the external mode to insist on at least 2 tokens (length increase greater than 4) results in the below first 25 candidates: jameslove sweetygirl james1234 sweetgirl1 sweetgirl9 amberlove amber1234 jamesbaby amberbaby moneylove money1234 moneybaby jerry1234 jerrybaby jerrygirl sweetgirl2006 sweetlove1 sweetlove4 sweetlover moneygirl jamesgirl jerryange ambergirl sweetlove sweet1234 This gets closer to "Markov phrases", although words longer than 4 are formed from the tokens plus individual letters. Unfortunately, this cracks only +14391 in 1 billion, out of which 427 are length 13+ (still an increase compared to previous runs). It may be worth retesting this kind of filtering with shorter tokens, as I guess at length 4 the low number of available tokens becomes too much of a limiting factor for which passwords may be formed. Besides longer passwords still being relatively rare and these attacks not being as effective as wordlist+rules at cracking them, yet another factor is that longer passwords - and especially non-wordlist-crackable ones - may be under-represented in HIBP compared to real-world usage. That's because HIBP is compiled largely from previously-cracked passwords (many of them from a long ago), not only from plaintext leaks. So whatever passwords others couldn't crack before are simply not in there, unless the specific leak was plaintext. In this context, Matt's suggested testing "against a site specific password dump" makes even more sense. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.