|
Message-ID: <20210502175045.GA2577@openwall.com> Date: Sun, 2 May 2021 19:50:45 +0200 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: source of information for John's charset files On Mon, Jan 25, 2021 at 07:54:32PM +0100, Solar Designer wrote: > On Mon, Jan 25, 2021 at 07:35:48PM +0100, Johny Krekan wrote: > > I understand right that the list rockyou which you used had duplicate > > words for example 2x word example in it. My question is what is the > > reason or advantage of using such wordlist with duplicates in comparison > > with wordlist with no duplicates? If I create one .pot file from the > > rockyou with no duplicates would it provide worse probability in finding > > the password during same time as yours? > > A reasonable expectation is that inclusion of duplicates in the training > set increases the number of cracked accounts rather than cracked unique > passwords in subsequent password security audits. Conversely, omitting > the duplicates would possibly optimize for cracking more unique > passwords but perhaps fewer accounts. An alternative hypothesis is that > inclusion of duplicates might also help crack more unique passwords that > are based on frequent substrings even if those came from fully duplicate > passwords, since otherwise those substrings would be under-represented > in the training set. You or/and others are welcome to research whether > these hypotheses are true or not. I no longer recall the results of my > own testing from back when I made this choice (IIRC, in 2013). I ran some new tests now, and I think we need to reconsider how we generate the default .chr files. I think that with the RockYou list staying easily publicly available (which felt doubtful back then) a JtR user would run through it as a wordlist in addition to (and likely before) running incremental mode. This would take care of duplicate passwords that are seen on the RockYou list, leaving only the harder passwords for incremental mode. And for those, my testing shows that generating a .chr file from a list of unique passwords is better. I took first 10M lines from pwned-passwords-ntlm-ordered-by-hash-v7.txt as my test set. Assuming there's no correlation between password complexity and NTLM hash value, this is a representative sample of the HIBP v7 set. The HIBP v7 set is 613M+ unique passwords from 3.65 billion accounts (a figure I calculated by adding up the counts included in the file's second field). This includes RockYou, which corresponds to 14M+ unique. Running the RockYou wordlist against my 10M test set cracks 2.3% of it. Using ascii.chr or utf8.chr as included with JtR 1.9.0 (thus, generated from RockYou with duplicates preserved) cracks 5.4% during the first one billion candidates. Together with the RockYou wordlist run, it's 7.3%. (Not 2.3+5.4 = 7.7% because there's some overlap.) Generating a new .chr file from RockYou unique passwords and using that cracks 6.0%. Together with RockYou wordlist, it's 7.8%. Then I tried using HIBP v7 passwords cracked with moderate effort as a training set, first excluding from there the fbobh_* pattern (likely noise coming from just one of the breaches) and passwords that are in my 10M test set. 451M passwords were left. Generating a new .chr file from that cracks 6.6%. Together with RockYou wordlist, it's 8.5%. So in these tests we can get a 10% improvement by using unique'd RockYou and another 10% improvement by expanding the training set to HIBP v7. With also running RockYou wordlist, these improvements are smaller, but similar. They also similarly persist on top of other attacks I tried. (I had heard folks cracked almost the entire HIBP set by downloading and testing against it various lists of breached passwords. After all, HIBP is supposed to only contain passwords that were breached or leaked in plaintext, so if Troy could compile this collection then others could as well. However, for my test above I only used what was crackable without usage of plaintext leaks beyond RockYou.) Finally, I tried using crackstation-human-only.txt (64M unique) as the training set, on its own or in addition to RockYou or HIBP v7 cracked passwords. In all such tests, the results were far worse than RockYou alone or HIBP v7 cracked alone, starting with only 3.5% cracked in one billion candidates using crackstation-human-only.txt on its own. Looking inside that file, I see surprisingly many mostly-uppercase passwords. I doubt this list only contains human-chosen passwords, despite of its name, unless maybe many of those came from systems that upper-cased passwords (doubtful, and makes them not very useful anyhow). Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.