Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210502175045.GA2577@openwall.com>
Date: Sun, 2 May 2021 19:50:45 +0200
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: source of information for John's charset files

On Mon, Jan 25, 2021 at 07:54:32PM +0100, Solar Designer wrote:
> On Mon, Jan 25, 2021 at 07:35:48PM +0100, Johny Krekan wrote:
> > I understand right that the list rockyou which you used had duplicate 
> > words for example 2x word example in it. My question is what is the 
> > reason or advantage of using such wordlist with duplicates in comparison 
> > with wordlist with no duplicates? If I create one .pot file from the 
> > rockyou with no duplicates would it provide worse probability in finding 
> > the password during same time as yours?
> 
> A reasonable expectation is that inclusion of duplicates in the training
> set increases the number of cracked accounts rather than cracked unique
> passwords in subsequent password security audits.  Conversely, omitting
> the duplicates would possibly optimize for cracking more unique
> passwords but perhaps fewer accounts.  An alternative hypothesis is that
> inclusion of duplicates might also help crack more unique passwords that
> are based on frequent substrings even if those came from fully duplicate
> passwords, since otherwise those substrings would be under-represented
> in the training set.  You or/and others are welcome to research whether
> these hypotheses are true or not.  I no longer recall the results of my
> own testing from back when I made this choice (IIRC, in 2013).

I ran some new tests now, and I think we need to reconsider how we
generate the default .chr files.  I think that with the RockYou list
staying easily publicly available (which felt doubtful back then) a JtR
user would run through it as a wordlist in addition to (and likely
before) running incremental mode.  This would take care of duplicate
passwords that are seen on the RockYou list, leaving only the harder
passwords for incremental mode.  And for those, my testing shows that
generating a .chr file from a list of unique passwords is better.

I took first 10M lines from pwned-passwords-ntlm-ordered-by-hash-v7.txt
as my test set.  Assuming there's no correlation between password
complexity and NTLM hash value, this is a representative sample of the
HIBP v7 set.

The HIBP v7 set is 613M+ unique passwords from 3.65 billion accounts (a
figure I calculated by adding up the counts included in the file's
second field).  This includes RockYou, which corresponds to 14M+ unique.

Running the RockYou wordlist against my 10M test set cracks 2.3% of it.

Using ascii.chr or utf8.chr as included with JtR 1.9.0 (thus, generated
from RockYou with duplicates preserved) cracks 5.4% during the first one
billion candidates.  Together with the RockYou wordlist run, it's 7.3%.
(Not 2.3+5.4 = 7.7% because there's some overlap.)

Generating a new .chr file from RockYou unique passwords and using that
cracks 6.0%.  Together with RockYou wordlist, it's 7.8%.

Then I tried using HIBP v7 passwords cracked with moderate effort as a
training set, first excluding from there the fbobh_* pattern (likely
noise coming from just one of the breaches) and passwords that are in my
10M test set.  451M passwords were left.  Generating a new .chr file
from that cracks 6.6%.  Together with RockYou wordlist, it's 8.5%.

So in these tests we can get a 10% improvement by using unique'd RockYou
and another 10% improvement by expanding the training set to HIBP v7.
With also running RockYou wordlist, these improvements are smaller, but
similar.  They also similarly persist on top of other attacks I tried.

(I had heard folks cracked almost the entire HIBP set by downloading and
testing against it various lists of breached passwords.  After all, HIBP
is supposed to only contain passwords that were breached or leaked in
plaintext, so if Troy could compile this collection then others could as
well.  However, for my test above I only used what was crackable without
usage of plaintext leaks beyond RockYou.)

Finally, I tried using crackstation-human-only.txt (64M unique) as the
training set, on its own or in addition to RockYou or HIBP v7 cracked
passwords.  In all such tests, the results were far worse than RockYou
alone or HIBP v7 cracked alone, starting with only 3.5% cracked in one
billion candidates using crackstation-human-only.txt on its own.
Looking inside that file, I see surprisingly many mostly-uppercase
passwords.  I doubt this list only contains human-chosen passwords,
despite of its name, unless maybe many of those came from systems that
upper-cased passwords (doubtful, and makes them not very useful anyhow).

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.