Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20210503141815.GA6683@openwall.com>
Date: Mon, 3 May 2021 16:18:15 +0200
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: source of information for John's charset files

On Sun, May 02, 2021 at 11:00:34PM -0400, Matt Weir wrote:
> I apologize in advance if I misunderstood your testing procedure or your
> results, but using the HIBP list as a test set is really problematic when
> applying that to normal password cracking sessions.
> 
> Duplicates matter and our techniques should reflect that. Making guesses of
> '123456' and 'password' before 'ajger' should be rewarded, but using the
> HIBP list all three guesses are awarded the same value. I could see
> excluding the top 10k password guesses from an incremental training set,
> (since '123456' and 'password' will be almost certainly cracked by a
> dictionary attack), to optimize how incremental plays with brute-force, but
> even that approach while it seems like it makes sense, has backfired on me
> every time I have tried it, resulting in worse results when applying it to
> new datasets.

FWIW, the current HIBP hash lists do include the counts, so we can
repeat each password the specified number of times and use that in our
training or/and test sets, if we want to.

My expectation is that training on unique passwords only will in fact
reduce the number of cracked accounts when only incremental mode is
used.  However, after having run through RockYou as a wordlist, it's not
obvious whether it's beneficial for incremental mode to also favor the
repeated passwords like 123456.

What you suggest about excluding e.g. just the top 10k makes sense.
Another approach I thought of, but I don't recall trying, is to apply a
logarithmic scale to the counts.  For example, for passwords appearing
1000+ times include them 4 times, for 100+ include them 3 times, etc.

> On a different point, I am totally ok with updating the training set from
> RockYou. I could go on and on about the weirdness of that dataset, not to
> mention that it really is showing its age. The gold standard right now of
> public datasets would probably be the LinkedIn list, which also is showing
> its age, but is a bit more comparable to current web passwords.

An advantage of RockYou is that it's easily available to everyone.

Another advantage is that it's plaintexts, so not biased to what was
crackable, or to what a person having downloaded LinkedIn hashes would
crack if they want to (re)generate .chr files from that.

> Side note, I just saw your most recent results of training/running against
> RockYou. I'm willing to admit I'm wrong if you are getting better results
> training without dupes. That's just contrary to what I've seen in the past.
> I might need to run some tests of my own to look into this.

Note: better results when the test set is also without dupes.  However,
I think that's what matters after most dupes are eliminated using a
wordlist anyway in real-world usage of our tools.

Even newer results below:

> On Sun, May 2, 2021 at 5:39 PM Solar Designer <solar@...nwall.com> wrote:
> > On Sun, May 02, 2021 at 11:21:34PM +0200, Solar Designer wrote:
> > > Anyway, I just ran some tests the other way around - "cracking" RockYou
> > > passwords.  I didn't try excluding RockYou itself from the training sets
> > > here - can't do that while including our current .chr files in the
> > > comparison.  So this is in-sample testing, which is generally a wrong
> > > thing to do, but with that in mind here are the results for different
> > > training sets (all are for incremental mode and 1 billion candidates):
> > >
> > > RockYou with dupes - 20.2%
> > > RockYou unique - 21.9%
> > > HIBPv7 cracked - 17.9%
> > >
> > > The percentages cracked are those of RockYou unique.
> > >
> > > Not surprisingly, RockYou is best fit for itself.  HIBP is an acceptable
> > > fit as well.  It could have potentially performed better than RockYou
> > > on this test due to its larger size, but as we can see that was not
> > > enough to overcome it not being such a perfect fit as RockYou itself.
> >
> > FWIW, RockYou unique being best fit for itself persists after I shuffled
> > it and split it into a 1M test set and 13.3M training set (no matching
> > passwords in the sets, but both sets are parts of RockYou).  Got 21.5%.

I decided to test these not only at 1 billion candidates, but also at
other points.  I use three training sets: RockYou with dupes (same as
was used to generate our currently bundled .chr files - in fact, I just
reuse ascii.chr from there), RockYou unique shuffled and 1M test set
removed from it (so 13.3M training set), and HIBP v7 458M cracked (after
removal of the fbobh_* pattern).  The test set is always the mentioned
1M from RockYou unique shuffled.  Here are the percentages cracked at
10M, 100M, 1G, 10G, 100G candidates:

RockYou with dupes - 4.6%, 10.2%, 20.2%, 33.3%, 48.0%
RockYou -1M unique - 4.7%, 11.2%, 21.5%, 35.0%, 48.3%
HIBP v7 cracked    - 3.2%,  8.7%, 17.8%, 30.0%, 44.5%

So despite of "RockYou -1M unique" being the only one 100% out-of-sample
test (no password appears in both the training and the test set) and
also having the smallest training set (at 13.3M), it outperforms the two
other tests across this whole range.

Of course, HIBP performing worse doesn't necessarily mean it's a worse
choice in general - just that it's a worse fit for RockYou.  We've also
seen that when using a portion of HIBP as the test set, things are the
other way around - training on the rest of HIBP produces better results
than training on RockYou does.  BTW, each of these being the best fit
for itself (even without overlap in actual passwords between test and
training sets) could be not only (or not so much) in password patterns,
but also in password lengths distribution (as incremental mode switches
lengths back and forth based on what it was trained on).

Also curious is how many different passwords the different training sets
crack.  At the 100G mark, the three runs above cracked a total of 52.5%.
Grouping by two:

RockYou dupes + HIBP   - 51.4%
RockYou unique + HIBP  - 51.0%
RockYou dupes + unique - 50.3%

Combining one 10G run with one 100G run yields 47.0% (RockYou unique 10G
with HIBP 100G) to 48.8% (HIBP 10G with RockYou unique 100G).

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.