john-users - Re: Replacement for all.chr based on "Rock You" Passwords. URL inside.

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100205015020.GA14708@openwall.com>
Date: Fri, 5 Feb 2010 04:50:20 +0300
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: Replacement for all.chr based on "Rock You" Passwords. URL inside.

Minga -

Thank you for posting this!  I was hoping someone would do it.

On Wed, Feb 03, 2010 at 05:21:29PM -0600, Minga Minga wrote:
> I dont exactly remember how/when all.chr was created, and I have no
> idea the last time it was updated, ...

It was last updated in December 2005, shortly before the JtR 1.7 release.
Most of the input data was much older, though - mid-1990s.

> Now, I have many opinions about the passwords from the RockYou list.
> They are NOT representative of "real" passwords by trained users in
> corporate environments. But they ARE representative of idiots on the
> Internet. And I guess thats a good enough place to start, as any, for
> the default behaviour of JtR. I propose the all.chr update because we
> cannot continue to use and propagate a .CHR file that is so outdated
> (assuming it is?).

The .chr files included with JtR are old (and are based on data that is
even older), but I am not convinced they're outdated.  Has there been
much of a change in users' choice of passwords in the last 10-15 years?
I think the average password became a little bit stronger (only a little
bit, unless a password policy is enforced), but I also think that the
relative frequencies of characters (as well as digraphs and trigraphs)
remained mostly the same.  Perhaps the change in average password
complexity will be reflected in the "cracking order" table in the
"header" of a .chr file, but do you really spot this change with the
RockYou passwords (which are likely biased towards weaker ones)?

If you look at password.lst prior to my last update (e.g., take the
revision from JtR 1.7.3.4), it matches RockYou's top 100 as published by
Matt reasonably closely, despite of the almost 15 years difference.
It missed a total of 15 passwords from RockYou's top 100.  One of those
was "rockyou" - likely not that common overall.  4 were in fact very
common nowadays.  The remaining 10 were somewhat common (not found on
another recent top 250 list).  So I'd say that we had an 85% to 96%
coverage of common passwords with a mostly 15-year old list.  Yes, that
list was longer (slightly over 3,000 entries), but most of the passwords
that were on RockYou's top 100 were also closer to the beginning of
password.lst.  You could want to read my verbose commit messages for the
recent password.lst updates here:

http://cvsweb.openwall.com/cgi/cvsweb.cgi/Owl/packages/john/john/run/password.lst

Thus, I think that the primary advantage of the RockYou list is not that
it is newer (although this is an advantage), but rather that it is larger
and more complete.  I mean that previously we had to work with hashes,
of which only a certain percentage - the weaker ones - were cracked.
This resulted in some bias towards weaker passwords.  Also, passwords
longer than 8 characters were almost non-existent, for several reasons:
the traditional crypt(3)'s limitation (and most of the hashes were of
this type), those passwords being stronger (so fewer of them were
cracked and could be used as input for .chr files), and those passwords
being less common (OK, I admit that there has been some change in the
percentage of longer passwords - from negligible to just small).

BTW, I wouldn't call someone using a weak password on a website an
"idiot".  This depends on the person's use of the account, as well as
age, experience with computers, and perception of risk.  This does not
necessarily suggest a low IQ, although maybe some correlation exists.

> Since the .chr created from the 'RockYou' list - can NOT be used
> to re-create the exact list of passwords, it is not a disclosure of
> personal information (up for debate). Therefore, I make the assumption
> it is safe for use.
> 
> So what KoreLogic did was, obtained the list, cleaned up the list,

Can you please describe the cleanups you made to the list?  Maybe post a
script that you used?

> obtained a unique list of passwords from the list (14,249,979 in total)

BTW, the input data for the .chr files included with JtR contained
multiple instances of common passwords.  The difficulty was in avoiding
duplicates that resulted from passwords set by a specific person or on a
specific system, yet including those that were genuine common passwords.
Producing the password.lst file involved a similar difficulty.  The
primary way to address this was to only include duplicates that were
found in unrelated input sets (IIRC, I required presence in 3+ input
sets for inclusion into the final password.lst), but to include the full
number of them if so (that is, also include duplicates from within the
same input set, with some exceptions).

You could want to re-introduce some repeated passwords into your unique
list - those that are also found on other unrelated lists - or you could
just include all, maybe with some manual filtering of very common yet
"spurious" ones (e.g., "rockyou" is questionable in this case).

> and created a .CHR file based on this list. We are now publishing this
> new .chr file for everyone to use.

Thank you!  I was hoping someone would dare to do that.  I am going to
include your files under john/contrib/ in the Openwall FTP archive.

You will likely need to release some updates, though - considering my
input above and/or changes to JtR itself.  Speaking of the latter, I am
going to re-work the "incremental" mode, for the better indeed.
I already have some test revisions of charset.c and inc.c files that
address one of the shortcomings of the current approach (namely, its
inability to increase the number of character indices for each character
position fully independently from the rest of the positions).  Even if I
maintain support for older .chr files for a while longer (with some
backwards compatibility code), it'd be beneficial to take advantage of
the new approach and implementation.

Also, you could want to generate multiple .chr files, with different
filters, like it is done for those included with JtR.  Arguably, JtR
itself could be enhanced to perform some filtering like this while
cracking, and to do so in an efficient manner (skipping large chunks of
would-be-filtered candidate passwords at once), but this is tricky to
implement if the goal is to achieve the same effect that is currently
achieved with separate files.  Specifically, it is not very difficult to
efficiently skip passwords not matching a certain reduced charset,
but it is more difficult to also use character, digraph, and trigraph
frequencies only based on passwords consisting _entirely_ of characters
from that reduced set.  The latter is only easy to achieve by
pre-filtering when the .chr file is generated.

> In the next few months, KoreLogic will be posting a large amount of
> password-based research on our website. Mostly based around new
> techniques, new rules, and automation of large jobs to be run across
> multiple systems. KoreLogic will also be doing multiple presentations
> about Security Cons this year presenting our tools/rules/research
> in 2010 as well.

Sounds good.  On a related note, I am seriously considering actually
dedicating some of my time to start implementing built-in support for
parallel processing.  Commercial demand for this could make a difference.

> Here is the CHR file, and the README associated with it including
> instructions for use, etc. If we don't want to replace all.chr -
> instructions are included for using rockyou.chr separately.
> 
> http://www.korelogic.com/tools.html#jtr

I am not replacing the included all.chr with this yet, but I am willing
to consider doing something like that a bit later.  I've mentioned some
reasons why not yet above.

Meanwhile, I'd be very interested in test runs of JtR with rockyou.chr
vs. all.chr against some recent but unrelated password hash files.  I'd
appreciate it if you and/or others run such tests and post in here.
Also, it'd be interesting to see the effect of (not) including repeated
passwords in the input set for .chr file generation.

Thanks again,

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.