john-users - Re: Crowd-sourcing statistics and rules

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BLU0-SMTP464824265B10CBE8632A8D6FD3F0@phx.gbl>
Date: Tue, 17 Apr 2012 19:48:06 +0200
From: Frank Dittrich <frank_dittrich@...mail.com>
To: john-users@...ts.openwall.com
Subject: Re: Crowd-sourcing statistics and rules

On 04/16/2012 03:36 AM, Rich Rumble wrote:
> With all this talk of pattern matching/finding, could it also be time
> to look at updating JtR's rules and giving anonymous feed back on
> rules? 

Yes, I think it may be time to evaluate if john's default rule sets for
word list or single mode need to be adjusted.

> With the clients I audit, I don't see much variation...
> password incrementing and using the company name or products in the
> passwords are very evident. 

This is to be expected, if users are forced to switch password regularly.

For some web sites where users don't have to change their passwords
frequently, users can  also pick their user name, which usually is not
the case in a business environment. That's why, users are much more
likely to pick a password that is somehow based on their user name.

> JtR has a lot of information about each
> cracking session in the log file that could be useful,

While useful, this information can be somewhat misleading.
The reason is that john is buffering passwords for performance reasons.
The buffer size depends on hash algorithm and compile options.

That's why, the last rule mentioned prior to a "Cracked ..." line does
not necessarily indicate which rule really cracked this password.

For larger word lists, the log file will in most cases report the
correct rule.
For very small word lists and for single mode, the reported rule will
more frequently be wrong.
With the upcoming GPU support (and further increased buffer sizes), it
will also be more likely that the last rule reported in the log file is
not the rule which really cracked the password.

That's why, just generating the statistics directly from the log file
will not provide results that are 100 percent correct.
To get correct results, some more effort is required.
But may be 100 % correctness is not needed.

> John/Jumbo could be patched, but I bet a script could be used just as
> well to cut down the minutia, and create more succinct details:
> 0:00:01:19 - Rule #15: '-c )?a r l' accepted as ')?arl'       (cracked 353)
> 0:00:01:30 - Rule #16: '-: <* !?A l p' accepted as '<*!?Alp'      (cracked 8)
> 0:00:01:39 - Rule #17: '-c <* !?A c p' accepted as '<*!?Acp'      (cracked 31)
> 0:00:01:47 - Rule #18: '-c <* c Q d' accepted as '<*cQd'      (cracked 99)
> 0:00:01:56 - Rule #19: '-c >7 '7 /?u' accepted as '>7'7/?u'      (cracked 0)
> 0:00:01:56 - Rule #20: '>4 '4 l' accepted as '>4'4l'      (cracked 0)
> 0:00:02:06 - Rule #21: '-c <+ (?l c r' accepted as '<+(?lcr'      (cracked 9)
> 0:00:02:15 - Rule #22: '-c <+ )?l l Tm' accepted as '<+)?llTm'      (cracked 17)
> ....
> 0:09:30:07 - Trying length 7, fixed @1, character count 31 (cracked 446)
> 0:09:37:26 - Trying length 6, fixed @6, character count 47 (cracked 248)

As I explained above, a script might provide somewhat incorrect results,
but this could still be better than nothing.
When you interrupt the cracking session and restart it later, there is a
risk to introduce even more errors, e.g., because the word list file has
been changed, because the input files with passwords have been changed
(so that you either have more or less hashes to be cracked), or the .rec
file has been changed.
In fact, even if you don't change any of those files manually, the
results bay be useless, because you had other sessions running in
parallel which cracked many of the passwords before you restarted your
session.

If you know what you are doing, you can of course avoid most of these
problems. But just collecting a lot of log files from a large number of
users and generating statistics on them might not work as one would hope.

> I think finding out what rules are working for more me personally
> could save some time for others as well, it could be interesting to
> see if I re-run John on those same passes minus the most "successful"
> rules and compare... 

To do a fair comparison, you'd of course have to start each test with an
empty pot file.
If you want to avoid repeating the session for slow hashes, you could
generate a new input file for --format=dummy and use this one for tests.
It should be much faster.
Currently, the dummy format is saltless, which means for frequently used
passwords, you still get just one line in john.pot.
But you could just put multiple lines with the same hash, but different
user names into your input file, and then use ,/john--show instead of
counting he lines in john.pot.

> perhaps for me using 0000-9999 get's me far more
> passes than the rule that does 19xx and 20xx date/years and I don't
> want the overlap. 

IMO, you have to consider the ratio of cracked passwords / password
candidates or even cracked / ( candidates * salts ).
For fast saltless hashes it might be OK to just try all numbers from
0000-9999.
For salted hashes, you should consider that 19xx and 20xx will probably
more likely than most other numbers (may be except 1234, 1337, 2345,
2468, 1111, and a few others.)
So, trying the years first will hopefully reduce the number of remaining
salts, and thus reduce the time you need for the other 4 digit numbers.

> There are a lot of variables here,
> some of the things I stated are moot if the wordlist is small or the
> hash is very very fast,

Even then you have to adjust your strategy.
When I started experimenting with the ca. 140 million raw-md5 hashes
published by KoreLogic, I realized that loading 10 million passwords and
checking which of these passwords have already been cracked can take
much more time than just trying a few rules on a word list.

> but I'd be curious to grab more stats not only
> from the passwords themselves, but the whole session holds information
> we might all benefit from, even if it's not going to get you 5x more
> passwords, maybe it gets you the 10-20 really hard ones you've been
> going after.

Those 10-20 really hard passwords (or similar passwords using the same
pattern) might not exist in your next set of hashes.
That's why IMHO it is more reasonable to hope to get the more likely
passwords cracked faster.
This will reduce the number of remaining salts, resulting in a larger
number of passwords that can be tried with --incremental or --markov mode.
Then, you can try to detect patterns in the passwords you didn't crack
using rules, and adjust your strategy.

> In closing my very long winded email: “Statistics are like a bikini.
> What they reveal is suggestive, but what they conceal is vital.”

Thanks for this Aaron Levenstein quote. I didn't know it.

Frank
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.