john-users - deviating from incremental

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20231125231630.GA10961@openwall.com>
Date: Sun, 26 Nov 2023 00:16:30 +0100
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: deviating from incremental

Hi,

JtR's incremental mode generates strings consisting of likely
characters, where each character is determined by two preceding ones (so
we're working on trigrams), its position, and total string length.  This
is similar to Markov chains.

While each trigram individually likely follows the training set nearly
optimally, the same isn't necessarily true for the generated candidate
password as a whole (nor is this desirable, since we want training and
generating new candidates, rather than memorizing and recalling of what
we already had in a wordlist).

Hypothesis: longer passwords tend to contain a few (or just one)
character(s) that would be outlier(s) within a trigram.  In other words,
what's statistically unlikely among 3 characters becomes statistically
likely to appear at least once (or perhaps exactly once) among, say, 8
characters.  (A related hypothesis, not to be tested here, is that the
same holds true for words in a passphrase.)

I use a representative 1 million non-unique NT hash sample from HIBP v8
generated using my pwned-passwords-sampler (which virtually repeats each
hash according to its count, then takes a random sample).  There are
826973 unique hashes in the sample.

First, test 1 billion candidates using incremental mode's default
charset (currently based on RockYou):

./john -form=nt -inc -len=8 -max-cand=1000000000 -verb=1 sample

This cracks 49141 unique hashes (john.pot lines), or 72660 non-unique
passwords (last line of "john --show").

Then save john.pot, and let it run for another 1 billion candidates:

./john -rest

This increases the total cracks to 55202 unique, 80373 non-unique.

Now revert john.pot to what it was after the first run (at 1 billion)
and try replacing the second run (second 1 billion candidates) with
possible alternatives.

One such alternative is overstriking each character in the
incremental-produced stream with each possible other printable ASCII
character.  With the 1 billion limit on candidates actually tested
against the hashes, this means that the input incremental-produced
stream is cut short at a much lower count.  In this test, we also start
that stream from the beginning rather than from the 1 billion mark.

./john -se=inc -inc -len=8 -stdo | ./john -form=nt -pipe -ru=o1 -max-cand=1000000000 -verb=1 sample

The total cracks are now 60152 unique, 88598 non-unique.

That's already an improvement.  Maybe going for the full printable ASCII
is taking the idea unoptimally too far, let's revert john.pot again and
instead try [a-z0-9]:

[List.Rules:o1an]
->\r[1-9A-ZZ] >\p[0-9A-Z] o\0[a-z0-9] Q

./john -se=inc -inc -len=8 -stdo | ./john -form=nt -pipe -ru=o1an -max-cand=1000000000 -verb=1 sample

The total cracks are now 65240 unique, 93976 non-unique.

A further improvement.

Notably, there's relatively little overlap in candidates tested and
passwords cracked by the 2nd billion of incremental vs. these
alternatives.  Totals for 2 billion incremental (where we had 55202
unique) and 1 billion of o1 are 65764 unique, 95123 non-unique.  For 2
billion incremental and 1 billion of o1an they are 69574 unique, 99003
non-unique.  For all of these (4 billion candidates total, including
significant overlap between o1 and o1an), they are 71330 unique, 101144
non-unique.  For comparison, 3 billion of incremental gives 62751
unique, 89318 non-unique.  4 billion of incremental gives 66833 unique,
94363 non-unique.

So we seem to have an improvement.

Of course, the pipe approach is way too slow for NT hashes, but with a
much slower target this could make sense.  Also, we may want to think of
whether/how to implement such functionality within one invocation.
Right now, we can use a mask on top of incremental mode, but this is
only good enough to prepend or append a mask character (or several), not
to replace or insert characters inside a previous mode's candidate.

I did not run many other experiments nor tune the numbers (e.g., the 1
billion figure) to achieve this effect.  I did, however, try one other
thing - "-inc -len=7" with "-ru=i1" - so inserting into 7-character
incremental output instead of replacing in 8-character output.  With
full ASCII (I didn't try any other), this performed moderately worse
than the 2nd billion of incremental.  So the improvement might not be
consistent.  Yet it looks like the hypothesis at least has merit.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.