john-users - Re: Markov phrases in john

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241027050828.GA8898@openwall.com>
Date: Sun, 27 Oct 2024 06:08:28 +0100
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: Markov phrases in john

On Wed, May 15, 2024 at 10:34:00PM +0200, Solar Designer wrote:
> * What about middle ground (e.g. syllables, including some followed by space)?
>   - e.g. extract all substrings of 2+ characters, sort from most to least
>     common, take top ~100, map them onto indices along with single characters,
>     train/use existing probabilistic candidate password generators, map back
> ---
> 
> I would really like to give this "middle ground" tokenization idea of
> mine a try.  If I had more time to play with this, I'd have already
> tried it months ago.

I gave this a try now, see tokenize.pl in bleeding-jumbo.

> As I mentioned in the talk (not on the slides), the "map back" step can
> be implemented as an external mode the code for which (with embedded
> array initialization) would be generated by the script performing the
> forward mapping prior to training.  Thus, no external tools would be
> needed during cracking - only John the Ripper and its usual files (e.g.
> some .chr and .conf).  If we support all printable ASCII (but not
> beyond) as single characters, we can also have ~128 arbitrary tokens,

Actually, 158.  This is 256 excluding 95 printable ASCII and 3 control
characters: NUL, LF, CR.

> which I guess would happen to be common syllables and maybe sometimes
> syllables followed or preceded by embedded space character (or whatever
> word separator may have been common in training data).  Even if we
> disallow the separators to be embedded in tokens, then the pre-existing
> per-character (now per-token) generator will commonly introduce
> separators before/after certain tokens if that's how the training data
> commonly had it.

For the first test, I didn't focus on phrases yet, so no deliberate
separator characters yet.

For training, I used the full RockYou list (duplicates included), to
match how the .chr files currently included with John the Ripper were
generated.  Specifically, comparing against ascii.chr and utf8.chr (as
well as a freshly-regenerated file like this, to ensure no unrelated
change creeped in):

-rw-------. 1 solar solar  5720262 Nov 12  2019 ascii.chr
-rw-------. 1 solar solar  9286825 Nov 12  2019 utf8.chr

-rw-------. 1 solar solar  9288427 Oct 27 03:56 custom-ref.chr

I chose the token length range 2 to 4, and most of the top 158 ended up
length 2.  See the output of tokenize.pl attached.

With RockYou pre-processed by the sed one-liner generated by tokenize.pl
from the same RockYou list, the resulting .chr file is:

-rw-------. 1 solar solar 26360011 Oct 27 04:20 custom.chr

That's a lot larger.  Is it any better?  Let's see:

I took my 10M representative sample from HIBPv8 generated by our
pwned-passwords-sampler.  That's 6969140 unique NTLM hashes.  After
running the RockYou wordlist on it (so that further tests would be
out-of-sample), 1099399 are cracked and 5869741 remain.

Running any of the 3 reference .chr files above for 1 billion candidate
passwords increases the total unique cracks to either 1669440 or
1669463, which is +570064 over the pure wordlist run.

Alternatively, running the new tokenized .chr file along with
--external=Untokenize for 1 billion candidate passwords, increases the
total unique cracks to 1816315, or +716916 over wordlist.

That's quite an improvement.  Unfortunately, when testing on fast hashes
like this, the real time to achieve this result is much higher (as we're
working with an external mode on top of a larger .chr file), but speeds
are still good enough (in the millions per second) to benefit from this
approach for slow (non-)hashes.

Curiously, combining the reference 1669463 cracks with tokenized run's
1816315 gives 2043045 unique (+226730 over the tokenized run's alone).
So there's not only an increase, but also the passwords cracked are
largely quite different.

For comparison, continuing a reference run to 2 billion gives 1820805,
so is on par with the tokenized run at 1 billion.  Average password
length among the 721406 cracked on top of wordlist is 6.70 characters.

Continuing the tokenized run to 2 billion gives 1952392.  Combined with
the reference 2 billion run's, it's 2229062 unique.  Average password
length among the first 721406 cracked by the tokenized run on top of
wordlist is 7.04 characters, so a slight increase over the non-tokenized
run's length.

I think this proof-of-concept is very successful.

There's a lot of room for improvement.

Some passwords may be represented in more than one way via the tokens,
which means that even though incremental mode outputs unique token
strings, we're getting some duplicate candidate passwords after the
external mode.  For example, in the first 1 million candidates generated
in my tests like the above, there are 994373 unique.  In the first 10
million, 9885308 unique.  In the first 100 million, 97567218 unique.

The above success was despite the duplicates included in the 1 or 2
billion counts.  Results would be even better with them excluded, for
more unique passwords fitting those counts.  Maybe we need to enable our
dupe suppressor for such runs.  Maybe we need to adjust the approach to
tokenization to reduce or avoid duplicates - e.g., use fixed-length
non-overlapping tokens, but this would probably hurt in other ways.
Perhaps the duplicates would become more of a problem for longer runs
(beyond a few billion candidates).

Right now, the 158 tokens are pre-chosen by their weight (number of
occurrences times token length).  However, since tokens can overlap,
after substitutions are actually made the numbers of occurrences may be
lower.  For example, one token ends up being "love" and two others "lov"
and "ove".  The numbers of occurrences of the latter two are not much
higher than those of "love", which means that after we substitute "love"
those other two tokens would be below the threshold, so we'd prefer to
have picked some other tokens instead.  An iterative algorithm can be
added to avoid the waste of token indices.

Testing on a passphrase list is needed.  I did try the script on a
Project Gutenberg phrase list, with token length range 2 to 7, and
looked at the results (didn't try a full cracking simulation yet, and
besides lengths beyond 4 are not supported by the currently generated
external mode program).  This roughly matches my description in the
quoted message above ("syllables followed or preceded by embedded space
character"), but the issue with wasteful overlapping tokens is extremely
apparent (more so than in my common passwords test above).

Another issue is that our command-line length control options don't work
"right" when an external mode alters incremental mode's length.  My
testing above was without password length restrictions, but it would be
handy to be able to use those options when desired.  Right now,
specifying e.g. "-inc=custom -ext=untokenize -length=8" appears to
result in length 8 (tokens) generated by incremental mode, possibly
longer passwords (characters) coming from external mode, and then them
getting truncated at 8 (sub-optimal order and probably more dupes).

Alexander

View attachment "tokenize.conf" of type "text/plain" (7690 bytes)
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.