john-users - Re: Markov phrases in john

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241030040147.GA26754@openwall.com>
Date: Wed, 30 Oct 2024 05:01:47 +0100
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: Markov phrases in john

On Tue, Oct 29, 2024 at 10:21:39PM -0400, Matt Weir wrote:
> I published a blog post explaining the new tokenizer attack works as well
> as detailing instructions on how to configure and run it. Link:
> https://reusablesec.blogspot.com/2024/10/running-jtrs-tokenizer-attack.html

Good stuff.  I like your introduction.  Indeed, this isn't to be called
"Markov phrases" - that's just the Subject line on this thread.  It is
an alternative to applying a Markov model to entire words (which is what
the thread was originally about), instead applying it to any substrings.

Where you suggest copy-pasting into john.conf, I instead suggest simply:

./tokenize.pl TRAINING_PASSWORDS.txt > john-local.conf

This file is automatically included, and the sed line in there is no
problem - it is treated a comment.

Where you observe the first 25(ish) guesses become visibly worse, I
guess that's because your training set is worse (before the tokenizer).
You train on whatever 1 million passwords, but the .chr files supplied
with JtR were trained on the full RockYou (32 million including dupes).
If you want to show the effect of the tokenizer alone, you need to
re-train both with and without tokenizer on the same input (I did).

Here's the first 25 I am getting for my tokenizer-enabled
RockYou-trained file (as used in the previous tests I posted about):

$ ./john -inc=custom -ext=untokenize -stdo -max-candidates=25
Warning: only 253 characters available
123456
12345
loveme
marian
12345a
mario
lovely
lovelove
justin
maria
superman
12341234
123412345
marie1
marie123
lovers
lover1
123457
12341231
mariel
marie2
lovely1
lovers1
12345j
12341235

This looks much better than what you observed.

> The tests I'm interested in running are comparing the tokenizer attack vs.
> standard incremental against different datasets and paired with different
> attacks. Aka you ran a quick wordlist attack first (using RockYou), so
> it'll be interesting to see how tokenizer works in conjunction with other
> attacks vs. it being a stand-alone attack. Also I have concerns about using
> HIBP as a test list. That's probably worth a whole other post/email, but
> long story short I'm really interested to see how tokenizer does against a
> site specific password dump vs. a more generic "combined leak list".

For combining with other attacks, it is possible that training on
stronger passwords (exclude those not matching a "policy") may yield
better results (more new on top of what other attacks crack), see:

https://github.com/openwall/john/issues/5220

I wonder how this fits in with tokenization - probably it's orthogonal,
but I'm not sure.

> Cheers and thanks for all the great work. I'm really looking forward to
> better understanding this tool!

Thank you very much.  I'm looking forward to your test results.

BTW, I notice you link to your previous blog posts on incremental mode
from 2009-2010.  If you compare against those old results now, please be
aware that I improved the incremental mode itself significantly in 2013
("such that the counts of character indices grow independently for each
position" as my commit message says).

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.