john-users - Re: Markov phrases in john

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241028040733.GA15044@openwall.com>
Date: Mon, 28 Oct 2024 05:07:33 +0100
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: Markov phrases in john

I've extracted some more data from my tests, below:

On Sun, Oct 27, 2024 at 06:08:28AM +0100, Solar Designer wrote:
> For comparison, continuing a reference run to 2 billion gives 1820805,
> so is on par with the tokenized run at 1 billion.  Average password
> length among the 721406 cracked on top of wordlist is 6.70 characters.
> 
> Continuing the tokenized run to 2 billion gives 1952392.  Combined with
> the reference 2 billion run's, it's 2229062 unique.  Average password
> length among the first 721406 cracked by the tokenized run on top of
> wordlist is 7.04 characters, so a slight increase over the non-tokenized
> run's length.

The average password lengths above were for cracked passwords.  I now
also have the corresponding average candidate password lengths.  For a
reference run to 2 billion, it's 7.04 characters (the match with 7.04
above is accidental).  For the tokenized run to 2 billion, it's 8.27
characters.  So there's a larger increase in candidate password length
than in successfully cracked passwords' length.

> Some passwords may be represented in more than one way via the tokens,
> which means that even though incremental mode outputs unique token
> strings, we're getting some duplicate candidate passwords after the
> external mode.  For example, in the first 1 million candidates generated
> in my tests like the above, there are 994373 unique.  In the first 10
> million, 9885308 unique.  In the first 100 million, 97567218 unique.

In the first 1 billion, 969765579 unique (97.0%).
In the first 2 billion, 1934766835 unique (96.7%).

> The above success was despite the duplicates included in the 1 or 2
> billion counts.  Results would be even better with them excluded, for
> more unique passwords fitting those counts.  Maybe we need to enable our
> dupe suppressor for such runs.  Maybe we need to adjust the approach to
> tokenization to reduce or avoid duplicates - e.g., use fixed-length
> non-overlapping tokens, but this would probably hurt in other ways.
> Perhaps the duplicates would become more of a problem for longer runs
> (beyond a few billion candidates).

Some other ideas:

1. Try models such as our Markov mode or OMEN, which (I think) only
generate token combinations that were actually seen in the training set.
When we always substitute a substring with a token code before training,
that substring is then never seen in the training set as-is, so those
models shouldn't produce it by other means.  Incremental mode is
different in that it eventually tries all combinations, even those that
were never seen, which is both a blessing and a curse.

I'd appreciate it if someone from the john-users community runs such
tests with models other than incremental mode.

2. Modify incremental mode so that it either doesn't try never-seen
combinations or is token-aware and skips effectively-duplicate token
strings (e.g., insist on certain ordering of tokens).  This is tricky to
implement and it has drawbacks - trying never-seen combinations may be
desirable and filtering is slow.  OTOH, having tokenization built in may
improve ease of use and will speed up the reverse mapping (native code
instead of external mode).

3. Choose a token lengths range that minimizes the duplicates rate.
In my test so far, most tokens are of length 2 and I guess these are
somewhat frequently matched by pairs of characters or by different
grouping of characters and tokens.  Fixing the token length at 3 or 4
might reduce such occurrences a lot while still covering the really
common substrings by the 158 tokens.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.