john-users - Re: Markov phrases in john

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJ9ii1HqwuaQpnKxZ7rQQAYMQQFS3pX1Ao4eaR134wyKpg0Bqg@mail.gmail.com>
Date: Mon, 28 Oct 2024 21:08:39 -0400
From: Matt Weir <cweir@...edu>
To: john-users@...ts.openwall.com
Subject: Re: Markov phrases in john

I took a look at this, and I think there is a fundamental
misunderstanding I'm having as I've run into a lot of issues getting this
attack to run.  I'm sure this is wrong but I hope providing examples will
help you correct what I'm doing.

Step 1) Run tokenize.pl on training passwords. Example:
./tokenize.pl TRAINING_PASSWORDS.txt

Step 2) Run resulting Sed script on training passwords. Example:
cat TRAINING_PASSWORDS.txt | sed 's/1234/\x10/; s/love/\x15/; s/2345/\x1b/;
s/3456/\x93/; s/ilov/\xe5/; s/123/\x6/; s/234/\x89/; s/ove/\x8c/;
s/lov/\x90/; s/345/\x96/; s/456/\xa4/; s/and/\xb2/; s/mar/\xc1/;
s/ell/\xd9/; s/199/\xdf/; s/ang/\xe0/; s/200/\xe7/; s/ter/\xe9/;
s/198/\xee/; s/man/\xf4/; s/ari/\xfb/; s/an/\x1/; s/er/\x2/; s/12/\x3/;
s/ar/\x4/; s/in/\x5/; s/23/\x7/; s/ma/\x8/; s/on/\x9/; s/el/\xb/;
s/lo/\xc/; s/ri/\xe/; s/le/\xf/; s/al/\x11/; s/la/\x12/; s/li/\x13/;
s/en/\x14/; s/ra/\x16/; s/es/\x17/; s/re/\x18/; s/19/\x19/; s/il/\x1a/;
s/na/\x1c/; s/ha/\x1d/; s/am/\x1e/; s/ie/\x1f/; s/11/\x7f/; s/ch/\x80/;
s/10/\x81/; s/00/\x82/; s/te/\x83/; s/ve/\x84/; s/as/\x85/; s/ne/\x86/;
s/ll/\x87/; s/or/\x88/; s/ta/\x8a/; s/st/\x8b/; s/is/\x8d/; s/01/\x8e/;
s/ro/\x8f/; s/20/\x91/; s/ni/\x92/; s/at/\x94/; s/34/\x95/; s/45/\x97/;
s/it/\x98/; s/08/\x99/; s/mi/\x9a/; s/ca/\x9b/; s/ic/\x9c/; s/da/\x9d/;
s/he/\x9e/; s/21/\x9f/; s/nd/\xa0/; s/me/\xa1/; s/ng/\xa2/; s/mo/\xa3/;
s/ba/\xa5/; s/sa/\xa6/; s/ti/\xa7/; s/56/\xa8/; s/sh/\xa9/; s/ea/\xaa/;
s/ia/\xab/; s/ol/\xac/; s/se/\xad/; s/ov/\xae/; s/be/\xaf/; s/de/\xb0/;
s/co/\xb1/; s/ss/\xb3/; s/99/\xb4/; s/to/\xb5/; s/22/\xb6/; s/oo/\xb7/;
s/02/\xb8/; s/ke/\xb9/; s/ee/\xba/; s/ho/\xbb/; s/ey/\xbc/; s/ck/\xbd/;
s/ab/\xbe/; s/et/\xbf/; s/ad/\xc0/; s/13/\xc2/; s/07/\xc3/; s/pa/\xc4/;
s/09/\xc5/; s/06/\xc6/; s/ki/\xc7/; s/98/\xc8/; s/hi/\xc9/; s/th/\xca/;
s/05/\xcb/; s/14/\xcc/; s/25/\xcd/; s/ay/\xce/; s/ce/\xcf/; s/89/\xd0/;
s/ac/\xd1/; s/os/\xd2/; s/ge/\xd3/; s/03/\xd4/; s/ka/\xd5/; s/ja/\xd6/;
s/bo/\xd7/; s/do/\xd8/; s/04/\xda/; s/e1/\xdb/; s/nn/\xdc/; s/em/\xdd/;
s/31/\xde/; s/15/\xe1/; s/18/\xe2/; s/ir/\xe3/; s/91/\xe4/; s/om/\xe6/;
s/90/\xe8/; s/30/\xea/; s/nt/\xeb/; s/di/\xec/; s/si/\xed/; s/ou/\xef/;
s/un/\xf0/; s/24/\xf1/; s/us/\xf2/; s/88/\xf3/; s/ai/\xf5/; s/78/\xf6/;
s/y1/\xf7/; s/so/\xf8/; s/pe/\xf9/; s/ot/\xfa/; s/ga/\xfc/; s/ly/\xfd/;
s/16/\xfe/; s/ed/\xff/'  > tokenize.chr

Step 3) Create entry in John.conf for the new charset. Example:
[Incremental:Tokenize]
File = $JOHN/tokenize.chr

Step 4) Run incremental attack with the new charset. Example:
./john --incremental=Tokenize --stdout

Error that is printed out:
Incorrect charset file format: $JOHN/tokenize.chr

Note: I thought it might be a UTF-8 issue so I repeated the above steps
using password.lst (picked because it doesn't have any UTF-8 characters,
not because it's a good training set). I also then repeated the experiment
using the sed script you attached in a previous e-mail vs. the one I
generated. Same error.

I'm sure I'm doing something wrong on my end, so if you could expand on the
steps you used to generate the custom .chr file I'd appreciate it!

Cheers,
Matt/Lakiw




On Mon, Oct 28, 2024 at 12:09 AM Solar Designer <solar@...nwall.com> wrote:

> I've extracted some more data from my tests, below:
>
> On Sun, Oct 27, 2024 at 06:08:28AM +0100, Solar Designer wrote:
> > For comparison, continuing a reference run to 2 billion gives 1820805,
> > so is on par with the tokenized run at 1 billion.  Average password
> > length among the 721406 cracked on top of wordlist is 6.70 characters.
> >
> > Continuing the tokenized run to 2 billion gives 1952392.  Combined with
> > the reference 2 billion run's, it's 2229062 unique.  Average password
> > length among the first 721406 cracked by the tokenized run on top of
> > wordlist is 7.04 characters, so a slight increase over the non-tokenized
> > run's length.
>
> The average password lengths above were for cracked passwords.  I now
> also have the corresponding average candidate password lengths.  For a
> reference run to 2 billion, it's 7.04 characters (the match with 7.04
> above is accidental).  For the tokenized run to 2 billion, it's 8.27
> characters.  So there's a larger increase in candidate password length
> than in successfully cracked passwords' length.
>
> > Some passwords may be represented in more than one way via the tokens,
> > which means that even though incremental mode outputs unique token
> > strings, we're getting some duplicate candidate passwords after the
> > external mode.  For example, in the first 1 million candidates generated
> > in my tests like the above, there are 994373 unique.  In the first 10
> > million, 9885308 unique.  In the first 100 million, 97567218 unique.
>
> In the first 1 billion, 969765579 unique (97.0%).
> In the first 2 billion, 1934766835 unique (96.7%).
>
> > The above success was despite the duplicates included in the 1 or 2
> > billion counts.  Results would be even better with them excluded, for
> > more unique passwords fitting those counts.  Maybe we need to enable our
> > dupe suppressor for such runs.  Maybe we need to adjust the approach to
> > tokenization to reduce or avoid duplicates - e.g., use fixed-length
> > non-overlapping tokens, but this would probably hurt in other ways.
> > Perhaps the duplicates would become more of a problem for longer runs
> > (beyond a few billion candidates).
>
> Some other ideas:
>
> 1. Try models such as our Markov mode or OMEN, which (I think) only
> generate token combinations that were actually seen in the training set.
> When we always substitute a substring with a token code before training,
> that substring is then never seen in the training set as-is, so those
> models shouldn't produce it by other means.  Incremental mode is
> different in that it eventually tries all combinations, even those that
> were never seen, which is both a blessing and a curse.
>
> I'd appreciate it if someone from the john-users community runs such
> tests with models other than incremental mode.
>
> 2. Modify incremental mode so that it either doesn't try never-seen
> combinations or is token-aware and skips effectively-duplicate token
> strings (e.g., insist on certain ordering of tokens).  This is tricky to
> implement and it has drawbacks - trying never-seen combinations may be
> desirable and filtering is slow.  OTOH, having tokenization built in may
> improve ease of use and will speed up the reverse mapping (native code
> instead of external mode).
>
> 3. Choose a token lengths range that minimizes the duplicates rate.
> In my test so far, most tokens are of length 2 and I guess these are
> somewhat frequently matched by pairs of characters or by different
> grouping of characters and tokens.  Fixing the token length at 3 or 4
> might reduce such occurrences a lot while still covering the really
> common substrings by the 158 tokens.
>
> Alexander
>
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.