john-users - Re: Markov phrases in john

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJ9ii1EndAeQGQd1HHKW0wBEdOdwZy+LtMEZfjgrq+guXgsFkw@mail.gmail.com>
Date: Mon, 28 Oct 2024 23:46:45 -0400
From: Matt Weir <cweir@...edu>
To: john-users@...ts.openwall.com
Subject: Re: Markov phrases in john

Looking through this more, my guess is that the output of this Sed script
needs to be put into potfile format so I can use --make-charset on it, (vs.
using it to generate a .chr file directly). I can then have
incremental=tokenize to generate "encoded" guesses which I then need to run
through the JtR external mode to convert into actual password guesses. But
my original request for an example still stands since there are a lot of
steps there and I want to make sure I have them right. Also I might be
totally wrong with this assumption as well ;p

Side note: This is a weird edge case so very low priority request, but one
thing this made me realize is that it would be nice to use the
--make-charset option on a set of training passwords vs.a potfile to remove
a step in the generation process. That's just me being lazy though, and
I'll admit this is a task that is rare enough that optimizing it doesn't
provide much value.

Cheers,
Matt/Lakiw

On Mon, Oct 28, 2024 at 9:08 PM Matt Weir <cweir@...edu> wrote:

> I took a look at this, and I think there is a fundamental
> misunderstanding I'm having as I've run into a lot of issues getting this
> attack to run.  I'm sure this is wrong but I hope providing examples will
> help you correct what I'm doing.
>
> Step 1) Run tokenize.pl on training passwords. Example:
> ./tokenize.pl TRAINING_PASSWORDS.txt
>
> Step 2) Run resulting Sed script on training passwords. Example:
> cat TRAINING_PASSWORDS.txt | sed 's/1234/\x10/; s/love/\x15/;
> s/2345/\x1b/; s/3456/\x93/; s/ilov/\xe5/; s/123/\x6/; s/234/\x89/;
> s/ove/\x8c/; s/lov/\x90/; s/345/\x96/; s/456/\xa4/; s/and/\xb2/;
> s/mar/\xc1/; s/ell/\xd9/; s/199/\xdf/; s/ang/\xe0/; s/200/\xe7/;
> s/ter/\xe9/; s/198/\xee/; s/man/\xf4/; s/ari/\xfb/; s/an/\x1/; s/er/\x2/;
> s/12/\x3/; s/ar/\x4/; s/in/\x5/; s/23/\x7/; s/ma/\x8/; s/on/\x9/;
> s/el/\xb/; s/lo/\xc/; s/ri/\xe/; s/le/\xf/; s/al/\x11/; s/la/\x12/;
> s/li/\x13/; s/en/\x14/; s/ra/\x16/; s/es/\x17/; s/re/\x18/; s/19/\x19/;
> s/il/\x1a/; s/na/\x1c/; s/ha/\x1d/; s/am/\x1e/; s/ie/\x1f/; s/11/\x7f/;
> s/ch/\x80/; s/10/\x81/; s/00/\x82/; s/te/\x83/; s/ve/\x84/; s/as/\x85/;
> s/ne/\x86/; s/ll/\x87/; s/or/\x88/; s/ta/\x8a/; s/st/\x8b/; s/is/\x8d/;
> s/01/\x8e/; s/ro/\x8f/; s/20/\x91/; s/ni/\x92/; s/at/\x94/; s/34/\x95/;
> s/45/\x97/; s/it/\x98/; s/08/\x99/; s/mi/\x9a/; s/ca/\x9b/; s/ic/\x9c/;
> s/da/\x9d/; s/he/\x9e/; s/21/\x9f/; s/nd/\xa0/; s/me/\xa1/; s/ng/\xa2/;
> s/mo/\xa3/; s/ba/\xa5/; s/sa/\xa6/; s/ti/\xa7/; s/56/\xa8/; s/sh/\xa9/;
> s/ea/\xaa/; s/ia/\xab/; s/ol/\xac/; s/se/\xad/; s/ov/\xae/; s/be/\xaf/;
> s/de/\xb0/; s/co/\xb1/; s/ss/\xb3/; s/99/\xb4/; s/to/\xb5/; s/22/\xb6/;
> s/oo/\xb7/; s/02/\xb8/; s/ke/\xb9/; s/ee/\xba/; s/ho/\xbb/; s/ey/\xbc/;
> s/ck/\xbd/; s/ab/\xbe/; s/et/\xbf/; s/ad/\xc0/; s/13/\xc2/; s/07/\xc3/;
> s/pa/\xc4/; s/09/\xc5/; s/06/\xc6/; s/ki/\xc7/; s/98/\xc8/; s/hi/\xc9/;
> s/th/\xca/; s/05/\xcb/; s/14/\xcc/; s/25/\xcd/; s/ay/\xce/; s/ce/\xcf/;
> s/89/\xd0/; s/ac/\xd1/; s/os/\xd2/; s/ge/\xd3/; s/03/\xd4/; s/ka/\xd5/;
> s/ja/\xd6/; s/bo/\xd7/; s/do/\xd8/; s/04/\xda/; s/e1/\xdb/; s/nn/\xdc/;
> s/em/\xdd/; s/31/\xde/; s/15/\xe1/; s/18/\xe2/; s/ir/\xe3/; s/91/\xe4/;
> s/om/\xe6/; s/90/\xe8/; s/30/\xea/; s/nt/\xeb/; s/di/\xec/; s/si/\xed/;
> s/ou/\xef/; s/un/\xf0/; s/24/\xf1/; s/us/\xf2/; s/88/\xf3/; s/ai/\xf5/;
> s/78/\xf6/; s/y1/\xf7/; s/so/\xf8/; s/pe/\xf9/; s/ot/\xfa/; s/ga/\xfc/;
> s/ly/\xfd/; s/16/\xfe/; s/ed/\xff/'  > tokenize.chr
>
> Step 3) Create entry in John.conf for the new charset. Example:
> [Incremental:Tokenize]
> File = $JOHN/tokenize.chr
>
> Step 4) Run incremental attack with the new charset. Example:
> ./john --incremental=Tokenize --stdout
>
> Error that is printed out:
> Incorrect charset file format: $JOHN/tokenize.chr
>
> Note: I thought it might be a UTF-8 issue so I repeated the above steps
> using password.lst (picked because it doesn't have any UTF-8 characters,
> not because it's a good training set). I also then repeated the experiment
> using the sed script you attached in a previous e-mail vs. the one I
> generated. Same error.
>
> I'm sure I'm doing something wrong on my end, so if you could expand on
> the steps you used to generate the custom .chr file I'd appreciate it!
>
> Cheers,
> Matt/Lakiw
>
>
>
>
> On Mon, Oct 28, 2024 at 12:09 AM Solar Designer <solar@...nwall.com>
> wrote:
>
>> I've extracted some more data from my tests, below:
>>
>> On Sun, Oct 27, 2024 at 06:08:28AM +0100, Solar Designer wrote:
>> > For comparison, continuing a reference run to 2 billion gives 1820805,
>> > so is on par with the tokenized run at 1 billion.  Average password
>> > length among the 721406 cracked on top of wordlist is 6.70 characters.
>> >
>> > Continuing the tokenized run to 2 billion gives 1952392.  Combined with
>> > the reference 2 billion run's, it's 2229062 unique.  Average password
>> > length among the first 721406 cracked by the tokenized run on top of
>> > wordlist is 7.04 characters, so a slight increase over the non-tokenized
>> > run's length.
>>
>> The average password lengths above were for cracked passwords.  I now
>> also have the corresponding average candidate password lengths.  For a
>> reference run to 2 billion, it's 7.04 characters (the match with 7.04
>> above is accidental).  For the tokenized run to 2 billion, it's 8.27
>> characters.  So there's a larger increase in candidate password length
>> than in successfully cracked passwords' length.
>>
>> > Some passwords may be represented in more than one way via the tokens,
>> > which means that even though incremental mode outputs unique token
>> > strings, we're getting some duplicate candidate passwords after the
>> > external mode.  For example, in the first 1 million candidates generated
>> > in my tests like the above, there are 994373 unique.  In the first 10
>> > million, 9885308 unique.  In the first 100 million, 97567218 unique.
>>
>> In the first 1 billion, 969765579 unique (97.0%).
>> In the first 2 billion, 1934766835 unique (96.7%).
>>
>> > The above success was despite the duplicates included in the 1 or 2
>> > billion counts.  Results would be even better with them excluded, for
>> > more unique passwords fitting those counts.  Maybe we need to enable our
>> > dupe suppressor for such runs.  Maybe we need to adjust the approach to
>> > tokenization to reduce or avoid duplicates - e.g., use fixed-length
>> > non-overlapping tokens, but this would probably hurt in other ways.
>> > Perhaps the duplicates would become more of a problem for longer runs
>> > (beyond a few billion candidates).
>>
>> Some other ideas:
>>
>> 1. Try models such as our Markov mode or OMEN, which (I think) only
>> generate token combinations that were actually seen in the training set.
>> When we always substitute a substring with a token code before training,
>> that substring is then never seen in the training set as-is, so those
>> models shouldn't produce it by other means.  Incremental mode is
>> different in that it eventually tries all combinations, even those that
>> were never seen, which is both a blessing and a curse.
>>
>> I'd appreciate it if someone from the john-users community runs such
>> tests with models other than incremental mode.
>>
>> 2. Modify incremental mode so that it either doesn't try never-seen
>> combinations or is token-aware and skips effectively-duplicate token
>> strings (e.g., insist on certain ordering of tokens).  This is tricky to
>> implement and it has drawbacks - trying never-seen combinations may be
>> desirable and filtering is slow.  OTOH, having tokenization built in may
>> improve ease of use and will speed up the reverse mapping (native code
>> instead of external mode).
>>
>> 3. Choose a token lengths range that minimizes the duplicates rate.
>> In my test so far, most tokens are of length 2 and I guess these are
>> somewhat frequently matched by pairs of characters or by different
>> grouping of characters and tokens.  Fixing the token length at 3 or 4
>> might reduce such occurrences a lot while still covering the really
>> common substrings by the 158 tokens.
>>
>> Alexander
>>
>
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.