john-users - Re: Issue Applying Rules to Tokenized in John the Ripper

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250327023048.GA9191@openwall.com>
Date: Thu, 27 Mar 2025 03:30:48 +0100
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: Issue Applying Rules to Tokenized in John the Ripper

Hi,

On Thu, Mar 27, 2025 at 05:08:16AM +0530, Pentester LAB wrote:
> I am reaching out to seek assistance regarding an issue I encountered while
> attempting to apply rules to a tokenized  using John the Ripper (JtR).

I assume you mean to a tokenized wordlist.

The tokenize.pl script was intended for use with probabilistic models,
where the input wordlist (or cracked passwords list) would be for
training only, and then the model would generate candidates.  I did in
fact have an afterthought to also try misusing it with wordlist rules,
which I mentioned as (currently) idea number 13 at:

https://github.com/openwall/john/issues/5597

"The tokenizer can also be useful along with wordlist mode, both to
produce different candidate passwords (by applying wordlist rules prior
to token expansion) and simply as a compression algorithm.  We could
want to experiment with this and document useful usage patterns and have
this in mind if/when integrating the functionality into john proper."

However, what you're doing looks very different, so what is your goal?

I'll proceed to reply step-by-step, but chances are you actually wanted
something much simpler, which I describe at the end of this message.

> Steps Taken:
> 
>    1.
> 
>    I created a test input file named test.txt with the following content:
> 
> abc
> @
> 123

That's way too little content for intended use of the tokenizer.  You'd
normally train it on the same large wordlist that you'd use for training
the probabilistic model.  However, since we're talking experiments with
unintended uses, let's proceed.

>    2.
> 
>    I used JtR's default tokenizer to process the file:
> 
> perl tokenize.pl test.txt > test_token.txt

Looks good so far, as preparation for whatever experiment comes next.

>    3.
> 
>    The content of test_token.txt is as follows:
> 
> # sed '/[^ -~]/d; s/123/\x1/g; s/abc/\x2/g; s/12/\x3/g; s/bc/\x4/g;
> s/23/\x5/g; s/ab/\x6/g; s/a/\x7/g; s/1/\x8/g; s/b/\x9/g; s/2/\xb/g;
> s/@/\xc/g; s/c/\xe/g; s/3/\xf/g; s/^/:/'
> 
> [List.External:Untokenize]
> int mod[0x100];
> 
> void init() {
>     for (int i = 0; i < 0x100; ++i) mod[i] = i;
>     mod[1] = 0x333231; // "123"
>     mod[2] = 0x636261; // "abc"
>     mod[3] = 0x3231;   // "12"
>     mod[4] = 0x6362;   // "bc"
>     mod[5] = 0x3332;   // "23"
>     mod[6] = 0x6261;   // "ab"
>     mod[7] = 0x61;     // "a"
>     mod[8] = 0x31;     // "1"
>     mod[9] = 0x62;     // "b"
>     mod[11] = 0x32;    // "2"
>     mod[12] = 0x40;    // "@"
>     mod[14] = 0x63;    // "c"
>     mod[15] = 0x33;    // "3"
> }
> 
> void filter() {
>     int i = 0, j = 0, k = 0, save[0x80];
>     while (save[i] = word[i]) i++;
>     while (int m = mod[save[j++]]) {
>         word[k++] = m;
>         while (m >>= 8) word[k++] = m;
>     }
>     word[k] = 0;
> }

There's no way tokenize.pl as ever released by our project would produce
exactly the above output.  I guess you modified it in many ways, which
made it produce subtly broken output.  I see at least two errors in
there: it's tokenizing even single characters (which is at best
unneeded), and it tries to use a "for" loop (which our external mode
compiler does not support).  However, none of this matters when you
don't even use this file correctly next:

>    4.
> 
>    I attempted to crack the hash using the following command:
> 
> john --format=raw-md5 --wordlist=test_token.txt
> --rules=KoreLogic,Best64 md5.hash

This makes no sense.  You use the programs output by the tokenizer as a
wordlist, but they're not useful as a wordlist.

> Issue Observed:
> 
>    -
> 
>    JtR correctly loaded the tokenized wordlist,

You had no "tokenized wordlist", so it couldn't possibly be "correctly
loaded".  What you had is a text file with two programs (to perform
tokenization and its reverse), which you instead misused as a wordlist.

> but it appears that the
>    selected rules (KoreLogic, Best64) were not applied during the cracking
>    attempt.

They probably were, but it doesn't help much when the wordlist doesn't
contain anything resembling passwords (has program code instead).

>    -
> 
>    The session completed without any successful cracks, and no rule-based
>    transformations seemed to have been executed on the tokenized input.
> 
> Request for Assistance:
> 
> I would appreciate guidance on:
> 
>    -
> 
>    Ensuring that rules are correctly applied to tokenized.

This is irrelevant.

>    -
> 
>    Identifying if there are any misconfigurations or additional parameters
>    needed.

This whole project is about trying out a misconfiguration because the
tokenizer was not intended for such misuse, but we may try that anyway.
However, worst of all the final command you ran is certainly not what
you intended.

I recommend that you first learn and practice with intended use of the
tokenizer along with incremental mode, as given in comments at the start
of tokenize.pl.  After you're familiar with that, you can proceed to try
weird things if you want to.

Trying to repair your weird attempts above using unmodified tokenize.pl:

$ cat test.txt 
abc
@
123
$ perl tokenize.pl test.txt > john-local.conf 
$ sed '/[^ -~]/d; s/123/\x1/g; s/abc/\x2/g; s/ab/\x3/g; s/23/\x4/g; s/bc/\x5/g; s/12/\x6/g' test.txt > test-tokenized.txt 
$ ./john --wordlist=test-tokenized.txt --external=Untokenize --stdout
Using default input encoding: UTF-8
abc
@
123
3p 0:00:00:00 100.00% (2025-03-27 03:01) 60.00p/s 123
$ ./john --wordlist=test-tokenized.txt --rules=Best64 --external=Untokenize --stdout | head
Using default input encoding: UTF-8
Press 'q' or Ctrl-C to abort, 'h' for help, almost any other key for status
Enabling duplicate candidate password suppressor using 256 MiB
124p 0:00:00:00 100.00% (2025-03-27 03:01) 1033p/s 123123123123123123
abc
@
123
abc0
@0
1230
abc1
@1
1231
abc2
$ wc test.txt test-tokenized.txt 
 3  3 10 test.txt
 3  1  6 test-tokenized.txt

Where I took the "sed" command from the generated john-local.conf, but
removed the final part where it had "; s/^/:/" as that part was there
for producing fake pot files (for incremental mode training) rather than
wordlists.

As you can see, --external=Untokenize was able to correctly restore the
wordlist from its tokenized or compressed form (original test.txt was 10
bytes, but tokenized test-tokenized.txt only 6 bytes).  And the rules
are applied if you request them.

Moreover, you can see that they're applied differently and their effect
is different than if you used the same rules on the original wordlist:

$ ./john --wordlist=test.txt --rules=Best64 --external=Untokenize --stdout | head
Using default input encoding: UTF-8
Press 'q' or Ctrl-C to abort, 'h' for help, almost any other key for status
Enabling duplicate candidate password suppressor using 256 MiB
152p 0:00:00:00 100.00% (2025-03-27 03:07) 1013p/s 123121
abc
@
123
cba
321
ABC
Abc
abc0
@0
1230

The generated password candidates are different and their number is also
different (152 original vs. 124 when rules are applied to tokenized
wordlist prior to --external=Untokenize).  That's the point of my idea
number 13, so thank you for making me try it out.

With all that said, maybe you actually wanted something completely
different.  Maybe you didn't need the tokenizer at all.  Maybe you
wanted to explicitly list your tokens and then have them mixed up,
and then rules applied?  You'd do that with PRINCE mode:

$ ./john --prince=test.txt --rules=Best64 --stdout | head
Press 'q' or Ctrl-C to abort, 'h' for help, almost any other key for status
Enabling duplicate candidate password suppressor using 256 MiB
@@@
@@@0
@@@1
@@@2
@@@3
@@@4
@@@5
@@@6
@@@7
@@@8
$ ./john --prince=test.txt --rules=Best64 --stdout | tail
Press 'q' or Ctrl-C to abort, 'h' for help, almost any other key for status
Enabling duplicate candidate password suppressor using 256 MiB
134147p 0:00:00:00 100.00% (2025-03-27 03:15) 838418p/s 4123123123123@
1223123123123@
12323123123123@
12313123123123@
131231231231
3123123@...312
1231231231231@
22312312312312
23@...1231231231
923123123123123@
4123123123123@

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.