john-users - Re: Markov phrases in john

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20241031160347.GA4391@openwall.com>
Date: Thu, 31 Oct 2024 17:03:47 +0100
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: Markov phrases in john

On Thu, Oct 31, 2024 at 01:27:25PM +0100, magnum wrote:
> On 2024-10-30 03:21, Matt Weir wrote:
> >I published a blog post explaining the new tokenizer attack works as well
> >as detailing instructions on how to configure and run it. Link:
> >https://reusablesec.blogspot.com/2024/10/running-jtrs-tokenizer-attack.html
> 
> Good stuff (not only the blog post but this whole thread). Perhaps 
> stating the obvious, you need to ensure the original wordlist is pure 
> ascii, or any parts of UTF-8 and/or legacy codepage stuff will be 
> erroneously detokenized.
> 
> BTW shouldn't the sed stuff all be /g? As in "s/me/\xa1/g;". If not, 
> words like "meme" or "james+me" would only have the first instance 
> tokenized, which I assume is not what we want.

Oh, you're absolutely correct.  I've just pushed an update to
tokenize.pl, so that the generated sed expression takes care of both of
these, as well as of producing pot format output.  I've also added a
usage example:

grep -v '^#!comment:' password.lst | ./tokenize.pl > john-local.conf
sed -n 's/^# //p' john-local.conf > tokenize.sh
grep -v '^#!comment:' password.lst | sh tokenize.sh > fake.pot
./john --pot=fake.pot --make-charset=custom.chr
./john --incremental=custom --external=untokenize --stdout --max-candidates=10
./john --incremental=custom --external=untokenize hashfile

And this reminded me - in the test results I posted, I had actually run
tokenize.pl like the above, so it was trained on our password.lst (a
subset of RockYou overlapping with top HIBP), even though for further
incremental mode training I used the full RockYou (with dupes, to match
what we did for the released .chr files).  Then in the message I posted,
I wrongly wrote that I had trained both the tokenizer and incremental
mode on the same input.  Oops.  Sorry.  I think this doesn't invalidate
my results, but it does make them inconsistent with the way I described
them in that message - now corrected with this paragraph.

Of course, we need to run more and proper tests.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.