john-users - Re: Markov phrases in john

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20241205014553.GA8319@openwall.com>
Date: Thu, 5 Dec 2024 02:45:53 +0100
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: Markov phrases in john

Hi,

I'm keeping track of issues with and enhancement ideas for the tokenizer
and its uses here:

https://github.com/openwall/john/issues/5597

This lists 12 items now.  Some of these have been mentioned in here,
others are "brand new".

On Wed, Dec 04, 2024 at 07:24:57PM -0500, Matt Weir wrote:
> I wrote a new blog post about additional analysis of the Tokenizer and OMEN
> attacks.
> 
> Link:
> https://reusablesec.blogspot.com/2024/12/analyzing-tokenizer-part-2-omen.html
> 
> TLDR: I ran into significant problems training on the output of tokenize.pl.
> The control characters it inserts into the training data causes problems
> both when I was reading in the data as well as when I was trying to write
> the OMEN rulesets to disk. Therefore I was unable to successfully combine
> the two attack techniques. I plan on continuing to look into this, but it's
> quickly turning into a much bigger project than I originally expected.

Thanks.  Yes, projects like these quickly expand into many directions.

Regarding the control characters, it should be fairly easy for you to
modify the script so that it does not use single-byte ones.  Just reduce
$maxtok from 158 to 128, and adjust this loop:

for (my $c = 1; $c < 0x100; $c++) {

to be:

for (my $c = 0x80; $c < 0x100; $c++) {

As to multi-byte strings that are somehow special in UTF-8 (you show
"\u2028" and "\u0085"), you could exclude (skip in the loop above) their
individual bytes such as 0xc2 and 0xe2 (if I got these right).  You'd
also need to decrease $maxtok further to 126.

Regarding your first 25 for RockYou training not matching mine, I think
it is important because it indicates that something went wrong and it
potentially invalidates your other results.  Sure everything looks
similar, but if it's unexpectedly not the same then we have no idea if
e.g. incremental+tokenize would actually perform better than OMEN with
whatever error there was corrected.  You found one puzzling display
error with piping into "less" under WSL.  However, even with that
corrected your first 25 list still doesn't match mine, right?

Anyway, all of this was presumably on a version of tokenize.pl before
fixing of now-known issues in it (which had made it sub-optimal compared
to the current version).  You don't mention what version you use now.  I
hope you've switched to the latest?  If so, it isn't supposed to match
the older version's output I had posted.  But I can get a new test case
to you.

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.