john-users - Re: Markov phrases in john

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJ9ii1HTw7f7hHD0JLjyN60Hdr+3_gcmK8KNpbVNPoUdi+SvLw@mail.gmail.com>
Date: Wed, 4 Dec 2024 19:24:57 -0500
From: Matt Weir <cweir@...edu>
To: john-users@...ts.openwall.com
Subject: Re: Markov phrases in john

I wrote a new blog post about additional analysis of the Tokenizer and OMEN
attacks.

Link:
https://reusablesec.blogspot.com/2024/12/analyzing-tokenizer-part-2-omen.html

TLDR: I ran into significant problems training on the output of tokenize.pl.
The control characters it inserts into the training data causes problems
both when I was reading in the data as well as when I was trying to write
the OMEN rulesets to disk. Therefore I was unable to successfully combine
the two attack techniques. I plan on continuing to look into this, but it's
quickly turning into a much bigger project than I originally expected.

Cheers,
Matt/Lakiw

On Sat, Nov 30, 2024 at 9:40 PM Solar Designer <solar@...nwall.com> wrote:

> On Wed, Nov 20, 2024 at 03:14:31AM +0100, Solar Designer wrote:
> > Anyway, it is interesting that OMEN alone performed better for you than
> > incremental with tokenizer did.  My guess as to why is that incremental
> > does too much (when extended in other ways, like with the tokenizer) in
> > terms of separation by length and character position.
> >
> > I also had this guess when I had tried extending incremental to
> > 3rd-order Markov (4-grams) from its current 2nd-order (3-grams) while
> > preserving the length and position separation.  This resulted in only
> > slight and inconclusive improvement (at huge memory usage increase
> > and/or reduction in character set), so I didn't release that version.
>
> I've created/closed a GitHub issue to record that experiment:
>
> https://github.com/openwall/john/issues/5584
>
> The patch is included in there, so please feel free to give it a try.
> I did not try it along with the tokenizer yet - would be interesting.
>
> > If I had more time, I'd try selectively removing that separation or/and
> > adding more fallbacks (like if a certain pair of characters never occurs
> > in that position for that length, see if it does for others and use that
> > before falling back to considering only one character instead of two).
>
> Excerpt from an e-mail I wrote in late 2021:
>
> > For incremental mode, I got inconsistent results for a possible upgrade
> > from the current 3-grams to 4-grams, which I spent a couple of days on
> > last week.  In my tests so far, results vary from -11.5% to +24.9%, and
> > are commonly at around +5%.  This is by number of passwords cracked in
> > comparison to the currently released code trained in the same way.
> >
> > The variance is for different training sets, test sets, prior exclusion
> > or not of passwords crackable by wordlist+rules, and different attack
> > duration (such as 1 vs. 10 billion candidates tested).  While the
> > results are mostly positive, it is not entirely obvious which ones
> > reflect future real-world usage best.  Since there's significant extra
> > processing and memory consumption for 4-grams vs. 3-grams, we might want
> > to justify it with a greater improvement than what I'm getting so far.
> >
> > Compared to the current publicly released .chr files, the improvement is
> > more obvious - up to +39.4% in my tests so far - but much of it is also
> > possible without code change (with more extensive training sets).
>
> Alexander
>
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.