Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240515203400.GA16877@openwall.com>
Date: Wed, 15 May 2024 22:34:00 +0200
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: Markov phrases in john

On Wed, May 08, 2024 at 07:34:41AM -0500, Adam Lininger wrote:
> Take a look at https://github.com/travco/rephraser.
> It was made by a friend of mine and intended to use markov chains to
> generate word phrases.

Thanks, I don't recall seeing this before.  I included it on slide 73:

---
-> Probabilistic candidate passphrase generators (2010s+) <-

* Probabilistic candidate password generators also happen to generate phrases
  if trained on such input (or just on a real-world mix of passwords/phrases)
  - PCFG fares better than per-character Markov chains

* "Phraser is a phrase generator using n-grams and Markov chains to generate
  phrases for passphrase cracking" in C# for Windows (2015)

* RePhraser "Python-based reimagining of Phraser using Markov-chains for
  linguistically-correct password cracking" (2020)
  * Also includes related hand-written and generated rulesets

* What about middle ground (e.g. syllables, including some followed by space)?
  - e.g. extract all substrings of 2+ characters, sort from most to least
    common, take top ~100, map them onto indices along with single characters,
    train/use existing probabilistic candidate password generators, map back
---

I would really like to give this "middle ground" tokenization idea of
mine a try.  If I had more time to play with this, I'd have already
tried it months ago.

As I mentioned in the talk (not on the slides), the "map back" step can
be implemented as an external mode the code for which (with embedded
array initialization) would be generated by the script performing the
forward mapping prior to training.  Thus, no external tools would be
needed during cracking - only John the Ripper and its usual files (e.g.
some .chr and .conf).  If we support all printable ASCII (but not
beyond) as single characters, we can also have ~128 arbitrary tokens,
which I guess would happen to be common syllables and maybe sometimes
syllables followed or preceded by embedded space character (or whatever
word separator may have been common in training data).  Even if we
disallow the separators to be embedded in tokens, then the pre-existing
per-character (now per-token) generator will commonly introduce
separators before/after certain tokens if that's how the training data
commonly had it.

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.