|
Message-ID: <20240515203400.GA16877@openwall.com> Date: Wed, 15 May 2024 22:34:00 +0200 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: Markov phrases in john On Wed, May 08, 2024 at 07:34:41AM -0500, Adam Lininger wrote: > Take a look at https://github.com/travco/rephraser. > It was made by a friend of mine and intended to use markov chains to > generate word phrases. Thanks, I don't recall seeing this before. I included it on slide 73: --- -> Probabilistic candidate passphrase generators (2010s+) <- * Probabilistic candidate password generators also happen to generate phrases if trained on such input (or just on a real-world mix of passwords/phrases) - PCFG fares better than per-character Markov chains * "Phraser is a phrase generator using n-grams and Markov chains to generate phrases for passphrase cracking" in C# for Windows (2015) * RePhraser "Python-based reimagining of Phraser using Markov-chains for linguistically-correct password cracking" (2020) * Also includes related hand-written and generated rulesets * What about middle ground (e.g. syllables, including some followed by space)? - e.g. extract all substrings of 2+ characters, sort from most to least common, take top ~100, map them onto indices along with single characters, train/use existing probabilistic candidate password generators, map back --- I would really like to give this "middle ground" tokenization idea of mine a try. If I had more time to play with this, I'd have already tried it months ago. As I mentioned in the talk (not on the slides), the "map back" step can be implemented as an external mode the code for which (with embedded array initialization) would be generated by the script performing the forward mapping prior to training. Thus, no external tools would be needed during cracking - only John the Ripper and its usual files (e.g. some .chr and .conf). If we support all printable ASCII (but not beyond) as single characters, we can also have ~128 arbitrary tokens, which I guess would happen to be common syllables and maybe sometimes syllables followed or preceded by embedded space character (or whatever word separator may have been common in training data). Even if we disallow the separators to be embedded in tokens, then the pre-existing per-character (now per-token) generator will commonly introduce separators before/after certain tokens if that's how the training data commonly had it. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.