|
Message-ID: <20240515200308.GA16670@openwall.com> Date: Wed, 15 May 2024 22:03:08 +0200 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: Markov phrases in john Hi, This thread was very timely for my talk, but I didn't have time to comment in here, so let me do that now. On Tue, May 14, 2024 at 04:16:28PM +0200, Jens Timmerman wrote: > I guess you could try to train a large language model on large lists of > known leaed passphrases? > > These might perform better at learning the underlying patterns people > use when thinking of passphrases than a simle markov model. > > However this might end up being computationally expensive, and probably > also storage intensive if you want to create a nice list of passphrases. Right. The current unsolved problem with generative NNs is that when you get them to produce progressively lower weight outputs, they also produce progressively more duplicates. The duplicates ratio is on the order of 50% at 1 billion candidate passwords/phrases generated (but it varies greatly). I wonder what it would be e.g. at 10 billion - 90% maybe? So yes, this becomes also storage intensive if we try and eliminate the duplicates. Yet I'd like researchers of generative NN based candidate password/phrase generators to release, say, 1 billion lists of deduplicated output, so that we could use them and run comparisons against other tools without making those time-consuming and unreliable setups ourselves (if documented and reproducible at all, which unfortunately is usually not the case so far). See slides 74, 77. Here are the excerpts from the Markdown source: --- -> Probabilistic candidate password generation with neural networks (2010s+) <- * William Melicher et al., "Fast, Lean, and Accurate: Modeling Password Guessability Using Neural Networks", 2016 - Recurrent neural network (RNN) predicts next character, no duplicates - 60 MB model outperforms other generators, but apparently was too slow to actually go beyond 10 million candidates so that is only simulated - 3 MB performs almost as well, takes ~100 ms per password in JavaScript * Generative Adversarial Networks (GAN) produce duplicates (~50% at 1 billion) - "PassGAN: A Deep Learning Approach for Password Guessing" (2017) - "Improving Password Guessing via Representation Learning" (2019) - "Generative Deep Learning Techniques for Password Generation" (2020) - David Biesner et al., VAE, WAE, fine-tuned GPT2 - maybe currently best? - "GNPassGAN: Improved Generative Adversarial Networks For Trawling Offline Password Guessing" "guessing 88.03% more passwords and generating 31.69% fewer duplicates" than PassGAN, which had already been outperformed (2022) -> Future <- [...] * Focus - Better passphrase support (tools, datasets), arbitrary tokenization - Further neural networks, tackling the duplicates problem of generative NNs - Meanwhile, publicly release pre-generated and pre-filtered output - Application of NNs for targeting (scraping and training on user data) --- > And that would be by design, I think the entire idea of using > passphrases over passwords is that it makes password cracking a lot > harder/more expensive. Passphrases offer a good balance between cost or risk to crack and user friendliness. > But it would be interesting to see the results of such an approach. We have some up to the GPT2 era so far, see above. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.