|
Message-ID: <20241031185818.GA4992@openwall.com>
Date: Thu, 31 Oct 2024 19:58:18 +0100
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: Markov phrases in john
On Sun, Oct 27, 2024 at 06:08:28AM +0100, Solar Designer wrote:
> On Wed, May 15, 2024 at 10:34:00PM +0200, Solar Designer wrote:
> > * What about middle ground (e.g. syllables, including some followed by space)?
> > - e.g. extract all substrings of 2+ characters, sort from most to least
> > common, take top ~100, map them onto indices along with single characters,
> > train/use existing probabilistic candidate password generators, map back
> I gave this a try now, see tokenize.pl in bleeding-jumbo.
> For the first test, I didn't focus on phrases yet, so no deliberate
> separator characters yet.
>
> For training, I used the full RockYou list (duplicates included), to
> match how the .chr files currently included with John the Ripper were
> generated. Specifically, comparing against ascii.chr and utf8.chr (as
> well as a freshly-regenerated file like this, to ensure no unrelated
> change creeped in):
>
> -rw-------. 1 solar solar 5720262 Nov 12 2019 ascii.chr
> -rw-------. 1 solar solar 9286825 Nov 12 2019 utf8.chr
>
> -rw-------. 1 solar solar 9288427 Oct 27 03:56 custom-ref.chr
>
> I chose the token length range 2 to 4, and most of the top 158 ended up
> length 2. See the output of tokenize.pl attached.
I've now attached a new one, corresponding to the best result below.
> With RockYou pre-processed by the sed one-liner generated by tokenize.pl
> from the same RockYou list, the resulting .chr file is:
>
> -rw-------. 1 solar solar 26360011 Oct 27 04:20 custom.chr
"from the same RockYou list" was an error in my message - the tokens
were actually chosen based on our password.lst, which is a subset of
RockYou. However, I am also getting similar sizes (25 to 26 MB) for
.chr files generated with tokens coming from the RockYou list.
For comparison, I've also tried generating a .chr file using solely
password.lst for both steps - exactly as in the newly added usage
example - and this gave a much smaller file:
-rw-------. 1 solar solar 3401494 Oct 31 17:11 custom.chr
> I took my 10M representative sample from HIBPv8 generated by our
> pwned-passwords-sampler. That's 6969140 unique NTLM hashes. After
> running the RockYou wordlist on it (so that further tests would be
> out-of-sample), 1099399 are cracked and 5869741 remain.
>
> Running any of the 3 reference .chr files above for 1 billion candidate
> passwords increases the total unique cracks to either 1669440 or
> 1669463, which is +570064 over the pure wordlist run.
For an extra without-tokenizer reference, I've now done the same using a
deduplicated RockYou list, which we already knew tends to work better
than the dupes-included list (including the dupes in our default .chr
files generation was a mistake). The .chr file size is exactly the same
as with dupes included (same as custom-ref.chr shown above), but the
file content is very different. And indeed it cracks more than our
default files (previous reference result). Now at 1 billion candidates
after RockYou wordlist, it's total unique cracks 1730377 or +630978 over
the wordlist. Average length of the extra 630978 passwords is 6.71.
> Alternatively, running the new tokenized .chr file along with
> --external=Untokenize for 1 billion candidate passwords, increases the
> total unique cracks to 1816315, or +716916 over wordlist.
With the new tokenize.pl, but otherwise the same test as above (so
password.lst used for determining the tokens, but full RockYou with
dupes for .chr file generation), this improved to 1821500 or +722101.
For comparison, the purely password.lst based test (with the 3 MB .chr
file above) gives 1809396 or +709997. This is what our current usage
example is capable of, without requiring any inputs that are not part of
the current jumbo tree. Not bad, but worse than previous best, and than
the below.
Now getting rid of the password.lst mistake, I have at 1 billion
candidates trained entirely on:
RockYou with dupes - 1830570 or +731171
without dupes - 1870645 or +771246
Comparing the best result so far without tokenizer vs. best with the
tokenizer, it's improvement from +630978 to +771246, or by 22%.
Average length of the extra 771246 passwords is 6.83, so this time
they're only very slightly longer than we had without tokenizer.
It's possible to tune for longer passwords, such as by excluding length
2 tokens, but with otherwise the same input I guess this will result in
incremental mode training to use fewer-token strings first and in fewer
passwords cracked. This may be more reasonable to do with/via a
pre-filtered training set (for use after more extensive other attacks
than just the wordlist) and once we re-focus this approach on phrases.
> Some passwords may be represented in more than one way via the tokens,
> which means that even though incremental mode outputs unique token
> strings, we're getting some duplicate candidate passwords after the
> external mode.
Curiously, with the mistakes/bugs corrected, for the best run above the
dupe rate is much lower than before - 997250925 unique (99.7%). My best
guess is the missing /g in the sed expression caused most of the dupes
previously, because it resulted in lots of sub-token training material.
First 25 candidates generated by the two best tokenized runs above are
puzzlingly very different between the two. RockYou with dupes:
123456
12345
lovely
loveme
justin
marian
12341234
love123
superman
12346
lovelove
123412
lover
justang
brandon
brandy
lovers
mario
lover1
marie1
marie123
loveyou
loveya
joshua
jessica
RockYou without dupes:
joshita
janey
susita
supers1
samira
jane1
samita
joshi1
suside
loveran
janette
janeth
sammit
sammies
joelie
joela
joshi
joshy
shaya
shaye
shaun1
shaunte
joshy1
joshya
susana
The most common password 123456 is as far as candidate number 781663,
and 12345 is number 4335832.
It looks like using RockYou with dupes for training results in greater
overlap with the wordlist run.
Without tokenizer, the above difference is far smaller. Training on
RockYou with dupes (like our default .chr files):
123456
12345
111189
123455
111188
12344
121288
121289
112345
112344
123123
123121
121987
121989
111288
111289
112222
112224
11111
11110
marissa
12356
12354
11190
11191
RockYou without dupes:
marana
mara12
marisha
anana
mina12
minana
marang
mara11
mina11
minang
123456
123455
123420
123421
199111
199112
199101
199100
alena
mandan
manday
mandy1
mandys
millyn
milly1
Alexander
View attachment "john-local.conf" of type "text/plain" (7956 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.