john-users - Re: Markov phrases in john

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241120021431.GA19322@openwall.com>
Date: Wed, 20 Nov 2024 03:14:31 +0100
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: Markov phrases in john

On Tue, Nov 19, 2024 at 03:46:10PM -0500, Matt Weir wrote:
> >>  One thing that surprised me is that your top 25 for training
> >> on RockYou Full ...
> 
> Yup I included dupes. Two options are A) Our versions of Rockyou are
> slightly different (there are some variations of the list floating around)
> or B) The order of the passwords during training matters. For reference my
> version of RockYou has been randomized to make it easier to slice up. I can
> run it on the original list when I have some time to see if that's the
> cause.

I doubt it's either (nor both) of these things, but indeed these are
among the possibilities until we've shown otherwise.  The differences
between versions of RockYou are minor and I don't see why the order
would matter.  I also don't see why e.g. the locale settings would
matter in this case, but again this is worth testing, and maybe we
should add explicit e.g. LC_CTYPE=C for the sed invocation.

> >> I thought you'd also try the tokenizer along with OMEN ...
> 
> It should be pretty easy, but fell outside my initial research. By easy I
> suspect it's an hour or two of work simply due to all the "known unknowns"
> that pop up when trying something new. My thought process is:
> 
> 1) Create the "tokenized" training set just like we do with Incremental
> (minus putting it in potfile format)
> 2) Run an OMEN training on it. Aka python3 pcfg_trainer.py -t
> TOKIZED_TRAINING_FILE.txt -r OMEN_TOK_TEST -c 0
> -- The '-c 0' option sets the coverage to be 0 so it will only generate
> OMEN guesses. Side note, setting '-c 1.0' means no OMEN guesses will be
> generated. Default is 0.6
> 3) Generate guesses using the PCFG tool and pipe it into JtR with the
> external mode. Aka: python3 pcfg_guesser -r OMEN_TOK_TEST | ./john --stdin
> --external=untokinize_omen --stdout

I've never tried using these tools (I really should make time for that),
but the above looks roughly like what I had thought you'd do.

> The "known unknown" for me where things could go sideways is passing the
> "tokenize control characters" via pipes and if weirdness occurs in my
> generating tool (pcfg) trying to output them. It might work, but it might
> not.

Should be no problem with the pipes, but could be within the tools.

Also, would you or someone please try the tokenizer along with JtR's
builtin Markov mode?

Anyway, it is interesting that OMEN alone performed better for you than
incremental with tokenizer did.  My guess as to why is that incremental
does too much (when extended in other ways, like with the tokenizer) in
terms of separation by length and character position.

I also had this guess when I had tried extending incremental to
3rd-order Markov (4-grams) from its current 2nd-order (3-grams) while
preserving the length and position separation.  This resulted in only
slight and inconclusive improvement (at huge memory usage increase
and/or reduction in character set), so I didn't release that version.

If I had more time, I'd try selectively removing that separation or/and
adding more fallbacks (like if a certain pair of characters never occurs
in that position for that length, see if it does for others and use that
before falling back to considering only one character instead of two).

Returning to the tokenizer and "Markov phrases", a while ago I also ran
experiments with my Project Gutenberg Australia phrase list (2 to 6
words, 2+ occurrences in books, with --rules=PhrasePreprocess applied)
as training material.  I got some preliminary results and intended to
improve upon them before posting in here, but I never got around to.  So
here's some of what I got:

The phrase list is 759 MB, the fake pot file for training is 520 MB
(smaller than original phrase list due to tokenization), 41M lines (so
average phrase length 17).  The chr file is 38 MB.  Token lengths 2 to 4
as in our default tokenizer.  The tokens ended up "common syllables" and
"sometimes syllables followed or preceded by embedded space character"
as I had predicted, and occasionally even two letters or a letter and a
syllable separated by a space within the token.  Here's the generated
sed line to illustrate:

sed '/[^ -~]/d; s/ the/\x7/g; s/the /\xc/g; s/and /\x1d/g; s/ and/\x83/g; s/ing /\x94/g; s/ of /\xa1/g; s/ to /\xa4/g; s/that/\xa8/g; s/hat /\xad/g; s/ was/\xb6/g; s/ tha/\xbd/g; s/was /\xbe/g; s/ you/\xc2/g; s/his /\xcd/g; s/ther/\xd3/g; s/n th/\xdc/g; s/d th/\xe4/g; s/her /\xe7/g; s/t th/\xef/g; s/with/\xf3/g; s/you /\xf6/g; s/e th/\xfa/g; s/the/\x4/g; s/ th/\x6/g; s/he /\x8/g; s/and/\x17/g; s/nd /\x7f/g; s/ed /\x82/g; s/ing/\x88/g; s/ an/\x89/g; s/ to/\x8a/g; s/er /\x91/g; s/to /\x92/g; s/ of/\x93/g; s/of /\x95/g; s/her/\x97/g; s/ he/\x9a/g; s/at /\x9d/g; s/ng /\xa3/g; s/hat/\xa5/g; s/as /\xab/g; s/ ha/\xaf/g; s/ in/\xb2/g; s/ wa/\xb3/g; s/was/\xb8/g; s/e t/\xb9/g; s/re /\xba/g; s/tha/\xbb/g; s/ere/\xbc/g; s/in /\xbf/g; s/is /\xc1/g; s/you/\xc3/g; s/d t/\xc5/g; s/his/\xc7/g; s/ be/\xc9/g; s/on /\xcf/g; s/e a/\xd2/g; s/e w/\xd4/g; s/en /\xd7/g; s/ hi/\xd9/g; s/t t/\xdb/g; s/ll /\xe0/g; s/e s/\xe1/g; s/ wh/\xe2/g; s/n t/\xe3/g; s/for/\xe5/g; s/ yo/\xee/g; s/ a /\xf0/g; s/ent/\xf7/g; s/ no/\xf8/g; s/ co/\xfc/g; s/me /\xfd/g; s/ it/\xfe/g; s/e /\x1/g; s/he/\x2/g; s/th/\x3/g; s/ t/\x5/g; s/d /\x9/g; s/t /\xb/g; s/ a/\xe/g; s/s /\xf/g; s/in/\x10/g; s/er/\x11/g; s/an/\x12/g; s/ h/\x13/g; s/n /\x14/g; s/re/\x15/g; s/ w/\x16/g; s/nd/\x18/g; s/ s/\x19/g; s/ou/\x1a/g; s/ha/\x1b/g; s/ i/\x1c/g; s/on/\x1e/g; s/at/\x1f/g; s/ o/\x80/g; s/r /\x81/g; s/ed/\x84/g; s/en/\x85/g; s/y /\x86/g; s/to/\x87/g; s/hi/\x8b/g; s/as/\x8c/g; s/o /\x8d/g; s/it/\x8e/g; s/or/\x8f/g; s/ng/\x90/g; s/is/\x96/g; s/ar/\x98/g; s/of/\x99/g; s/ b/\x9b/g; s/st/\x9c/g; s/ve/\x9e/g; s/ m/\x9f/g; s/te/\xa0/g; s/es/\xa2/g; s/f /\xa6/g; s/se/\xa7/g; s/me/\xa9/g; s/nt/\xaa/g; s/le/\xac/g; s/ll/\xae/g; s/wa/\xb0/g; s/ c/\xb1/g; s/ f/\xb4/g; s/ea/\xb5/g; s/ne/\xb7/g; s/ d/\xc0/g; s/al/\xc4/g; s/be/\xc6/g; s/no/\xc8/g; s/g /\xca/g; s/ti/\xcb/g; s/a /\xcc/g; s/de/\xce/g; s/ho/\xd0/g; s/ro/\xd1/g; s/l /\xd5/g; s/ad/\xd6/g; s/li/\xd8/g; s/ut/\xda/g; s/el/\xdd/g; s/h /\xde/g; s/ l/\xdf/g; s/co/\xe6/g; s/om/\xe8/g; s/sh/\xe9/g; s/ p/\xea/g; s/ri/\xeb/g; s/ur/\xec/g; s/ee/\xed/g; s/ot/\xf1/g; s/ow/\xf2/g; s/wh/\xf4/g; s/ n/\xf5/g; s/ma/\xf9/g; s/ce/\xfb/g; s/ch/\xff/g; s/^/:/'

Here are the first 25 candidates this produces:

the shows
and some
the saying to
and w t
and so he
the show you
and w a
and soon and
and sooner
the bed he
the bed for
the m t
the m a
the saying on the
the boy and
the boyle
and well I
and well i
the saying in
of a t
of a th
and by the
and by a
the pans
the pan and

Clearly, my phrase list is suboptimal in that it includes many word
sequences that start or especially end mid-phrase.  This is bad for
training, and it is also bad for direct usage as a wordlist(+rules).
I should probably exclude them (all or most?) and try again.

Also, the original phrases don't use all of the printable ASCII, which
left many potential token codes unused as the current tokenizer is
limited to using non-ASCII for the tokens.  I got a total of only 213
different "characters", whereas when training on RockYou it could be up
to the current maximum of 253.  We could have 40 more tokens here if the
tokenizer dynamically allocated all unused codes as potential tokens.

Anyway, here's what cracking with this looks like (on one CPU core):

1 billion:

Loaded 6969140 password hashes with no different salts (NT [MD4 128/128 AVX 4x3])
Note: Passwords longer than 27 rejected
Warning: only 213 characters available
Press 'q' or Ctrl-C to abort, 'h' for help, almost any other key for status
82044g 0:00:02:12  621.5g/s 2425Kp/s 2425Kc/s 16758GC/s side off afoes..side offenboy
114014g 0:00:04:56  383.9g/s 2843Kp/s 2843Kc/s 19578GC/s four goe ha ha..four giance no
117608g 0:00:05:47  338.4g/s 2877Kp/s 2877Kc/s 19792GC/s ill lationa..ill latimer
Session stopped (max candidates reached)

10 billion (from scratch, not reusing the 1 billion above):

82755g 0:00:02:11  631.6g/s 2501Kp/s 2501Kc/s 17287GC/s dead not come he..dead not had to
105241g 0:00:03:40  478.3g/s 2781Kp/s 2781Kc/s 19173GC/s he hate acere..and Gaten
111355g 0:00:04:32  409.4g/s 2865Kp/s 2865Kc/s 19733GC/s shied ouseas..shied i sa
159497g 0:00:13:23  198.4g/s 3117Kp/s 3117Kc/s 21347GC/s went of gick..went of gioull
160544g 0:00:13:35  196.7g/s 3119Kp/s 3119Kc/s 21358GC/s KSen  to..the said I ours in
217094g 0:00:51:25  70.36g/s 3240Kp/s 3240Kc/s 22016GC/s a fester tomi..a festwho i p
Session stopped (max candidates reached)

100 billion (ditto):

175857g 0:00:20:16  144.5g/s 3155Kp/s 3155Kc/s 21556GC/s duke a crow i..duke a cros for
180903g 0:00:21:57  137.3g/s 3168Kp/s 3168Kc/s 21635GC/s i way indeoux..i way in a toes
182154g 0:00:22:33  134.5g/s 3172Kp/s 3172Kc/s 21662GC/s on thylston..on thylae
243763g 0:01:19:34  51.06g/s 3205Kp/s 3205Kc/s 21713GC/s blottore mon..blottor OB
245510g 0:01:24:15  48.57g/s 3183Kp/s 3183Kc/s 21556GC/s and mrained his limp..and mrained his arty
292117g 0:03:09:07  25.74g/s 3176Kp/s 3176Kc/s 21378GC/s hazu hipast tin..hazu hipas and of
294151g 0:03:10:14  25.77g/s 3180Kp/s 3180Kc/s 21405GC/s his farman sing..his farman the b
297221g 0:03:25:40  24.09g/s 3183Kp/s 3183Kc/s 21412GC/s nred th to no..nred thleve
302601g 0:03:45:29  22.37g/s 3167Kp/s 3167Kc/s 21289GC/s I'll thaly Thom..I'll thaly gait
311134g 0:04:29:45  19.22g/s 3161Kp/s 3161Kc/s 21214GC/s aget the mill becoe..aget the manner in an
318158g 0:04:59:04  17.73g/s 3156Kp/s 3156Kc/s 21167GC/s but was the verthed it in..but was the verthesisi
348129g 0:07:47:10  12.42g/s 3156Kp/s 3156Kc/s 21083GC/s pluty omen no..pluty omini
350695g 0:08:04:38  12.06g/s 3153Kp/s 3153Kc/s 21060GC/s id sears hell..id searsalom
350795g 0:08:10:39  11.92g/s 3152Kp/s 3152Kc/s 21052GC/s daroxing I waur..daroxing ecal
351535g 0:08:14:33  11.85g/s 3152Kp/s 3152Kc/s 21048GC/s her he-mede..her hemt was
353467g 0:08:26:40  11.63g/s 3151Kp/s 3151Kc/s 21035GC/s he senda below..he senda belle a
354956g 0:08:39:49  11.38g/s 3150Kp/s 3150Kc/s 21026GC/s markins arties'..markins all ring with
356067g 0:08:49:59  11.20g/s 3144Kp/s 3144Kc/s 20983GC/s sawrent thou bother..sawrent thou door
Session stopped (max candidates reached)

This is using the same HIBPv8 sample that I had run the RockYou-trained
tests against, but this time without pre-cracking by RockYou wordlist
(since in this case it's out-of-sample testing even without that).

As you can see, most candidates look somewhat like phrases, but are not.
Almost all are space-separated, which is consistent with the training
data.  They also become rather long.

As to phrases that actually got cracked, it's not a lot.  Out of the
successful guess counts seen above - e.g., 117608 at 1 billion
candidates - it's just these 6 for length 13+:

$ cut -d: -f2- tok100.pot | egrep '^.{13,}'
what the hell
what is yours
planningforce
superformance
smile the best
I am the King

At 10 billion, this increases to 18.  At 100 billion, it's 83, and some
of these are just long words (e.g., misunderstanding).  The more
interesting ones are these 19:

$ cut -d: -f2- tok102.pot | egrep '^.{13,}' | grep '[^a-z]'
what the hell
what is yours
smile the best
I am the King
counter strike
friend of Dad
im the coolest
prince of peace
one tree hill
cant remember
shadow-spirit
the casualties
greater harvest
what the fuck
is it to late
the art of war
personal only
full professor
cats and dogs

However, it's important to note that space-separated is by far not the
most common form of passphrases, at least not in HIBP, yet this is what
our training was for.

I had previously optimized the Phrase ruleset for usage with this phrase
list as a wordlist.  This optimized ruleset starts by trying the phrases
with spaces removed.  Here it is for reference, along with PhrasePreprocess
that I had mentioned above:

# A special ruleset intended for stacking before other Phrase* rules below,
# such that you have the option to run its output through "unique" first
[List.Rules:PhrasePreprocess]
/[ ] :
-c /[ ] l Q
/[ ] @' Q
-c /[ ] @' Q M l Q

# The main optimized Phrase ruleset, almost no duplicates with proper input
[List.Rules:Phrase]
# This one rule cracks ~1050 HIBP v7 passwords per million with sequences of
# 2 to 6 words occurring 2+ times across Project Gutenberg Australia books
# when our sequence list includes them in both their original case and
# all-lowercase, as well as both with apostrophes intact and removed (these
# variations are not implemented in this ruleset not to produce duplicates)
@?w Q
# Sorted separator characters: 1_24 -.3785690@,&+*!'$/?:=#~^%;`>"[)<]|({}\
# (the apostrophe is probably overrated since it also occurs inside words)
# Each character in 1_24 cracks ~82 to ~61 passwords per million
s[ ][1_24] Q
# Leaving the space separators intact cracks ~59 passwords per million
/[ ]
# Each character in -.3785690@ cracks ~53 to ~12 passwords per million
s[ ][\-.3785690@] Q
# Each character in ,&+*!'$/?:=#~ cracks ~10 to ~1 passwords per million
s[ ][,&+*!'$/?:=#~] Q

As my comments in there say, it was ~1050 per million with spaces
removed vs. ~59 with them intact.  (A thousand per million may seem low,
but let's keep in mind that most of those 1 million are not phrases.)

So another obvious improvement to the training material may be to remove
the spaces or pre-apply a ruleset like Phrase above.

Also, seeing how nothing longer than 4 words (with separators) got
cracked in this way, maybe it's best to exclude longer phrases from the
training set (which also included 5- and 6-word sequences).

Yet another option may be to train on actually cracked phrases, but
there may be too few for good training, and besides this will skew the
training towards what was crackable by prior techniques.

For comparison, simply running my 41M phrase list against the same
HIBPv8 sample as in the recent tests so far cracks only 300 hashes.
That's even lower success rate than the first 1 billion generated by the
tokenized incremental provided above (117608*0.041 = ~4800 per 41M).

Running the phrase list with --rules=Phrase cracks 53202.  That's still
a low rate, but let's keep in mind that most of the hashes we're
cracking are not actually of phrases.  So this approaches 0.1% of total.

Out of those 53202, 1788 are length 13+ and out of them 318 also contain
non-letters (separators), but only 43 are both length 13+ and use spaces
(not other separators).  The same 43 are also found among the 300
cracked without rules (so were in the original phrase list), which is
actually no surprise since only one rule preserves spaces and that one
doesn't do any mangling.

So the Phrase rules perform better than the tokenizer at cracking actual
phrases, whereas much of what the tokenizer cracks are non-phrases
(which is what contributes to its greater success rate) despite of it
trying mostly phrase-looking strings.  But that's with a poor training
set that targeted a relatively uncommon phrase form (space-separated).

Re-testing with an improved training set is needed.  Also, there's room
for relevant improvement of the tokenizer (reduce overlapping tokens,
use more token codes when available).

OTOH, the tokenizer run did crack some phrases that were not cracked by
the phrase list + rules.  Out of the 19 more interesting ones quoted
above, 9 are not in the list+rules cracks.  These numbers are indeed
very small, but that's extra cracks nevertheless.

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.