|
Message-ID: <20241120021431.GA19322@openwall.com> Date: Wed, 20 Nov 2024 03:14:31 +0100 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: Markov phrases in john On Tue, Nov 19, 2024 at 03:46:10PM -0500, Matt Weir wrote: > >> One thing that surprised me is that your top 25 for training > >> on RockYou Full ... > > Yup I included dupes. Two options are A) Our versions of Rockyou are > slightly different (there are some variations of the list floating around) > or B) The order of the passwords during training matters. For reference my > version of RockYou has been randomized to make it easier to slice up. I can > run it on the original list when I have some time to see if that's the > cause. I doubt it's either (nor both) of these things, but indeed these are among the possibilities until we've shown otherwise. The differences between versions of RockYou are minor and I don't see why the order would matter. I also don't see why e.g. the locale settings would matter in this case, but again this is worth testing, and maybe we should add explicit e.g. LC_CTYPE=C for the sed invocation. > >> I thought you'd also try the tokenizer along with OMEN ... > > It should be pretty easy, but fell outside my initial research. By easy I > suspect it's an hour or two of work simply due to all the "known unknowns" > that pop up when trying something new. My thought process is: > > 1) Create the "tokenized" training set just like we do with Incremental > (minus putting it in potfile format) > 2) Run an OMEN training on it. Aka python3 pcfg_trainer.py -t > TOKIZED_TRAINING_FILE.txt -r OMEN_TOK_TEST -c 0 > -- The '-c 0' option sets the coverage to be 0 so it will only generate > OMEN guesses. Side note, setting '-c 1.0' means no OMEN guesses will be > generated. Default is 0.6 > 3) Generate guesses using the PCFG tool and pipe it into JtR with the > external mode. Aka: python3 pcfg_guesser -r OMEN_TOK_TEST | ./john --stdin > --external=untokinize_omen --stdout I've never tried using these tools (I really should make time for that), but the above looks roughly like what I had thought you'd do. > The "known unknown" for me where things could go sideways is passing the > "tokenize control characters" via pipes and if weirdness occurs in my > generating tool (pcfg) trying to output them. It might work, but it might > not. Should be no problem with the pipes, but could be within the tools. Also, would you or someone please try the tokenizer along with JtR's builtin Markov mode? Anyway, it is interesting that OMEN alone performed better for you than incremental with tokenizer did. My guess as to why is that incremental does too much (when extended in other ways, like with the tokenizer) in terms of separation by length and character position. I also had this guess when I had tried extending incremental to 3rd-order Markov (4-grams) from its current 2nd-order (3-grams) while preserving the length and position separation. This resulted in only slight and inconclusive improvement (at huge memory usage increase and/or reduction in character set), so I didn't release that version. If I had more time, I'd try selectively removing that separation or/and adding more fallbacks (like if a certain pair of characters never occurs in that position for that length, see if it does for others and use that before falling back to considering only one character instead of two). Returning to the tokenizer and "Markov phrases", a while ago I also ran experiments with my Project Gutenberg Australia phrase list (2 to 6 words, 2+ occurrences in books, with --rules=PhrasePreprocess applied) as training material. I got some preliminary results and intended to improve upon them before posting in here, but I never got around to. So here's some of what I got: The phrase list is 759 MB, the fake pot file for training is 520 MB (smaller than original phrase list due to tokenization), 41M lines (so average phrase length 17). The chr file is 38 MB. Token lengths 2 to 4 as in our default tokenizer. The tokens ended up "common syllables" and "sometimes syllables followed or preceded by embedded space character" as I had predicted, and occasionally even two letters or a letter and a syllable separated by a space within the token. Here's the generated sed line to illustrate: sed '/[^ -~]/d; s/ the/\x7/g; s/the /\xc/g; s/and /\x1d/g; s/ and/\x83/g; s/ing /\x94/g; s/ of /\xa1/g; s/ to /\xa4/g; s/that/\xa8/g; s/hat /\xad/g; s/ was/\xb6/g; s/ tha/\xbd/g; s/was /\xbe/g; s/ you/\xc2/g; s/his /\xcd/g; s/ther/\xd3/g; s/n th/\xdc/g; s/d th/\xe4/g; s/her /\xe7/g; s/t th/\xef/g; s/with/\xf3/g; s/you /\xf6/g; s/e th/\xfa/g; s/the/\x4/g; s/ th/\x6/g; s/he /\x8/g; s/and/\x17/g; s/nd /\x7f/g; s/ed /\x82/g; s/ing/\x88/g; s/ an/\x89/g; s/ to/\x8a/g; s/er /\x91/g; s/to /\x92/g; s/ of/\x93/g; s/of /\x95/g; s/her/\x97/g; s/ he/\x9a/g; s/at /\x9d/g; s/ng /\xa3/g; s/hat/\xa5/g; s/as /\xab/g; s/ ha/\xaf/g; s/ in/\xb2/g; s/ wa/\xb3/g; s/was/\xb8/g; s/e t/\xb9/g; s/re /\xba/g; s/tha/\xbb/g; s/ere/\xbc/g; s/in /\xbf/g; s/is /\xc1/g; s/you/\xc3/g; s/d t/\xc5/g; s/his/\xc7/g; s/ be/\xc9/g; s/on /\xcf/g; s/e a/\xd2/g; s/e w/\xd4/g; s/en /\xd7/g; s/ hi/\xd9/g; s/t t/\xdb/g; s/ll /\xe0/g; s/e s/\xe1/g; s/ wh/\xe2/g; s/n t/\xe3/g; s/for/\xe5/g; s/ yo/\xee/g; s/ a /\xf0/g; s/ent/\xf7/g; s/ no/\xf8/g; s/ co/\xfc/g; s/me /\xfd/g; s/ it/\xfe/g; s/e /\x1/g; s/he/\x2/g; s/th/\x3/g; s/ t/\x5/g; s/d /\x9/g; s/t /\xb/g; s/ a/\xe/g; s/s /\xf/g; s/in/\x10/g; s/er/\x11/g; s/an/\x12/g; s/ h/\x13/g; s/n /\x14/g; s/re/\x15/g; s/ w/\x16/g; s/nd/\x18/g; s/ s/\x19/g; s/ou/\x1a/g; s/ha/\x1b/g; s/ i/\x1c/g; s/on/\x1e/g; s/at/\x1f/g; s/ o/\x80/g; s/r /\x81/g; s/ed/\x84/g; s/en/\x85/g; s/y /\x86/g; s/to/\x87/g; s/hi/\x8b/g; s/as/\x8c/g; s/o /\x8d/g; s/it/\x8e/g; s/or/\x8f/g; s/ng/\x90/g; s/is/\x96/g; s/ar/\x98/g; s/of/\x99/g; s/ b/\x9b/g; s/st/\x9c/g; s/ve/\x9e/g; s/ m/\x9f/g; s/te/\xa0/g; s/es/\xa2/g; s/f /\xa6/g; s/se/\xa7/g; s/me/\xa9/g; s/nt/\xaa/g; s/le/\xac/g; s/ll/\xae/g; s/wa/\xb0/g; s/ c/\xb1/g; s/ f/\xb4/g; s/ea/\xb5/g; s/ne/\xb7/g; s/ d/\xc0/g; s/al/\xc4/g; s/be/\xc6/g; s/no/\xc8/g; s/g /\xca/g; s/ti/\xcb/g; s/a /\xcc/g; s/de/\xce/g; s/ho/\xd0/g; s/ro/\xd1/g; s/l /\xd5/g; s/ad/\xd6/g; s/li/\xd8/g; s/ut/\xda/g; s/el/\xdd/g; s/h /\xde/g; s/ l/\xdf/g; s/co/\xe6/g; s/om/\xe8/g; s/sh/\xe9/g; s/ p/\xea/g; s/ri/\xeb/g; s/ur/\xec/g; s/ee/\xed/g; s/ot/\xf1/g; s/ow/\xf2/g; s/wh/\xf4/g; s/ n/\xf5/g; s/ma/\xf9/g; s/ce/\xfb/g; s/ch/\xff/g; s/^/:/' Here are the first 25 candidates this produces: the shows and some the saying to and w t and so he the show you and w a and soon and and sooner the bed he the bed for the m t the m a the saying on the the boy and the boyle and well I and well i the saying in of a t of a th and by the and by a the pans the pan and Clearly, my phrase list is suboptimal in that it includes many word sequences that start or especially end mid-phrase. This is bad for training, and it is also bad for direct usage as a wordlist(+rules). I should probably exclude them (all or most?) and try again. Also, the original phrases don't use all of the printable ASCII, which left many potential token codes unused as the current tokenizer is limited to using non-ASCII for the tokens. I got a total of only 213 different "characters", whereas when training on RockYou it could be up to the current maximum of 253. We could have 40 more tokens here if the tokenizer dynamically allocated all unused codes as potential tokens. Anyway, here's what cracking with this looks like (on one CPU core): 1 billion: Loaded 6969140 password hashes with no different salts (NT [MD4 128/128 AVX 4x3]) Note: Passwords longer than 27 rejected Warning: only 213 characters available Press 'q' or Ctrl-C to abort, 'h' for help, almost any other key for status 82044g 0:00:02:12 621.5g/s 2425Kp/s 2425Kc/s 16758GC/s side off afoes..side offenboy 114014g 0:00:04:56 383.9g/s 2843Kp/s 2843Kc/s 19578GC/s four goe ha ha..four giance no 117608g 0:00:05:47 338.4g/s 2877Kp/s 2877Kc/s 19792GC/s ill lationa..ill latimer Session stopped (max candidates reached) 10 billion (from scratch, not reusing the 1 billion above): 82755g 0:00:02:11 631.6g/s 2501Kp/s 2501Kc/s 17287GC/s dead not come he..dead not had to 105241g 0:00:03:40 478.3g/s 2781Kp/s 2781Kc/s 19173GC/s he hate acere..and Gaten 111355g 0:00:04:32 409.4g/s 2865Kp/s 2865Kc/s 19733GC/s shied ouseas..shied i sa 159497g 0:00:13:23 198.4g/s 3117Kp/s 3117Kc/s 21347GC/s went of gick..went of gioull 160544g 0:00:13:35 196.7g/s 3119Kp/s 3119Kc/s 21358GC/s KSen to..the said I ours in 217094g 0:00:51:25 70.36g/s 3240Kp/s 3240Kc/s 22016GC/s a fester tomi..a festwho i p Session stopped (max candidates reached) 100 billion (ditto): 175857g 0:00:20:16 144.5g/s 3155Kp/s 3155Kc/s 21556GC/s duke a crow i..duke a cros for 180903g 0:00:21:57 137.3g/s 3168Kp/s 3168Kc/s 21635GC/s i way indeoux..i way in a toes 182154g 0:00:22:33 134.5g/s 3172Kp/s 3172Kc/s 21662GC/s on thylston..on thylae 243763g 0:01:19:34 51.06g/s 3205Kp/s 3205Kc/s 21713GC/s blottore mon..blottor OB 245510g 0:01:24:15 48.57g/s 3183Kp/s 3183Kc/s 21556GC/s and mrained his limp..and mrained his arty 292117g 0:03:09:07 25.74g/s 3176Kp/s 3176Kc/s 21378GC/s hazu hipast tin..hazu hipas and of 294151g 0:03:10:14 25.77g/s 3180Kp/s 3180Kc/s 21405GC/s his farman sing..his farman the b 297221g 0:03:25:40 24.09g/s 3183Kp/s 3183Kc/s 21412GC/s nred th to no..nred thleve 302601g 0:03:45:29 22.37g/s 3167Kp/s 3167Kc/s 21289GC/s I'll thaly Thom..I'll thaly gait 311134g 0:04:29:45 19.22g/s 3161Kp/s 3161Kc/s 21214GC/s aget the mill becoe..aget the manner in an 318158g 0:04:59:04 17.73g/s 3156Kp/s 3156Kc/s 21167GC/s but was the verthed it in..but was the verthesisi 348129g 0:07:47:10 12.42g/s 3156Kp/s 3156Kc/s 21083GC/s pluty omen no..pluty omini 350695g 0:08:04:38 12.06g/s 3153Kp/s 3153Kc/s 21060GC/s id sears hell..id searsalom 350795g 0:08:10:39 11.92g/s 3152Kp/s 3152Kc/s 21052GC/s daroxing I waur..daroxing ecal 351535g 0:08:14:33 11.85g/s 3152Kp/s 3152Kc/s 21048GC/s her he-mede..her hemt was 353467g 0:08:26:40 11.63g/s 3151Kp/s 3151Kc/s 21035GC/s he senda below..he senda belle a 354956g 0:08:39:49 11.38g/s 3150Kp/s 3150Kc/s 21026GC/s markins arties'..markins all ring with 356067g 0:08:49:59 11.20g/s 3144Kp/s 3144Kc/s 20983GC/s sawrent thou bother..sawrent thou door Session stopped (max candidates reached) This is using the same HIBPv8 sample that I had run the RockYou-trained tests against, but this time without pre-cracking by RockYou wordlist (since in this case it's out-of-sample testing even without that). As you can see, most candidates look somewhat like phrases, but are not. Almost all are space-separated, which is consistent with the training data. They also become rather long. As to phrases that actually got cracked, it's not a lot. Out of the successful guess counts seen above - e.g., 117608 at 1 billion candidates - it's just these 6 for length 13+: $ cut -d: -f2- tok100.pot | egrep '^.{13,}' what the hell what is yours planningforce superformance smile the best I am the King At 10 billion, this increases to 18. At 100 billion, it's 83, and some of these are just long words (e.g., misunderstanding). The more interesting ones are these 19: $ cut -d: -f2- tok102.pot | egrep '^.{13,}' | grep '[^a-z]' what the hell what is yours smile the best I am the King counter strike friend of Dad im the coolest prince of peace one tree hill cant remember shadow-spirit the casualties greater harvest what the fuck is it to late the art of war personal only full professor cats and dogs However, it's important to note that space-separated is by far not the most common form of passphrases, at least not in HIBP, yet this is what our training was for. I had previously optimized the Phrase ruleset for usage with this phrase list as a wordlist. This optimized ruleset starts by trying the phrases with spaces removed. Here it is for reference, along with PhrasePreprocess that I had mentioned above: # A special ruleset intended for stacking before other Phrase* rules below, # such that you have the option to run its output through "unique" first [List.Rules:PhrasePreprocess] /[ ] : -c /[ ] l Q /[ ] @' Q -c /[ ] @' Q M l Q # The main optimized Phrase ruleset, almost no duplicates with proper input [List.Rules:Phrase] # This one rule cracks ~1050 HIBP v7 passwords per million with sequences of # 2 to 6 words occurring 2+ times across Project Gutenberg Australia books # when our sequence list includes them in both their original case and # all-lowercase, as well as both with apostrophes intact and removed (these # variations are not implemented in this ruleset not to produce duplicates) @?w Q # Sorted separator characters: 1_24 -.3785690@,&+*!'$/?:=#~^%;`>"[)<]|({}\ # (the apostrophe is probably overrated since it also occurs inside words) # Each character in 1_24 cracks ~82 to ~61 passwords per million s[ ][1_24] Q # Leaving the space separators intact cracks ~59 passwords per million /[ ] # Each character in -.3785690@ cracks ~53 to ~12 passwords per million s[ ][\-.3785690@] Q # Each character in ,&+*!'$/?:=#~ cracks ~10 to ~1 passwords per million s[ ][,&+*!'$/?:=#~] Q As my comments in there say, it was ~1050 per million with spaces removed vs. ~59 with them intact. (A thousand per million may seem low, but let's keep in mind that most of those 1 million are not phrases.) So another obvious improvement to the training material may be to remove the spaces or pre-apply a ruleset like Phrase above. Also, seeing how nothing longer than 4 words (with separators) got cracked in this way, maybe it's best to exclude longer phrases from the training set (which also included 5- and 6-word sequences). Yet another option may be to train on actually cracked phrases, but there may be too few for good training, and besides this will skew the training towards what was crackable by prior techniques. For comparison, simply running my 41M phrase list against the same HIBPv8 sample as in the recent tests so far cracks only 300 hashes. That's even lower success rate than the first 1 billion generated by the tokenized incremental provided above (117608*0.041 = ~4800 per 41M). Running the phrase list with --rules=Phrase cracks 53202. That's still a low rate, but let's keep in mind that most of the hashes we're cracking are not actually of phrases. So this approaches 0.1% of total. Out of those 53202, 1788 are length 13+ and out of them 318 also contain non-letters (separators), but only 43 are both length 13+ and use spaces (not other separators). The same 43 are also found among the 300 cracked without rules (so were in the original phrase list), which is actually no surprise since only one rule preserves spaces and that one doesn't do any mangling. So the Phrase rules perform better than the tokenizer at cracking actual phrases, whereas much of what the tokenizer cracks are non-phrases (which is what contributes to its greater success rate) despite of it trying mostly phrase-looking strings. But that's with a poor training set that targeted a relatively uncommon phrase form (space-separated). Re-testing with an improved training set is needed. Also, there's room for relevant improvement of the tokenizer (reduce overlapping tokens, use more token codes when available). OTOH, the tokenizer run did crack some phrases that were not cracked by the phrase list + rules. Out of the 19 more interesting ones quoted above, 9 are not in the list+rules cracks. These numbers are indeed very small, but that's extra cracks nevertheless. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.