|
Message-ID: <CAJ9ii1HqwuaQpnKxZ7rQQAYMQQFS3pX1Ao4eaR134wyKpg0Bqg@mail.gmail.com> Date: Mon, 28 Oct 2024 21:08:39 -0400 From: Matt Weir <cweir@...edu> To: john-users@...ts.openwall.com Subject: Re: Markov phrases in john I took a look at this, and I think there is a fundamental misunderstanding I'm having as I've run into a lot of issues getting this attack to run. I'm sure this is wrong but I hope providing examples will help you correct what I'm doing. Step 1) Run tokenize.pl on training passwords. Example: ./tokenize.pl TRAINING_PASSWORDS.txt Step 2) Run resulting Sed script on training passwords. Example: cat TRAINING_PASSWORDS.txt | sed 's/1234/\x10/; s/love/\x15/; s/2345/\x1b/; s/3456/\x93/; s/ilov/\xe5/; s/123/\x6/; s/234/\x89/; s/ove/\x8c/; s/lov/\x90/; s/345/\x96/; s/456/\xa4/; s/and/\xb2/; s/mar/\xc1/; s/ell/\xd9/; s/199/\xdf/; s/ang/\xe0/; s/200/\xe7/; s/ter/\xe9/; s/198/\xee/; s/man/\xf4/; s/ari/\xfb/; s/an/\x1/; s/er/\x2/; s/12/\x3/; s/ar/\x4/; s/in/\x5/; s/23/\x7/; s/ma/\x8/; s/on/\x9/; s/el/\xb/; s/lo/\xc/; s/ri/\xe/; s/le/\xf/; s/al/\x11/; s/la/\x12/; s/li/\x13/; s/en/\x14/; s/ra/\x16/; s/es/\x17/; s/re/\x18/; s/19/\x19/; s/il/\x1a/; s/na/\x1c/; s/ha/\x1d/; s/am/\x1e/; s/ie/\x1f/; s/11/\x7f/; s/ch/\x80/; s/10/\x81/; s/00/\x82/; s/te/\x83/; s/ve/\x84/; s/as/\x85/; s/ne/\x86/; s/ll/\x87/; s/or/\x88/; s/ta/\x8a/; s/st/\x8b/; s/is/\x8d/; s/01/\x8e/; s/ro/\x8f/; s/20/\x91/; s/ni/\x92/; s/at/\x94/; s/34/\x95/; s/45/\x97/; s/it/\x98/; s/08/\x99/; s/mi/\x9a/; s/ca/\x9b/; s/ic/\x9c/; s/da/\x9d/; s/he/\x9e/; s/21/\x9f/; s/nd/\xa0/; s/me/\xa1/; s/ng/\xa2/; s/mo/\xa3/; s/ba/\xa5/; s/sa/\xa6/; s/ti/\xa7/; s/56/\xa8/; s/sh/\xa9/; s/ea/\xaa/; s/ia/\xab/; s/ol/\xac/; s/se/\xad/; s/ov/\xae/; s/be/\xaf/; s/de/\xb0/; s/co/\xb1/; s/ss/\xb3/; s/99/\xb4/; s/to/\xb5/; s/22/\xb6/; s/oo/\xb7/; s/02/\xb8/; s/ke/\xb9/; s/ee/\xba/; s/ho/\xbb/; s/ey/\xbc/; s/ck/\xbd/; s/ab/\xbe/; s/et/\xbf/; s/ad/\xc0/; s/13/\xc2/; s/07/\xc3/; s/pa/\xc4/; s/09/\xc5/; s/06/\xc6/; s/ki/\xc7/; s/98/\xc8/; s/hi/\xc9/; s/th/\xca/; s/05/\xcb/; s/14/\xcc/; s/25/\xcd/; s/ay/\xce/; s/ce/\xcf/; s/89/\xd0/; s/ac/\xd1/; s/os/\xd2/; s/ge/\xd3/; s/03/\xd4/; s/ka/\xd5/; s/ja/\xd6/; s/bo/\xd7/; s/do/\xd8/; s/04/\xda/; s/e1/\xdb/; s/nn/\xdc/; s/em/\xdd/; s/31/\xde/; s/15/\xe1/; s/18/\xe2/; s/ir/\xe3/; s/91/\xe4/; s/om/\xe6/; s/90/\xe8/; s/30/\xea/; s/nt/\xeb/; s/di/\xec/; s/si/\xed/; s/ou/\xef/; s/un/\xf0/; s/24/\xf1/; s/us/\xf2/; s/88/\xf3/; s/ai/\xf5/; s/78/\xf6/; s/y1/\xf7/; s/so/\xf8/; s/pe/\xf9/; s/ot/\xfa/; s/ga/\xfc/; s/ly/\xfd/; s/16/\xfe/; s/ed/\xff/' > tokenize.chr Step 3) Create entry in John.conf for the new charset. Example: [Incremental:Tokenize] File = $JOHN/tokenize.chr Step 4) Run incremental attack with the new charset. Example: ./john --incremental=Tokenize --stdout Error that is printed out: Incorrect charset file format: $JOHN/tokenize.chr Note: I thought it might be a UTF-8 issue so I repeated the above steps using password.lst (picked because it doesn't have any UTF-8 characters, not because it's a good training set). I also then repeated the experiment using the sed script you attached in a previous e-mail vs. the one I generated. Same error. I'm sure I'm doing something wrong on my end, so if you could expand on the steps you used to generate the custom .chr file I'd appreciate it! Cheers, Matt/Lakiw On Mon, Oct 28, 2024 at 12:09 AM Solar Designer <solar@...nwall.com> wrote: > I've extracted some more data from my tests, below: > > On Sun, Oct 27, 2024 at 06:08:28AM +0100, Solar Designer wrote: > > For comparison, continuing a reference run to 2 billion gives 1820805, > > so is on par with the tokenized run at 1 billion. Average password > > length among the 721406 cracked on top of wordlist is 6.70 characters. > > > > Continuing the tokenized run to 2 billion gives 1952392. Combined with > > the reference 2 billion run's, it's 2229062 unique. Average password > > length among the first 721406 cracked by the tokenized run on top of > > wordlist is 7.04 characters, so a slight increase over the non-tokenized > > run's length. > > The average password lengths above were for cracked passwords. I now > also have the corresponding average candidate password lengths. For a > reference run to 2 billion, it's 7.04 characters (the match with 7.04 > above is accidental). For the tokenized run to 2 billion, it's 8.27 > characters. So there's a larger increase in candidate password length > than in successfully cracked passwords' length. > > > Some passwords may be represented in more than one way via the tokens, > > which means that even though incremental mode outputs unique token > > strings, we're getting some duplicate candidate passwords after the > > external mode. For example, in the first 1 million candidates generated > > in my tests like the above, there are 994373 unique. In the first 10 > > million, 9885308 unique. In the first 100 million, 97567218 unique. > > In the first 1 billion, 969765579 unique (97.0%). > In the first 2 billion, 1934766835 unique (96.7%). > > > The above success was despite the duplicates included in the 1 or 2 > > billion counts. Results would be even better with them excluded, for > > more unique passwords fitting those counts. Maybe we need to enable our > > dupe suppressor for such runs. Maybe we need to adjust the approach to > > tokenization to reduce or avoid duplicates - e.g., use fixed-length > > non-overlapping tokens, but this would probably hurt in other ways. > > Perhaps the duplicates would become more of a problem for longer runs > > (beyond a few billion candidates). > > Some other ideas: > > 1. Try models such as our Markov mode or OMEN, which (I think) only > generate token combinations that were actually seen in the training set. > When we always substitute a substring with a token code before training, > that substring is then never seen in the training set as-is, so those > models shouldn't produce it by other means. Incremental mode is > different in that it eventually tries all combinations, even those that > were never seen, which is both a blessing and a curse. > > I'd appreciate it if someone from the john-users community runs such > tests with models other than incremental mode. > > 2. Modify incremental mode so that it either doesn't try never-seen > combinations or is token-aware and skips effectively-duplicate token > strings (e.g., insist on certain ordering of tokens). This is tricky to > implement and it has drawbacks - trying never-seen combinations may be > desirable and filtering is slow. OTOH, having tokenization built in may > improve ease of use and will speed up the reverse mapping (native code > instead of external mode). > > 3. Choose a token lengths range that minimizes the duplicates rate. > In my test so far, most tokens are of length 2 and I guess these are > somewhat frequently matched by pairs of characters or by different > grouping of characters and tokens. Fixing the token length at 3 or 4 > might reduce such occurrences a lot while still covering the really > common substrings by the 158 tokens. > > Alexander >
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.