|
Message-ID: <CAJ9ii1EndAeQGQd1HHKW0wBEdOdwZy+LtMEZfjgrq+guXgsFkw@mail.gmail.com> Date: Mon, 28 Oct 2024 23:46:45 -0400 From: Matt Weir <cweir@...edu> To: john-users@...ts.openwall.com Subject: Re: Markov phrases in john Looking through this more, my guess is that the output of this Sed script needs to be put into potfile format so I can use --make-charset on it, (vs. using it to generate a .chr file directly). I can then have incremental=tokenize to generate "encoded" guesses which I then need to run through the JtR external mode to convert into actual password guesses. But my original request for an example still stands since there are a lot of steps there and I want to make sure I have them right. Also I might be totally wrong with this assumption as well ;p Side note: This is a weird edge case so very low priority request, but one thing this made me realize is that it would be nice to use the --make-charset option on a set of training passwords vs.a potfile to remove a step in the generation process. That's just me being lazy though, and I'll admit this is a task that is rare enough that optimizing it doesn't provide much value. Cheers, Matt/Lakiw On Mon, Oct 28, 2024 at 9:08 PM Matt Weir <cweir@...edu> wrote: > I took a look at this, and I think there is a fundamental > misunderstanding I'm having as I've run into a lot of issues getting this > attack to run. I'm sure this is wrong but I hope providing examples will > help you correct what I'm doing. > > Step 1) Run tokenize.pl on training passwords. Example: > ./tokenize.pl TRAINING_PASSWORDS.txt > > Step 2) Run resulting Sed script on training passwords. Example: > cat TRAINING_PASSWORDS.txt | sed 's/1234/\x10/; s/love/\x15/; > s/2345/\x1b/; s/3456/\x93/; s/ilov/\xe5/; s/123/\x6/; s/234/\x89/; > s/ove/\x8c/; s/lov/\x90/; s/345/\x96/; s/456/\xa4/; s/and/\xb2/; > s/mar/\xc1/; s/ell/\xd9/; s/199/\xdf/; s/ang/\xe0/; s/200/\xe7/; > s/ter/\xe9/; s/198/\xee/; s/man/\xf4/; s/ari/\xfb/; s/an/\x1/; s/er/\x2/; > s/12/\x3/; s/ar/\x4/; s/in/\x5/; s/23/\x7/; s/ma/\x8/; s/on/\x9/; > s/el/\xb/; s/lo/\xc/; s/ri/\xe/; s/le/\xf/; s/al/\x11/; s/la/\x12/; > s/li/\x13/; s/en/\x14/; s/ra/\x16/; s/es/\x17/; s/re/\x18/; s/19/\x19/; > s/il/\x1a/; s/na/\x1c/; s/ha/\x1d/; s/am/\x1e/; s/ie/\x1f/; s/11/\x7f/; > s/ch/\x80/; s/10/\x81/; s/00/\x82/; s/te/\x83/; s/ve/\x84/; s/as/\x85/; > s/ne/\x86/; s/ll/\x87/; s/or/\x88/; s/ta/\x8a/; s/st/\x8b/; s/is/\x8d/; > s/01/\x8e/; s/ro/\x8f/; s/20/\x91/; s/ni/\x92/; s/at/\x94/; s/34/\x95/; > s/45/\x97/; s/it/\x98/; s/08/\x99/; s/mi/\x9a/; s/ca/\x9b/; s/ic/\x9c/; > s/da/\x9d/; s/he/\x9e/; s/21/\x9f/; s/nd/\xa0/; s/me/\xa1/; s/ng/\xa2/; > s/mo/\xa3/; s/ba/\xa5/; s/sa/\xa6/; s/ti/\xa7/; s/56/\xa8/; s/sh/\xa9/; > s/ea/\xaa/; s/ia/\xab/; s/ol/\xac/; s/se/\xad/; s/ov/\xae/; s/be/\xaf/; > s/de/\xb0/; s/co/\xb1/; s/ss/\xb3/; s/99/\xb4/; s/to/\xb5/; s/22/\xb6/; > s/oo/\xb7/; s/02/\xb8/; s/ke/\xb9/; s/ee/\xba/; s/ho/\xbb/; s/ey/\xbc/; > s/ck/\xbd/; s/ab/\xbe/; s/et/\xbf/; s/ad/\xc0/; s/13/\xc2/; s/07/\xc3/; > s/pa/\xc4/; s/09/\xc5/; s/06/\xc6/; s/ki/\xc7/; s/98/\xc8/; s/hi/\xc9/; > s/th/\xca/; s/05/\xcb/; s/14/\xcc/; s/25/\xcd/; s/ay/\xce/; s/ce/\xcf/; > s/89/\xd0/; s/ac/\xd1/; s/os/\xd2/; s/ge/\xd3/; s/03/\xd4/; s/ka/\xd5/; > s/ja/\xd6/; s/bo/\xd7/; s/do/\xd8/; s/04/\xda/; s/e1/\xdb/; s/nn/\xdc/; > s/em/\xdd/; s/31/\xde/; s/15/\xe1/; s/18/\xe2/; s/ir/\xe3/; s/91/\xe4/; > s/om/\xe6/; s/90/\xe8/; s/30/\xea/; s/nt/\xeb/; s/di/\xec/; s/si/\xed/; > s/ou/\xef/; s/un/\xf0/; s/24/\xf1/; s/us/\xf2/; s/88/\xf3/; s/ai/\xf5/; > s/78/\xf6/; s/y1/\xf7/; s/so/\xf8/; s/pe/\xf9/; s/ot/\xfa/; s/ga/\xfc/; > s/ly/\xfd/; s/16/\xfe/; s/ed/\xff/' > tokenize.chr > > Step 3) Create entry in John.conf for the new charset. Example: > [Incremental:Tokenize] > File = $JOHN/tokenize.chr > > Step 4) Run incremental attack with the new charset. Example: > ./john --incremental=Tokenize --stdout > > Error that is printed out: > Incorrect charset file format: $JOHN/tokenize.chr > > Note: I thought it might be a UTF-8 issue so I repeated the above steps > using password.lst (picked because it doesn't have any UTF-8 characters, > not because it's a good training set). I also then repeated the experiment > using the sed script you attached in a previous e-mail vs. the one I > generated. Same error. > > I'm sure I'm doing something wrong on my end, so if you could expand on > the steps you used to generate the custom .chr file I'd appreciate it! > > Cheers, > Matt/Lakiw > > > > > On Mon, Oct 28, 2024 at 12:09 AM Solar Designer <solar@...nwall.com> > wrote: > >> I've extracted some more data from my tests, below: >> >> On Sun, Oct 27, 2024 at 06:08:28AM +0100, Solar Designer wrote: >> > For comparison, continuing a reference run to 2 billion gives 1820805, >> > so is on par with the tokenized run at 1 billion. Average password >> > length among the 721406 cracked on top of wordlist is 6.70 characters. >> > >> > Continuing the tokenized run to 2 billion gives 1952392. Combined with >> > the reference 2 billion run's, it's 2229062 unique. Average password >> > length among the first 721406 cracked by the tokenized run on top of >> > wordlist is 7.04 characters, so a slight increase over the non-tokenized >> > run's length. >> >> The average password lengths above were for cracked passwords. I now >> also have the corresponding average candidate password lengths. For a >> reference run to 2 billion, it's 7.04 characters (the match with 7.04 >> above is accidental). For the tokenized run to 2 billion, it's 8.27 >> characters. So there's a larger increase in candidate password length >> than in successfully cracked passwords' length. >> >> > Some passwords may be represented in more than one way via the tokens, >> > which means that even though incremental mode outputs unique token >> > strings, we're getting some duplicate candidate passwords after the >> > external mode. For example, in the first 1 million candidates generated >> > in my tests like the above, there are 994373 unique. In the first 10 >> > million, 9885308 unique. In the first 100 million, 97567218 unique. >> >> In the first 1 billion, 969765579 unique (97.0%). >> In the first 2 billion, 1934766835 unique (96.7%). >> >> > The above success was despite the duplicates included in the 1 or 2 >> > billion counts. Results would be even better with them excluded, for >> > more unique passwords fitting those counts. Maybe we need to enable our >> > dupe suppressor for such runs. Maybe we need to adjust the approach to >> > tokenization to reduce or avoid duplicates - e.g., use fixed-length >> > non-overlapping tokens, but this would probably hurt in other ways. >> > Perhaps the duplicates would become more of a problem for longer runs >> > (beyond a few billion candidates). >> >> Some other ideas: >> >> 1. Try models such as our Markov mode or OMEN, which (I think) only >> generate token combinations that were actually seen in the training set. >> When we always substitute a substring with a token code before training, >> that substring is then never seen in the training set as-is, so those >> models shouldn't produce it by other means. Incremental mode is >> different in that it eventually tries all combinations, even those that >> were never seen, which is both a blessing and a curse. >> >> I'd appreciate it if someone from the john-users community runs such >> tests with models other than incremental mode. >> >> 2. Modify incremental mode so that it either doesn't try never-seen >> combinations or is token-aware and skips effectively-duplicate token >> strings (e.g., insist on certain ordering of tokens). This is tricky to >> implement and it has drawbacks - trying never-seen combinations may be >> desirable and filtering is slow. OTOH, having tokenization built in may >> improve ease of use and will speed up the reverse mapping (native code >> instead of external mode). >> >> 3. Choose a token lengths range that minimizes the duplicates rate. >> In my test so far, most tokens are of length 2 and I guess these are >> somewhat frequently matched by pairs of characters or by different >> grouping of characters and tokens. Fixing the token length at 3 or 4 >> might reduce such occurrences a lot while still covering the really >> common substrings by the 158 tokens. >> >> Alexander >> >
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.