|
Message-ID: <CAJ9ii1EmrzuP3DacWGxtcd95JaOdLciBWhLjFu0_B903FpwKvQ@mail.gmail.com> Date: Sun, 2 May 2021 23:00:34 -0400 From: Matt Weir <cweir@...edu> To: john-users@...ts.openwall.com Subject: Re: source of information for John's charset files I apologize in advance if I misunderstood your testing procedure or your results, but using the HIBP list as a test set is really problematic when applying that to normal password cracking sessions. Duplicates matter and our techniques should reflect that. Making guesses of '123456' and 'password' before 'ajger' should be rewarded, but using the HIBP list all three guesses are awarded the same value. I could see excluding the top 10k password guesses from an incremental training set, (since '123456' and 'password' will be almost certainly cracked by a dictionary attack), to optimize how incremental plays with brute-force, but even that approach while it seems like it makes sense, has backfired on me every time I have tried it, resulting in worse results when applying it to new datasets. I've actually been looking into something similar with an "optimization" of the PCFG tool. I wanted to make OMEN play nicer with the dictionary like attack that PCFG does, so I've tried to train OMEN on passwords that the other parts of the grammar didn't crack. My thinking was that OMEN then would specifically target those types of passwords. Long story short, those tests were an unmitigated disaster when I then applied the grammar against new test sets. It made my tool worse, not better. Now I admit I could be wrong. Training on unique passwords might end up making Incremental mode better. But before we make those changes, I'd really like to see those tests run against a more realistic dataset that HIBP. I know HIBP is based on real passwords, but there are so many different artificialities that go into it's construction I have deep suspicions on using it as a representative password set. On a different point, I am totally ok with updating the training set from RockYou. I could go on and on about the weirdness of that dataset, not to mention that it really is showing its age. The gold standard right now of public datasets would probably be the LinkedIn list, which also is showing its age, but is a bit more comparable to current web passwords. The one advantage of the HIBP list is it does have some non-english datasets in it. That's a whole other conversation though on how to better incorporate other languages into cracking sessions. Side note, I just saw your most recent results of training/running against RockYou. I'm willing to admit I'm wrong if you are getting better results training without dupes. That's just contrary to what I've seen in the past. I might need to run some tests of my own to look into this. Cheers, Matt/Lakiw On Sun, May 2, 2021 at 5:39 PM Solar Designer <solar@...nwall.com> wrote: > On Sun, May 02, 2021 at 11:21:34PM +0200, Solar Designer wrote: > > Anyway, I just ran some tests the other way around - "cracking" RockYou > > passwords. I didn't try excluding RockYou itself from the training sets > > here - can't do that while including our current .chr files in the > > comparison. So this is in-sample testing, which is generally a wrong > > thing to do, but with that in mind here are the results for different > > training sets (all are for incremental mode and 1 billion candidates): > > > > RockYou with dupes - 20.2% > > RockYou unique - 21.9% > > HIBPv7 cracked - 17.9% > > > > The percentages cracked are those of RockYou unique. > > > > Not surprisingly, RockYou is best fit for itself. HIBP is an acceptable > > fit as well. It could have potentially performed better than RockYou > > on this test due to its larger size, but as we can see that was not > > enough to overcome it not being such a perfect fit as RockYou itself. > > FWIW, RockYou unique being best fit for itself persists after I shuffled > it and split it into a 1M test set and 13.3M training set (no matching > passwords in the sets, but both sets are parts of RockYou). Got 21.5%. > > Alexander >
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.