|
Message-ID: <20210505121450.GA16397@openwall.com> Date: Wed, 5 May 2021 14:14:51 +0200 From: Solar Designer <solar@...nwall.com> To: john-users@...ts.openwall.com Subject: Re: source of information for John's charset files On Mon, May 03, 2021 at 04:18:15PM +0200, Solar Designer wrote: > My expectation is that training on unique passwords only will in fact > reduce the number of cracked accounts when only incremental mode is > used. However, after having run through RockYou as a wordlist, it's not > obvious whether it's beneficial for incremental mode to also favor the > repeated passwords like 123456. > > What you suggest about excluding e.g. just the top 10k makes sense. > Another approach I thought of, but I don't recall trying, is to apply a > logarithmic scale to the counts. For example, for passwords appearing > 1000+ times include them 4 times, for 100+ include them 3 times, etc. I experimented with this now, and it may very well be a way to have the best of both worlds, or at least a reasonable balance. Compared to RockYou unique, I am getting the most improvement (overall across different test sets) by adding to it a list of RockYou passwords that appeared 3+ times on the original with-duplicates list. In other words, a password that appeared 3+ times is listed twice, otherwise just once. Of course, this doesn't distinguish frequencies of passwords within top ~1 million, so e.g. "password" ranks way lower than it does in our current default .chr files generated from the original with-duplicates RockYou. However, "123456" still ranks first, because not only it but also its substrings rank high. I also tried adding top 100k and top 10k lists (giving 4 repeats for passwords in top 10k), which hurt my tests on all-unique test sets a tiny bit (possibly just noise), but it also brought "password" only a bit higher. > I decided to test these not only at 1 billion candidates, but also at > other points. I use three training sets: RockYou with dupes (same as > was used to generate our currently bundled .chr files - in fact, I just > reuse ascii.chr from there), RockYou unique shuffled and 1M test set > removed from it (so 13.3M training set), and HIBP v7 458M cracked (after > removal of the fbobh_* pattern). The test set is always the mentioned > 1M from RockYou unique shuffled. Here are the percentages cracked at > 10M, 100M, 1G, 10G, 100G candidates: > > RockYou with dupes - 4.6%, 10.2%, 20.2%, 33.3%, 48.0% > RockYou -1M unique - 4.7%, 11.2%, 21.5%, 35.0%, 48.3% > HIBP v7 cracked - 3.2%, 8.7%, 17.8%, 30.0%, 44.5% > > So despite of "RockYou -1M unique" being the only one 100% out-of-sample > test (no password appears in both the training and the test set) and > also having the smallest training set (at 13.3M), it outperforms the two > other tests across this whole range. > > Of course, HIBP performing worse doesn't necessarily mean it's a worse > choice in general - just that it's a worse fit for RockYou. We've also > seen that when using a portion of HIBP as the test set, things are the > other way around - training on the rest of HIBP produces better results > than training on RockYou does. Here's the mix of my new potion: HIBP v7 cracked - x1 (also includes RockYou) RockYou unique - x30 RockYou top 1M 3+ - x31 This gives a list of roughly double the size of HIBP v7 cracked: ~458M to ~924M. Generating a .chr file with --external=filter_ascii uses ~896M from there (excludes the nested hashes, among other things). This gives the highest weight to passwords appearing on RockYou 3+ times, then to the rest of RockYou, and hopefully only uses HIBP to resolve ties and as a fallback where RockYou lacks a definitive pattern. The results directly comparable to those above, also repeated below for comparison, are: RockYou with dupes - 4.6%, 10.2%, 20.2%, 33.3%, 48.0% RockYou -1M unique - 4.7%, 11.2%, 21.5%, 35.0%, 48.3% HIBP v7 cracked - 3.2%, 8.7%, 17.8%, 30.0%, 44.5% New mix - 4.9%, 10.8%, 20.8%, 34.0%, 47.6% Also for comparison, the best I am able to achieve by training on (full) RockYou only (processed in various ways) at 1 billion candidates (middle column above) is 22.0%. To remind, the above uses 1M of RockYou unique shuffled as the test set, so except for the "RockYou -1M unique" line this is in-sample testing. So these good results don't mean a lot, but they do mean that the usage of HIBP hurting this test is mostly gone with the new mix. This matters more together with another result: As I mentioned earlier in this thread, training on HIBP v7 cracked not surprisingly provided the best result when testing on HIBP v7 as well, including with non-overlapping training and test sets. The results at 1 billion candidates were 5.4% for training on RockYou with duplicates, 6.0% for RockYou unique, and 6.6% for HIBP cracked unique. Well, here's a new result: the new mix above achieves almost 6.6% as well (to be specific, 6.64% vs. 6.59%). So maybe it's a good balance for targeting yet unknown password sets. > Also curious is how many different passwords the different training sets > crack. At the 100G mark, the three runs above cracked a total of 52.5%. Adding the new mix increases the total for the four 100G pots to 52.7% - not much increase from the previous 52.5%. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.