|
Message-ID: <20191208185932.GA25592@openwall.com> Date: Sun, 8 Dec 2019 19:59:32 +0100 From: Solar Designer <solar@...nwall.com> To: passwdqc-users@...ts.openwall.com Subject: Re: curse words in passwords Hi, I finally approached the task of cleaning up our word list used for generated passphrases, and adding other words to make up for the removed ones and keep the count at 4096. This took some trial and error - e.g., some approaches didn't produce enough words. I describe below the approach I ended up settling on. I started with passwdqc's current list, dropped from it everything that started with a capital letter (225 words) and added to it the overlap of EFF's Diceware with the combination of English/1-tiny/lower.gz and English/2-small/lower.gz in Openwall's wordlists collection limited to lengths 3 to 6. This resulted in 5295 words. Then I proceeded with manual edits of the list, mostly removing words but also adding some. My current work-in-progress list has 4728 words. This is more than 4096, which is a good problem to have - we can still identify and drop more potentially problematic words without having to come up with replacements. My removals so far included these categories: - Words that are too similar to each other in pronunciation (e.g., right, rite, wright, and write). - Words for which different valid spellings exist (e.g., gray and grey). - Plural forms of nouns (leave only the singular, except where plural is the more common form in which case leave only the plural). - Given names, countries, cities, languages, nationalities, and everything else that normally starts with a capital letter (e.g., even all month names, some of which are also people's given names). - Pronouns and also the word "user". - Curse or rude words, and words with such slang meanings. - Words related to race, etc. Potential skin colors. - Words related to religion, as well as "pig", "piggy", "pork". - Words related to sexuality, and innuendo. Some body parts. - Words related to drugs. - Words related to death, murder, burial. - Certain medical conditions (might need to drop benign ones as well, for completeness?) - Other likely trigger words, e.g. related to pregnancy and abortion. - Some other words that are OK on their own but would fall in the above categories if seen paired with other included words. - Words like "bully", "harass", and "offend" (gone too far?) - Words that are too obscure (but there are still many). I also thought of (and briefly attempted, separately) achieving a nice property that EFF's Diceware list has: that words can be concatenated without separator characters yet produce no duplicate passphrases for different word combinations. Unfortunately, to achieve it some common and otherwise benign words would need to be dropped. Testing my work-in-progress list for this property shows about 6000 duplicates among 22 million word pairs, including e.g. these examples: actorbit (can be "actor" + "bit" or "act" + "orbit") allyear (can be "all" + "year" or "ally" + "ear") just to illustrate what I'm talking about (like I say, there are about 6000 of these). At this point, I'd appreciate the community's opinion and maybe help on what else to drop. More problematic words? More obscure words? Different forms of the same words (e.g., right now we have "bake", "baked", "baker", and "bakery")? Words that are too similar in meaning (was a stated property of current passwdqc's list, but something I neglected with these updates so far)? Some of the length 6 words to reduce the average passphrase length? Try to achieve the above property to allow for concatenation without separators yet no security impact? The more words we drop of one category (e.g., forms of the same simple words, or length 6 words), the fewer we can drop from other categories (e.g., obscure words), so even with 600+ words yet to drop it's a tough decision. Any words to add - e.g., was my decision to choose only singular or plural and not both a wrong one? By including both, we could instead drop more obscure words. Attached to this e-mail are 3 word lists: passwdqc's current (wordset_4k-old.txt), the result of initial automated processing as mentioned above (wordset_4k-new-draft0.txt), and my current manually edited list (wordset_4k-new-draft1.txt). Please feel free to review these (and/or the changes between them) and make suggestions. Overall, this is a lot of effort. Alexander On Sun, Sep 25, 2016 at 01:24:20PM +0200, Solar Designer wrote: > On Sun, Sep 25, 2016 at 04:54:58PM +1000, Andrew Stuart wrote: > > In less than 50 password generations I have had three passwords that included > > > > shit > > cock > > gay (not that this is a curse word > > And is e.g. cock a curse word? It depends. > > > but I'm wondering if some childish code underlies this password generator) > > Not sure what you mean here. That there was deliberate attempt to use > controversial words? No, there was not. It's just that 4096 common > English words of length up to 6 do indeed include these words above. > > > Is this some sort of joke? I am generating passwords to give to my users - can this software trusted? Can I expect it to generate more controversial words? > > Unfortunately, yes - it will generate more controversial words, and not > only words, but also word combinations where each individual word would > likely not be considered controversial on its own, but the combination > is likely to be. > > We have a pending task to revise passwdqc's list of words to replace the > more likely problematic ones - in terms of not only such words on their > own, but also their use in passphrases. My current estimate is that > maybe 200 words, if not more, will need to be replaced. 200 is about 5% > of the total words we have. Unfortunately, this may make passphrases > somewhat harder to memorize, but we probably have to make this change. > > Thank you for reminding us about this. > > Alexander View attachment "wordset_4k-old.txt" of type "text/plain" (24510 bytes) View attachment "wordset_4k-new-draft0.txt" of type "text/plain" (32221 bytes) View attachment "wordset_4k-new-draft1.txt" of type "text/plain" (28759 bytes)
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.