|
Message-ID: <BLU0-SMTP83D6D02D4FFEAB01C75376FDBB0@phx.gbl> Date: Sat, 18 Aug 2012 11:10:06 +0200 From: Frank Dittrich <frank_dittrich@...mail.com> To: john-users@...ts.openwall.com Subject: Re: Passphrase Creation On 08/18/2012 08:57 AM, Solar Designer wrote: > On Fri, Aug 17, 2012 at 10:04:35AM -0600, Kevin Young wrote: >> I also create a no-space version at the same time. (Is there a mangling >> rule that can handle this?) > > Yes. It's either: > > @?w > > or: > > @ : > > depending on whether you want to remove all whitespace characters (both > space and tabs) or just the space character. In the latter rule, the > colon (a no-op command) prevents the would-be trailing space from being > inadvertently removed when editing the conf file. I like this trick. >> Step 6. Optimize and reduce >> As expected there are lot of duplicates so my script performs a dictionary >> sort and filters out the duplicates (sort and uniq). I also filter out >> (grep) things like open source verbiage, distribution notices, credits, etc. > > FWIW, the "unique" program included with JtR is generally a lot faster > than "sort -u" (or "sort | uniq"), but you may need to tune its memory > usage (set it to the max of 2 GB with "-mem=25" in jumbo if you can > afford that - which I guess you can given modern computers' RAM sizes). > Of course, the result is different (not sorted), which may be good or > bad depending on your input data and your needs. > > GNU sort may also be made a lot faster by letting it use more memory, > e.g. "sort -S 14G" works well for me on a 16 GB RAM machine (and at this > setting it may even outperform our "unique", which is limited to 2 GB, > as long as the input is large enough). With such a large list, there will be many duplicates. Wouldn't it be better to sort the list by frequency, e.g. sort | uniq -c |sort -nr |sed "s/^ *[0-9]* //' That way, the most frequent phrases will be at the top of the list. (Of course, a lot of single words or phrases consisting of just 2 words will appear at the top of this list. But you can still separate the large list into separate smaller ones, depending on the number of words per phrase.) Setting LC_ALL=C might also be useful. Lexicographical sort order probably isn't that important in this case, and sort, grep and other commands are much faster when LC_ALL=C, compared to other locale settings like LANG=en_US.UTF-8. Frank
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.