john-users - Re: wordlist generation

Follow @Openwall on Twitter for new release announcements and other news

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20091028234533.GA23548@openwall.com>
Date: Thu, 29 Oct 2009 02:45:33 +0300
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: wordlist generation

On Sat, Oct 24, 2009 at 02:43:59AM +0200, SL wrote:
> What is the recommended/preferrable method to convert an arbitrary  
> text file (SQL dump, con-'cat'-enated HTML files, Wikipedia XML  
> export, not a precompiled dictionary) into a (reasonably usable) john  
> wordlist?
> 
> cat $textfile | tr -s -c "[:alpha:]\-??????????????" "\n" | ./unique  
> wordlist.lst
> kind of works, but I wonder if there are better ways?

You're on the right track.

When I need something like this, I generally try to combine several
approaches.  Specifically, I pass the input files through several
different tr's, splitting up "words" on different characters - e.g., in
one of the invocations a dash will be a delimiter, but in another it
will be part of the target "word".

When processing files of a known format, such as SQL dumps, I may also
use "sed" to extract and un-escape the values - e.g., for proper
handling of apostrophes and backslashes embedded into the values vs.
those added for the SQL dump.

Then the resulting stream is passed through "sort -u" or "sort | uniq"
(the standard Unix commands) or "unique" (the program included with
JtR).  The latter tends to be quicker (because it does not need to do
any sorting), but when the input data was not sorted in a meaningful
way, it may be better to have the resulting wordlist sorted
alphabetically as that allows for some optimizations in JtR to work -
detecting effectively-duplicates when the hash type truncates passwords
at a certain length, as well as speeding up DES key setup.  On the other
hand, if the hashes are fast to compute and you do not intend to be
applying plenty of rules to your wordlist, you may choose to save time
on generating the wordlist and use the quicker "unique".

BTW, "unique" can be made even quicker by increasing the values of
UNIQUE_HASH_LOG and UNIQUE_BUFFER_SIZE in params.h.  The defaults are
rather conservative (using around 9 MB of RAM).

Alexander

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.