Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180911154246.GA3070@openwall.com>
Date: Tue, 11 Sep 2018 17:42:46 +0200
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: good program for sorting large wordlists

Hi,

On Tue, Sep 11, 2018 at 05:19:18PM +0200, JohnyKrekan wrote:
> Hello, I would like to ask whether someone has experience with good tool to sort large text files with possibilities such as gnu sort. I am using it to sort wordlists but when I tried to sort 11 gb wordlist, it crashed while writing final output file after writing around 7 gb of data  and did not delete some temp files. When I was sorting smaller (2gb) wordlist it took me just about 15 minutes while this 11 gb took 4.5 hours (Intel core I 7 2.6ghz, 12 gb ram, ssd drives).

Most importantly, usually you do not need to "sort" - you just need to
eliminate duplicates.  In fact, in many cases you'd prefer to eliminate
duplicates without sorting, in case your input list is sorted roughly
for non-increasing estimated probability of hitting a real password -
e.g., if it's produced by concatenating common/leaked password lists
first with other general wordlists next, or/and by pre-applying wordlist
rules (which their authors generally order such that better performing
rules come first).

You can eliminate duplicates without sorting using JtR's bundled
"unique" program.  In jumbo and running on a 64-bit platform, it will by
default use a memory buffer of 2 GB (the maximum it can use).  It does
not use any temporary files (instead, it reads back the output file
multiple times if needed).  You can use it e.g. like this:

./unique output.lst < input.lst

or:

cat ~/wordlists/* | ./unique output.lst

or:

cat ~/wordlists/common/* ~/wordlists/uncommon/* | ./unique output.lst

or:

./john -w=password.lst --rules=jumbo --stdout | ./unique output.lst

As to sorting, recent GNU sort from the coreutils package works well.
You'll want to use the "-S" option to let it use more RAM, and less
temporary files, e.g. "-S 5G".  You can also use e.g. "--parallel=8".

As to it running out of space for the temporary files, perhaps you have
your /tmp on tmpfs, so in RAM+swap, and this might be too limiting.  If
so, you may use the "-T" option, e.g. "-T /home/user/tmp", to let it use
your SSDs instead.  Combine this with e.g. "-S 5G" to also use your RAM.

As to "it crashed while writing final output file after writing around 7
gb of data", did you possibly put the output file in /tmp as well?  Just
don't do that.

I hope this helps.

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.