john-users - RE: unique

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <006d01cbe2c5$dd1c5f90$97551eb0$@net>
Date: Mon, 14 Mar 2011 23:02:55 -0500
From: "jfoug" <jfoug@....net>
To: <john-users@...ts.openwall.com>
Subject: RE: unique

>From: Solar Designer [mailto:solar@...nwall.com]
>>On Mon, Feb 28, 2011 at 09:23:20AM -0600, jfoug wrote:
>> Unique is nice, but you HAVE to bump up the params.h values a lot to
>>make it work, on anything other than a tiny file.
>
>You mean, to make it work "fast".  

Yes, exactly. It did work, but really ground on the disk IO ;)

>Because it does work on huge files
>even with default settings, just slowly.
>
>Also, I bumped up the defaults between 1.7.4 and 1.7.4.2.  Starting with
>that version, unique uses about 70 MB of RAM by default (previously, it
>would use about 9 MB by default).  

I bumped way over that, but at this level, here is the IO needed, assuming
the file is 'mostly' unique.

9mb buffer size:
100mb file  700mb read
500mb file   14gb read
1gb   file   57gb read
2gb   file  224gb read
5gb   file 1389gb read

At 70mb buffer
100mb file 170mb read
500mb file 2.4gb read
1gb   file   9gb read
2gb   file  29gb read
5gb   file 179gb read

At 384mb buffer (where I run)
100mb file 100mb read
500mb file 884mb read
1gb   file   2gb read
2gb   file   8gb read
5gb   file  40gb read

Now, above is worst case (data already unique, or almost). However, if the
file is highly redundant, all buffer sizes will get a likewise reduction,
but I believe the above proportions will stay close to what is listed.  So
yes, at the current default 70mb buffer level, a 1gb file to unique, is not
'too' bad.  

The file 'read' IO is classic O(n^2).  Actually the runtime of the algorithm
is a 'triangular number' so it is actually O((n^2+n)/2), but that reduces to
simple O(n^2). In this case, n is sizeof(file)/sizeof(buffer)  (or number of
buffer reads from file).  Thus reducing this ratio (increasing the size of
the buffer), greatly reduces the overall impact of a O(n^2) algorithm.   So
for that 9mb buffer, and 5gb file,  n is about 555 while 384mb buffer n is
only about 13.  Each 384mb operation is larger than the 9mb, but I will take
O(13^2) over O(555^2) any day.

NOTE, I have not looked at the unique program in a long time.  You may have
changed how the base algorithm works (but I do not think so).  I image you
just made the code run faster (such as faster hashing), but the algorithm as
it deals with disk IO is likely the same.
 
>With these settings, unique is
>reasonably usable on wordlists of up to a few hundred megabytes.  So
>maybe your comment applies to a pre-1.7.4.2 version.

Yes, when I did look at unique, was before that version.

>That said, I think a future version should have the memory buffer size
>tunable from the command-line or/and in john.conf.  This has been on my
>to-do for ages.

Would be a wise addition, and should add no overhead to runtime of unique.

>>   -inp=file     Allows file to be read, vs only reading from stdin.
>
>What for?  Is it somehow more useful on Windows (IIRC, you're on
>Windows).

Yes I am. I do not remember if the reason I added this was due to
redirection limitations in windoze cmd.exe or not.  IIRC, I think it had to
do with 2gb limits, but I am a little foggy on remembering just why.

>>   --2nd_uniq_file=file
>
>We need to come up with self-explanatory names for these options.

If you have ideas, please post.  To make command line switch changes is
trivial.  The biggest change to make, is to get any changes I have done into
your newer code.

The most useful 'real' changes are the addition of the --2nd_uniq_file (or
whatever command line switch you think is better) processing. The verbosity
is not really useful (just a pacifier). 

Jim
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.