|
Message-ID: <006d01cbe2c5$dd1c5f90$97551eb0$@net> Date: Mon, 14 Mar 2011 23:02:55 -0500 From: "jfoug" <jfoug@....net> To: <john-users@...ts.openwall.com> Subject: RE: unique >From: Solar Designer [mailto:solar@...nwall.com] >>On Mon, Feb 28, 2011 at 09:23:20AM -0600, jfoug wrote: >> Unique is nice, but you HAVE to bump up the params.h values a lot to >>make it work, on anything other than a tiny file. > >You mean, to make it work "fast". Yes, exactly. It did work, but really ground on the disk IO ;) >Because it does work on huge files >even with default settings, just slowly. > >Also, I bumped up the defaults between 1.7.4 and 1.7.4.2. Starting with >that version, unique uses about 70 MB of RAM by default (previously, it >would use about 9 MB by default). I bumped way over that, but at this level, here is the IO needed, assuming the file is 'mostly' unique. 9mb buffer size: 100mb file 700mb read 500mb file 14gb read 1gb file 57gb read 2gb file 224gb read 5gb file 1389gb read At 70mb buffer 100mb file 170mb read 500mb file 2.4gb read 1gb file 9gb read 2gb file 29gb read 5gb file 179gb read At 384mb buffer (where I run) 100mb file 100mb read 500mb file 884mb read 1gb file 2gb read 2gb file 8gb read 5gb file 40gb read Now, above is worst case (data already unique, or almost). However, if the file is highly redundant, all buffer sizes will get a likewise reduction, but I believe the above proportions will stay close to what is listed. So yes, at the current default 70mb buffer level, a 1gb file to unique, is not 'too' bad. The file 'read' IO is classic O(n^2). Actually the runtime of the algorithm is a 'triangular number' so it is actually O((n^2+n)/2), but that reduces to simple O(n^2). In this case, n is sizeof(file)/sizeof(buffer) (or number of buffer reads from file). Thus reducing this ratio (increasing the size of the buffer), greatly reduces the overall impact of a O(n^2) algorithm. So for that 9mb buffer, and 5gb file, n is about 555 while 384mb buffer n is only about 13. Each 384mb operation is larger than the 9mb, but I will take O(13^2) over O(555^2) any day. NOTE, I have not looked at the unique program in a long time. You may have changed how the base algorithm works (but I do not think so). I image you just made the code run faster (such as faster hashing), but the algorithm as it deals with disk IO is likely the same. >With these settings, unique is >reasonably usable on wordlists of up to a few hundred megabytes. So >maybe your comment applies to a pre-1.7.4.2 version. Yes, when I did look at unique, was before that version. >That said, I think a future version should have the memory buffer size >tunable from the command-line or/and in john.conf. This has been on my >to-do for ages. Would be a wise addition, and should add no overhead to runtime of unique. >> -inp=file Allows file to be read, vs only reading from stdin. > >What for? Is it somehow more useful on Windows (IIRC, you're on >Windows). Yes I am. I do not remember if the reason I added this was due to redirection limitations in windoze cmd.exe or not. IIRC, I think it had to do with 2gb limits, but I am a little foggy on remembering just why. >> --2nd_uniq_file=file > >We need to come up with self-explanatory names for these options. If you have ideas, please post. To make command line switch changes is trivial. The biggest change to make, is to get any changes I have done into your newer code. The most useful 'real' changes are the addition of the --2nd_uniq_file (or whatever command line switch you think is better) processing. The verbosity is not really useful (just a pacifier). Jim
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.