Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20080127052520.GA1281@openwall.com>
Date: Sun, 27 Jan 2008 08:25:20 +0300
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: wordlist compression and runtime decompression (was: wordlist mode improvement maybe)

Replying to both helleye and Marti at once:

On Sat, Jan 26, 2008 at 08:41:55PM +0000, helleye wrote:
> so when we read the new wordlist , we will need to read 8 chars total, 
> updating just the new chars
> (not counting crlf counters and spaces , but i think it will worth it on real 
> sorted wordlists) ,instead of 16

You have almost re-invented Crack 5's DAWG (Directed Acyclic Word
Graphs) compression.  It's a good thing, but it's not important to have,
and it will confuse some users.  Thus, it stays on my to-do list for JtR
as a low priority item for years.

Now that there's JtR Pro, which includes a large wordlist pre-packaged
along with the program, I am actually considering implementing this as
it would reduce the download size for the package (currently at 13 MB).

> you think it will improve speed ?

Not with Unix password hashes.  It might improve speed in special cases,
such as when running a huge number of wordlist rules against very fast
saltless hashes (or against a single fast hash, in which case it may as
well be salted).

However, the primary purpose would be to reduce the size of downloadable
packages, to save hard drive space when the packages are installed, and
maybe to save RAM that the OS uses for caching the wordlist (see below).

On Sun, Jan 27, 2008 at 05:23:56AM +0200, Marti Raudsepp wrote:
> ... cracking speed is usually CPU-bound (not I/O bound).

This is correct - except in special cases I've mentioned above.  In
fact, even in those special cases, a decent OS will cache the entire
wordlist in RAM (as long as there's enough RAM that would otherwise stay
unused), so reading the wordlist is CPU-bound and memory-bound, too.
One annoyance here is the access time updates on the file, unless
disabled with the "noatime" mount option or equivalent.  Maybe a future
version of JtR should optionally read the entire wordlist into memory on
its own to avoid this OS overhead.  Then DAWG compression would make
more sense (can store the wordlist in memory as DAWG-compressed).

> Doing runtime decompression would increase CPU usage rather
> than reduce it.

This is not certain.  The OS kernel and C library (stdio) overhead
involved in reading data from a (cached) file is comparable to or even
bigger than the computational cost of DAWG decompression, which is
almost negligible.

> However if there really is a good reason, you could pull this off
> using widely available compression tools like gzip/bzip2/lzma by
> piping their output to john's stdin. They would probably also have
> much better compression ratios (>5 times is normal for text files).

Of course, but:

1. These will in fact increase CPU usage, although this is fine if the
data is actually being read off disk platters (not yet cached) and/or a
CPU core would otherwise stay idle.

2. It has been shown that the combination of DAWG and gzip provides much
better overall compression for sorted wordlists than gzip alone does.

3. "john --stdin" is incompatible with wordlist rules, precisely because
JtR does not load entire wordlists into memory.

4. Built-in support for "serious" compression in JtR would introduce
dependencies on external libraries and/or result in code bloat (if such
support is fully implemented in JtR's own source code).  That said,
optional dependencies on external libraries (and thus optional support
for the corresponding compression methods) may be OK.

Given the above, you can see that DAWG is a more likely candidate for
inclusion into JtR itself.  In fact, it's not really an alternative to
the "serious" compression methods since both can co-exist nicely.

-- 
Alexander Peslyak <solar at openwall.com>
GPG key ID: 5B341F15  fp: B3FB 63F4 D7A3 BCCC 6F6E  FC55 A2FC 027C 5B34 1F15
http://www.openwall.com - bringing security into open computing environments

-- 
To unsubscribe, e-mail john-users-unsubscribe@...ts.openwall.com and reply
to the automated confirmation request that will be sent to you.

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.