Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20100205070136.GA15755@openwall.com>
Date: Fri, 5 Feb 2010 10:01:36 +0300
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: Pre-Mangling (Wordlist cleanup)

On Wed, Feb 03, 2010 at 02:04:25PM +0100, SL wrote:
> I would like to use ./john --rules=Pre-Mangle --stdout | ./ unique to  
> clean up arbitrary (large) "dirty" wordlists.
> 
> In other words: I have target-specific generated wordlists (of about  
> 2GB size), which still contain a lot of "unusable junk" like raw MD5  
> hashes, punctuation, Base64 fragments, QP-encoded fragments, falsely  
> decoded UTF-8 etc.
> 
> My intention is to put together a number of word mangling rules that  
> help to reduce this chaos and only let through "reasonable" candidates  
> for future processing with ./john --rules and ./john --rules=Single.

That's a curious idea.  So far, people have been using tools other than
JtR itself to pre-process "dirty" wordlists like this.  I do see some
value in having a ruleset like this for JtR itself.

> Does such a collection of rules already exist? I couldn't find one,  

I think not.

> and I must admit that the complexity of 
> http://www.openwall.com/john/doc/RULES.shtml is a bit too much for me to 
>  start from scratch.

I suggest that you start by reading the existing john.conf.  Many of the
rules in the default rulesets start by rejecting some "words".  You can
learn from those rejection commands and build your ruleset upon them.

> What it should accomplish:
> * obviously no no-op (:)
> * include "dictionary-like" words up to ?? certain length (haven't seen  
> any password longer than 18 chars in my samples, so lenght 22 should  
> probably be sufficient)

# Permit pure alphabetic words of up to 22 characters long
<N !?A

> * shorter alphanumeric "words" might be included as-is, maybe up to 8  
> or 10 chars

So instead of the above, we have to write:

# Permit alphanumeric "words" of up to 10 characters long
<B !?X
# Permit pure alphabetic words of up to 22 characters long
>A <N !?A

> * punctuation should probably be purged (or truncated?)

...and what should be done with whatever remains?

# Purge punctuation and special symbols, then apply the usual requirements
@?p @?s Q <B !?X
@?p @?s Q >A <N !?A

> * words with false transcodings (lots of /(.[????])+/) should get  
> rejected

You haven't fully specified this (your regexp looks wrong) and it'd be
tricky to implement with the rules anyway.

> Could anybody please point me to a reasonable start? I shall follow-up  
> with a patch to john.conf, if this idea proves succesful.

I've provided some examples above.  Please do post whatever ruleset you
might come up with.

Thanks,

Alexander

Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.