john-users - Re: Work (optimization) in progress, and some ideas

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090726171611.GA13686@openwall.com>
Date: Sun, 26 Jul 2009 21:16:11 +0400
From: Solar Designer <solar@...nwall.com>
To: john-users@...ts.openwall.com
Subject: Re: Work (optimization) in progress, and some ideas

Jim,

I've skipped your comments on my points 1-3, just not to spend too much
time on this discussion.  We're mostly thinking the same.

I initially wrote the rules.c code when JtR only supported iterated and
salted hashes, the fastest of which was DES-based crypt(3).  Then, when
faster and saltless hashes "started to appear", I was still quite
satisfied with the speed of that code, because the default rulesets
would complete quickly.  For typical runs, the total overhead was maybe
on the order of 1 minute, which was insignificant if the total session
duration could be hours, days, or weeks (mostly in "incremental" mode).

Now, I realize that people have started to (ab)use the rules to a greater
extent, coming up with thousands of wordlist rules and applying them to
fast saltless hashes.  ("Normally", JtR would only use almost this many
rules when in "single crack" mode.)  Given this change in "use patterns",
I agree that it may be time to rework the rules.c code.

On Sat, Jul 25, 2009 at 08:06:02PM -0500, JimF wrote:
> >4. Introduce string append/prepend/insert commands, like:
> >
> >A"string" - append
> >B"string" - prepend
> >CN"string" - insert in position N
> 
> These would be WONDERFUL additions.  I have about 12000 rules,
> which were gleaned off running cracked words against other
> dictionaries, looking for prefix and postfix strings, then sorting based
> upon how often a prefix is seen, then keeping the ones that were seen
> 3 or more times (in sorted order of course). I ended up with 12k rules.
> I built another program to convert those prefix and postfix strings into
> 'proper' rules for john (knowing that prefix has to be in reverse order,
> etc), and  on it's output, it alternated prefix and postfix rules (there
> were more postfix rules, btw).  That set of rules when run cracks a
> LOT of passwords, once the 'normal' john default rules have run, but
> it runs quite a bit slower, due to so much string moving stuff in john.
> If we had string append that would speed things up a lot.  Appending
> with existing rules is one time through all of the logic of a rule for
> each letter, vs a simple one time run through the rule logic.  No matter
> how much the prepend and append logic for one letter is optimized,
> it will not be faster than append of a word (string.)

That's true.  Speaking of your specific intended use described above,
I think a "next generation" version of JtR should introduce some
multi-word or passphrase cracking mode.  This is a topic we could
discuss separately and at a later time.

> So, your rule would be:
> 
> A!string!  where ! is any unused character within the string? That type
> logic works fine for me.

That's correct.

> >Unfortunately, the 'P' and 'I' command characters are already taken.
> 
> Do they have to be 1 letter?

For optimal performance, yes, as long as we don't introduce a compiler
into some internal representation (which is something to consider when
we're ready for a major rework).

Also, single-letter commands allow for using the preprocessor on them:

:[AB]"string"

At a later time the preprocessor may be enhanced to also support
expressions like {string1,string2}, but we're not there yet.

> It could be something like:
> $.!string! and ^.!string!      (again ! being any char not used in string)
> 
> this would require $\. or ^\.  (or $[.] ^[.] for the pre-processor) to 
> append just a period.

I thought of this before, and decided against this approach for the
following reasons:

1. This would break some existing valid rules (those appending/prepending
the magic character or the escape character).

2. The checks for the magic character and the escape character would have
some performance cost.

Also, your preprocessor example is not correct.  The preprocessor is
separate from the rest of the rules engine, and I'd like to keep it that
way, so we can't reasonably "teach" the preprocessor to treat a
character specially.  Thus, the rule command syntax you proposed would
be incompatible with having a preprocessor expression for
appending/prepending an arbitrary character from a list (which might
include the magic character).  For example, instead of:

$[abc.123]

one would have to write:

$[abc]
$\\.
$[123]

In fact, since the backslash would also become special in that context,
it would not be possible to list it as well.  So instead of:

$[abc\\123]

one would have to write:

$[abc]
$\\\
$[123]

Really not great.

(The backslash is already a special character for the preprocessor,
which is why it has to be doubled now and would have to be tripled.)

> This way, the same 'prepend' 'append' rule semantics are used, keeping
> the 'learning' of new rules to a minimum.

Yes, but the drawbacks outweigh that.

> I am not sure if preprocessing
> rules would need messed with, as I don't think pre-generated strings
> would really benefit from the pre-processor.

There's currently no way to turn the preprocessor off, and I don't think
we want to change that.  It would be a source of confusion.

Some other thoughts on the string commands:

1. "prepend" is redundant when we have "insert", yet we could want to
implement it separately for clarity and because we don't have an
optimizing compiler of the rules into an internal representation.  We
already have this sort of redundancy for the single-letter commands.

2. "prepend" and "insert" could be made faster if we knew the length of
the string to be prepended/inserted in advance.  This may be achieved
by changing the input syntax to require the user to specify the length
(inconvenient), by introducing a rules compiler into an internal
representation, or by introducing some sort of per-rule-command caching
(there may be multiple string commands per rule, so the lengths would
need to be cached separately for each).

> Speaking of rules, I have found the rejection rules to be a little on
> the slow side.  Part of this, is due to the rejection rule having
> to operate on each word, one at a time, and if you have many rules
> with the same rejection rule, john does not know that this word
> was rejected already, so checks it again.
[...]
> I am not sure we can optimize john a whole lot for this
> type of situation.  Does anyone have any other ideas?

I think this problem can be generalized as follows:

Right now, JtR tries the entire wordlist against each rule before it
advances to the next rule.  If the next rule has the same initial few
commands (not only rejections, but any commands), those will be applied
to all entries in the wordlist for a second time.

To deal with this, as well as to provide a way to temporarily swap the
loops for some other reason (and I can think of at least one other valid
reason), I've been thinking of introducing a way to declare "rule
blocks" - some sort of BEGIN and END directives within a ruleset.
JtR would then swap the loops while working on such a block.

When going through the rules for the first time, JtR would need to
detect common initial substrings between adjacent rules (or better but
trickier - detect common initial sets of commands) and record the
character position (or command number) of the first difference into the
rule struct (to be introduced, and we'd have to keep the preprocessor
output in memory).  Then, when actually working at full speed, the rules
engine would use this "first difference position" to determine when to
cache the intermediate result, when to recall it, and how many initial
characters or rule commands to skip.

If a rejection occurs at or before the caching position, that fact
should be cached instead of the intermediate result (string).  Then the
following rules with the same or larger caching position should be
skipped.  The pointer to the next rule to process in such a case could
have been cached too.  To make this even more optimal, two caching
positions could be kept track of: one as described above and the other
at the end of the last rejection command within the common initial set
of commands.

> Sorry for the ramble.  Jim 

Hey, that's what this mailing list is for. ;-)  Well, not only that, but
I appreciated your postings so far.

Thanks,

Alexander

-- 
To unsubscribe, e-mail john-users-unsubscribe@...ts.openwall.com and reply
to the automated confirmation request that will be sent to you.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.