john-dev - Re: john scalability

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4D8E1789.1000503@bredband.net>
Date: Sat, 26 Mar 2011 17:42:49 +0100
From: magnum <rawsmooth@...dband.net>
To: john-dev@...ts.openwall.com
Subject: Re: john scalability

Thanks a lot for explaining this. Questions inline:

On 2011-03-26 15:55, jfoug wrote:
> If you have 500 candidates, and your hash function is random (enough), and
> you have 5 hash buckets, then on 'average' there will be 10 items in each
> hash bucket, and no average, you will look through 1/2 of them for success
> (which may be rare), and look through all of them on non-success (often very
> common).   This makes for checking each value to take quite a lot of time
> (tesing against 10 candidates on average).   However, the memory usage is
> small (only 5 hash buckets).

Did you mean 50 candidates total? Or perhaps 50 hash buckets, or 100 
candidates in each. Or am I lost already?

By candidates here you actually mean binary representations of the input 
file's hashes, no?

> During the run of john, there is a hash table used to check computed hashes
> against possible candidates.  If we had 1 million candidates we would NOT
> want to check our computed hash value against all 1 million of them.  We
> would want to as quickly as possible narrow that search down to none, or
> just a few candidates.  This is where the hashing comes in.
>
> When the loader loads the cipher file, all of the want to find hashes are
> there.  The loader then builds a hash table using those values.  Then when a
> candidate is run, the result is put through the same hash function, and only
> the wanted values which match that hash (if any) are looked at.

So if we use the "size 2" (4096) hash, we will have 4096 buckets? And if 
we use the "size 4" (1M) hash, we'll have 1M buckets with a lot less in 
each.

The loader produce binaries from the hashes once and for all, and calls 
binary_hash() to decide what bucket each one should be put in?

And after we produce a candidate binary, say we call get_hash() where 
some part of it is (in the 1M case) and'ed with 0xFFFFF and it returns 
0x32154, then we only have to look through the bucket that contains the 
"0x32154" hashes for a match. Something like that?

> There used to be 3 hash tables.  I found serious slow-downs that started at
> a couple thousand candidates, and went totally to pot when doing 350k or so.
> I bumped that up to 5, and Solar has since put that into the jumbo.

Apparently it has also made it into plain John. But even non-jumbo 
AFS_fmt.c lacks the extra functions.

> It now sounds like some of them I 'left out' need to be added (they ALL
> should be added), The faster the hash, the more important it is to check
> zero or just a very small few candidates.  If the format is very slow, then
> checking more will not add up to a large percentage of runtime.  However,
> for saltless single crypt hashes (like rawmd5, rawsha, single-salted, lm,
> ntlm, etc), checking as few candidates as possible speeds things up a lot.

It seems very easy to add them to most formats and I have already added 
them to a couple I needed. It may be academic for some formats but I can 
volunteer to produce a diff with the "missing" functions added to all or 
most formats, if that is something we should have.

> I am sure if I do not have this logic right, Solar will correct my mistakes.
> But even it I do not have everything listed exactly right, it is close
> enough that it should give you the insight as to why it make such a huge
> impact.

Again, thanks!

magnum
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.