john-users - Salt dupe removal logic and format changes (Comments wanted)

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140705084651.5SWRU.588376.imail@fed1rmwml108>
Date: Sat, 5 Jul 2014 8:46:51 -0400
From:  <jfoug@....net>
To: "john-dev@...ts.openwall.com" <john-users@...ts.openwall.com>
Subject: Salt dupe removal logic and format changes (Comments wanted)

I have started an issue on the git repository ( https://github.com/magnumripper/JohnTheRipper/issues/692#issuecomment-48072107 ) dealing with the logic for duplicate salt removal, and problems/limitations it has.

There are some formats which require a large amount of 'extra' information for them, to be stored with the salt.  Take for instance the rar format, or the zip format.  They require the actual blob of encrypted compressed file data to work.  BUT this is highly variable, and can be extremely large.  This chunk of data is not really a candidate to put into the salt structure as a fixed size array, since we would have to make it 100's of kb in size, and even then, it will greatly limit the format, and cause archives to not be handled, simply because they do not have these objects that are small enough to fit into our salt.  So, within formats like this, we simply need to put a pointer into the salt structure, allocate memory for our data item, and store this item off into memory.

Ok, here is the real problem.  The salt removal is simply a memcmp() call between 2 salts done by the core logic in cracker.c.   It uses the salt records returned by the format, and the size the format lists it's salt to be.  So we have the exact same .zip record in out input file 3 times.  We load up and build a salt record for each line as they are processed.  They are all identical, IF you know the salt structure.  But if you are only looking AT the salt structure, they are not the same. The 3 records all have a different pointer, even though the contents of WHAT is pointed to by that pointer is identical for all data. Thus, we end up with 3 salts, but each are IDENTICAL, and simply triple the amount of work required, and making the format 1/3 as efficient as it should be.

Magnum and I have talked a little at attacking this problem with a format change.    To give a 2nd salt size into the format structure.   The first (and existing), is the actual size of the salt structure.  This is used by core JtR to know how much memory to allocate and copy, to properly handle the salt.  The 2nd size would normally be set to the same value as the first size.  BUT this 2nd size could be made smaller than the first.  This second size is to be used in the comparison function, so it knows how much of the salt to use in the duplication logic.  Thus, we can put pointers to these variable sized ancillary data at the end of the salt structure, and tell core john to only compare the first part of the structure.  This allows the duplicate salt logic to function, keeping the duplicates in the input file from generating multiple salts, along with allowing already cracked hashes in the .pot file to be removed, even if those records can no longer (or never could) generate the proper data.

There are several caveats to this layout.  

1. it is a format change, and means that EVERY format has to be updated.  Thus even if in concept it is trivial, it is a pretty substantial undertaking.
2. it is a deviation from core JtR, which is woefully still lacking a current jumbo, and this change makes that even a larger task when the time comes.
3. all data that would not be looked at for salt comparison would have to be put into the back of the salt record.
4. a developer would have to be VERY careful in selection of just what data to NOT use in the salt comparison.  Often this is not obvious, so there still is chance this would lead to salts which should be removed that are not (dupes), but also could lead to salts that are unique, but that get removed as duplicates, because some unique data was placed into this do-not-look-at section of the salt by the coder.
5. There may still exist greatly variable sized data that must be used in the salt comparison.  Thus this is only part of the answer. a 'dumb' comparator check (simple memcmp) may not be adequate.   It may be that we need a new method to be added to the format equal_salt() or something like that, vs this size.
6. there are likely other 'issues' which magnum sees that I am missing.

Magnum asked that this question be tossed out to john-dev for discussion, prior to making changes.  I think that is a great thing also, but that is assuming that there are any replies at all.  However, it is nice to actually put things down and ask the questions, because it actually helps me to get a better picture, and even causes me to think differently.  The extra method (equal_salt() ) just popped into my head when I was typing.  Now that I think of it, equal_salt() really sounds like a better way to go anyway, with a default that simply falls back to returning !memcmp(salt1, salt2, format.salt_size)


Comments and discussion wanted.  I have an open 'bug/issue' on git to at least identify which formats have problems with this. I imagine there are quite a few in bleeding that behave like this.
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.