john-dev - dropping *_hash

Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120121184035.GA13751@openwall.com>
Date: Sat, 21 Jan 2012 22:40:35 +0400
From: Solar Designer <solar@...nwall.com>
To: john-dev@...ts.openwall.com
Subject: dropping *_hash_0()

magnum, Jim, all -

Right now, the *_hash*() functions provided by each format are supposed
to return values of these bit sizes:

4, 8, 12, 16, 20, 24, 27

The smallest one of these (4-bit) was relevant in 1990s when RAM sizes
were small and John only supported salted hashes.  So this helped to
check a computed hash value against a few loaded hashes (with that one
salt) quicker while not consuming much RAM.

Things have changed.  The last straw may be my recent introduction of
bitmaps, where a bitmap of 16 bits is ridiculous (it is smaller than one
element in the array of "unsigned int") and where hash tables (used
after bitmap lookups) are even smaller (so far, based on some testing,
I set them to have 4 times fewer elements than the bitmaps).
A 4-element hash table makes little sense - if we use a hash table now,
we can as well make it larger.

On the other hand, 27-bit hash values were meant to be sufficient for up
to about 100 million of same-salt hashes loaded.  Larger values would be
potentially problematic.  27 bits corresponded to 1 GB hash table size
with 64-bit pointers.  The next size would be 2 GB, and I suspect that
it could hit limitations of some malloc() implementations.

With my introduction of bitmaps, hash tables became smaller - 4 times
smaller now.  With 100M hashes loaded, we'd get a 27-bit bitmap, but
only a 25-bit hash table.  The bitmap would be approx. 52.5% full
(assuming uniform distribution).  This means that more than 50% of the
time the smaller hash table would be accessed, and it'd take about 5
iterations to see that the computed hash does not actually match any of
those in that hash bucket.  This seems suboptimal.

A workaround could be to make the bitmap vs. hash table size ratio
variable (have it depend on hash count, so that e.g. for 100M hashes
we'd have both the bitmap and the hash table use full 27-bit indices),
but this may slow the code down a little bit (in all cases, not only
when this flexibility is actually needed - unless we use specialized
pieces of code).

Thus, I propose that we drop 4-bit and add 29-bit hash values (for
bitmaps; this corresponds to 27-bit for hash tables).  The maximum hash
table size will then remain at 1 GB (per salt), although the
corresponding threshold would be higher than we'd set it for direct hash
tables (without bitmaps).

(29 also happens to be the number of significant bits in the first
32-bit half of binary tripcodes with the current code.)

This lets us reuse the same sets of 7 hash function pointers that we
already have.  They will correspond to these return value sizes:

8, 12, 16, 20, 24, 27, 29

I propose that we don't rename any functions.  Simply drop *_0() and add
*_7(), so the new numbering will start with 1 (although array indices
obviously start at 0).  Similarly, drop PASSWORD_HASH_SIZE_0 and
PASSWORD_HASH_THRESHOLD_0, and add PASSWORD_HASH_SIZE_7 and
PASSWORD_HASH_THRESHOLD_7 in params.h.

Any comments?

If a format is not updated, it will end up wasting memory.  To deal with
this, we may introduce a new FMT_* flag that will indicate the use of
the new sizes - but this flag will soon be set on all formats.  Do we
need this complexity?

Alexander
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.