Follow @Openwall on Twitter for new release announcements and other news
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Tue, 15 Sep 2015 00:32:42 +0300
From: Solar Designer <>
Subject: Re: Judy array

On Mon, Sep 14, 2015 at 05:58:42AM -0700, Fred Wang wrote:
> On Sep 14, 2015, at 5:46 AM, Solar Designer <> wrote:
> > Do you mean a Bloom filter is always a win over a simple bitmap for the
> > same limited RAM usage?  If so, that's no surprise.  Or do you mean it's
> > always a win regardless of RAM usage, when optimizing for speed only?
> > If so, that's unexpected to me except when a significant portion fits in
> > a cache or when you optimize the order in which you perform the lookups
> > to achieve a higher cache hit rate (or when you manage to use multiple
> > bits per lookup at application level, but that's probably not what you
> > meant above yet).
> My focus has always been large number of hashes, and thus my statements. By large, I mean > 1,000,000 unsolved hashes (I typically run with about 80,000,000 unsolved).

Sure, but how large is the optimal Bloom filter for those hash counts?
Isn't it much larger than L3 cache anyway?  If so, I don't see why it'd
work faster than an even larger bitmap.

> As soon as I can, I issue _mm_prefetch(bloom_target,_MM_HINT_NTA), since it is likely that the bloom filter address will not be re-used in the near term.  By the time the bloom filter is called, the address is in cache, so I don't stall waiting for it.

Oh, as a minor evolutionary change to what we have now, we should in
fact decouple the bitmap lookups from actually using those looked up
values.  Here's how I had it written as an experiment in January 2012:

	for (index = 0; index < crk_key_index; index += 2) {
		int hash2 = salt->index(index + 1);
		int hash1 = salt->index(index);
		unsigned int b1 = salt->bitmap[hash1 / (sizeof(*salt->bitmap) * 8)];
		unsigned int b2 = salt->bitmap[hash2 / (sizeof(*salt->bitmap) * 8)];
		if (b1 & (1U << (hash1 % (sizeof(*salt->bitmap) * 8)))) {
		if (b2 & (1U << (hash2 % (sizeof(*salt->bitmap) * 8)))) {

and IIRC it actually ran significantly faster, but I never committed
that code version.  I also had it with OpenMP support, but to use that
we need formats to declare that their get_hash*() are MT-safe (or verify
that all currently are, which I think may very well be the case).

We're not currently using the prefetch instructions in JtR, but I did
give them a try in yescrypt.  Just a couple of days ago, I confirmed
what I had suspected: turning off the hardware prefetcher in BIOS
settings can make code with explicit prefetch instructions even faster
(got ~4% boost for yescrypt on 2x E5-2660v2), but hurts code that does
not use those instructions (got ~30% hit for scrypt implementation with
no prefetch instructions when turning off all 4 hardware prefetch
options in a Dell BIOS - but then I was able to identify 1 out of 4
options that, when only that 1 is disabled, still boosts yescrypt
without hurting scrypt).

> By "as soon as I can", I mean as I finish the first 32 bits of the MD5, for example.

Oh, do you proceed to compute more than 32 bits of MD5 even before you
know you'd need more than 32 bits?  That's weird if so, since you have
other work to do there (on other candidate passwords) that you're sure
you'd make use of.

> If you are considering only a (very) small number of hashes, then yes, bitmaps could fit in cache and thus be faster.  Because I have sufficient work to do before I check the bloom filter results, however, there appears to be little penalty.

You shouldn't want to initiate more than one Bloom filter lookup for a
hash computed until you know that one lookup would happen to be
insufficient... or are you sure you'd always have spare bandwidth to do
so (and only latency matters, which you deal with in the way you



Powered by blists - more mailing lists

Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.