|
Message-ID: <20230418211231.GA4992@openwall.com> Date: Tue, 18 Apr 2023 23:12:31 +0200 From: Solar Designer <solar@...nwall.com> To: john-dev@...ts.openwall.com Subject: Re: Birthday paradox On Tue, Apr 18, 2023 at 10:31:04PM +0200, magnum wrote: > On 2023-04-18 21:44, Solar Designer wrote: > >However, perfect hash tables, which we already use as the final level in > >the code that you refer to, are an even further step in the same > >direction, needing exactly 1 lookup. Given that we already went for the > >complexity there, is there a reason we're not doing the same for the > >smaller early-reject tables in local memory? Just fill them with tiny > >fingerprints like a Cuckoo filter would use. > > Assuming by "lookup" you mean the actual hash tables lookup, and by > "early-reject" you mean the bitmap... By "lookup" I mean one random access to whatever memory, for whatever purpose. However, for the above paragraph your way of reading it is also right. > Are you saing we should build an > early-reject table more similar to the actual hash table? Yes, I am saying we could at least try either that or at least a cuckoo filter. I'm very familiar with cuckoo filters now, but not yet with perfect hash tables. So a cuckoo filter would be easier for me, but we also already have perfect hash tables due to Sayantan's contribution and could just try to duplicate their usage at this earlier step and with small fingerprints (cuckoo filter alike) instead of full hashes. The difference is 1 (perfect) vs. up to 2 lookups (cuckoo) per early-reject. I wonder, would mixing the two concepts be original research? Similar to how first there was cuckoo hashing, and later cuckoo filter. I'm not aware of people taking a similar step from perfect hashing to let's call it perfect filter yet, even though it seems obvious. > I'm merely trying to understand all bits and pieces involved. The bitmap > is always "a single lookup", but it can consist of up to eight reads of > (hopefully local) memory: I'd call that 8 lookups. > ---8<--------------8<--------------8<--------------8<----------- > uint bitmap_index = hash[1] & BITMAP_MASK; > > uint tmp = (bitmaps[BITMAP_SHIFT * 0 + (bitmap_index >> 5)] >> > (bitmap_index & 31)) & 1U; > #if SELECT_CMP_STEPS == 2 > bitmap_index = hash[0] & BITMAP_MASK; > > tmp &= (bitmaps[BITMAP_SHIFT * 1 + (bitmap_index >> 5)] >> > (bitmap_index & 31)) & 1U; > #endif > #if SELECT_CMP_STEPS == 4 > (...) > ---8<--------------8<--------------8<--------------8<----------- > > The BITMAP_SHIFT macro and the occasional multiplication are all > compile-time constants so should be optimized away. Then each bitmap > step is just an add, a couple shifts and a couple ANDs. So BITMAP_SHIFT makes this separate 8 bitmaps. You could instead try removing the shifts and having each lookup cover the entire array (larger mask). That would be a Bloom filter. You could then experiment with lookup count, as it would be untangled from the size. Of course, all of this would need to be consistently replicated during bitmap construction on host. You might very well achieve better results than we have now. Instead of experimenting, assuming that local memory is of uniform access speed, you could simply use a Bloom filter parameters vs. FP rate calculator (such JavaScript web pages exist) to simulate this. Those web page simulators might not care about actual average lookup count, though - so you might need to write your own simulator - should be quick and easy. However, knowing that cuckoo and perfect are more efficient, the above could be a waste of effort unless you expect to stay with it for a long time (such as for lack of R&D time to proceed further). > Here's the code for a hash table look-up once we passed the bitmap test: > > ---8<--------------8<--------------8<--------------8<----------- > uint t, hash_table_index; > > ulong hash64; > > > > hash64 = ((ulong)hash[1] << 32) | (ulong)hash[0]; > > hash64 += (ulong)offset_table[hash64 % OFFSET_TABLE_SIZE]; > > hash_table_index = hash64 % HASH_TABLE_SIZE; > > > > if (hash_table[hash_table_index] == hash[0] && > > hash_table[hash_table_index + HASH_TABLE_SIZE] == hash[1]) { > > /* we've got a crack */ > ---8<--------------8<--------------8<--------------8<----------- > > Like we discussed recently, the modulo operations are compile-time > constants so should be fairly optimized - but perhaps they are still > relatively expensive? If optimized by the compiler, their latency is probably that of two consecutive integer multiplies plus a little more. We could help the compiler optimize further (removing that "a little more" part) by doing the reciprocals explicitly targeting the 31-bit range if/when it's sufficient for us. It costs a few more operations to support the full 32-bit range, as I figured out and explained on that recent GitHub PR comments. However, the performance difference is probably small. We could also verify whether the compiler optimizes this. One way would be to look at the ISA. Another would be to use a runtime variable for the divisor and see whether and how much slower it would be - in the full program or/and in an isolated test case. Oh! I just realized that we're dividing a 64-bit number here. My tests and reasoning above were for 32-bit to 32-bit (or to 31-bit). It should be most costly for 64-bit, although conceptually the same kind of optimizations would apply (needing 128-bit or maybe 96-bit temporarily, which is probably still faster than a division). OTOH, if the input were 32-bit, then the modulo division would introduce significant biases. This could be fine for a perfect hash table (not sure, though), but it'd certainly hurt efficiency of a cuckoo filter (assuming uniform memory access speed; with caching, not so obvious). > I tried disabling the bitmap completely, to see > how bad it would be, and speed plummeted. I've been meaning to re-try > that but with the offset and hash tables in local memory (when they fit > of course). But if I recall correctly, I even tried replacing the bitmap > lookup (for a single hash loaded) with a constant compare [if (hash[1] = > 0xcafebabe)] and even *that* did not run any faster than the current > bitmap code. Not sure why really. That's puzzling. Maybe performance is dominated by random global memory access throughput (as in random lookups per second). > Anyway I've mostly been tinkering with "many" hashes and how to optimize > that. One idea I've had is for the case where eg. a 4x bitmap doesn't > fit in local memory: Perhaps the first (or first two) steps can be in > local, and the rest in global? If global memory were not cached, this would make no sense - it'd be more optimal to do the last-level hash table lookup right away. However, with caching it could be different - you'd move those steps not from local to global, but from local hopefully mostly to cache, and it could theoretically run faster on some devices. > Well, many ideas, lots of basic research, slow progress. The most obvious and easiest change relative to what we have is to try combining those bitmaps into one Bloom filter. This isn't expected to be the most optimal long-term solution, but if it turns out to be faster than what we have now, it'd provide a new baseline for further attempts. Alexander
Powered by blists - more mailing lists
Confused about mailing lists and their use? Read about mailing lists on Wikipedia and check out these guidelines on proper formatting of your messages.