When picking a hash table size we usually have two choices: Either, we pick a prime number or a power of 2. Powers of 2 are easy to use, as a modulo by a power of 2 is just a bit-wise and, but 1) they waste quite a bit of space, as we have to round up to the next power of 2, and 2) they require "good" hash functions, where looking at just a subset of bits is ok.

Prime numbers are more forgiving concerning the hash function, and we have more choices concerning the size, which leads to less overhead. But using a prime number requires a modulo computation, which is expensive. And we have to find a suitable prime number at runtime, which is not that simple either.

Fortunately we can solve both problems simultaneously, which is what this blog post is about. We can tackle the problem of finding prime numbers by pre-computing suitable numbers with a given maximum distance. For example when when only considering prime numbers that are at least 5% away from each other we can cover the whole space from 0 to 2^64 with just 841 prime numbers. We can solve the performance problem by pre-computing the magic numbers from Hacker's Delight for each prime number in our list, which allows us to use multiplications instead of expensive modulo computations. And we can skip prime numbers with unpleasant magic numbers (i.e., the ones that require an additional add fixup), preferring the next cheap prime number instead.

The resulting code can be found here. It contains every prime number you will ever need for hash tables, covering the whole 64bit address space. Usage is very simple, we just ask for a prime number and then perform modulo operations as needed:

class HashTable { primes::Prime prime; vectortable; public: HashTable(uint64_t size) {

prime = prime::Prime::pick(size);

table.resize(prime.get());

} ... Entry* getEntry(uint64_t hash) { return table[prime.mod(hash)]; } ... };

The performance is quite good. On an AMD 1950X, computing the modulo for 10M values (and computing the sum of the results) takes about 4.7ns per value when using a plain (x%p), but only 0.63ns per value when using p.mod(x).

Getting this into unordered_map would be useful, it would probably improve the performance quite significantly when we have few cache misses.

Getting this into unordered_map would be useful, it would probably improve the performance quite significantly when we have few cache misses.

Hey Thomas, good stuff as always. I was surprised about 0.63ns (2 cycles roughly!), esp. with 128-bit computations, but for independent computations it seems it actually works :)

ReplyDeleteI also thought Daniel Lemire's alternative to modulo would be faster (https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/), but at least in my quick benchmark the difference was (surprisingly) very small, maybe 10-20%.

Hi Marcin, when you look at the generated assembler code you see that no (full) 128-bit computation is performed: https://godbolt.org/z/yoLmWk

ReplyDeleteWe intentionally perform two shifts in the code, one the get the upper word after multiplication and one for the result. The first shift is a no-op, as the MUL instruction always places the high result word in RDX anyway. Thus the code is quite cheap, even though it looks a bit expensive.

The approach from Daniel Lemire has stronger requirements on the hash value, i.e., it requires that all hash values spread over the whole integer domain immediately. This makes it unsuitable for unordered_map, where some people use identity hashing or other very weak hash functions. Using modulo prime is more forgiving there.

Good point about the hash function range. Maybe indeed for unordered_map replacement it is too aggressive / unsafe. Especially with the _terrible_ default hash functions in some places.

ReplyDeleteBut for the "real" hash functions Daniel's solution might be a bit easier and faster. And allows completely arbitrary hash table sizes, not that it matters much :)

cheerios!