Wednesday, July 24, 2019

Cuckoo Filters with arbitrarily sized tables

Cuckoo Filters are an interesting alternative to Bloom filters. Instead of maintaining a filter bitmap, they maintain a small (cuckoo-)hash table of key signatures, which has several good properties. For example is stores just the signature of a key instead of the key itself, but is nevertheless able to move an element to a different position in the case of conflicts.

This conflict resolution mechanism is quite interesting: Just like regular cuckoo hash tables each element has two potential positions where is be placed, a primary position i1 and a secondary position i2. These can be computed as follows:

i1 = hash(x)
i2 = i1 xor hash(signature(x))

Remember that the cuckoo filter stores only the (small) signature(x), not x itself. Thus, when we encounter a value, we cannot know if it is at its position i1 or position i2. However, we can nevertheless alternate between positions because the following holds

i1 = i2 xor hash(signature(x))

and we have the signature stored in the table. Thus, we can just use the self-inverse xor hash(signature(x)) to switch between i1 and i2, regardless of whether are currently at i1 or i2. Which is a neat little trick. This allows is to switch between positions, which is used in the cuckoo filter conflict resolution logic.

However all this hold only because the original cuckoo filters use power-of-two hash tables. If our hash table size is not a power of 2, the xor can place the alternative position beyond the size of the hash table, which breaks the filter. Thus cuckoo filter tables always had to be powers of two, even if that wasted a lot of memory.

In more recent work Lang et al. proposed using cuckoo filters with size C, where C did not have to be a power of two, offering much better space utilization. They achieved this by using a different self-inverse function:

i1 = hash(x) mod C
i2 = -(i1 + hash(signature(x)) mod C

Note that the modulo computation can be made reasonable efficient by using magic numbers, which can be precomputed when allocating the filter.

A slightly different way to formulate this is to introduce a switch function f, which switches between positions:

f(i,sig,C) = -(i + hash(sig)) mod C
i1 = hash(x) mod C
i2 = f(i1, signature(x), C)
i1 = f(i2, signature(x), C)
All this works because f is self-inverse, i.e.,

i = f(f(i, signature(x), C), signature(x), C)

for all C>0, i between 0 and C-1, and signature(x)>0.

The only problem is: Is this true? In a purely mathematical sense it is, as you can convince yourself by expanding the formula, but the cuckoo filters are not executed on abstract machines but on real CPUs. And there something unpleasant happens: We can get numerical overflows of our integer registers, which implicitly introduces a modulo 2^32 into our computation. Which breaks the self-inverseness of f in some cases, except when C is power of two itself.

Andreas Kipf noticed this problem when using the cuckoo filters with real world data. Which teaches us not to trust in formulas without additional extensive empirical validation... Fortunately we can repair the function f by using proper modular arithmetic. In pseudo-code this looks like this

    x=(C-1)-(hash(sig) mod C)
    if (x>=i)
       return (x-i);
    // The natural formula would be C-(i-x), but we prefer this one...
    return C+(x-i);

This computes the correct wrap-around module C, at the cost of one additional if. We can avoid the if by using predication, as shown below

    x=(C-1)-(hash(sig) mod C)
    m = (x<i)*(~0)
    return (m&C)+(x-i);

which can be attractive for SSE implementations where the comparison produces a bit mask anyway.

We have validated that this new f function is now self-inverse for all possible values of i, sig, and C. And we did this by not just looking at the formula, but by trying out all values programmatically. Which is a good way to get confidence in your approach; there is only a finite number of combinations, and we can test them all.
With this small fix, we can now enjoy Cuckoo Filters with arbitrarily sized tables.

Edit:  The original post did not mirror the hash space correctly (using C-... instead of (C-1)-...), thanks to Andreas Kipf for pointing this out.

Thursday, May 16, 2019

Why use learning when you can fit?

We recently had a talk by Tim Kraska in our group, and he spoke among other things about learned indexes. As I had mentioned before, I am more in favor of using suitably implemented b-trees, for reasons like update friendliness and distribution independence. But nevertheless, the talk made me curious: The model they are learning is in the end very primitive. It is a two-level linear model, i.e., they are using a linear function to select another linear function. But if that is enough, why do we need machine learning? A simple function fit should work just as well.

Thus, I tried the following:
1) we sort all data and keep it in an array, just like with learned indexes
2) we build the CDF
3) we fit a linear spline to the CDF minimizing the Chebyshev norm
4) we fit a polynomial function to the spline nodes
5) now we can lookup a value by evaluating first the polynomial function, then the spline, and then retrieving the values from the array. The previous step is always the seed to a local search in the next step.

As we bound the Chebyshev norm in each step, the lookup is in O(1), without any need for machine learning or other magic boxes.

Now admittedly there was some weasel wording in the previous paragraph: The lookup is in O(1), but the "constant" here is the Chebyshev norm of the fit, which means this only works well if we can find the good fit. But just the same is true for the learned indexes, of course.

Now do we find a good fit? In theory we know how to construct the optimal fit in O(n log n), but that paper is beyond me. I am not aware of any implementation, and the paper is much too vague to allow for one. But constructing a good fit is much easier, and can also be done in O(n log n). Using that algorithm, we can construct a linear spline that maximum error efficiently, and we know what the maximum is over the whole domain. Thus, we can probe the spline to get an estimate for the real value position, and we then can perform an efficient local search on a small, known, window of the data.

The only problem is evaluating the spline itself. Evaluating a linear spline is pretty cheap, but we have to find the appropriate knot points to evaluate. Traditionally, we find these with binary search again. Note that the spline is much smaller than the original data, but still we want to avoid the binary search. Thus, we construct a polynomial function to predict the spline knot, again minimizing the Chebyshev norm, which allows us to consider only a small subset of spline nodes, leading to the before mentioned time bound.

How well does this work in practice? On the map data set from the learned indexes paper and a log normal data set we get the following. (The learned indexes numbers are from the paper, the b-tree numbers are from here, and the spline numbers from this experiments. I still do not really know what the averages mean for the learned indexes, but probably the average errors averaged over all models).

Map datasize (MB)avg error
Learned Index (10,000)0.158 ± 45
Learned Index (100,000)1.532 ± 36
B-tree (10,000)0.15225
B-tree (100,000)1.5322
Spline (10,000)0.15193
Spline (100,000)1.5322

Log normal datasize (MB)avg error
Learned Index (10,000)0.1517,060 ± 61,072
Learned Index (100,000)1.5317,005 ± 60,959
B-tree (10,000)0.151330
B-tree (100,000)1.533
Spline (10,000)0.15153
Spline (100,000)1.531

Somewhat surprising the accuracy the accuracy of the spline is nearly identical to the interpolating b-tree for the real-world map data, which suggests that the separators span the domain reasonably well there. For the log normal data the spline is significantly better, and leads to nearly perfect predictions. Note that the original data sets contains many millions of data points in both cases, thus the prediction accuracy is really high.

For practical applications I still recommend the B-tree, of course, even though the polynomial+spline solution is in "O(1)" while the B-tree is in O(log n). I can update a B-tree just fine, including concurrency, recovery, etc., while I do not know how to do that with the polynomial+spline solution.
But if one wants to go the read-only/read-mostly route, the function functions could be attractive alternative the machine learning. The advantage of using fits is that the algorithms are reasonably fast, we understand how they work, and they give strong guarantees for the results.

Friday, February 1, 2019

Honest asymptotic complexity for search trees and tries

Fun fact that was pointed out to me by Viktor: All traditional books on algorithms and data structures that we could find gave the lookup costs of balanced search trees as O(log n) (i.e., the depth of the search tree), and the lookup costs of tries as O(k) (i.e., the length of the key). At a first glance this is a logarithmic time lookup against a linear time lookup, which makes people nervous when thinking about long keys.

But that argument is very unfair: Either we consider a key comparison a O(1) operation, then a tree lookup is indeed in O(log n), but then a trie lookup is in O(1)! As the length of the key has to be bounded by a constant to get O(1) comparisons, the depth of the trie is bounded, too. Or the length of the key matters, then a trie lookup is indeed in O(k), but then a tree lookup is in O(k log n). We have to compare with the key on every level, and if we are unlucky we have to look at the whole key, which gives the factor k.

Which of course makes tries much more attractive asymptotic wise. Note that we ignore wall clock times here, which are much more difficult to predict, but in many if not most cases tries are indeed much faster than search trees.

I believe the reason why text books get away with this unfair comparison is that they all present balanced search trees with integer keys:

image/svg+xml 8 10 4 1 6
While tries are traditionally introduced with string examples. If they had used string keys for balanced search trees instead it would have been clear that the key length matters:

image/svg+xml ABCD8 ABCD10 ABCD4 ABCD1 ABCD6
The trie examines every key byte at most once, while the search tree can examine every key byte log n times. Thus, the asymptotic complexity of tries is actually better than that of balanced search tries.