## Thursday, May 16, 2019

### Why use learning when you can fit?

We recently had a talk by Tim Kraska in our group, and he spoke among other things about learned indexes. As I had mentioned before, I am more in favor of using suitably implemented b-trees, for reasons like update friendliness and distribution independence. But nevertheless, the talk made me curious: The model they are learning is in the end very primitive. It is a two-level linear model, i.e., they are using a linear function to select another linear function. But if that is enough, why do we need machine learning? A simple function fit should work just as well.

Thus, I tried the following:
1) we sort all data and keep it in an array, just like with learned indexes
2) we build the CDF
3) we fit a linear spline to the CDF minimizing the Chebyshev norm
4) we fit a polynomial function to the spline nodes
5) now we can lookup a value by evaluating first the polynomial function, then the spline, and then retrieving the values from the array. The previous step is always the seed to a local search in the next step.

As we bound the Chebyshev norm in each step, the lookup is in O(1), without any need for machine learning or other magic boxes.

Now admittedly there was some weasel wording in the previous paragraph: The lookup is in O(1), but the "constant" here is the Chebyshev norm of the fit, which means this only works well if we can find the good fit. But just the same is true for the learned indexes, of course.

Now do we find a good fit? In theory we know how to construct the optimal fit in O(n log n), but that paper is beyond me. I am not aware of any implementation, and the paper is much too vague to allow for one. But constructing a good fit is much easier, and can also be done in O(n log n). Using that algorithm, we can construct a linear spline that maximum error efficiently, and we know what the maximum is over the whole domain. Thus, we can probe the spline to get an estimate for the real value position, and we then can perform an efficient local search on a small, known, window of the data.

The only problem is evaluating the spline itself. Evaluating a linear spline is pretty cheap, but we have to find the appropriate knot points to evaluate. Traditionally, we find these with binary search again. Note that the spline is much smaller than the original data, but still we want to avoid the binary search. Thus, we construct a polynomial function to predict the spline knot, again minimizing the Chebyshev norm, which allows us to consider only a small subset of spline nodes, leading to the before mentioned time bound.

How well does this work in practice? On the map data set from the learned indexes paper and a log normal data set we get the following. (The learned indexes numbers are from the paper, the b-tree numbers are from here, and the spline numbers from this experiments. I still do not really know what the averages mean for the learned indexes, but probably the average errors averaged over all models).

 Map data size (MB) avg error Learned Index (10,000) 0.15 8 ± 45 Learned Index (100,000) 1.53 2 ± 36 B-tree (10,000) 0.15 225 B-tree (100,000) 1.53 22 Spline (10,000) 0.15 193 Spline (100,000) 1.53 22

 Log normal data size (MB) avg error Learned Index (10,000) 0.15 17,060 ± 61,072 Learned Index (100,000) 1.53 17,005 ± 60,959 B-tree (10,000) 0.15 1330 B-tree (100,000) 1.53 3 Spline (10,000) 0.15 153 Spline (100,000) 1.53 1

Somewhat surprising the accuracy the accuracy of the spline is nearly identical to the interpolating b-tree for the real-world map data, which suggests that the separators span the domain reasonably well there. For the log normal data the spline is significantly better, and leads to nearly perfect predictions. Note that the original data sets contains many millions of data points in both cases, thus the prediction accuracy is really high.

For practical applications I still recommend the B-tree, of course, even though the polynomial+spline solution is in "O(1)" while the B-tree is in O(log n). I can update a B-tree just fine, including concurrency, recovery, etc., while I do not know how to do that with the polynomial+spline solution.
But if one wants to go the read-only/read-mostly route, the function functions could be attractive alternative the machine learning. The advantage of using fits is that the algorithms are reasonably fast, we understand how they work, and they give strong guarantees for the results.

1. There is an exciting unexplored spectrum of applying math to more traditional b-tree node representations for reducing space and the branchiness of search. Recently I've been achieving promising results in the sled node representation by using a relatively simple stride detector on the keys.

If all keys are a fixed stride apart, we only need to store the node's low key + number of children + stride, and all keys can be losslessly derived through addition, and making the pertinent parts of node traversal branch-free.

This is only a moderately invasive change, and not too much of the overall tree's merging and splitting logic is impacted. I don't apply it when keys are different lengths, but for index nodes where suffix truncation has percolated shorter keys upwards it sometimes results in far more compact and more efficiently-traversed indexes even with messier input sets (as long as the descendents are lexicographically uniform, as may be the case with F1-like nested tables where the key length varies but for sequential workloads it still results in uniform strides at higher index levels). I was actually trying to emulate entropy-coded tries after reading the SILT paper, and only later did Ryan Marcus point out to me that there is in essence a similar technique happening under the hood for some learned index structures.

1. That is an interesting idea, but one that probably has consequences for the b-tree algorithm itself. Your scheme works best if the data within a node is quite regular. Ideally you would want to take this into account when splitting nodes, i.e., you will not only want to look at the fill degree but also at the content of a node when deciding where to split.

2. Absolutely - one of the things that I'm currently spending effort on is finding cases where a "downgrade" can occur, where an extremely compact fixed-stride node receives an out-of-stride key, causing a cascade of:

1. converting all formerly-implicit keys into explicit keys (prefix encoded & suffix truncated but still costly), causing the node to inflate (we support dynamically sized nodes)
2. splitting the node, potentially many times if the values were short / empty - millions or even billions of implicit keys can be present on these highly compressed nodes if the value is empty as children were added by simply incrementing the child count
3. re-applying the stride detection where applicable and making the keys implicit again, causing the node to significantly shrink
4. re-merging the re-shrunken nodes, finally ending up with 2 stride-detected nodes - one to either side of an inflated node containing the anomalous update

There are a lot of opportunities for short-circuiting most of this and avoiding the work by applying pre-flight splits at targeted locations or applying other techniques like deletion bitmaps. It's pretty cheap to calculate whether an operation will cause the downgrade, but it's more complexity. Another option is to auto-tune and avoid stride detection in the first place if its associated costs exceed its benefits (which can be strong), but this doesn't solve the problem of downgrading large sections of the index as workloads shift over time.

Working out the kinks has been keeping me busy for the past few days, but the space savings and being able to avoid any interpolation search at all for such a common write distribution is highly motivating :)