Database Architects: Why use learning when you can fit?

Thursday, May 16, 2019

Why use learning when you can fit?

We recently had a talk by Tim Kraska in our group, and he spoke among other things about learned indexes. As I had mentioned before, I am more in favor of using suitably implemented b-trees, for reasons like update friendliness and distribution independence. But nevertheless, the talk made me curious: The model they are learning is in the end very primitive. It is a two-level linear model, i.e., they are using a linear function to select another linear function. But if that is enough, why do we need machine learning? A simple function fit should work just as well.

Thus, I tried the following:
1) we sort all data and keep it in an array, just like with learned indexes
2) we build the CDF
3) we fit a linear spline to the CDF minimizing the Chebyshev norm
4) we fit a polynomial function to the spline nodes
5) now we can lookup a value by evaluating first the polynomial function, then the spline, and then retrieving the values from the array. The previous step is always the seed to a local search in the next step.

As we bound the Chebyshev norm in each step, the lookup is in O(1), without any need for machine learning or other magic boxes.

Now admittedly there was some weasel wording in the previous paragraph: The lookup is in O(1), but the "constant" here is the Chebyshev norm of the fit, which means this only works well if we can find the good fit. But just the same is true for the learned indexes, of course.

Now do we find a good fit? In theory we know how to construct the optimal fit in O(n log n), but that paper is beyond me. I am not aware of any implementation, and the paper is much too vague to allow for one. But constructing a good fit is much easier, and can also be done in O(n log n). Using that algorithm, we can construct a linear spline that maximum error efficiently, and we know what the maximum is over the whole domain. Thus, we can probe the spline to get an estimate for the real value position, and we then can perform an efficient local search on a small, known, window of the data.

The only problem is evaluating the spline itself. Evaluating a linear spline is pretty cheap, but we have to find the appropriate knot points to evaluate. Traditionally, we find these with binary search again. Note that the spline is much smaller than the original data, but still we want to avoid the binary search. Thus, we construct a polynomial function to predict the spline knot, again minimizing the Chebyshev norm, which allows us to consider only a small subset of spline nodes, leading to the before mentioned time bound.

How well does this work in practice? On the map data set from the learned indexes paper and a log normal data set we get the following. (The learned indexes numbers are from the paper, the b-tree numbers are from here, and the spline numbers from this experiments. I still do not really know what the averages mean for the learned indexes, but probably the average errors averaged over all models).

Map data	size (MB)	avg error
Learned Index (10,000)	0.15	8 ± 45
Learned Index (100,000)	1.53	2 ± 36
B-tree (10,000)	0.15	225
B-tree (100,000)	1.53	22
Spline (10,000)	0.15	193
Spline (100,000)	1.53	22

Log normal data	size (MB)	avg error
Learned Index (10,000)	0.15	17,060 ± 61,072
Learned Index (100,000)	1.53	17,005 ± 60,959
B-tree (10,000)	0.15	1330
B-tree (100,000)	1.53	3
Spline (10,000)	0.15	153
Spline (100,000)	1.53	1

Somewhat surprising the accuracy the accuracy of the spline is nearly identical to the interpolating b-tree for the real-world map data, which suggests that the separators span the domain reasonably well there. For the log normal data the spline is significantly better, and leads to nearly perfect predictions. Note that the original data sets contains many millions of data points in both cases, thus the prediction accuracy is really high.

For practical applications I still recommend the B-tree, of course, even though the polynomial+spline solution is in "O(1)" while the B-tree is in O(log n). I can update a B-tree just fine, including concurrency, recovery, etc., while I do not know how to do that with the polynomial+spline solution.
But if one wants to go the read-only/read-mostly route, the function functions could be attractive alternative the machine learning. The advantage of using fits is that the algorithms are reasonably fast, we understand how they work, and they give strong guarantees for the results.

3 comments:

Tyler NeelyJanuary 27, 2021 at 11:07 AM
There is an exciting unexplored spectrum of applying math to more traditional b-tree node representations for reducing space and the branchiness of search. Recently I've been achieving promising results in the sled node representation by using a relatively simple stride detector on the keys.

If all keys are a fixed stride apart, we only need to store the node's low key + number of children + stride, and all keys can be losslessly derived through addition, and making the pertinent parts of node traversal branch-free.

This is only a moderately invasive change, and not too much of the overall tree's merging and splitting logic is impacted. I don't apply it when keys are different lengths, but for index nodes where suffix truncation has percolated shorter keys upwards it sometimes results in far more compact and more efficiently-traversed indexes even with messier input sets (as long as the descendents are lexicographically uniform, as may be the case with F1-like nested tables where the key length varies but for sequential workloads it still results in uniform strides at higher index levels). I was actually trying to emulate entropy-coded tries after reading the SILT paper, and only later did Ryan Marcus point out to me that there is in essence a similar technique happening under the hood for some learned index structures.
ReplyDelete
Replies