Thus, I tried the following:
1) we sort all data and keep it in an array, just like with learned indexes
2) we build the CDF
3) we fit a linear spline to the CDF minimizing the Chebyshev norm
4) we fit a polynomial function to the spline nodes
5) now we can lookup a value by evaluating first the polynomial function, then the spline, and then retrieving the values from the array. The previous step is always the seed to a local search in the next step.
As we bound the Chebyshev norm in each step, the lookup is in O(1), without any need for machine learning or other magic boxes.
Now admittedly there was some weasel wording in the previous paragraph: The lookup is in O(1), but the "constant" here is the Chebyshev norm of the fit, which means this only works well if we can find the good fit. But just the same is true for the learned indexes, of course.
Now do we find a good fit? In theory we know how to construct the optimal fit in O(n log n), but that paper is beyond me. I am not aware of any implementation, and the paper is much too vague to allow for one. But constructing a good fit is much easier, and can also be done in O(n log n). Using that algorithm, we can construct a linear spline that maximum error efficiently, and we know what the maximum is over the whole domain. Thus, we can probe the spline to get an estimate for the real value position, and we then can perform an efficient local search on a small, known, window of the data.
The only problem is evaluating the spline itself. Evaluating a linear spline is pretty cheap, but we have to find the appropriate knot points to evaluate. Traditionally, we find these with binary search again. Note that the spline is much smaller than the original data, but still we want to avoid the binary search. Thus, we construct a polynomial function to predict the spline knot, again minimizing the Chebyshev norm, which allows us to consider only a small subset of spline nodes, leading to the before mentioned time bound.
How well does this work in practice? On the map data set from the learned indexes paper and a log normal data set we get the following. (The learned indexes numbers are from the paper, the b-tree numbers are from here, and the spline numbers from this experiments. I still do not really know what the averages mean for the learned indexes, but probably the average errors averaged over all models).
Map data | size (MB) | avg error |
Learned Index (10,000) | 0.15 | 8 ± 45 |
Learned Index (100,000) | 1.53 | 2 ± 36 |
B-tree (10,000) | 0.15 | 225 |
B-tree (100,000) | 1.53 | 22 |
Spline (10,000) | 0.15 | 193 |
Spline (100,000) | 1.53 | 22 |
Log normal data | size (MB) | avg error |
Learned Index (10,000) | 0.15 | 17,060 ± 61,072 |
Learned Index (100,000) | 1.53 | 17,005 ± 60,959 |
B-tree (10,000) | 0.15 | 1330 |
B-tree (100,000) | 1.53 | 3 |
Spline (10,000) | 0.15 | 153 |
Spline (100,000) | 1.53 | 1 |
Somewhat surprising the accuracy the accuracy of the spline is nearly identical to the interpolating b-tree for the real-world map data, which suggests that the separators span the domain reasonably well there. For the log normal data the spline is significantly better, and leads to nearly perfect predictions. Note that the original data sets contains many millions of data points in both cases, thus the prediction accuracy is really high.
For practical applications I still recommend the B-tree, of course, even though the polynomial+spline solution is in "O(1)" while the B-tree is in O(log n). I can update a B-tree just fine, including concurrency, recovery, etc., while I do not know how to do that with the polynomial+spline solution.
But if one wants to go the read-only/read-mostly route, the function functions could be attractive alternative the machine learning. The advantage of using fits is that the algorithms are reasonably fast, we understand how they work, and they give strong guarantees for the results.