Saturday, December 23, 2017

The Case for B-Tree Index Structures

Recently a very interesting paper made a Case for Learned Index Structures. It argued that we could, and perhaps should, replace traditional index structures with machine learning, using the following reasoning: If we consider the leaf pages of an index as a sorted array, the inner pages of the index point towards a (bucketized) position within that array. Which means that it essentially describes the cummulative distribution function (CDF), mapping from keys to array positions.

And the argument of that paper was that using machine learning we can do that mapping much better because a) the learned model (in this case neuronal network) is much smaller than a traditional b-tree, and b) the learned model can predict the CDF value much more accurately than a simple b-tree, which improves performance.

Now I am all in favor of trying out new ideas, and adapting to the data distribution is clearly a good idea, but do we really need a neural network for that? Because, after all, the neuronal network is just an approximation of the CDF function. There are many other ways to approximate a function, for example spline interpolation: We define a few knots of the spline, and then interpolate between the knots. For example (picture by D.J. Graham)


Thus, what we need for a spline are a sequence of knots where we can interpolate between, i.e., a sequence of (x,y) values.

Now, if we think back about traditional index structures, in particular B-trees, we see that they have something similar:



The inner pages consist of separator values, and offsets to the next lower level. Well, we can interpret that as a spline. Instead of just going down to the next level and then doing binary search in the next node, we can interpret our search key as position between the two separators, and then interpolate the position of our search key one the next level. This estimate will be slightly off, of course, but the same is true for the machine learning approach, and we can use the same binary search strategy starting from our estimated position. We can use that interpolation strategy on all levels, both when navigating the inner pages and then going down to the leaf nodes.

How well does that work in practice? The learned indexes paper gives accuracy results and performance results for different sizes of neuronal network models. In the paper the b-trees are depicted as being very large, but in reality that is a parameter, of course. We can get arbitrarily sized b-trees by modifying the page size of the b-tree. For comparisons we chose the b-trees to have the same size (in KB) as the neuronal networks reported in the paper. The source code of the learned indexes approach is not available, thus we only report the original numbers for the neuronal networks. Our own proof of concept code is available upon request. As data sets we used the map set and the lognormal set mentioned in the paper, as we could not obtain the other data sets.

If we just look at the accuracy of the prediction of the final tuple we get as average error the number shown below. For the b-trees we report distance between the estimated position and the real tuple position, averaged over all elements in the data set. For the neuronal networks the wording in the paper is a bit unclear, we think the numbers are the average of the average errors of the second level models, which might be slightly different.

Map datasize (MB)avg error
Learned Index (10,000)0.158 ± 45
Learned Index (100,000)1.532 ± 36
Complex Learned Index1.532 ± 30
B-tree (10,000)0.15225
B-tree (100,000)1.5322

Log normal datasize (MB)avg error
Learned Index (10,000)0.1517,060 ± 61,072
Learned Index (100,000)1.5317,005 ± 60,959
Complex Learned Index1.538 ± 33
B-tree (10,000)0.151,330
B-tree (100,000)1.533

If we look at the numbers, the interpolating b-tree doesn't perform that bad. For the map data the learned index is a bit more accurate, but the difference is small. For the log normal data the interpolating b-tree is in fact much more accurate than the learned index, being able to predict the final position very accurately.

What does that mean for index performance? That is a complicated topic, as we do not have the source code of the learned index and we do not even know precisely on which hardware the experiments were run. We thus only give some indicative numbers, being fully aware that we might be comparing apples with oranges due to various differences in hardware and implementation. If we compare the reported numbers from the paper for lognormal with our proof of
concept implementation (running on a i7-5820K @ 3.30GHz, searching for every element in the data set in shuffled order) we get


Log normal dataTotal (ns)Model (ns)Search (ns)
Learned Index (10,000)17826152
Learned Index (100,000)15236127
Complex Learned Index17811067
B-tree (10,000)15610154
B-tree (100,000)17115912

Again, the b-tree does not perform that bad, being virtually identical to the reported learned index performance (remember the caveat about hardware differences!). And the b-tree is a very well understood data structure, well tested, with efficient update support etc., while the machine learning model will have great difficulties if the data is updated later on. Thus, I would argue that traditional index structures, in particular b-trees, are still the method of choice, and will probably remain so in the foreseeable future.

Does this mean we should not consider machine learning for indexing? No, we should consider everything that helps. It is just that "everything that helps" does not just include fashionable trends like machine learning, but also efficient implementations of well known data structures.

Friday, February 17, 2017

Reasoning in the presence of NULLs

One of the defining characteristics of SQL is that it is a declarative query language. That is, the user does not specify how the result should be computed, but instead specifies what conditions the result should satisfy. Within these constraints the database is free to choose between execution alternatives. This has some interesting consequences:

Consider for example the query

select x=a and x=b

clearly, it is equivalent to the following query

select x=a and a=b

after all, x=a, and thus we can substitute a with x in the second term.

And of course the same is true is we replace a and b with constants:

select x=1 and x=2

is equivalent to

select x=1 and 1=2

Now the really interesting question is: Is a database system allowed to infer that this is equivalent to

select false


? After all, x cannot be equal 1 and equal to 2 simultaneously, and the second formulation is x=1 and false, which is clearly false. But what looks intuitive is not necessarily true, if we consider the result of the following two queries (with PostgreSQL results):
postgres=> select x,x=1 and x=2 from (values(1),(null)) s(x);
  x   | ?column? 
------+----------
    1 | f
 NULL | NULL
(2 rows)

postgres=> select x,x=1 and 1=2 from (values(1),(null)) s(x);
  x   | ?column? 
------+----------
    1 | f
 NULL | f
(2 rows)

The reason for that difference is the tricky behavior of NULL: 1) A comparison of value with NULL is NULL, and 2) NULL AND NULL is NULL, but 3) NULL AND FALSE is FALSE.

Both 1) and 3) make sense, 1) because we cannot say anything about the result of the comparison, and 3) because it doesn't matter for which value NULL really stands (either true or false), the and with FALSE will always produce a FALSE value. But in combination, they lead to very surprising behavior, namely that some databases return NULL and other databases return FALSE for the same query, depending on whether they noticed that the query contained a contradiction or not.

And this kind of reasoning occurs in many other circumstances, too, for example in this query

select cast(x as integer)=1.5


obviously, this condition can never be true, regardless of the value for x, as 1.5 is not an integer number. But what about NULL? Are we allowed to statically infer that the result is false, even if x might be a NULL value?

I tried to find an answer to that question in the SQL:2011 standard, but unfortunately it is not specified clearly. But after looking at the definition of AND, I have convinced myself that returning false here is ok. After all, the argument for NULL AND FALSE being FALSE is that NULL is an unknown value from within the domain, and any value AND FALSE is FALSE. If we use the same argument here, we can statically decide that the result is false, regardless of the value of x.

But even though the argument makes sense (and it is good for the query optimizer, which exploits that fact), it is still unfortunate that the query result changes depending on how smart the database system is. A smart system can infer that the result is false, not matter what, while a more simple system might return NULL. But perhaps that is just the consequence of having a declarative system, we cannot (and should not!) control in which order expressions are evaluated.