Friday, June 8, 2018

Propagation of Mistakes in Papers

While reading papers on cardinality estimation I noticed something odd: The seminal paper by Flajolet and Martin on probabilistic counting gives a bias correction constant as 0.77351, while a more recent (and very useful) paper by Scheuermann and Mauve gives the constant as 0.775351. Was this a mistake? Or did they correct a mistake in the original paper?

I started searching, and there is a large number of papers that uses the value 0.775351, but there is also a number of papers that uses the value 0.77351. Judging by the number of Google hits for "Flajolet 0.77351" vs. "Flajolet 0.775351" the 0.77351 group seems to be somewhat larger, but both camps have a significant number of publications. Interestingly, not a single paper mentions both constants, and thus no paper explains what the correct constant should be.

In the end I repeated the constant computation as explained by Flajolet, and the correct value is 0.77351. We can even derive one digit more when using double arithmetic (i.e., 0.773516), but that makes no difference in practice. Thus, the original paper was correct.

But why do so many paper use the incorrect value 0.775351 then? My guess is that at some point somebody made a typo while writing a paper, introducing the superfluous digit 5, and that all other authors copied the constant from that paper without re-checking its value. I am not 100% sure what the origin of the mistake is. The incorrect value seems to appear first in the year 2007, showing up in multiple publications from that year. Judging by publication date the source seems to be this paper (also it did not cite any other papers with the incorrect value, as far as I know). And everybody else just copied the constant from somewhere else, propagating it from paper to paper.

If you find this web page because you are searching for the correct Flajolet/Martin bias correction constant, I can assure you that the original paper was correct, and that the value is 0.77351. But you do not have to trust me on this, you can just repeat the computation yourself.

Monday, April 23, 2018

Addendum to the MILP Optimization Times

In our upcoming SIGMOD paper on Adaptive Optimization of Very Large Join Queries we compared result quality and optimization times of over a dozen approaches for join ordering, over a wide range of query sizes. Which is quite a challenging problem, as the different algorithms often work under different assumptions, usually no reference implementation is available, and we had to unify them all into one framework that can handle joins from 10 to 5,000 relations.

One of the approaches that we included was Solving the Join Ordering Problem via Mixed Integer Linear Programming by and Christoph Koch. Our implementation tries to follow the original paper faithfully, implementing the mapping from query graph to MILP problem just as described in the original paper. For some benchmarks like TPC-DS (up to 18 relations in a join, with a median of 3) that implementation worked fine. But for some other benchmarks like the Join Order Benchmark (up to 17 relations, median 8) and the SQLite join set (up to 64 relations, median 34) we saw significant optimization times on our Xeon E7-4870 system: In total 290s for the JOB queries, and 5,100s for the SQLite queries. (Note that the JOB times do not include queries with non-inner join edges, as these currently cannot be handled by the MILP approach).

Immanuel pointed out to me that we can improve the optimization time quite a bit by initializing the start position for the Gurobi solver to a solution constructed by a greedy heuristic. In particular for the SQLite queries that helps a lot, as the greedy solution works very well there and thus the start position is already very good. The optimization times for his implementation (on a weaker hardware, a 2.2 GHz Intel Core i7 laptop) are 52s for the Join Order Benchmark and 44s for the SQLite queries.

I am glad for the hint, and thus amend the numbers here. As far as a I can see it this initialization trick was not mentioned in the original SIGMOD17 paper. But of course a carefully tuned implementation will often have tricks that are unfortunately not described in detail in the corresponding publication.

If there are any more comments about any of the approaches we measured I am happy to hear them.

Saturday, December 23, 2017

The Case for B-Tree Index Structures

Recently a very interesting paper made a Case for Learned Index Structures. It argued that we could, and perhaps should, replace traditional index structures with machine learning, using the following reasoning: If we consider the leaf pages of an index as a sorted array, the inner pages of the index point towards a (bucketized) position within that array. Which means that it essentially describes the cummulative distribution function (CDF), mapping from keys to array positions.

And the argument of that paper was that using machine learning we can do that mapping much better because a) the learned model (in this case neuronal network) is much smaller than a traditional b-tree, and b) the learned model can predict the CDF value much more accurately than a simple b-tree, which improves performance.

Now I am all in favor of trying out new ideas, and adapting to the data distribution is clearly a good idea, but do we really need a neural network for that? Because, after all, the neuronal network is just an approximation of the CDF function. There are many other ways to approximate a function, for example spline interpolation: We define a few knots of the spline, and then interpolate between the knots. For example (picture by D.J. Graham)


Thus, what we need for a spline are a sequence of knots where we can interpolate between, i.e., a sequence of (x,y) values.

Now, if we think back about traditional index structures, in particular B-trees, we see that they have something similar:



The inner pages consist of separator values, and offsets to the next lower level. Well, we can interpret that as a spline. Instead of just going down to the next level and then doing binary search in the next node, we can interpret our search key as position between the two separators, and then interpolate the position of our search key one the next level. This estimate will be slightly off, of course, but the same is true for the machine learning approach, and we can use the same binary search strategy starting from our estimated position. We can use that interpolation strategy on all levels, both when navigating the inner pages and then going down to the leaf nodes.

How well does that work in practice? The learned indexes paper gives accuracy results and performance results for different sizes of neuronal network models. In the paper the b-trees are depicted as being very large, but in reality that is a parameter, of course. We can get arbitrarily sized b-trees by modifying the page size of the b-tree. For comparisons we chose the b-trees to have the same size (in KB) as the neuronal networks reported in the paper. The source code of the learned indexes approach is not available, thus we only report the original numbers for the neuronal networks. Our own proof of concept code is available upon request. As data sets we used the map set and the lognormal set mentioned in the paper, as we could not obtain the other data sets.

If we just look at the accuracy of the prediction of the final tuple we get as average error the number shown below. For the b-trees we report distance between the estimated position and the real tuple position, averaged over all elements in the data set. For the neuronal networks the wording in the paper is a bit unclear, we think the numbers are the average of the average errors of the second level models, which might be slightly different.

Map datasize (MB)avg error
Learned Index (10,000)0.158 ± 45
Learned Index (100,000)1.532 ± 36
Complex Learned Index1.532 ± 30
B-tree (10,000)0.15225
B-tree (100,000)1.5322

Log normal datasize (MB)avg error
Learned Index (10,000)0.1517,060 ± 61,072
Learned Index (100,000)1.5317,005 ± 60,959
Complex Learned Index1.538 ± 33
B-tree (10,000)0.151,330
B-tree (100,000)1.533

If we look at the numbers, the interpolating b-tree doesn't perform that bad. For the map data the learned index is a bit more accurate, but the difference is small. For the log normal data the interpolating b-tree is in fact much more accurate than the learned index, being able to predict the final position very accurately.

What does that mean for index performance? That is a complicated topic, as we do not have the source code of the learned index and we do not even know precisely on which hardware the experiments were run. We thus only give some indicative numbers, being fully aware that we might be comparing apples with oranges due to various differences in hardware and implementation. If we compare the reported numbers from the paper for lognormal with our proof of
concept implementation (running on a i7-5820K @ 3.30GHz, searching for every element in the data set in shuffled order) we get


Log normal dataTotal (ns)Model (ns)Search (ns)
Learned Index (10,000)17826152
Learned Index (100,000)15236127
Complex Learned Index17811067
B-tree (10,000)15610154
B-tree (100,000)17115912

Again, the b-tree does not perform that bad, being virtually identical to the reported learned index performance (remember the caveat about hardware differences!). And the b-tree is a very well understood data structure, well tested, with efficient update support etc., while the machine learning model will have great difficulties if the data is updated later on. Thus, I would argue that traditional index structures, in particular b-trees, are still the method of choice, and will probably remain so in the foreseeable future.

Does this mean we should not consider machine learning for indexing? No, we should consider everything that helps. It is just that "everything that helps" does not just include fashionable trends like machine learning, but also efficient implementations of well known data structures.

Friday, February 17, 2017

Reasoning in the presence of NULLs

One of the defining characteristics of SQL is that it is a declarative query language. That is, the user does not specify how the result should be computed, but instead specifies what conditions the result should satisfy. Within these constraints the database is free to choose between execution alternatives. This has some interesting consequences:

Consider for example the query

select x=a and x=b

clearly, it is equivalent to the following query

select x=a and a=b

after all, x=a, and thus we can substitute a with x in the second term.

And of course the same is true is we replace a and b with constants:

select x=1 and x=2

is equivalent to

select x=1 and 1=2

Now the really interesting question is: Is a database system allowed to infer that this is equivalent to

select false


? After all, x cannot be equal 1 and equal to 2 simultaneously, and the second formulation is x=1 and false, which is clearly false. But what looks intuitive is not necessarily true, if we consider the result of the following two queries (with PostgreSQL results):
postgres=> select x,x=1 and x=2 from (values(1),(null)) s(x);
  x   | ?column? 
------+----------
    1 | f
 NULL | NULL
(2 rows)

postgres=> select x,x=1 and 1=2 from (values(1),(null)) s(x);
  x   | ?column? 
------+----------
    1 | f
 NULL | f
(2 rows)

The reason for that difference is the tricky behavior of NULL: 1) A comparison of value with NULL is NULL, and 2) NULL AND NULL is NULL, but 3) NULL AND FALSE is FALSE.

Both 1) and 3) make sense, 1) because we cannot say anything about the result of the comparison, and 3) because it doesn't matter for which value NULL really stands (either true or false), the and with FALSE will always produce a FALSE value. But in combination, they lead to very surprising behavior, namely that some databases return NULL and other databases return FALSE for the same query, depending on whether they noticed that the query contained a contradiction or not.

And this kind of reasoning occurs in many other circumstances, too, for example in this query

select cast(x as integer)=1.5


obviously, this condition can never be true, regardless of the value for x, as 1.5 is not an integer number. But what about NULL? Are we allowed to statically infer that the result is false, even if x might be a NULL value?

I tried to find an answer to that question in the SQL:2011 standard, but unfortunately it is not specified clearly. But after looking at the definition of AND, I have convinced myself that returning false here is ok. After all, the argument for NULL AND FALSE being FALSE is that NULL is an unknown value from within the domain, and any value AND FALSE is FALSE. If we use the same argument here, we can statically decide that the result is false, regardless of the value of x.

But even though the argument makes sense (and it is good for the query optimizer, which exploits that fact), it is still unfortunate that the query result changes depending on how smart the database system is. A smart system can infer that the result is false, not matter what, while a more simple system might return NULL. But perhaps that is just the consequence of having a declarative system, we cannot (and should not!) control in which order expressions are evaluated.

Wednesday, August 24, 2016

Equivalence of Unicode strings is strange

Originally HyPer had a very simple model for strings: We made sure that all strings are valid UTF-8, but otherwise did not really care about the intrinsics of Unicode. And that is actually a quite sane model for most applications. Usually we do not care about the precise string structure anyway, and in the few places that we do (e.g., strpos and substr), we add some extra logic to handle UTF-8 correctly.

That was all good until users started complaining about sort orders. When sorting UTF-8 naively, we sort by code point value, which is often ok but not always what users want. So we added support for collations:
select * from foo order by x collate "de_DE";

And that is where life starts getting interesting. The collate statement can either be given explicitly at any place in the query, as shown above, or added in a create table statement, giving a default collation for a column. So basically every value that is ordered or compared within a SQL statement can have an associated collation.

Initially I naively though that this would just affect the comparison operation, like calling strcoll instead of strcmp, but life is unfortunately much more complex. First, the Unicode collation algorithm is quite involved, translating the input sequence into a sequence of 3 (or 4) level of weights before comparisons, and second, some insane database systems default to case-insensitive collations, and some users actually want that behavior.

Why is case-insensitivity insane? Because it leads to strange semantics. First of all, the whole Unicode weighting mechanism is strange. Some of the strangeness can be hidden by using ICU, but you still get strange results. Consider for example the following two strings (\u is a Unicode escape):

abc
abc\u007F

Clearly they are different, right? Well, if you ask the ICU Collation Demo they are not, they are considered equal. The reason for that is the entry in the DUCET table, which for 007F says

007F  ; [.0000.0000.0000] # DELETE (in ISO 6429)
 
So this character must be ignored for comparisons. And there are other code points that are relevant for plain comparisons, but not for accent-insensitive collations, etc. A lot of fun, in particular if you want to implement hash-based algorithms.

But technical problems aside, is that really what users want? Do users expect these two strings to be equal, just as they apparently expect 'abc' and 'äbc' to compare to equal in accent-insensitive collation? And what about queries? Consider for example

select x,count(*)
from (values('abc'),('ABC'),('äbc')) s(x)
group by x collate "de_DE_ci_ai";

which perform the group by in a case-insensitive, accent-insensitive manner, which means that all three strings compare as equal. What is the result of that? abc 3? ABC 3? äbc 3?
All three would be valid answers, as the three are "equal" according to the chosen collation. And the result might even be non-deterministic, if the query is parallelized across threads and the first value wins.

Do users really want that? Well, apparently they do, at least I was told that, and some systems even default to case-insensitive behavior. But I find that strange, the semantic of these queries can be quite puzzling.

Monday, April 11, 2016

Comparing Join Implementations

In the upcoming SIGMOD the group around Jens Dittrich as a very nice paper about different join implementations: Stefan Schuh, Xiao Chen, Jens Dittrich. An Experimental Comparison of Thirteen Relational Equi-Joins in Main Memory. SIGMOD 2016.
And, as the title implies, it brings light into the overwhelming landscape of papers about join implementations.  It does a pretty good job of pointing out classical implementation choices, and then compares them using not only micro-benchmarks but also using TPC-H Query 19.

And this is a very important contribution. Many papers only consider micro-benchmarks for join experiments, but this is often not very helpful to derive the overall impact of a join implementation for complex queries. This paper does much better.

Still, I am not 100% satisfied with the comparison, which brings me to this blog post here. There are two aspects that I would like to emphasize: First, what is actually a typical join problem? And second, how should we interpret the performance of different join implementations?
Note that even though I will argue against some of the conclusions from the SIGMOD paper, this is not meant as a critique of the paper itself. Stefan Schuh at al. do a good job there, I just disagree a bit with how we should interpret the results.


Coming to my first point, what is a typical join problem? Most papers that consider partitioning join implementations, including the current one, consider joins between relations of similar sizes. In Figure 9 and 10 the authors show results for R⨝S where |S|=|R| and where |S|=10*|R|. Which seems to span a wide range of cardinalities. But is this realistic? If we consider a typical star or snowflake schema like this one

https://upload.wikimedia.org/wikipedia/commons/7/73/Snowflake-schema-example.png


we notice that 1) the cardinality differences between connected relations can be very large, and 2) they will be even larger once query predicates are introduced. Many data warehouse queries will filter one or more dimensions and then join that with the huge facts table.

Of course not every database has a star or snowflake schema. In particular, TPC-H does not. But even there, when looking across all 22 TPC-H queries, 97.4% of the joined tuples occur on the probe side. In fact I would argue that, due to filter predicates, most joins will be between relations of very different sizes. Which makes partitioning joins a lot less attractive, but which is often ignored by papers about that kind of joins.
Of course it might make sense to have different join implementations available, to handle the |S|=|R| case more efficiently. But one should keep in mind that this case is an exception rather than the norm.



Another question I would like to emphasize is, how should we interpret the performance of different join implementations? Query performance is a complex beast with many different metrics like cache-miss stalls, instructions per cycle, branch misses etc. I would argue that if you compare different approaches, the only sane metric is the overall execution time. It is the only sane one because 1) we are waiting for the query result after all, and 2) the individual metrics are often misleading for the overall performance. In fact that there can be approaches that are worse than the competitors in all the previously mentions metrics, but are faster overall.

In the paper the authors show performance results for TPC-H Q19, which consists of a single join between lineitem and part, followed by simple summation without group by attributes. If we look at Figure 14

Figure 14 from [SIGMOD16]

We see that, overall, NOP, i.e., a non-partitioning join, is the fastest for this query. Still, the authors argue that NOP is the slowest of all four alternatives, because the colored bar of it is the highest. I tend to believe that this is an artifact of measurement of the colored bars: They were computed by running the joins on pre-filtered input, without the subsequent aggregation. But that experimental setup ignores a major weakness of the partitioning joins, namely that they often lead to massively random access after the join.
All joins were implemented as tid-joins, i.e., not the tuples themselves were passed around but only references to the original tuple. This is important for the partitioning joins, as otherwise the partitioning phase becomes very expensive. But as a consequence, the missing attributes have to be looked up afterwards. For the NOP join the accesses to lineitem are sequential, while for the partitioning joins these accesses are in more or less random order, which is quite expensive. Therefore the colored bars, which lack the attribute accesses, are misleading.

Which brings me back to my original point, namely that the only really sane metric is the overall execution time. Of course it would be nice to know which fraction of the query execution time would be spend in the join, or how some individual metrics are affected by a certain join implementation. But in reality we cannot answer these questions because there is a very complex interaction between implementations and the rest of the execution pipeline. The only thing we can measure reasonably well is the overall time.


Note that I do not want to criticize the paper for the before mentioned points. It does a very fine job of comparing different approaches, and I strongly recommend you to read it. I would have wished for a slightly different emphasis in the conclusion, but that is of course just my own opinion. I am looking forward for hearing the talk at SIGMOD!


Friday, December 18, 2015

The price of correctness

When implementing a database system we often have two contradicting goals: Performance and correctness. Being as fast as possible is very attractive, of course. But unfortunately this often means ignoring corner cases, and can thus lead to incorrect results in rare cases.

And usually, incorrect results are unacceptable. Or at least, should be unacceptable. It is very tempting for a prototype system to ignore correctness issues and just aim for maximum performance. But once money is associated with the data, wrong results are very problematic.

A classical example of this is the representation of numerical values in databases. It is tempting to just use doubles for everything. Doubles are well supported by hardware, we can use SIMD instructions with them, and they are often the fastest way to represent non-integer values. On the other hand doubles can get into rounding issues very quickly. This can be seen by when performing the computation 0.1+0.2-0.3:


select ceil((0.1+0.2)-0.3)
 -> 0.0

select ceil((0.1::double precision+0.2::double precision)
            -0.3::double precision)
 -> 1.0


When using a correct NUMERIC implementation (e.g., HyPer or PostgreSQL) we get the correct result of 0.1+0.2-0.3=0. When using DOUBLE PRECISION we get a non-zero result, which is here rounded to 1. Unfortunately some systems like SQLite (and a lot of research systems) do not both to implement NUMERIC correctly, and always use doubles, which leads to wrong results even for the first query.

Implementing NUMERIC correctly means implementing fixed-point arithmetic. The number is usually represented as integer value plus corresponding position of the decimal point, and all operations are then mapped to integer operations. For addition and subtraction that is reasonable easy (as long as both arguments have the decimal point at the same position), but division for example is more complex even if we ignore the decimal point:


int64_t div(int64_t a,int64_t b)
{
   if (!b) // avoid hardware trap due to division by zero
      throw DivBy0();
   if (b==-1) // avoid hardware trap due to int64min/-1
      return sub(0,a);
   return a/b;
}

Note that the sub function in there is non-trivial, as we will see in a moment. Plus the extra code needed for handling the decimal point, plus the code needed to handle rounding. A fixed point division operation easily needs 20-30 instructions, compared to a single instruction for a floating point division. This costs performance, of course, but has the undeniable advantage of producing the correct result.


Now why is the subtraction function non-trivial? Because it has to cope with underflows/overflows. And that is not only nice to have, but fundamental, because there are values that are fundamentally non-representably in our usual two's complement representation. Consider for example the following computation:


select (-9223372036854775808)/(-1)
-> div(-9223372036854775808,-1)
-> sub(0,-9223372036854775808)
-> ???


The problem here is that -9223372036854775808 has no corresponding positive number when using 64bit integers for our fixed-point values, the number is non-representable. In fact if we had executed the division without the check for -1 we would have gotten a very unpleasant hardware trap due to that. We avoid the trap by delegating to subtraction, but if we not check for underflows/overflows there we silently produce wrong results.

Checking for overflows manually is quite painful, in particular since signed overflows are undefined in C! We have to break the computation down into unsigned operations, which is both complex and slow. Fortunately recent versions of both gcc and clang added intrinsics to use the CPU flags for overflow checking, which is both much easier and much cheaper:


int64_t sub(int64_t a,int64_t b)
{
   int64_t c;
   if (__builtin_ssubll_overflow(a,b,&c))
      throw Overflow();
   return c;
}


Even when using the intrinsics the (correct) fixed-point operations are much more complex than the (inaccurate) double operations. What does that mean for performance? I show the execution time of 100,000 passes over two vectors of 1,000 values each below (i.e., 100,000*1,000 = 100,000,000 arithmetic operation of the indicated type were executed), both for double arithmetic and for fixed point arithmetic. (i7-5820K, gcc 5.2.1, -O3 -march=native)




add sub mul div
double (unchecked) 11ms 11ms 12ms 212ms
fixed point (unchecked) 10ms 10ms 42ms 817ms
fixed point (checked) 57ms 57ms 56ms 912ms

The correct arithmetic (fixed-point, with checks) is quite a bit slower than just using doubles. But that is the price we have to pay for correctness. (Update: in the original posting I forgot -march=native, adding that improved performance of the unchecked versions by another factor 2 due to AVX instructions).
Note that it would have been possible to check for overflows after double operations, too, using fetestexcept. But that is even slower than the checked fixed point arithmetic (>620ms for all cases), and it does not help with the rouding issues of floating point numbers.


So performing numerical operations correctly is difficult and expensive. But still, every system should do it! Users get very unhappy when the result is wrong, in particular if the values correspond to money. If you are building a prototype system, do not use floating point numbers, even if it is fast and tempting. Using doubles is nice for micro benchmarks, but inappropriate for real-world usage when money is involved.