Wednesday, August 24, 2016

Equivalence of Unicode strings is strange

Originally HyPer had a very simple model for strings: We made sure that all strings are valid UTF-8, but otherwise did not really care about the intrinsics of Unicode. And that is actually a quite sane model for most applications. Usually we do not care about the precise string structure anyway, and in the few places that we do (e.g., strpos and substr), we add some extra logic to handle UTF-8 correctly.

That was all good until users started complaining about sort orders. When sorting UTF-8 naively, we sort by code point value, which is often ok but not always what users want. So we added support for collations:
select * from foo order by x collate "de_DE";

And that is where life starts getting interesting. The collate statement can either be given explicitly at any place in the query, as shown above, or added in a create table statement, giving a default collation for a column. So basically every value that is ordered or compared within a SQL statement can have an associated collation.

Initially I naively though that this would just affect the comparison operation, like calling strcoll instead of strcmp, but life is unfortunately much more complex. First, the Unicode collation algorithm is quite involved, translating the input sequence into a sequence of 3 (or 4) level of weights before comparisons, and second, some insane database systems default to case-insensitive collations, and some users actually want that behavior.

Why is case-insensitivity insane? Because it leads to strange semantics. First of all, the whole Unicode weighting mechanism is strange. Some of the strangeness can be hidden by using ICU, but you still get strange results. Consider for example the following two strings (\u is a Unicode escape):

abc
abc\u007F

Clearly they are different, right? Well, if you ask the ICU Collation Demo they are not, they are considered equal. The reason for that is the entry in the DUCET table, which for 007F says

007F  ; [.0000.0000.0000] # DELETE (in ISO 6429)
 
So this character must be ignored for comparisons. And there are other code points that are relevant for plain comparisons, but not for accent-insensitive collations, etc. A lot of fun, in particular if you want to implement hash-based algorithms.

But technical problems aside, is that really what users want? Do users expect these two strings to be equal, just as they apparently expect 'abc' and 'äbc' to compare to equal in accent-insensitive collation? And what about queries? Consider for example

select x,count(*)
from (values('abc'),('ABC'),('äbc')) s(x)
group by x collate "de_DE_ci_ai";

which perform the group by in a case-insensitive, accent-insensitive manner, which means that all three strings compare as equal. What is the result of that? abc 3? ABC 3? äbc 3?
All three would be valid answers, as the three are "equal" according to the chosen collation. And the result might even be non-deterministic, if the query is parallelized across threads and the first value wins.

Do users really want that? Well, apparently they do, at least I was told that, and some systems even default to case-insensitive behavior. But I find that strange, the semantic of these queries can be quite puzzling.

Monday, April 11, 2016

Comparing Join Implementations

In the upcoming SIGMOD the group around Jens Dittrich as a very nice paper about different join implementations: Stefan Schuh, Xiao Chen, Jens Dittrich. An Experimental Comparison of Thirteen Relational Equi-Joins in Main Memory. SIGMOD 2016.
And, as the title implies, it brings light into the overwhelming landscape of papers about join implementations.  It does a pretty good job of pointing out classical implementation choices, and then compares them using not only micro-benchmarks but also using TPC-H Query 19.

And this is a very important contribution. Many papers only consider micro-benchmarks for join experiments, but this is often not very helpful to derive the overall impact of a join implementation for complex queries. This paper does much better.

Still, I am not 100% satisfied with the comparison, which brings me to this blog post here. There are two aspects that I would like to emphasize: First, what is actually a typical join problem? And second, how should we interpret the performance of different join implementations?
Note that even though I will argue against some of the conclusions from the SIGMOD paper, this is not meant as a critique of the paper itself. Stefan Schuh at al. do a good job there, I just disagree a bit with how we should interpret the results.


Coming to my first point, what is a typical join problem? Most papers that consider partitioning join implementations, including the current one, consider joins between relations of similar sizes. In Figure 9 and 10 the authors show results for R⨝S where |S|=|R| and where |S|=10*|R|. Which seems to span a wide range of cardinalities. But is this realistic? If we consider a typical star or snowflake schema like this one

https://upload.wikimedia.org/wikipedia/commons/7/73/Snowflake-schema-example.png


we notice that 1) the cardinality differences between connected relations can be very large, and 2) they will be even larger once query predicates are introduced. Many data warehouse queries will filter one or more dimensions and then join that with the huge facts table.

Of course not every database has a star or snowflake schema. In particular, TPC-H does not. But even there, when looking across all 22 TPC-H queries, 97.4% of the joined tuples occur on the probe side. In fact I would argue that, due to filter predicates, most joins will be between relations of very different sizes. Which makes partitioning joins a lot less attractive, but which is often ignored by papers about that kind of joins.
Of course it might make sense to have different join implementations available, to handle the |S|=|R| case more efficiently. But one should keep in mind that this case is an exception rather than the norm.



Another question I would like to emphasize is, how should we interpret the performance of different join implementations? Query performance is a complex beast with many different metrics like cache-miss stalls, instructions per cycle, branch misses etc. I would argue that if you compare different approaches, the only sane metric is the overall execution time. It is the only sane one because 1) we are waiting for the query result after all, and 2) the individual metrics are often misleading for the overall performance. In fact that there can be approaches that are worse than the competitors in all the previously mentions metrics, but are faster overall.

In the paper the authors show performance results for TPC-H Q19, which consists of a single join between lineitem and part, followed by simple summation without group by attributes. If we look at Figure 14

Figure 14 from [SIGMOD16]

We see that, overall, NOP, i.e., a non-partitioning join, is the fastest for this query. Still, the authors argue that NOP is the slowest of all four alternatives, because the colored bar of it is the highest. I tend to believe that this is an artifact of measurement of the colored bars: They were computed by running the joins on pre-filtered input, without the subsequent aggregation. But that experimental setup ignores a major weakness of the partitioning joins, namely that they often lead to massively random access after the join.
All joins were implemented as tid-joins, i.e., not the tuples themselves were passed around but only references to the original tuple. This is important for the partitioning joins, as otherwise the partitioning phase becomes very expensive. But as a consequence, the missing attributes have to be looked up afterwards. For the NOP join the accesses to lineitem are sequential, while for the partitioning joins these accesses are in more or less random order, which is quite expensive. Therefore the colored bars, which lack the attribute accesses, are misleading.

Which brings me back to my original point, namely that the only really sane metric is the overall execution time. Of course it would be nice to know which fraction of the query execution time would be spend in the join, or how some individual metrics are affected by a certain join implementation. But in reality we cannot answer these questions because there is a very complex interaction between implementations and the rest of the execution pipeline. The only thing we can measure reasonably well is the overall time.


Note that I do not want to criticize the paper for the before mentioned points. It does a very fine job of comparing different approaches, and I strongly recommend you to read it. I would have wished for a slightly different emphasis in the conclusion, but that is of course just my own opinion. I am looking forward for hearing the talk at SIGMOD!