Friday, February 17, 2017

Reasoning in the presence of NULLs

One of the defining characteristics of SQL is that it is a declarative query language. That is, the user does not specify how the result should be computed, but instead specifies what conditions the result should satisfy. Within these constraints the database is free to choose between execution alternatives. This has some interesting consequences:

Consider for example the query

select x=a and x=b

clearly, it is equivalent to the following query

select x=a and a=b

after all, x=a, and thus we can substitute a with x in the second term.

And of course the same is true is we replace a and b with constants:

select x=1 and x=2

is equivalent to

select x=1 and 1=2

Now the really interesting question is: Is a database system allowed to infer that this is equivalent to

select false

? After all, x cannot be equal 1 and equal to 2 simultaneously, and the second formulation is x=1 and false, which is clearly false. But what looks intuitive is not necessarily true, if we consider the result of the following two queries (with PostgreSQL results):
postgres=> select x,x=1 and x=2 from (values(1),(null)) s(x);
  x   | ?column? 
    1 | f
(2 rows)

postgres=> select x,x=1 and 1=2 from (values(1),(null)) s(x);
  x   | ?column? 
    1 | f
 NULL | f
(2 rows)

The reason for that difference is the tricky behavior of NULL: 1) A comparison of value with NULL is NULL, and 2) NULL AND NULL is NULL, but 3) NULL AND FALSE is FALSE.

Both 1) and 3) make sense, 1) because we cannot say anything about the result of the comparison, and 3) because it doesn't matter for which value NULL really stands (either true or false), the and with FALSE will always produce a FALSE value. But in combination, they lead to very surprising behavior, namely that some databases return NULL and other databases return FALSE for the same query, depending on whether they noticed that the query contained a contradiction or not.

And this kind of reasoning occurs in many other circumstances, too, for example in this query

select cast(x as integer)=1.5

obviously, this condition can never be true, regardless of the value for x, as 1.5 is not an integer number. But what about NULL? Are we allowed to statically infer that the result is false, even if x might be a NULL value?

I tried to find an answer to that question in the SQL:2011 standard, but unfortunately it is not specified clearly. But after looking at the definition of AND, I have convinced myself that returning false here is ok. After all, the argument for NULL AND FALSE being FALSE is that NULL is an unknown value from within the domain, and any value AND FALSE is FALSE. If we use the same argument here, we can statically decide that the result is false, regardless of the value of x.

But even though the argument makes sense (and it is good for the query optimizer, which exploits that fact), it is still unfortunate that the query result changes depending on how smart the database system is. A smart system can infer that the result is false, not matter what, while a more simple system might return NULL. But perhaps that is just the consequence of having a declarative system, we cannot (and should not!) control in which order expressions are evaluated.

Wednesday, August 24, 2016

Equivalence of Unicode strings is strange

Originally HyPer had a very simple model for strings: We made sure that all strings are valid UTF-8, but otherwise did not really care about the intrinsics of Unicode. And that is actually a quite sane model for most applications. Usually we do not care about the precise string structure anyway, and in the few places that we do (e.g., strpos and substr), we add some extra logic to handle UTF-8 correctly.

That was all good until users started complaining about sort orders. When sorting UTF-8 naively, we sort by code point value, which is often ok but not always what users want. So we added support for collations:
select * from foo order by x collate "de_DE";

And that is where life starts getting interesting. The collate statement can either be given explicitly at any place in the query, as shown above, or added in a create table statement, giving a default collation for a column. So basically every value that is ordered or compared within a SQL statement can have an associated collation.

Initially I naively though that this would just affect the comparison operation, like calling strcoll instead of strcmp, but life is unfortunately much more complex. First, the Unicode collation algorithm is quite involved, translating the input sequence into a sequence of 3 (or 4) level of weights before comparisons, and second, some insane database systems default to case-insensitive collations, and some users actually want that behavior.

Why is case-insensitivity insane? Because it leads to strange semantics. First of all, the whole Unicode weighting mechanism is strange. Some of the strangeness can be hidden by using ICU, but you still get strange results. Consider for example the following two strings (\u is a Unicode escape):


Clearly they are different, right? Well, if you ask the ICU Collation Demo they are not, they are considered equal. The reason for that is the entry in the DUCET table, which for 007F says

007F  ; [.0000.0000.0000] # DELETE (in ISO 6429)
So this character must be ignored for comparisons. And there are other code points that are relevant for plain comparisons, but not for accent-insensitive collations, etc. A lot of fun, in particular if you want to implement hash-based algorithms.

But technical problems aside, is that really what users want? Do users expect these two strings to be equal, just as they apparently expect 'abc' and 'äbc' to compare to equal in accent-insensitive collation? And what about queries? Consider for example

select x,count(*)
from (values('abc'),('ABC'),('äbc')) s(x)
group by x collate "de_DE_ci_ai";

which perform the group by in a case-insensitive, accent-insensitive manner, which means that all three strings compare as equal. What is the result of that? abc 3? ABC 3? äbc 3?
All three would be valid answers, as the three are "equal" according to the chosen collation. And the result might even be non-deterministic, if the query is parallelized across threads and the first value wins.

Do users really want that? Well, apparently they do, at least I was told that, and some systems even default to case-insensitive behavior. But I find that strange, the semantic of these queries can be quite puzzling.

Monday, April 11, 2016

Comparing Join Implementations

In the upcoming SIGMOD the group around Jens Dittrich as a very nice paper about different join implementations: Stefan Schuh, Xiao Chen, Jens Dittrich. An Experimental Comparison of Thirteen Relational Equi-Joins in Main Memory. SIGMOD 2016.
And, as the title implies, it brings light into the overwhelming landscape of papers about join implementations.  It does a pretty good job of pointing out classical implementation choices, and then compares them using not only micro-benchmarks but also using TPC-H Query 19.

And this is a very important contribution. Many papers only consider micro-benchmarks for join experiments, but this is often not very helpful to derive the overall impact of a join implementation for complex queries. This paper does much better.

Still, I am not 100% satisfied with the comparison, which brings me to this blog post here. There are two aspects that I would like to emphasize: First, what is actually a typical join problem? And second, how should we interpret the performance of different join implementations?
Note that even though I will argue against some of the conclusions from the SIGMOD paper, this is not meant as a critique of the paper itself. Stefan Schuh at al. do a good job there, I just disagree a bit with how we should interpret the results.

Coming to my first point, what is a typical join problem? Most papers that consider partitioning join implementations, including the current one, consider joins between relations of similar sizes. In Figure 9 and 10 the authors show results for R⨝S where |S|=|R| and where |S|=10*|R|. Which seems to span a wide range of cardinalities. But is this realistic? If we consider a typical star or snowflake schema like this one

we notice that 1) the cardinality differences between connected relations can be very large, and 2) they will be even larger once query predicates are introduced. Many data warehouse queries will filter one or more dimensions and then join that with the huge facts table.

Of course not every database has a star or snowflake schema. In particular, TPC-H does not. But even there, when looking across all 22 TPC-H queries, 97.4% of the joined tuples occur on the probe side. In fact I would argue that, due to filter predicates, most joins will be between relations of very different sizes. Which makes partitioning joins a lot less attractive, but which is often ignored by papers about that kind of joins.
Of course it might make sense to have different join implementations available, to handle the |S|=|R| case more efficiently. But one should keep in mind that this case is an exception rather than the norm.

Another question I would like to emphasize is, how should we interpret the performance of different join implementations? Query performance is a complex beast with many different metrics like cache-miss stalls, instructions per cycle, branch misses etc. I would argue that if you compare different approaches, the only sane metric is the overall execution time. It is the only sane one because 1) we are waiting for the query result after all, and 2) the individual metrics are often misleading for the overall performance. In fact that there can be approaches that are worse than the competitors in all the previously mentions metrics, but are faster overall.

In the paper the authors show performance results for TPC-H Q19, which consists of a single join between lineitem and part, followed by simple summation without group by attributes. If we look at Figure 14

Figure 14 from [SIGMOD16]

We see that, overall, NOP, i.e., a non-partitioning join, is the fastest for this query. Still, the authors argue that NOP is the slowest of all four alternatives, because the colored bar of it is the highest. I tend to believe that this is an artifact of measurement of the colored bars: They were computed by running the joins on pre-filtered input, without the subsequent aggregation. But that experimental setup ignores a major weakness of the partitioning joins, namely that they often lead to massively random access after the join.
All joins were implemented as tid-joins, i.e., not the tuples themselves were passed around but only references to the original tuple. This is important for the partitioning joins, as otherwise the partitioning phase becomes very expensive. But as a consequence, the missing attributes have to be looked up afterwards. For the NOP join the accesses to lineitem are sequential, while for the partitioning joins these accesses are in more or less random order, which is quite expensive. Therefore the colored bars, which lack the attribute accesses, are misleading.

Which brings me back to my original point, namely that the only really sane metric is the overall execution time. Of course it would be nice to know which fraction of the query execution time would be spend in the join, or how some individual metrics are affected by a certain join implementation. But in reality we cannot answer these questions because there is a very complex interaction between implementations and the rest of the execution pipeline. The only thing we can measure reasonably well is the overall time.

Note that I do not want to criticize the paper for the before mentioned points. It does a very fine job of comparing different approaches, and I strongly recommend you to read it. I would have wished for a slightly different emphasis in the conclusion, but that is of course just my own opinion. I am looking forward for hearing the talk at SIGMOD!

Friday, December 18, 2015

The price of correctness

When implementing a database system we often have two contradicting goals: Performance and correctness. Being as fast as possible is very attractive, of course. But unfortunately this often means ignoring corner cases, and can thus lead to incorrect results in rare cases.

And usually, incorrect results are unacceptable. Or at least, should be unacceptable. It is very tempting for a prototype system to ignore correctness issues and just aim for maximum performance. But once money is associated with the data, wrong results are very problematic.

A classical example of this is the representation of numerical values in databases. It is tempting to just use doubles for everything. Doubles are well supported by hardware, we can use SIMD instructions with them, and they are often the fastest way to represent non-integer values. On the other hand doubles can get into rounding issues very quickly. This can be seen by when performing the computation 0.1+0.2-0.3:

select ceil((0.1+0.2)-0.3)
 -> 0.0

select ceil((0.1::double precision+0.2::double precision)
            -0.3::double precision)
 -> 1.0

When using a correct NUMERIC implementation (e.g., HyPer or PostgreSQL) we get the correct result of 0.1+0.2-0.3=0. When using DOUBLE PRECISION we get a non-zero result, which is here rounded to 1. Unfortunately some systems like SQLite (and a lot of research systems) do not both to implement NUMERIC correctly, and always use doubles, which leads to wrong results even for the first query.

Implementing NUMERIC correctly means implementing fixed-point arithmetic. The number is usually represented as integer value plus corresponding position of the decimal point, and all operations are then mapped to integer operations. For addition and subtraction that is reasonable easy (as long as both arguments have the decimal point at the same position), but division for example is more complex even if we ignore the decimal point:

int64_t div(int64_t a,int64_t b)
   if (!b) // avoid hardware trap due to division by zero
      throw DivBy0();
   if (b==-1) // avoid hardware trap due to int64min/-1
      return sub(0,a);
   return a/b;

Note that the sub function in there is non-trivial, as we will see in a moment. Plus the extra code needed for handling the decimal point, plus the code needed to handle rounding. A fixed point division operation easily needs 20-30 instructions, compared to a single instruction for a floating point division. This costs performance, of course, but has the undeniable advantage of producing the correct result.

Now why is the subtraction function non-trivial? Because it has to cope with underflows/overflows. And that is not only nice to have, but fundamental, because there are values that are fundamentally non-representably in our usual two's complement representation. Consider for example the following computation:

select (-9223372036854775808)/(-1)
-> div(-9223372036854775808,-1)
-> sub(0,-9223372036854775808)
-> ???

The problem here is that -9223372036854775808 has no corresponding positive number when using 64bit integers for our fixed-point values, the number is non-representable. In fact if we had executed the division without the check for -1 we would have gotten a very unpleasant hardware trap due to that. We avoid the trap by delegating to subtraction, but if we not check for underflows/overflows there we silently produce wrong results.

Checking for overflows manually is quite painful, in particular since signed overflows are undefined in C! We have to break the computation down into unsigned operations, which is both complex and slow. Fortunately recent versions of both gcc and clang added intrinsics to use the CPU flags for overflow checking, which is both much easier and much cheaper:

int64_t sub(int64_t a,int64_t b)
   int64_t c;
   if (__builtin_ssubll_overflow(a,b,&c))
      throw Overflow();
   return c;

Even when using the intrinsics the (correct) fixed-point operations are much more complex than the (inaccurate) double operations. What does that mean for performance? I show the execution time of 100,000 passes over two vectors of 1,000 values each below (i.e., 100,000*1,000 = 100,000,000 arithmetic operation of the indicated type were executed), both for double arithmetic and for fixed point arithmetic. (i7-5820K, gcc 5.2.1, -O3 -march=native)

add sub mul div
double (unchecked) 11ms 11ms 12ms 212ms
fixed point (unchecked) 10ms 10ms 42ms 817ms
fixed point (checked) 57ms 57ms 56ms 912ms

The correct arithmetic (fixed-point, with checks) is quite a bit slower than just using doubles. But that is the price we have to pay for correctness. (Update: in the original posting I forgot -march=native, adding that improved performance of the unchecked versions by another factor 2 due to AVX instructions).
Note that it would have been possible to check for overflows after double operations, too, using fetestexcept. But that is even slower than the checked fixed point arithmetic (>620ms for all cases), and it does not help with the rouding issues of floating point numbers.

So performing numerical operations correctly is difficult and expensive. But still, every system should do it! Users get very unhappy when the result is wrong, in particular if the values correspond to money. If you are building a prototype system, do not use floating point numbers, even if it is fast and tempting. Using doubles is nice for micro benchmarks, but inappropriate for real-world usage when money is involved.

Saturday, September 12, 2015

Trying to speed up Binary Search

It is well known that binary search is not particular fast. For point queries hash tables are much faster, ideally accessing in O(1),  And even when we need range queries n-ary search structures like B-Trees are much faster than binary search or binary search trees.

Still, there is a certain charm to binary search. First, it is easy to implement, And second, it needs no extra memory but can directly operate on sorted data. And sometimes the data is sorted anyway, so we get the binary search more or less for free. And for something we get for free O(log n) lookup time is not that bad. The only question is, how can we implement it efficiently.

Of course there is the text-book way of implementing binary search, as explained for example in Wikipedia. The resulting code looks like this:

while (lower!=upper) {
  unsigned middle=lower+((upper-lower)/2);
  unsigned v=data[middle];
  if (v==needle) {
      return middle;
  } else if (v<needle) {
  } else {
return notFound;

Not particular difficult, but how fast it is? We ran experiments where we performed 1,000,000 random lookups for 32bit integers in data sets of various sizes and measured the performance on a Broadwell i7-5500U (all experiments using -O3):

set size103105107109
classic65 ms119 ms362 ms702 ms

 When we look carefully we notice that the runtime grows a bit faster than the asymptotic complexity of O(log n) would suggest. Or, in other words, the constants for searching in the 4GB data set seem to be about a factor 4 higher than when searching in the 4KB data set. That is not surprising, as the 4KB of the smallest set easily fit into the CPU cache, while the 4GB of the largest set are clearly beyond cache size. What is surprising is that it is only a factor of 4!

The performance of binary search is largely affected by two effects: First, there are cache misses, as every lookup accesses O(log n) elements, which are often not in cache for the large data sets. And second, there are branch mispredictions, as the comparison v<needle has about a 50% chance of being true, which leads to inefficient execution. For small data sets the branch misses dominate, while for the large data sets the branch misses are somewhat hidden by the memory latency. Therefore a slowdown of only 4.

Now the question is, can we do better? Usually people try to speed up binary search by reducing branch misses. There is an interesting talk about binary search, less wrong that described how to implement binary search without the hard to predict branch. Somewhat ironic the example code in that talk is wrong, contrary to its title, but the idea is valid. The corrected code looks like this:

while (auto_t half=n/2) {
   auto middle=lower+half;
return ((*lower)==needle)?lower:notFound;

The code is not only quite short, the loop body is compiled by gcc into completely branch-free assembler code by using a conditional move. The resulting performance is shown below:

set size103105107109
branch free23 ms53 ms320 ms853 ms

This is a somewhat surprising result. For small data sets the branch-free version is indeed faster, but for the largest data set the search is significantly slower. And that is after manual tuning of the generated code, out of the box it needed 920 ms for the largest set.

The speed-up for the small data sizes is easy to explain (no branch mis-predictions), but where does the slow-down for the largest data set come from? We get a clue by compiling the same code with clang, which is uses a branch instead of a conditional move inside the loop. With that we get

set size103105107109
compile with clang67 ms115 ms359 ms694 ms

which is virtual identical to the performance of the text-book implementation. Apparently the conditional move is good to avoid branch mispredictions, but cannot hide memory latency as much as the regular implementation.

So while we found a faster implementation for small data sets, the question is still open if we can do better for large sets (which might be common in databases). We tried using ternary search instead of binary search, with the argument that while ternary search accesses more memory locations than binary search, the memory latency can be hidden by performing the accesses in parallel, and also the number of iterations is smaller. There are many ways to implement ternary search, all with different trade-offs. One alternative that issues parallel memory accesses is shown below:

while (uintptr_t step=n/3) {
  auto t1=lower+step,t2=lower+(step<<1);
  unsigned cmp=((*t1)<=needle)+((*t2)<=needle);
  if (cmp==2) {
      n-=t2-lower; lower=t2;
  } else if (cmp) {
      n-=t1-lower; lower=t1;
  } else {

Unfortunately the results were disappointing:

set size103105107109
ternary search81 ms143 ms400 ms781 ms

The extra memory accesses are just not worth it. In worst case we have now two cache misses per iteration (even if they are done in parallel), and the benefit of the increased fanout is just too small.

Of course there are ways to speed up search in large data sets further. For example by storing the data in a cache-optimized B-Tree rather than a sorted vector. But that forces us to abandon our original mission, namely exploiting sortedness that is available in the data anyway. For small and medium sets we have shown improvements over the text-book approach here, if someone has a good idea for large sets I would be happy to hear about it in the comments.

Update: As Marcin pointed out in the comments, it is possible to use a SIMD-version of binary search to perform 4 searches in parallel. I have used the code from section 5.5.4 of the thesis, it is a bit long to show here. And indeed we get good performance (best of 10 runs):

set size103105107109
SIMD version24 ms63 ms478 ms706 ms

Note that for many data sizes the performance is in the middle between the text-book version and the conditional-move version, regardless of which of the two is actually faster. Puzzling. What is also puzzling is that the variance of the runtime seems to be much higher than for the non-SIMD versions. For the largest data set I have usually seen runtimes around 780ms, with peaks up to 950ms. I have no explanation for that.

If, as Marcin also suggested in the comment below, we replace the manual gather code

int cmpval0 = data_shifted[result_vector[i + 0]];
int cmpval1 = data_shifted[result_vector[i + 1]];
int cmpval2 = data_shifted[result_vector[i + 2]];
int cmpval3 = data_shifted[result_vector[i + 3]];
__m128i xm_cmpvalvec = _mm_set_epi32(cmpval3, cmpval2, cmpval1, cmpval0);

with the AVX2 gather instruction

__m128i xm_cmpvalvec = _mm_i32gather_epi32(data_shifted,_mm_load_si128((__m128i*)(result_vector)),4);

we get as runtime performance

set size103105107109
AVX2 gather26 ms68 ms487 ms708 ms

At a first glance the performance of the AVX2-gather is identical to the manual gather. And indeed I had got exactly that result in previous gather experiments, too. However the variance problem seems to have gone away when using AVX2 gather, the runtime is now relative stable across runs.

Thursday, July 2, 2015

"Open-Sourcing" HyPer queries

We are sometimes contacted by people who would like to inspect the behavior of HyPer. HyPer itself is currently not open source, but it is still possible to look at the source code of queries (which is all that matters for query performance).

To do that, we can use an internal debug interface that we use for profiling. Note that this feature is completely undocumented and unsupported and might go away at any point in time, it is not aimed at end users. But it can still be used to inspect queries. As prerequisite we need the clang compiler in the search path, more precisely a binary called clang-3.5. Then, we can start the server daemon with a debug flag like that:

bin/hyperd mydatabase -xcompilestatic

Now every statement that generates code (e.g., queries, or create table statements) writes the code to the local directory and then call clang-3.5 to compile it into a shared library. Usually we do not use clang but JIT directly for performance reasons, but some debugging and profiling tools can handle shared libraries much better than JITed code. Note that the generated code is probably bested viewed after passing it through c++filt, as that will pretty-print the mangled names.

Consider for example an SQL statement like

create table foo(x integer not null)

It results in three file when run with the debug flag, the file dump1.ll with the source code, a file dump1.debug-ll for debug info purposes, and a file with the compiled code. Here the code is not very exciting, it is basically code for memory management (initPartition_foo etc.), multiversioning (retrieveVersionPartition_foo etc.) and the low-level SQL update primitives (insertPartition_foo).

If we run a query like

select count(*) from foo where x<123

another ll file is generated. Here the query consists basically of a single function (planStart).The first few dozen lines are uninteresting, they just navigate memory until the correct columns pointers are computed. The main query code itself is quote short, I have included it below (ignoring the extra code to handle tuples that are modified by concurrent transactions).

  %tid = phi i64 [ %86, %loopDonescan18 ], [ %66, %loopscan.preheader ]
  %76 = call i64 @hyper::RuntimeFunctions::findUnmodifiedRange(hyper::Transaction::VersionRecord**, unsigned int*, unsigned long, unsigned long)(%"hyper::Transaction::VersionRecord"** %72, i32* %74, i64 %tid, i64 %67)
  %77 = icmp ult i64 %tid, %76
  br i1 %77, label %loopscan11.preheader, label %loopDonescan12

  br label %loopscan11, !dbg !156

  %tid13 = phi i64 [ %84, %scanStep ], [ %tid, %loopscan11.preheader ]
  %.sum = sub i64 %tid13, %66
  %78 = getelementptr i32* %68, i64 %.sum
  %x = load i32* %78, align 4
  %79 = icmp slt i32 %x, 123
  br i1 %79, label %then14, label %scanStep

  %80 = load i64* %0, align 8
  %81 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 %80, i64 1)
  %82 = extractvalue { i64, i1 } %81, 1
  br i1 %82, label %overflow.loopexit35, label %cont16

  %83 = extractvalue { i64, i1 } %81, 0
  store i64 %83, i64* %0, align 8
  br label %scanStep

  %84 = add i64 %tid13, 1
  %85 = icmp eq i64 %84, %76
  br i1 %85, label %loopDonescan12.loopexit, label %loopscan11

  br label %loopDonescan12

It first figures out which tuples are unmodified and can be scanned safely. Then (starting from loopscan11) it scans over these tuples, loads the x column, checks the filter condition (before then14), and updates the count for qualifying tuples (including overflow checks).

Note that the query is not parallelized because the query optimized decided that the (empty) table did not warrant parallelization. For more larger data sets the generated code will be more complex due to parallel execution, but it should be quite readable in general.

Note that there is a particular hackish way to exploit that interface that might be interesting for low-level experiments: Instead of providing the real clang compiler, we can put a script called clang-3.5 into your search path that first opens the provided source code with an editor and then calls clang. This allows is to modify the generated code and play with the query as we like. Do do not do that if you value the data that is currently stored inside your database! Bad things can and will happen. But it is great for ad-hoc experiments.

Wednesday, June 10, 2015

First impressions of MemSQL

The MemSQL system uses some interesting techniques like query compilation and lock-free data structures, and aspires to be "The Fastest In-Memory database". Parts of its approach are similar to our HyPer system, therefore we were curious and downloaded the MemSQL Community Edition for testing.

The MemSQL web page is not very explicit about it, but after some testing we tend to believe that MemSQL aims at OLTP workloads, so our tests might be unfavorable for MemSQL, as we tried out TPC-H SF1, an OLAP workload. We used the Comunity Edition 4.0.28 for that, and ran MemSQL out of the box. Which worked very pleasantly by the way, installation was completely hassle free.

What was surprising was that while some queries were fast, some queries were very slow. Below are the runtimes for the first 5 TPC-H queries, running on a six core i7-3930K, compared with HyPer v0.5-332.


After investigating the plans with explain we saw that apparently MemSQL always uses either index-nested-loop-joins (INL) or nested-loop-joins (NL), which is very expensive in large, TPC-H-style join queries. The INL is ok, although still somewhat expensive, as seen in Q4, where only INL is used, but if the system is forced to use NL, performance is very poor.

Therefore our assumption that the system aims at OLTP settings, where large joins do not occur. Unfortunately there is no easy way to run TPC-C on MemSQL (we would have to implement a benchmark driver first), therefore we stoped here with just some first impressions. Perhaps we can get a student to implement an OTLP workload for both systems, we will publish an update then.

An interesting side aspect is the compilation model. HyPer generates machine code using the LLVM compiler backend, while MemSQL generates C++ code and passes it through GCC. That is particular interesting because we can look at the generated C++ code and see what the system is doing (and the generated code is quite readable). A downside of generating C++ code is compile time: While HyPer uses in the order of 100ms to compile a TPC-H query, MemSQL uses in the order of 10s, a factor 100 slower. They mitigate that using a plan cache, but interactive query sessions can become quite painful.

But for the OLTP side the system looks quite interesting, they use fancy data structures like lock-free skip lists, and it will be interesting to see how they perform there.

Update: Robbie Walzer from MemSQL suggested to add foreign key indexes, which eliminates the expensive nested loop joins from the queries. I have included the numbers below, and I have also added results from PostgreSQL 9.4. Note that PostgreSQL is disk-based and single threaded, so the comparison is unfair, but it represents "classical" systems.

QueryMemSQLwith FKPostgreSQLHyPer
Q10.64s 0.64s 10.90s 0.013s
Q245.92s 0.11s 0.38s 0.001s
Q3326.94s 0.83s 1.96s 0.013s
Q40.56s 0.57s 0.48s 0.007s
Q5120.50s 0.79s 0.63s 0.008s

And indeed the foreign key indexes greatly improve performance, nested loop joins are just too expensive. Note, however, that the other two systems do fine without these extra indexes, so investing some effort in a hash join implementation might be worthwhile for MemSQL.

Robbie also mentioned cluster support, which we did not test at all here. I assume MemSQL scales nicely with the number of nodes in a cluster, we will test that at some later point. On the other hand even with perfect scalability you need a decent number of nodes until the single-node performance of HyPer is reached in TPC-H. Things might be very different in OLTP settings like TPC-C, of course, there the index structures of MemSQL can probably shine.