Sunday, June 29, 2014

Main-memory vs. disk based

Given today's RAM sizes the working set of a database is in main-memory most of the time, even for disk-based systems. However, it makes a big difference if the system knows that the data will be in main-memory or if it has to expect disk I/O.

We can illustrate that nicely in the HyPer system, which is a pure main-memory database system. Due to the way it was developed it is nearly 100% source compatible with an older project that implemented a System R-style disk-based OLAP engine aiming at very large (and thus cold cache) OLAP workloads. The disk-based engine includes a regular buffer manager, locking/latching, a column store using compressed pages, etc. This high degree of compatibility allows for an interesting experiment, namely replacing the data access of a main-memory system with that of a disk based system (thanks to Alexander Böhm for the idea).

In the following experiment we replaced the table scan operator of HyPer with the table scan of the disk-based system, but left all other operators like joins untouched. We than executed all TPC-H queries on SF1 repeatedly, using identical execution plans, and compared the performance of the disk-based scan to the original HyPer system. After the first pass all data is in main-memory, so both systems are effectively main-memory systems, but of course the disk-based system does not know this and assumes data is coming from disk. In the experiments we disabled parallelism and index nested loop joins, as these were not supported by the older project. All runtimes are in milliseconds on SF1 (single-threaded).



1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
main-memory 50 4 20 72 11 52 43 17 132 44 5 42 117 12 14 41 34 225 133 20 115 17
disk based 374 40 263 208 240 185 277 271 316 202 47 275 170 182 182 53 250 461 200 204 556 39

It is evident that the main-memory system is much faster, approx. a factor 5 in geometric means. The interesting question is why. Profiling showed that a main culprit was compression. The disk-based system compressed data quite aggressively, which is a good idea if the data is indeed coming from disk, as then throughput increases, but a very bad idea if the data is already in memory to begin with. For scan heavy queries like Q1 nearly up to 70% of the time was spend on decompressing the data, Or, to phrase it differently, we could have improved the performance by nearly a factor of 4 by avoiding that expensive decompression for these queries (other queries are less affected, but pay for decompression, too).

Note that even though that aggressive compression looks like a bad idea today (and it probably is, given todays systems), it was plausible when the system was designed. Aggressive compression saves disk I/O, and if the system is waiting for disk the only thing that matters is that decompression is faster than the disk drive. Which is the case here. Today we would prefer a more light-weight compression that does not add such a high CPU overhead if the data is already in main-memory. But which of of course offers worse performance if data should indeed come directly from disk...

The compression explains roughly a factor 3 in performance difference, where does the rest come from? There is no obvious hot spot, performance is lost all over the place. Buffering, latching, memory management, tuple reconstruction. data passing, all components add some overhead here and there, with a quite significant overhead in total. All of that was largely irrelevant as long as we waited for disk I/O, but in the main-memory case the overhead is quite painful.

So what can we learn from that little experiment? First, systems should be designed with the main-memory case in mind. Tuning originally disk-based systems for the in-memory case is difficult, as it requires removing layers and overhead all over the system. Not impossible, with sufficient determination we could probably get the disk-based case within a small factor of the main-memory case, but a lot of work.
And second, comparisons between disk-based systems and main-memory systems are unfair. Here, it looks like the main-memory system would win by a large margin. And of course it does, in most settings. If, however, the data would really come from disk, without any caching, the aggressive compression of the disk-based system would have paid off, as it would fetch less data from the slow disk. These are really two different use cases, and even though main-memory becomes the norm, there is still some use for disk-based systems.

Wednesday, June 4, 2014

Random Execution Plans

Query optimizers spend a lot of effort on finding the best execution plans. HyPer for example uses a fairly complex dynamic-programming strategy for finding the optimal join order. Of course all of this complex and expensive optimization is based upon the cost model and cardinality estimations. Unfortunately cardinality estimation is often wrong, in particular higher up in the execution plan. Therefore some cynics claim that databases are basically executing random execution plans.

Now the interesting question is, do we really do that? To get an impression on what "random execution plan" means, we took regular SQL queries, generated 100 random execution plan using QuickPick and executed all of them in HyPer. Even though they are random, the generated plans are still somewhat reasonable, as 1) they contain no cross products, 2) selections are pushed down, and 3) the smaller side is used as build input. Note that constructing these plans needed no estimations at all (except for build/probe selection), we simply constructed plans with random join orders and executed them with a timeout of 3 seconds.
They results for TPC-H Query 5 (SF1) are shown below:


Pygal 0.0 200.0 400.0 600.0 800.0 1000.0 1200.0 1400.0 1600.0 1800.0 2000.0 2200.0 2400.0 2600.0 2800.0 3000.0 26 16.538461538461537 521.5057692307693 31 23.153846153846153 521.0778846153846 36 29.769230769230766 520.6500000000001 37 36.38461538461538 520.564423076923 41 43.0 520.2221153846153 46 49.61538461538461 519.7942307692308 47 56.230769230769226 519.7086538461539 48 62.846153846153854 519.623076923077 50 69.46153846153845 519.4519230769231 50 76.07692307692308 519.4519230769231 51 82.6923076923077 519.3663461538462 52 89.3076923076923 519.2807692307692 91 95.9230769230769 515.9432692307693 93 102.53846153846153 515.7721153846154 95 109.15384615384615 515.6009615384615 102 115.76923076923076 515.001923076923 102 122.38461538461537 515.001923076923 105 129.0 514.7451923076924 107 135.6153846153846 514.5740384615385 107 142.23076923076923 514.5740384615385 112 148.84615384615387 514.1461538461539 112 155.46153846153845 514.1461538461539 113 162.0769230769231 514.060576923077 113 168.6923076923077 514.060576923077 118 175.30769230769232 513.6326923076923 138 181.92307692307696 511.92115384615386 140 188.53846153846155 511.75 140 195.1538461538462 511.75 151 201.7692307692308 510.8086538461539 154 208.3846153846154 510.5519230769231 154 215.0 510.5519230769231 160 221.61538461538464 510.03846153846155 162 228.23076923076925 509.8673076923077 172 234.84615384615387 509.0115384615385 178 241.4615384615385 508.49807692307695 178 248.0769230769231 508.49807692307695 178 254.6923076923077 508.49807692307695 180 261.30769230769226 508.3269230769231 243 267.92307692307685 502.93557692307695 275 274.53846153846155 500.1971153846154 277 281.15384615384613 500.02596153846156 290 287.7692307692307 498.91346153846155 334 294.38461538461536 495.1480769230769 334 300.99999999999994 495.1480769230769 334 307.6153846153846 495.1480769230769 334 314.23076923076917 495.1480769230769 350 320.8461538461538 493.7788461538462 360 327.4615384615384 492.92307692307696 360 334.07692307692304 492.92307692307696 360 340.6923076923076 492.92307692307696 360 347.30769230769226 492.92307692307696 360 353.9230769230769 492.92307692307696 361 360.53846153846155 492.83750000000003 387 367.15384615384613 490.6125 416 373.7692307692307 488.13076923076926 422 380.38461538461536 487.6173076923077 423 387.0 487.53173076923076 470 393.6153846153845 483.5096153846154 672 400.23076923076917 466.223076923077 685 406.8461538461538 465.11057692307696 699 413.4615384615384 463.9125 714 420.07692307692304 462.6288461538462 714 426.6923076923076 462.6288461538462 715 433.30769230769226 462.5432692307693 751 439.9230769230769 459.46250000000003 757 446.5384615384615 458.9490384615385 844 453.15384615384613 451.5038461538462 852 459.7692307692307 450.8192307692308 1098 466.38461538461536 429.7673076923077 1139 472.9999999999999 426.2586538461539 1139 479.6153846153845 426.2586538461539 1194 486.23076923076917 421.5519230769231 1212 492.8461538461538 420.0115384615385 1212 499.4615384615384 420.0115384615385 1234 506.07692307692304 418.1288461538462 1518 512.6923076923076 393.82500000000005 2062 519.3076923076923 347.2711538461539 2126 525.9230769230768 341.7942307692308 2662 532.5384615384614 295.92500000000007 2662 539.1538461538462 295.92500000000007 2670 545.7692307692308 295.24038461538464 3000 552.3846153846154 267.0 3000 558.9999999999999 267.0 3000 565.6153846153845 267.0 3000 572.2307692307692 267.0 3000 578.8461538461537 267.0 3000 585.4615384615385 267.0 3000 592.0769230769231 267.0 3000 598.6923076923076 267.0 3000 605.3076923076923 267.0 3000 611.9230769230769 267.0 3000 618.5384615384615 267.0 3000 625.1538461538461 267.0 3000 631.7692307692307 267.0 3000 638.3846153846154 267.0 3000 645.0 267.0 3000 651.6153846153845 267.0 3000 658.2307692307692 267.0 3000 664.8461538461538 267.0 3000 671.4615384615383 267.0 random query plans [sorted by runtime] runtime [ms]


We see several interesting results. First, the best random plan is not bad at all. It is only slightly worse than the plan generated by our complex DP optimizer. On the other hand, the worst random plan is very bad, we had to kill it after 3 seconds. So picking a random plan is clearly dangerous. And even the median plan is not that good, either. Roughly speaking, the median plan is a factor of 10 slower than the best random plan, and the worst random plan is more than a factor of 10 slower than the median for this query.

Therefore database optimizers are most likely not picking random plans, even though they are reasoning using noisy estimates. Truly random plans are simply too bad. This also demonstrates that query optimization is crucially important, even a fast runtime system cannot correct the mistakes made in query optimization.

Of course randomness can also be used during query optimization. Generate 1000 plans using QuickPick, pick the cheapest one, and you will most likely get a decent plan. Of course there we cannot simply execute all of the plans, so we will have to pick the cheapest plan based upon estimates, which brings us back to the original problem. But still, QuickPick is very fast, so that might be attractive for large queries.

Random execution plans are also useful for testing the quality of the cost prediction. For TPC-H that is not that an issue, as here cardinality estimates and cost predictions are quite good, but for data sets with skew and correlations a scatter plot of expected runtime and actual runtime is quite enlightening.
So there is indeed a lot of unresolved issues with cardinality estimation,  but fortunately there is usually at least a correlation between expected costs and actual costs. Which means that we might pick plans with some randomness induced by estimation errors, but at least we tend to pick them from the good end of the spectrum.

Friday, May 23, 2014

Vortex: Vectorwise goes Hadoop

Like Thomas in his first blog, where he announced his super-cool research system Hyper being available for download, I will also start my first blog post with a systems announcement. For this one, there is no download just yet, but by the end of next month, Actian will have available a new product that allows to use the Actian Vector system in MPP mode on Hadoop clusters. I will talk about all this more extensively at the upcoming Hadoop Summit on June 3 in San Jose, California. So please come and attend!

As you may know, Actian Vector, previously called Vectorwise (oh, those marketing marvels!) is an analytical database system that originated from my research group at CWI. It can be considered a follow-up system from our earlier experiences with MonetDB, the open source system from CWI that pioneered columnar storage and CPU-cache optimized query processing for the new wave of analytical database systems that since have appeared. Whereas MonetDB is a main-memory optimized system, Vectorwise also works well if the hot-set of your query is larger than RAM, and while sticking to columnar storage, it added a series of innovative ideas accross the board such as lightweight compression methods, cooperative I/O scheduling and a new vectorized query execution method, which gives the system its name. Columnar storage and vectorized execution are now being adopted by the relational database industry, with Microsoft SQLserver including vectorized operations on columnstore tables, IBM introducing BLU in DB2 and recently also Oracle announcing an in-memory columnar subsystem.  However, these systems have not yet ended the TPC-H Vectorwise domination in the industry standard benchmark for analytical database systems. So far this domination only concerns the single-server category, as Vectorwise was a single-server system - but not anymore thanks to project "Vortex": the Hadoop-enabled MPP version of Vectorwise.

To scale further than a single-server allows, one can create MPP database systems that use a cluster consisting of many machines. Big Data is heavily in vogue and many organizations are now developing data analysis pipelines that work on Hadoop clusters. Hadoop is not only relevant for the scalability and features it offers, but also because it is becoming the cluster management standard. The new YARN version of Hadoop adds many useful features for resource management. Hadoop distributors like Hortonworks, MapR and Cloudera are among the highest-valued tech startups now. An important side effect is strongly increasing availability of Hadoop skills, as Hadoop is not only getting much mindshare in industry but also making its way into Data Science curricula at universities (which are booming). Though I would not claim that managing a Hadoop cluster is a breeze, it is certainly more easy to find expertise on that, than finding expertise in managing and tuning MPP database systems. Further, Big Data pipelines typically process huge unstructured data, with MapReduce and other technologies being employed to find useful information hidden in the raw data. The end product is cleaner and more structured data, that one could query with SQL. Being able to use SQL applications right on Hadoop significantly enriches Hadoop infrastructures. No wonder that systems offering to do SQL on Hadoop are the hottest thing in analytical database systems now, with new products being introduced regularly. The marketplace is made up by three categories of vendors:
  1. those that run MPP database systems outside Hadoop and provide a connector (examples are Teradata and Vertica). In this approach, system complexity is not reduced, and one needs to buy and manage two separate clusters: a database cluster for the structured data, and a Hadoop cluster to process the unstructured data. Plus, there is a lot of data copying.
  2. those that take an legacy open-source DBMS (typically Postgres) and deploy that  on Hadoop with a wrapping layer (examples are Citusdata and HAWQ).  Legacy database systems like Derby and Postgres were not designed for an append-only file system like HDFS and the resulting database functionality is therefore often also append-only. Note that I call these older database engines "legacy" from the analytical database point of view, since modern vectorized columnar engines are typically an order of magnitude faster on analytical workloads. The Splice Machine is an interesting design point in that it ported Derby execution onto HBase storage. HBase is designed for HDFS, so this system can accomodate update-intensive workloads better than analytical-oriented SQL-on-Hadoop alternatives, but  its reliance on the row-wise tuple-at-a time Derby engine and the access paths through HBase are bound to slow it down on analytical workloads compared with analytical engines.
  3. native implementations that store all data in HDFS and are YARN integrated. Most famous here are Hive and Impala. However, I can tell from experience that developing a feature-complete analytical database system takes... roughly a decade. Consequently, query execution and query optimization are very immature still in Impala and Hive. These young systems, while certainly serious attempts (both now store data column-wised and compressed, and Hive even adopted vectorized execution) still miss multiple of the advanced features that the demanding SQL user base has come to expect: workload management, internationalization, advanced SQL features such as ROLLUP and windowing functions, not to mention user authentication and access control. 
Interestingly, the new Actian Hadoop product that leverages Vectorwise is of the native category above, as it purely uses HDFS for storage, and is YARN integrated. However, its optimizer and feature set is quite mature. The resulting system is also much faster than Impala and Hive. Of course, it was not trivial to port Vectorwise to HDFS, but true compressed column-stores are in their access patterns quite compatible with an append-only file system that prefers large sequential I/Os (i.e. HDFS). Unlike Hive and Impala, Vectorwise also supports fine-grained updates, since updates (at first) go to a separate data structure called the Positional Delta Tree (PDT), yet another CWI innovation. This has been an important piece the puzzle of making Vectorwise a native inhabitant of HDFS.

One last link I provide is to the Master Thesis of twoVector engineers which one could consider the baby stage of project Vortex. Back then, we envisioned using a cluster file system like RedHat's GlusterFS (it turned out dead slow), but HDFS took its place in project Vortex. The "A Team" (as we used to call Adrian and Andrei) has been instrumental in making this come alive. Similar thanks extend to the entire Actian team in Amsterdam which has been working very hard on this, as well as to the many other Actian people worldwide who contributed vital parts.

My blog post is becoming too long, so I cannot provide all the juicy bits on how Vortex is designed. But, the above picture does give quite a bit of information for the trained eye. Vectorwise now supports partitioned tables, and when deploying on Hadoop, the various partitions get stored in HDFS, which by default replicates every data block three times. HDFS when reading remote data is quite slow, so we give hints to the NameNode where to store the blocks, and when a Vectorwise session starts and X100 backend processes are started on (the "worker set", a designated subset of) the Hadoop cluster, we make sure that the processes that become responsible for a particular partition have it local. In the above figure I have tried to suggest this decoupling of partitions and database processes: in principle and thanks to HDFS everybody can read anything, but we try to optimize things such that most data access is in fact a "shortcut read", as HDFS calls it. We are also working hard to make the solution work well with YARN, such that the system can increase and reduce its footprint (#nodes and #cores per node) on the Hadoop cluster as the query workload varies, and avoids as much as possible getting slowed down by concurrent Hadoop jobs.

More news about Actian Vector 4.0 Hadoop Edition, as this new product is eloquently named, will appear shortly. But, you heard it first on Database Architects!

Thursday, May 22, 2014

Using Benchmarks in Research Papers

Many, if not most research papers want to include experimental results, and due to the lack of customer data researchers tend to use well known benchmarks. Which is fine in principle, as these benchmarks are well known and well specified, so using them makes sense.

However, following a benchmark to the letter, as would be needed for an audited result, is hard, and thus most research papers deviate more or less from the official benchmark. For example, most TPC-C results in research papers ignore the (mandatory) client wait time. Which is strictly speaking not allowed, but usually accepted in research papers, as it "only" affects the space requirements, and has hopefully little effect on the results otherwise. Some deviations like ignoring warehouse-crossing transactions are more dangerous, as these can suddenly have a large impact on transaction rates:

TPC-C rates, taken from ICDE14
There might be good reasons for deviating from the official benchmarks. For example implementing client wait times correctly in TPC-C means that the allowed number of transactions per second is bounded by the number of warehouses in the database. For 300,000 transactions per second we would need ca. 4,500,000 warehouses, which would need ca. 1.3PB of space. This is far beyond the size of main memory, and probably no OLTP system on this planet has that size, which means that TPC-C with wait times is not suitable for main-memory database.
But then a paper has to acknowledge that explicitly. Deviating for good reasons is fine, if that deviation is clearly and easily visible in the text and well justified.

Of course all research papers are somewhat sloppy. Hardly anybody cares about overflow checking in arithmetics, for example. Which gives a certain bias in comparisons, as commercial systems, and even a few research system, will do these checks, but these are usually small deviations. What is more critical  is if somebody implements something that has little resemblance to the original benchmark, but then does not state that in the experiments.

This paper for example studies fast in-memory OLAP processing. Which is fine as a paper topic. But then they claim to run TPC-H in the experiments, which is an ad-hoc benchmark that explicitly forbids most performance tricks. For example most non-key indexes are forbidden, materialization is forbidden, exploiting domain knowledge is forbidden, etc. And they get excellent performance numbers in their experiment, easily beating the official TPC-H champion VectorWise:

Excerpt from ICDE13

But even though they claim to show TPC-H results, they are not really running TPC-H. They have precomputed the join partners, they use inverted indices, they use all kinds of tricks to be fast. Unfortunately most of these tricks are explicitly forbidden in TPC-H. Not to mention the update problem, the real TPC-H benchmark contains updates, too, which would probably make some of the data structures expensive to maintain, but which the paper conveniently ignores.
And then the experiments compare their system on their machine with VectorWise numbers from a completely different machine, scaling them using SPEC rates. Such a scaling does not make sense, in particular since VectorWise is freely available.

These kinds of problems are common in research papers, and not limited to the somewhat arbitrary examples mentioned here, but they greatly devalue the experimental results. Experiments must be reproducible, they must be fair, and they must provide insightful results. Deviating from benchmarks is ok if the deviation is well justified, if the evaluation description makes this very clear and explicit, and if the deviation affects all contenders. Comparing a rigged implementation to a faithful one is not very useful.

Thursday, May 15, 2014

Trying out HyPer

At TUM we have built a very fast main-memory database system named HyPer. It offers fairly complete SQL92 support plus some SQL99 features, and is much faster than "traditional" database systems.

The easiest way to play with it is the online demo. It provides you with an easy to use interface for entering queries, running them, and inspecting the execution plan. All queries are evaluated against a SF1 TPC-H database which contains roughly 1GB of data.


HyPer web interface


The web interface is easy to use and requires no setup, but it runs on a fairly weak server machine and queries only 1GB of data. For larger experiments you might be interested in a local installation of HyPer. For that, download the hyper demo, unpack it with tar xvfJ, and try out one of the included demo scripts (or your own queries, of course).

For TPC-C experiments you can use the demo-tpcc script:


 neumann@tester:~/hyperdemo$ ./demo-tpcc   
 Executing scripts...  
 2044ms  
 Loading...  
 5598ms (722MB)  
 OLTP...  
  wallclock    total  primary    tps   time    mem  
 --------------------------------------------------------------  
         0s   100000   52946  119904   834ms   19MB  
         1s   200000   55108  124843   801ms   31MB  
         2s   300000   52573  120192   832ms   142MB  
         3s   400000   54219  122549   816ms   179MB  
         4s   500000   54896  124533   803ms   199MB  
         4s   600000   54576  123609   809ms   217MB  
         5s   700000   54812  124069   806ms   230MB  
         6s   800000   54391  123609   809ms   236MB  
         7s   900000   52794  119617   836ms   329MB  
         8s   1000000  53557  122100   819ms   347MB  
 --------------------------------------------------------------  
         8s   1000000  53914  122339  8174ms   347MB  
TPC-C run

In the example run above we get 122,339 transactions per second (53,913 neworder transactions per second), and grow the database by 347MB.

For TPC-H experiments you can use the demo-tpch script. Note that the demo does not include the TPC-H data itself, it has to be generated by using the official dbgen tool (in ../tpch). Sample run:

 neumann@tester:~/hyperdemo$ ./demo-tpch   
 Executing scripts...
 6792ms
 Writing database state to tpch.dump
 10287ms
 TPC-H is now available in tpch.dump. Access with ./bin/sql tpch.dump

 Running all 22 TPC-H queries in sequence
 query 1: 23ms
 query 2: 3ms
 query 3: 29ms
 query 4: 13ms
 query 5: 7ms
 query 6: 7ms
 query 7: 18ms
 query 8: 10ms
 query 9: 100ms
 query 10: 27ms
 query 11: 7ms
 query 12: 13ms
 query 13: 57ms
 query 14: 8ms
 query 15: 9ms
 query 16: 116ms
 query 17: 14ms
 query 18: 71ms
 query 19: 35ms
 query 20: 9ms
 query 21: 31ms
 query 22: 8ms
TPC-H run

As indicated in the output the database is now available as a dump, so you can use the command line interface for experiments:

 neumann@tester:~/hyperdemo$ ./sql tpch.dump
 > select count(*) from lineitem;
 1
 6001215
 1 row (0.00176411 s)
 > \q
command-line interface

This should get you started with trying out HyPer. Feel free to contact us if you encounter any issues, or if you get performance numbers that are significantly different from what is shown here.

About this Blog

This blog is intended for people interested in database systems design. We plan to publish comments and results that might be too small for a full-blown research paper, but that are interesting in themselves.

Naturally this blog will be somewhat HyPer centric, as its originators are from this group, but we explicitly encourage other researchers to leave comments, or, even better, post their opinions as text. This included everybody, but we are particular hopeful for our friend from CWI, as Peter Boncz is visiting our group.

This blog is particularly interested in controversial opinions. If you think that the HyPer approach is obviously inferior to yours, or that our experiments are unrealistic, or that the feature is obviously [your choice], by all means, send us a post!