Friday, May 23, 2014

Vortex: Vectorwise goes Hadoop

Like Thomas in his first blog, where he announced his super-cool research system Hyper being available for download, I will also start my first blog post with a systems announcement. For this one, there is no download just yet, but by the end of next month, Actian will have available a new product that allows to use the Actian Vector system in MPP mode on Hadoop clusters. I will talk about all this more extensively at the upcoming Hadoop Summit on June 3 in San Jose, California. So please come and attend!

As you may know, Actian Vector, previously called Vectorwise (oh, those marketing marvels!) is an analytical database system that originated from my research group at CWI. It can be considered a follow-up system from our earlier experiences with MonetDB, the open source system from CWI that pioneered columnar storage and CPU-cache optimized query processing for the new wave of analytical database systems that since have appeared. Whereas MonetDB is a main-memory optimized system, Vectorwise also works well if the hot-set of your query is larger than RAM, and while sticking to columnar storage, it added a series of innovative ideas accross the board such as lightweight compression methods, cooperative I/O scheduling and a new vectorized query execution method, which gives the system its name. Columnar storage and vectorized execution are now being adopted by the relational database industry, with Microsoft SQLserver including vectorized operations on columnstore tables, IBM introducing BLU in DB2 and recently also Oracle announcing an in-memory columnar subsystem.  However, these systems have not yet ended the TPC-H Vectorwise domination in the industry standard benchmark for analytical database systems. So far this domination only concerns the single-server category, as Vectorwise was a single-server system - but not anymore thanks to project "Vortex": the Hadoop-enabled MPP version of Vectorwise.

To scale further than a single-server allows, one can create MPP database systems that use a cluster consisting of many machines. Big Data is heavily in vogue and many organizations are now developing data analysis pipelines that work on Hadoop clusters. Hadoop is not only relevant for the scalability and features it offers, but also because it is becoming the cluster management standard. The new YARN version of Hadoop adds many useful features for resource management. Hadoop distributors like Hortonworks, MapR and Cloudera are among the highest-valued tech startups now. An important side effect is strongly increasing availability of Hadoop skills, as Hadoop is not only getting much mindshare in industry but also making its way into Data Science curricula at universities (which are booming). Though I would not claim that managing a Hadoop cluster is a breeze, it is certainly more easy to find expertise on that, than finding expertise in managing and tuning MPP database systems. Further, Big Data pipelines typically process huge unstructured data, with MapReduce and other technologies being employed to find useful information hidden in the raw data. The end product is cleaner and more structured data, that one could query with SQL. Being able to use SQL applications right on Hadoop significantly enriches Hadoop infrastructures. No wonder that systems offering to do SQL on Hadoop are the hottest thing in analytical database systems now, with new products being introduced regularly. The marketplace is made up by three categories of vendors:
  1. those that run MPP database systems outside Hadoop and provide a connector (examples are Teradata and Vertica). In this approach, system complexity is not reduced, and one needs to buy and manage two separate clusters: a database cluster for the structured data, and a Hadoop cluster to process the unstructured data. Plus, there is a lot of data copying.
  2. those that take an legacy open-source DBMS (typically Postgres) and deploy that  on Hadoop with a wrapping layer (examples are Citusdata and HAWQ).  Legacy database systems like Derby and Postgres were not designed for an append-only file system like HDFS and the resulting database functionality is therefore often also append-only. Note that I call these older database engines "legacy" from the analytical database point of view, since modern vectorized columnar engines are typically an order of magnitude faster on analytical workloads. The Splice Machine is an interesting design point in that it ported Derby execution onto HBase storage. HBase is designed for HDFS, so this system can accomodate update-intensive workloads better than analytical-oriented SQL-on-Hadoop alternatives, but  its reliance on the row-wise tuple-at-a time Derby engine and the access paths through HBase are bound to slow it down on analytical workloads compared with analytical engines.
  3. native implementations that store all data in HDFS and are YARN integrated. Most famous here are Hive and Impala. However, I can tell from experience that developing a feature-complete analytical database system takes... roughly a decade. Consequently, query execution and query optimization are very immature still in Impala and Hive. These young systems, while certainly serious attempts (both now store data column-wised and compressed, and Hive even adopted vectorized execution) still miss multiple of the advanced features that the demanding SQL user base has come to expect: workload management, internationalization, advanced SQL features such as ROLLUP and windowing functions, not to mention user authentication and access control. 
Interestingly, the new Actian Hadoop product that leverages Vectorwise is of the native category above, as it purely uses HDFS for storage, and is YARN integrated. However, its optimizer and feature set is quite mature. The resulting system is also much faster than Impala and Hive. Of course, it was not trivial to port Vectorwise to HDFS, but true compressed column-stores are in their access patterns quite compatible with an append-only file system that prefers large sequential I/Os (i.e. HDFS). Unlike Hive and Impala, Vectorwise also supports fine-grained updates, since updates (at first) go to a separate data structure called the Positional Delta Tree (PDT), yet another CWI innovation. This has been an important piece the puzzle of making Vectorwise a native inhabitant of HDFS.

One last link I provide is to the Master Thesis of twoVector engineers which one could consider the baby stage of project Vortex. Back then, we envisioned using a cluster file system like RedHat's GlusterFS (it turned out dead slow), but HDFS took its place in project Vortex. The "A Team" (as we used to call Adrian and Andrei) has been instrumental in making this come alive. Similar thanks extend to the entire Actian team in Amsterdam which has been working very hard on this, as well as to the many other Actian people worldwide who contributed vital parts.

My blog post is becoming too long, so I cannot provide all the juicy bits on how Vortex is designed. But, the above picture does give quite a bit of information for the trained eye. Vectorwise now supports partitioned tables, and when deploying on Hadoop, the various partitions get stored in HDFS, which by default replicates every data block three times. HDFS when reading remote data is quite slow, so we give hints to the NameNode where to store the blocks, and when a Vectorwise session starts and X100 backend processes are started on (the "worker set", a designated subset of) the Hadoop cluster, we make sure that the processes that become responsible for a particular partition have it local. In the above figure I have tried to suggest this decoupling of partitions and database processes: in principle and thanks to HDFS everybody can read anything, but we try to optimize things such that most data access is in fact a "shortcut read", as HDFS calls it. We are also working hard to make the solution work well with YARN, such that the system can increase and reduce its footprint (#nodes and #cores per node) on the Hadoop cluster as the query workload varies, and avoids as much as possible getting slowed down by concurrent Hadoop jobs.

More news about Actian Vector 4.0 Hadoop Edition, as this new product is eloquently named, will appear shortly. But, you heard it first on Database Architects!

Thursday, May 22, 2014

Using Benchmarks in Research Papers

Many, if not most research papers want to include experimental results, and due to the lack of customer data researchers tend to use well known benchmarks. Which is fine in principle, as these benchmarks are well known and well specified, so using them makes sense.

However, following a benchmark to the letter, as would be needed for an audited result, is hard, and thus most research papers deviate more or less from the official benchmark. For example, most TPC-C results in research papers ignore the (mandatory) client wait time. Which is strictly speaking not allowed, but usually accepted in research papers, as it "only" affects the space requirements, and has hopefully little effect on the results otherwise. Some deviations like ignoring warehouse-crossing transactions are more dangerous, as these can suddenly have a large impact on transaction rates:

TPC-C rates, taken from ICDE14
There might be good reasons for deviating from the official benchmarks. For example implementing client wait times correctly in TPC-C means that the allowed number of transactions per second is bounded by the number of warehouses in the database. For 300,000 transactions per second we would need ca. 4,500,000 warehouses, which would need ca. 1.3PB of space. This is far beyond the size of main memory, and probably no OLTP system on this planet has that size, which means that TPC-C with wait times is not suitable for main-memory database.
But then a paper has to acknowledge that explicitly. Deviating for good reasons is fine, if that deviation is clearly and easily visible in the text and well justified.

Of course all research papers are somewhat sloppy. Hardly anybody cares about overflow checking in arithmetics, for example. Which gives a certain bias in comparisons, as commercial systems, and even a few research system, will do these checks, but these are usually small deviations. What is more critical  is if somebody implements something that has little resemblance to the original benchmark, but then does not state that in the experiments.

This paper for example studies fast in-memory OLAP processing. Which is fine as a paper topic. But then they claim to run TPC-H in the experiments, which is an ad-hoc benchmark that explicitly forbids most performance tricks. For example most non-key indexes are forbidden, materialization is forbidden, exploiting domain knowledge is forbidden, etc. And they get excellent performance numbers in their experiment, easily beating the official TPC-H champion VectorWise:

Excerpt from ICDE13

But even though they claim to show TPC-H results, they are not really running TPC-H. They have precomputed the join partners, they use inverted indices, they use all kinds of tricks to be fast. Unfortunately most of these tricks are explicitly forbidden in TPC-H. Not to mention the update problem, the real TPC-H benchmark contains updates, too, which would probably make some of the data structures expensive to maintain, but which the paper conveniently ignores.
And then the experiments compare their system on their machine with VectorWise numbers from a completely different machine, scaling them using SPEC rates. Such a scaling does not make sense, in particular since VectorWise is freely available.

These kinds of problems are common in research papers, and not limited to the somewhat arbitrary examples mentioned here, but they greatly devalue the experimental results. Experiments must be reproducible, they must be fair, and they must provide insightful results. Deviating from benchmarks is ok if the deviation is well justified, if the evaluation description makes this very clear and explicit, and if the deviation affects all contenders. Comparing a rigged implementation to a faithful one is not very useful.

Thursday, May 15, 2014

Trying out HyPer

At TUM we have built a very fast main-memory database system named HyPer. It offers fairly complete SQL92 support plus some SQL99 features, and is much faster than "traditional" database systems.

The easiest way to play with it is the online demo. It provides you with an easy to use interface for entering queries, running them, and inspecting the execution plan. All queries are evaluated against a SF1 TPC-H database which contains roughly 1GB of data.

HyPer web interface

The web interface is easy to use and requires no setup, but it runs on a fairly weak server machine and queries only 1GB of data. For larger experiments you might be interested in a local installation of HyPer. For that, download the hyper demo, unpack it with tar xvfJ, and try out one of the included demo scripts (or your own queries, of course).

For TPC-C experiments you can use the demo-tpcc script:

 neumann@tester:~/hyperdemo$ ./demo-tpcc   
 Executing scripts...  
 5598ms (722MB)  
  wallclock    total  primary    tps   time    mem  
         0s   100000   52946  119904   834ms   19MB  
         1s   200000   55108  124843   801ms   31MB  
         2s   300000   52573  120192   832ms   142MB  
         3s   400000   54219  122549   816ms   179MB  
         4s   500000   54896  124533   803ms   199MB  
         4s   600000   54576  123609   809ms   217MB  
         5s   700000   54812  124069   806ms   230MB  
         6s   800000   54391  123609   809ms   236MB  
         7s   900000   52794  119617   836ms   329MB  
         8s   1000000  53557  122100   819ms   347MB  
         8s   1000000  53914  122339  8174ms   347MB  
TPC-C run

In the example run above we get 122,339 transactions per second (53,913 neworder transactions per second), and grow the database by 347MB.

For TPC-H experiments you can use the demo-tpch script. Note that the demo does not include the TPC-H data itself, it has to be generated by using the official dbgen tool (in ../tpch). Sample run:

 neumann@tester:~/hyperdemo$ ./demo-tpch   
 Executing scripts...
 Writing database state to tpch.dump
 TPC-H is now available in tpch.dump. Access with ./bin/sql tpch.dump

 Running all 22 TPC-H queries in sequence
 query 1: 23ms
 query 2: 3ms
 query 3: 29ms
 query 4: 13ms
 query 5: 7ms
 query 6: 7ms
 query 7: 18ms
 query 8: 10ms
 query 9: 100ms
 query 10: 27ms
 query 11: 7ms
 query 12: 13ms
 query 13: 57ms
 query 14: 8ms
 query 15: 9ms
 query 16: 116ms
 query 17: 14ms
 query 18: 71ms
 query 19: 35ms
 query 20: 9ms
 query 21: 31ms
 query 22: 8ms
TPC-H run

As indicated in the output the database is now available as a dump, so you can use the command line interface for experiments:

 neumann@tester:~/hyperdemo$ ./sql tpch.dump
 > select count(*) from lineitem;
 1 row (0.00176411 s)
 > \q
command-line interface

This should get you started with trying out HyPer. Feel free to contact us if you encounter any issues, or if you get performance numbers that are significantly different from what is shown here.

About this Blog

This blog is intended for people interested in database systems design. We plan to publish comments and results that might be too small for a full-blown research paper, but that are interesting in themselves.

Naturally this blog will be somewhat HyPer centric, as its originators are from this group, but we explicitly encourage other researchers to leave comments, or, even better, post their opinions as text. This included everybody, but we are particular hopeful for our friend from CWI, as Peter Boncz is visiting our group.

This blog is particularly interested in controversial opinions. If you think that the HyPer approach is obviously inferior to yours, or that our experiments are unrealistic, or that the feature is obviously [your choice], by all means, send us a post!