As you may know, Actian Vector, previously called Vectorwise (oh, those marketing marvels!) is an analytical database system that originated from my research group at CWI. It can be considered a follow-up system from our earlier experiences with MonetDB, the open source system from CWI that pioneered columnar storage and CPU-cache optimized query processing for the new wave of analytical database systems that since have appeared. Whereas MonetDB is a main-memory optimized system, Vectorwise also works well if the hot-set of your query is larger than RAM, and while sticking to columnar storage, it added a series of innovative ideas accross the board such as lightweight compression methods, cooperative I/O scheduling and a new vectorized query execution method, which gives the system its name. Columnar storage and vectorized execution are now being adopted by the relational database industry, with Microsoft SQLserver including vectorized operations on columnstore tables, IBM introducing BLU in DB2 and recently also Oracle announcing an in-memory columnar subsystem. However, these systems have not yet ended the TPC-H Vectorwise domination in the industry standard benchmark for analytical database systems. So far this domination only concerns the single-server category, as Vectorwise was a single-server system - but not anymore thanks to project "Vortex": the Hadoop-enabled MPP version of Vectorwise.
To scale further than a single-server allows, one can create MPP database systems that use a cluster consisting of many machines. Big Data is heavily in vogue and many organizations are now developing data analysis pipelines that work on Hadoop clusters. Hadoop is not only relevant for the scalability and features it offers, but also because it is becoming the cluster management standard. The new YARN version of Hadoop adds many useful features for resource management. Hadoop distributors like Hortonworks, MapR and Cloudera are among the highest-valued tech startups now. An important side effect is strongly increasing availability of Hadoop skills, as Hadoop is not only getting much mindshare in industry but also making its way into Data Science curricula at universities (which are booming). Though I would not claim that managing a Hadoop cluster is a breeze, it is certainly more easy to find expertise on that, than finding expertise in managing and tuning MPP database systems. Further, Big Data pipelines typically process huge unstructured data, with MapReduce and other technologies being employed to find useful information hidden in the raw data. The end product is cleaner and more structured data, that one could query with SQL. Being able to use SQL applications right on Hadoop significantly enriches Hadoop infrastructures. No wonder that systems offering to do SQL on Hadoop are the hottest thing in analytical database systems now, with new products being introduced regularly. The marketplace is made up by three categories of vendors:
- those that run MPP database systems outside Hadoop and provide a connector (examples are Teradata and Vertica). In this approach, system complexity is not reduced, and one needs to buy and manage two separate clusters: a database cluster for the structured data, and a Hadoop cluster to process the unstructured data. Plus, there is a lot of data copying.
- those that take an legacy open-source DBMS (typically Postgres) and deploy that on Hadoop with a wrapping layer (examples are Citusdata and HAWQ). Legacy database systems like Derby and Postgres were not designed for an append-only file system like HDFS and the resulting database functionality is therefore often also append-only. Note that I call these older database engines "legacy" from the analytical database point of view, since modern vectorized columnar engines are typically an order of magnitude faster on analytical workloads. The Splice Machine is an interesting design point in that it ported Derby execution onto HBase storage. HBase is designed for HDFS, so this system can accomodate update-intensive workloads better than analytical-oriented SQL-on-Hadoop alternatives, but its reliance on the row-wise tuple-at-a time Derby engine and the access paths through HBase are bound to slow it down on analytical workloads compared with analytical engines.
- native implementations that store all data in HDFS and are YARN integrated. Most famous here are Hive and Impala. However, I can tell from experience that developing a feature-complete analytical database system takes... roughly a decade. Consequently, query execution and query optimization are very immature still in Impala and Hive. These young systems, while certainly serious attempts (both now store data column-wised and compressed, and Hive even adopted vectorized execution) still miss multiple of the advanced features that the demanding SQL user base has come to expect: workload management, internationalization, advanced SQL features such as ROLLUP and windowing functions, not to mention user authentication and access control.
One last link I provide is to the Master Thesis of twoVector engineers which one could consider the baby stage of project Vortex. Back then, we envisioned using a cluster file system like RedHat's GlusterFS (it turned out dead slow), but HDFS took its place in project Vortex. The "A Team" (as we used to call Adrian and Andrei) has been instrumental in making this come alive. Similar thanks extend to the entire Actian team in Amsterdam which has been working very hard on this, as well as to the many other Actian people worldwide who contributed vital parts.
More news about Actian Vector 4.0 Hadoop Edition, as this new product is eloquently named, will appear shortly. But, you heard it first on Database Architects!