Showing posts with label query optimization. Show all posts
Showing posts with label query optimization. Show all posts

Thursday, August 7, 2014

TPC-DS with Vector Hadoop Edition

My kick-off blog entry for Database Architects contained the first announcement of a product my friends over at the Vectorwise group of Actian in Amsterdam have been working on in the past years: the Hadoop version of  what is now called Actian Vector. In this post, I will go behind the scenes of this system, which I think is currently the fastest SQL-on-Hadoop system available.

In my first post, this system was only known under its code name Vortex, though the marketeers of Actian in the end have branded it Actian Vector - Hadoop Edition. They in fact advertise the product mostly under yet another name: Actian Analytics Platform - Hadoop SQL Edition. This platform contains two main products: besides Vortex you also get Actian DataFlow (formerly Pervasive's "DataRush" product) which one can compare to something like Yahoo Pig on steroids, that is, an interface to build ETL dataflows with full relational operators, equipped with a graphical user interface (KNIME). It shares no code with Pig, and I say "on steroids" because it is both faster and has broader functionality than Pig; e.g. it not only is a relational processing pipeline, but also contains a suite of data mining and data integration algorithms as well. Actian DataFlow has an optimized parallel path to load data into Vortex. It is a luxury for Vortex to launch with such an easy-to-use and function Hadoop-design ETL system.

SQL-on-Hadoop is in the spotlight lately since IT users are (i) looking to standardize their cluster hardware and sofware management and Hadoop has become this standard software layer and (ii) because typical Big Data pipelines, while they may start with many messy textual- or log-datasources, in the end spit out cleaner and more structured results, that can be processed by SQL systems. This also allows to directly leverage the many existing SQL business applications over Hadoop based data. From what Actian people tell me, there is a lot of interest for Vortex, and some customers already bought the platform right at announcement. Vortex was announced on June 3 2014 at the Hadoop Summit in San Jose (see the video or flip through the presentation), and it has been released by the end of June 2014.

I started to write this blogpost with the simple goal to provide the details of the performance results that were obtained on a side-by side comparison with Impala version 1.3.1 on the benchmark that the Cloudera folks have released at github.com/cloudera/impala-tpcds-kit. This is a 19-query subset of the normally 99 query TPC-DS benchmark. TPC-DS has been in the making forever and seems to take ages to replace TPC-H, so this may be seen as a good sign of initial industry adoption (TPC-DS still has no official results). Though, on the downside, selecting 19 queries out of 99 may not be a very objective use of the benchmark, and additionally the Impala subset also modified the queries by removing all ROLLUP and window functions (not supported in Impala) and also added extra SQL clauses that help Impala prune partitions. But my understanding of the TPC-DS benchmark is still limited so I still hold my judgement on whether this 19-query subset is representative. It just takes a lot of time to analyze and play with 99 different queries, and I have not yet found that time...

The experiment, at scale factor 3000 (3TB raw data) was run on a small cluster of 5 nodes, each with 64GB RAM and 16 cores (Xeon, 2.4GHz) and hyperthreading disabled. The nodes each have two 1Gb ethernet adapters and two 1TB magnetic disks. Both systems are hot (second run), and the Hadoop compute node configuration has not changed since data import (hence we can assume both systems read local HDFS blocks where possible). All results are in seconds, first line is Vortex, the second line is Impala (with its stock settings as published in the original Impala blogpost, back in January):

Q03 Q07 Q19 Q27 Q34 Q42 Q43 Q46 Q52 Q53 Q55 Q59 Q63 Q65 Q68 Q73 Q79 Q89 Q98
Vortex 0.98 2.52 2.49 2.71 2.48 0.39 4.15 5.19 0.64 1.13 0.54 30.15 1.10 10.77 2.53 1.20 3.83 2.02 1.00
Impala 18.78 23.85 18.01 20.16 45.30 5.47 62.33 30.53 5.59 22.06 5.33 456.67 21.92 262.67 18.40 9.05 17.72 34.28 7.35

In total, Vortex is 14x faster than Impala on this benchmark. Inferencing from the Impala follow-up blogpost on this benchmark in May, we could even conclude that Vortex also is much faster than SparkPrestoDB and the "Stinger" release of Hive.  Let me state that I highly esteem Impala as an attempt to create a fast analytical database system for Hadoop. It has quite a few nice features, such as support for compressed columnar formats as ORCfile and Parquet, and it can compile queries at run-time to machine code using LLVM. However, its support for SQL is still far from complete and below what in my MonetDB and Vectorwise experience the analytical data market demands. Which is a .. very large amount of features, including not only SQL-99 analytical extensions (window functions, rollup), but also workload management, access control and a further longlist. Hence, Impala still has a long way to go. Actian Vector has been on the market for four years, has been in development for a decade, and its SQL frontend and API have been on the market even longer (they stem from Ingres). By now with many customers in production, I almost dare start to call Vector "mature". So, in my experience it does take at least a decade to reach maturity. The biggest shortfall of Impala in this benchmark, is that it runs each query single-threaded: Impala just uses one core per node for query processing. In an IT environment where each low end-server has at least 16 cores this could mean a 16-fold  hit in performance, and the number of cores per server is still rapidly increasing (newer Intel machines can have up to 120 cores), so it will get even worse in the future. I do expect that at some point Impala will get multi-threaded, but as explained in the sequel, exploiting this efficiently on benchmarks like TPC-DS is by no means trivial. On the one hand, one needs to keep all cores busy, while on the other hand all these busy cores cause a large increase in memory cache pressure and network bandwidth, which can easily draw the system into all kinds of bottlenecks.

Digging Deeper..
Writing this blogpost just about the overall results on this benchmark left the database architect in me a bit unsatisfied, so I decided to show an example how the network bottleneck can affect TPC-DS, and what can be done about it. Taking Vectorwise from being the fastest per-core single-server analytical database system on the market into cluster territory required a lot of work on the columnar rewriter (optimizer) of Vectorwise. Now, we will dig deep into optimization for one of the benchmark queries in particular, namely TPC-DS Q98 - Impala flavor:

select i_item_desc,
i_category,
i_class,
i_current_price,
sum(ss_ext_sales_price) as itemrevenue
-- commented out, Impala does not support PARTITION BY -- sum(ss_ext_sales_price)*100/sum(sum(ss_ext_sales_price)) OVER (PARTITION BY i_class) 
from
store_sales,
item,
date_dim
where
ss_item_sk = i_item_sk
and i_category in ('Jewelry', 'Sports', 'Books')
and ss_sold_date_sk = d_date_sk
and ss_sold_date_sk between 2451911 and 2451941  -- manually added partition key filter
and d_date between '2001-01-01' and '2001-01-31'
group by
i_item_id,
i_item_desc,
i_category,
i_class,
i_current_price
order by
i_category,
i_class,
i_item_id,
i_item_desc

limit 1000; -- added limit

To make their 19 query derived subset of TPC-DS run well, the Impala folks modified the queries from the official TPC-DS definitions. In case of Q98 they added the clause:
ss_sold_date_sk between 2451911 and 2451941
in order for it to trigger partition pruning (plus, use of the "partition by" SQL-99 clause was removed, and the result was made smaller with a LIMIT 1000). In the original Impala setup - which we followed - the fact table store_sales is partitioned on date, or rather the foreign key column of the date dimension ss_sold_date_sk. The above selection predicate is phrased directly on this foreign key, which is the table partitioning key, and this allows Impala to conclude that the great majority of the partitions contain irrelevant data and can be skipped (there are multiple years of data in TPC-DS, and this selection corresponds to just one month). This explicit hack had to be placed because Impala's optimizer cannot yet automatically infer this foreign key restriction from the SQL predicate
d_date between '2001-01-01' and '2001-01-31'
which is exactly the same restriction.

As an aside, TPC-DS uses a date dimension that has preemptively been filled with decades of dates, even while the fact tables refer just to a small time period of a few years. Hence selecting 5% of the tuples in the date dimension either means selecting 0 tuples from the fact table (because the fact table just contains tuples from a few years, not intersecting with the selected 5%), or more likely, selecting much more than 5% from the fact table. This may initially be missed by optimizers, who may just use the selection percentage of the dimension as the selection percentage of the fact table. I saw it happen in Vectorwise, but that got fixed ;-), but PostgreSQL makes the same mistake. It is relatively straightforward, however, to fix the cost model estimates for this by looking at the value distribution of the ss_sold_date_sk column in the optimizer statistics and comparing them to the value distribution of the d_date_sk column, and adjusting the plan estimates accordingly.

In the TPC-DS schema design used in the Vortex experiments, the store_sales is partitioned on its primary key, not on the date foreign key, but a clustered index is created on the ss_sold_date_sk column. This means that each partition is organized in date order, yet holds a uniform subset of the table. The "MinMax indexes" which Vectorwise automatically creates on all columns, and some intricate logic in the columnar rewriter of Vectorwise, allows Vortex to skip all irrelevant data straight from the normal TPC-DS schema. No help required.

While the Actian engineers were looking at the TPC-DS benchmark, I got involved as well, and helped provide some ideas for optimization. It was my first confrontation with TPC-DS and now I have looked at 19 out of the 99 queries, so my impression and understanding of its "choke points" is still very partial compared to my understanding of its predecessor TPC-H (see my post on the choke points of TPC-H, and an explanation of what I actually mean with "choke points"). Whereas large joins are an important problem in TPC-H, in case of TPC-DS on cluster systems, it appears that a bigger challenge is handling large-scale aggregations. An aggregation is a GROUP BY query, but if the amount of GROUP BY result tuples is large, and unrelated to the partitioning keys, then all nodes must help in computing this large group (because it is so large and thus expensive), and need to send data to each other (because it is unrelated to the partitioning). This all-to-all communication pattern is the worst that can happen to a parallel query plan, as it can lead to the system becoming network bound, and this also happened in the GROUP BY of Q98, illustrated below:



The above picture shows the un-optimized original query plan that Vortex used. The single-server Vectorwise aggregation normally chooses between various parallel (multi-core) aggregation strategies, where C is the amount of cores involved in the query on a single node. Here are two such strategies being considered:
  1. Scan(tuples)=>Aggr(Local)=>eXchangeUnion XU(C:1)=>Aggr(Global)
  2. Scan(tuples)=>eXchangeHashSplit XHS(C:C)=>Aggr(Partitioned)=>eXchangeUnion XU(C:1)
An eXchangeUnion XU(C:1) runs on C input threads and unions, sending their output to one thread. The eXchangehashSplit XHS(C:C) runs on C threads and distributes all tuples based on hash code, sending the output to C threads. In the first strategy, each thread scans (a supposed different) set of tuples, locally aggregates that data, and then sends the aggregate results to one coordinating thread. This thread re-aggregates the results (e.g. global SUM is a SUM of local SUMs), and returns these. In the second strategy, the tuples are hash-partitioned according to the aggregation keys (XHS): each thread reads some tuples, and sends the tuples to one of the other threads (which thread is determined by a "hash" code computed over the GROUP BY key). Thereafter, each thread only has tuples that hash to the same number: the data has been re-partitioned. Subsequently, each thread can locally compute the aggregate in one go, because all data of each GROUP is in the same thread. The final result is the union of all results (XU).

The initial parallel execution model based on exchange operators originally implemented in  single-server Vectorwise was extended in Vortex with distributed  exchange operators (XU, XHS), which behave exactly the same, but can exchange data over the network also with threads that run on other compute nodes. In the below distributed variants of the two strategies introduced above, N is the amount of nodes, and C@N means that C threads are running on N nodes, hence C*N threads are running in the total system (**this is just convenience notation and does not imply that Vortex always uses the same amount C of threads on all N nodes):
  1. =>Aggr(Local)=>DXU(C@N:1)=>Aggr(Global)
  2. =>DXHS(C@N:C@N)=>Aggr(Partitioned)=>DXU(C@N:1)
Given the somewhat large size of the GROUP BY (53K results), the second strategy was chosen in the graphically depicted query plan above (DXHS operators). The cost model of Vortex did not choose the first strategy, because in each thread the Aggr(local) would locally aggregate and produces roughly 50K results, and then all 80 threads would exchange 80 * 50K tuples. This would mean that the last Aggr(Global) step of the query is that a single thread needs to aggregate 80*50K = 4M tuples: significant non-parallel work, therefore little speedup. Therefore the second strategy is superior here.  The second strategy exchanges the tuple stream based on a hash code computed over the GROUP BY columns, such that each thread receives data for a distinct subset of groups. Therefore, the second strategy only aggregates once, and more importantly, all cores participate in the Aggr(Partitioned) computation.  The problem in this second approach, which the trained Vectorwise query performance analyst identifies by the dark red color of the operators in the execution plan, is that these DXHS operations are quite slow. Each compute node is sending and receiving 16*50K tuples, and the tuples are quite wide (45 bytes) so this is close to 40MB of data, and sending that over Ethernet takes time. Essentially, the system is running into the dreaded network bottleneck. We also run Vortex on clusters with Infiniband, where this problem is less severe (but still occurs on some queries). However, many Hadoop clusters just have an Ethernet interconnect.

For this reason, a new strategy was added to the Vortex rewriter, now showing all three:
  1. Aggr(Local)=>DXU(N:1)=>Aggr(Global)
  2. DXHS(C@N:C@N)=>Aggr(Partitioned)=>DXU(C@N:1)
  3. XHS(C@N:C@local)=>Aggr(Local)=>DXHS(C@N:C@N)=>Aggr(Partitioned)=>DXU(C@N:1)
Here C@local means that the local XHS (not DXHS) redistributes the data only inside the same node. In our benchmark, it just exchanges data between each 16 local threads, for which network communication is not required. Each thread then pre-aggregates the data locally, producing independent non-overlapping results among the 16 threads. The effect of this pre-aggregation is that the amount of tuples is reduced; in this case, by a factor 10. The subsequent DXHS operation thus has to send 10x less data and avoids the network bottleneck. All in all, Q98 became twice faster with this optimization.


Note that an alternative variant of the third strategy that would omit =>XHS(C@N:C@local) and just tries to first aggregate locally before sending to the network, would be much less effective. Thanks to the local exchange, each thread gathers relevant keys from 16 neighbors, and finds 100 duplicates. Without the local exchange, one would therefore find on average just 100/16=6 duplicates, hence the subsequent DXHS network communication volume would be much less reduced.

This has been just one example of the many cost-based rewrite rules that were introduced in Vortex. As David DeWitt rightly notes, query optimization is more complicated than rocket science and thus can infinitely evolve, but given the results Actian has seen so far, I think that Actian Vector - Hadoop Edition (Vortex) is already an interesting addition to the sprawling SQL-on-Hadoop market. With my students, we are discovering many scientific challenges as well and look forward to reporting on our results in addressing some of these, in the near future.

Wednesday, June 4, 2014

Random Execution Plans

Query optimizers spend a lot of effort on finding the best execution plans. HyPer for example uses a fairly complex dynamic-programming strategy for finding the optimal join order. Of course all of this complex and expensive optimization is based upon the cost model and cardinality estimations. Unfortunately cardinality estimation is often wrong, in particular higher up in the execution plan. Therefore some cynics claim that databases are basically executing random execution plans.

Now the interesting question is, do we really do that? To get an impression on what "random execution plan" means, we took regular SQL queries, generated 100 random execution plan using QuickPick and executed all of them in HyPer. Even though they are random, the generated plans are still somewhat reasonable, as 1) they contain no cross products, 2) selections are pushed down, and 3) the smaller side is used as build input. Note that constructing these plans needed no estimations at all (except for build/probe selection), we simply constructed plans with random join orders and executed them with a timeout of 3 seconds.
They results for TPC-H Query 5 (SF1) are shown below:


Pygal 0.0 200.0 400.0 600.0 800.0 1000.0 1200.0 1400.0 1600.0 1800.0 2000.0 2200.0 2400.0 2600.0 2800.0 3000.0 26 16.538461538461537 521.5057692307693 31 23.153846153846153 521.0778846153846 36 29.769230769230766 520.6500000000001 37 36.38461538461538 520.564423076923 41 43.0 520.2221153846153 46 49.61538461538461 519.7942307692308 47 56.230769230769226 519.7086538461539 48 62.846153846153854 519.623076923077 50 69.46153846153845 519.4519230769231 50 76.07692307692308 519.4519230769231 51 82.6923076923077 519.3663461538462 52 89.3076923076923 519.2807692307692 91 95.9230769230769 515.9432692307693 93 102.53846153846153 515.7721153846154 95 109.15384615384615 515.6009615384615 102 115.76923076923076 515.001923076923 102 122.38461538461537 515.001923076923 105 129.0 514.7451923076924 107 135.6153846153846 514.5740384615385 107 142.23076923076923 514.5740384615385 112 148.84615384615387 514.1461538461539 112 155.46153846153845 514.1461538461539 113 162.0769230769231 514.060576923077 113 168.6923076923077 514.060576923077 118 175.30769230769232 513.6326923076923 138 181.92307692307696 511.92115384615386 140 188.53846153846155 511.75 140 195.1538461538462 511.75 151 201.7692307692308 510.8086538461539 154 208.3846153846154 510.5519230769231 154 215.0 510.5519230769231 160 221.61538461538464 510.03846153846155 162 228.23076923076925 509.8673076923077 172 234.84615384615387 509.0115384615385 178 241.4615384615385 508.49807692307695 178 248.0769230769231 508.49807692307695 178 254.6923076923077 508.49807692307695 180 261.30769230769226 508.3269230769231 243 267.92307692307685 502.93557692307695 275 274.53846153846155 500.1971153846154 277 281.15384615384613 500.02596153846156 290 287.7692307692307 498.91346153846155 334 294.38461538461536 495.1480769230769 334 300.99999999999994 495.1480769230769 334 307.6153846153846 495.1480769230769 334 314.23076923076917 495.1480769230769 350 320.8461538461538 493.7788461538462 360 327.4615384615384 492.92307692307696 360 334.07692307692304 492.92307692307696 360 340.6923076923076 492.92307692307696 360 347.30769230769226 492.92307692307696 360 353.9230769230769 492.92307692307696 361 360.53846153846155 492.83750000000003 387 367.15384615384613 490.6125 416 373.7692307692307 488.13076923076926 422 380.38461538461536 487.6173076923077 423 387.0 487.53173076923076 470 393.6153846153845 483.5096153846154 672 400.23076923076917 466.223076923077 685 406.8461538461538 465.11057692307696 699 413.4615384615384 463.9125 714 420.07692307692304 462.6288461538462 714 426.6923076923076 462.6288461538462 715 433.30769230769226 462.5432692307693 751 439.9230769230769 459.46250000000003 757 446.5384615384615 458.9490384615385 844 453.15384615384613 451.5038461538462 852 459.7692307692307 450.8192307692308 1098 466.38461538461536 429.7673076923077 1139 472.9999999999999 426.2586538461539 1139 479.6153846153845 426.2586538461539 1194 486.23076923076917 421.5519230769231 1212 492.8461538461538 420.0115384615385 1212 499.4615384615384 420.0115384615385 1234 506.07692307692304 418.1288461538462 1518 512.6923076923076 393.82500000000005 2062 519.3076923076923 347.2711538461539 2126 525.9230769230768 341.7942307692308 2662 532.5384615384614 295.92500000000007 2662 539.1538461538462 295.92500000000007 2670 545.7692307692308 295.24038461538464 3000 552.3846153846154 267.0 3000 558.9999999999999 267.0 3000 565.6153846153845 267.0 3000 572.2307692307692 267.0 3000 578.8461538461537 267.0 3000 585.4615384615385 267.0 3000 592.0769230769231 267.0 3000 598.6923076923076 267.0 3000 605.3076923076923 267.0 3000 611.9230769230769 267.0 3000 618.5384615384615 267.0 3000 625.1538461538461 267.0 3000 631.7692307692307 267.0 3000 638.3846153846154 267.0 3000 645.0 267.0 3000 651.6153846153845 267.0 3000 658.2307692307692 267.0 3000 664.8461538461538 267.0 3000 671.4615384615383 267.0 random query plans [sorted by runtime] runtime [ms]


We see several interesting results. First, the best random plan is not bad at all. It is only slightly worse than the plan generated by our complex DP optimizer. On the other hand, the worst random plan is very bad, we had to kill it after 3 seconds. So picking a random plan is clearly dangerous. And even the median plan is not that good, either. Roughly speaking, the median plan is a factor of 10 slower than the best random plan, and the worst random plan is more than a factor of 10 slower than the median for this query.

Therefore database optimizers are most likely not picking random plans, even though they are reasoning using noisy estimates. Truly random plans are simply too bad. This also demonstrates that query optimization is crucially important, even a fast runtime system cannot correct the mistakes made in query optimization.

Of course randomness can also be used during query optimization. Generate 1000 plans using QuickPick, pick the cheapest one, and you will most likely get a decent plan. Of course there we cannot simply execute all of the plans, so we will have to pick the cheapest plan based upon estimates, which brings us back to the original problem. But still, QuickPick is very fast, so that might be attractive for large queries.

Random execution plans are also useful for testing the quality of the cost prediction. For TPC-H that is not that an issue, as here cardinality estimates and cost predictions are quite good, but for data sets with skew and correlations a scatter plot of expected runtime and actual runtime is quite enlightening.
So there is indeed a lot of unresolved issues with cardinality estimation,  but fortunately there is usually at least a correlation between expected costs and actual costs. Which means that we might pick plans with some randomness induced by estimation errors, but at least we tend to pick them from the good end of the spectrum.