Query optimizers spend a lot of effort on finding the best execution plans. HyPer for example uses a fairlycomplex dynamic-programming strategy for finding the optimal join order. Of course all of this complex and expensive optimization is based upon the cost model and cardinality estimations. Unfortunately cardinality estimation is often wrong, in particular higher up in the execution plan. Therefore some cynics claim that databases are basically executing random execution plans.
Now the interesting question is, do we really do that? To get an impression on what "random execution plan" means, we took regular SQL queries, generated 100 random execution plan using QuickPick and executed all of them in HyPer. Even though they are random, the generated plans are still somewhat reasonable, as 1) they contain no cross products, 2) selections are pushed down, and 3) the smaller side is used as build input. Note that constructing these plans needed no estimations at all (except for build/probe selection), we simply constructed plans with random join orders and executed them with a timeout of 3 seconds.
They results for TPC-H Query 5 (SF1) are shown below:
We see several interesting results. First, the best random plan is not bad at all. It is only slightly worse than the plan generated by our complex DP optimizer. On the other hand, the worst random plan is very bad, we had to kill it after 3 seconds. So picking a random plan is clearly dangerous. And even the median plan is not that good, either. Roughly speaking, the median plan is a factor of 10 slower than the best random plan, and the worst random plan is more than a factor of 10 slower than the median for this query.
Therefore database optimizers are most likely not picking random plans, even though they are reasoning using noisy estimates. Truly random plans are simply too bad. This also demonstrates that query optimization is crucially important, even a fast runtime system cannot correct the mistakes made in query optimization.
Of course randomness can also be used during query optimization. Generate 1000 plans using QuickPick, pick the cheapest one, and you will most likely get a decent plan. Of course there we cannot simply execute all of the plans, so we will have to pick the cheapest plan based upon estimates, which brings us back to the original problem. But still, QuickPick is very fast, so that might be attractive for large queries.
Random execution plans are also useful for testing the quality of the cost prediction. For TPC-H that is not that an issue, as here cardinality estimates and cost predictions are quite good, but for data sets with skew and correlations a scatter plot of expected runtime and actual runtime is quite enlightening.
So there is indeed a lot of unresolved issues with cardinality estimation, but fortunately there is usually at least a correlation between expected costs and actual costs. Which means that we might pick plans with some randomness induced by estimation errors, but at least we tend to pick them from the good end of the spectrum.
At TUM we have built a very fast main-memory database system named HyPer. It offers fairly complete SQL92 support plus some SQL99 features, and is much faster than "traditional" database systems.
The easiest way to play with it is the online demo. It provides you with an easy to use interface for entering queries, running them, and inspecting the execution plan. All queries are evaluated against a SF1 TPC-H database which contains roughly 1GB of data.
HyPer web interface
The web interface is easy to use and requires no setup, but it runs on a fairly weak server machine and queries only 1GB of data. For larger experiments you might be interested in a local installation of HyPer. For that, download the hyper demo, unpack it with tar xvfJ, and try out one of the included demo scripts (or your own queries, of course).
For TPC-C experiments you can use the demo-tpcc script:
In the example run above we get 122,339 transactions per second (53,913 neworder transactions per second), and grow the database by 347MB.
For TPC-H experiments you can use the demo-tpch script. Note that the demo does not include the TPC-H data itself, it has to be generated by using the official dbgen tool (in ../tpch). Sample run:
This should get you started with trying out HyPer. Feel free to contact us if you encounter any issues, or if you get performance numbers that are significantly different from what is shown here.