Database Architects: Using Benchmarks in Research Papers

Many, if not most research papers want to include experimental results, and due to the lack of customer data researchers tend to use well known benchmarks. Which is fine in principle, as these benchmarks are well known and well specified, so using them makes sense.

However, following a benchmark to the letter, as would be needed for an audited result, is hard, and thus most research papers deviate more or less from the official benchmark. For example, most TPC-C results in research papers ignore the (mandatory) client wait time. Which is strictly speaking not allowed, but usually accepted in research papers, as it "only" affects the space requirements, and has hopefully little effect on the results otherwise. Some deviations like ignoring warehouse-crossing transactions are more dangerous, as these can suddenly have a large impact on transaction rates:

TPC-C rates, taken from ICDE14

There might be good reasons for deviating from the official benchmarks. For example implementing client wait times correctly in TPC-C means that the allowed number of transactions per second is bounded by the number of warehouses in the database. For 300,000 transactions per second we would need ca. 4,500,000 warehouses, which would need ca. 1.3PB of space. This is far beyond the size of main memory, and probably no OLTP system on this planet has that size, which means that TPC-C with wait times is not suitable for main-memory database.
But then a paper has to acknowledge that explicitly. Deviating for good reasons is fine, if that deviation is clearly and easily visible in the text and well justified.

Of course all research papers are somewhat sloppy. Hardly anybody cares about overflow checking in arithmetics, for example. Which gives a certain bias in comparisons, as commercial systems, and even a few research system, will do these checks, but these are usually small deviations. What is more critical is if somebody implements something that has little resemblance to the original benchmark, but then does not state that in the experiments.

This paper for example studies fast in-memory OLAP processing. Which is fine as a paper topic. But then they claim to run TPC-H in the experiments, which is an ad-hoc benchmark that explicitly forbids most performance tricks. For example most non-key indexes are forbidden, materialization is forbidden, exploiting domain knowledge is forbidden, etc. And they get excellent performance numbers in their experiment, easily beating the official TPC-H champion VectorWise:

Excerpt from ICDE13

But even though they claim to show TPC-H results, they are not really running TPC-H. They have precomputed the join partners, they use inverted indices, they use all kinds of tricks to be fast. Unfortunately most of these tricks are explicitly forbidden in TPC-H. Not to mention the update problem, the real TPC-H benchmark contains updates, too, which would probably make some of the data structures expensive to maintain, but which the paper conveniently ignores.
And then the experiments compare their system on their machine with VectorWise numbers from a completely different machine, scaling them using SPEC rates. Such a scaling does not make sense, in particular since VectorWise is freely available.

These kinds of problems are common in research papers, and not limited to the somewhat arbitrary examples mentioned here, but they greatly devalue the experimental results. Experiments must be reproducible, they must be fair, and they must provide insightful results. Deviating from benchmarks is ok if the deviation is well justified, if the evaluation description makes this very clear and explicit, and if the deviation affects all contenders. Comparing a rigged implementation to a faithful one is not very useful.

Database Architects

Thursday, May 22, 2014

Using Benchmarks in Research Papers

1 comment:

Blog Archive