However, following a benchmark to the letter, as would be needed for an audited result, is hard, and thus most research papers deviate more or less from the official benchmark. For example, most TPC-C results in research papers ignore the (mandatory) client wait time. Which is strictly speaking not allowed, but usually accepted in research papers, as it "only" affects the space requirements, and has hopefully little effect on the results otherwise. Some deviations like ignoring warehouse-crossing transactions are more dangerous, as these can suddenly have a large impact on transaction rates:
TPC-C rates, taken from ICDE14 |
But then a paper has to acknowledge that explicitly. Deviating for good reasons is fine, if that deviation is clearly and easily visible in the text and well justified.
Of course all research papers are somewhat sloppy. Hardly anybody cares about overflow checking in arithmetics, for example. Which gives a certain bias in comparisons, as commercial systems, and even a few research system, will do these checks, but these are usually small deviations. What is more critical is if somebody implements something that has little resemblance to the original benchmark, but then does not state that in the experiments.
This paper for example studies fast in-memory OLAP processing. Which is fine as a paper topic. But then they claim to run TPC-H in the experiments, which is an ad-hoc benchmark that explicitly forbids most performance tricks. For example most non-key indexes are forbidden, materialization is forbidden, exploiting domain knowledge is forbidden, etc. And they get excellent performance numbers in their experiment, easily beating the official TPC-H champion VectorWise:
Excerpt from ICDE13 |
But even though they claim to show TPC-H results, they are not really running TPC-H. They have precomputed the join partners, they use inverted indices, they use all kinds of tricks to be fast. Unfortunately most of these tricks are explicitly forbidden in TPC-H. Not to mention the update problem, the real TPC-H benchmark contains updates, too, which would probably make some of the data structures expensive to maintain, but which the paper conveniently ignores.
And then the experiments compare their system on their machine with VectorWise numbers from a completely different machine, scaling them using SPEC rates. Such a scaling does not make sense, in particular since VectorWise is freely available.
These kinds of problems are common in research papers, and not limited to the somewhat arbitrary examples mentioned here, but they greatly devalue the experimental results. Experiments must be reproducible, they must be fair, and they must provide insightful results. Deviating from benchmarks is ok if the deviation is well justified, if the evaluation description makes this very clear and explicit, and if the deviation affects all contenders. Comparing a rigged implementation to a faithful one is not very useful.
Very true.
ReplyDeleteA related problem is "let's consider records with 2 columns, each 32-bit integers", or "let's use a large dataset of 1GB". It's easy to optimize a microbenchmark. It's hard to optimize an entire system.
Great blog so far, keep it up! :)