Database Architects: Experiments Hurt the Review Process

Monday, September 8, 2014

Experiments Hurt the Review Process

Experimentally validating (or falsifying) a theory is one of the fundamental aspects of science. As such I have always put a lot of emphasize on experiments, and of course empirical evidence are essential when designing systems. To pick a simple example: merge sort has better asymptotic behavior than quick sort, but in practice quick sort is usually faster. Wall clock is what matters.

However while all that is true in general, in practice experiments can be quite harmful, in particular in the context of peer-reviewed papers. In the rest of this post I will try to illustrate why that is the case and what could potentially be done about that.

It took my a while to realize that experiments hurt the review process, but both real papers that I have seen, and in particular input from my colleague Alfons Kemper have convinced me that experiments in papers are a problem. Alfons suggested to simply ignore the experimental section in papers, which I find a bit extreme, but he has a point.

The first problem is that virtually all papers "lie by omission" in their evaluation section. The authors will include experiments where their approach behaved well, but they will not show results where it has problems. In Computer Science we find that perfectly normal behavior, in fields like Pharmacy we would go to jail for that.

Furthermore this behavior interacts badly with the review process. Of course the reviewers know that are shown only the good cases, therefore these cases have to be really good. If someone writes a paper about an approach that is nicer and cleaner than a previous one, but is 20% slower in experiments, he will have a hard time publishing it (even though the 20% might well just stem from small implementation differences).

Which is a shame. Often reviewers are too keen on good experimental results.

The easiest way to kill a paper is to state that the experiments not conclusive enough. The authors will have a hard time arguing against that, as there is always another experiment that sounds plausible and important and on which the reviewer can insist. Authors therefore have an incentive to make their experiments slick looking and impressive, as that will hopefully guide the reviewer in other directions.

Which brings me to the core of the problem with experiments in papers: There have been papers where the thin line between careful experimental setup and outright cheating has been touched, perhaps even crossed. And that is really bad for science. Now one can claim that this is all the fault of the authors, and that they should be tarred and feathered etc., but that is too short sighted. Of course it is the fault of the authors. But the current system is rigged in a way that makes is very, very attractive to massage experimental results. I once heard the statement that "the reviewers expect good numbers". And unfortunately, that is true.

And this affects not only the highly problematic cases, even papers that stay "legal" and "just" create good looking results by a very careful choice of their experimental setup are quite harmful, in best case we learn little, in worst case we are misled.

Now what can we do about that problem? I don't really know, but I will propose a few alternatives. One extreme would be to say we largely ignore the experimental part of a paper, and evaluate it purely based upon the ideas presented within. Which is not the worst thing to do, and arguably it would be an improvement over the current system, but if we ignore the actual experiments the quick sort mentioned above might have had a hard time against the older merge sort.

The other extreme would be what the SIGMOD Repeatability Effort tried to achieve, namely that all experiments are validated. And this validation should happen during reviewing (SIGMOD did it only after the fact). Then, the reviewer should repeat the experiments, try out different settings, and fully understand and validate the pros and cons of the proposed approach.

Unfortunately, in an ideal world that might actually be the best approach, but that is not going to happen. First, authors will claim IP problems and all kinds of excuses why their approach cannot be validated externally. And second, even more fundamental, reviewers simply do not have the time to spend days on repeating and validating experiments for each paper they review.

So how could a compromise look like? Perhaps a good mode would be to review papers primarily based upon the ideas presented therein, and take only a cursory look at the experiments. The evaluation should look plausible, but that is it, it should not have much impact on the review decision. And in particular authors should not be expected to produce another miracle result, reporting honest number is preferable over a new performance record.

Now if the authors want to they could optionally submit a repeatability set (including binaries, scripts, documentation etc.) together with their paper, and that would give them bonus points during reviewing, in particular for performance numbers, as now the reviewers can verify the experiments if they want. No guarantee that they will do, and papers should still be ranked primarily based upon ideas, but that would allow for more reasonable experimental results.

Experiments are great in principle, but in the competitive review process they have unintended consequences. In the long run, we have to do something about that.

8 comments:

Mark CallaghanSeptember 10, 2014 at 6:20 PM
This comment has been removed by the author.
ReplyDelete
Replies
Mark CallaghanSeptember 10, 2014 at 6:21 PM
I don't write papers but I read a lot of them in the OLTP area. When the key contribution is better performance I expect experiment results. MySQL is frequently used for the comparisons and I doubt many of the results because they are much worse then what I can get from MySQL and because the authors are not MySQL experts. Benchmarking is hard and requires expertise in all of the systems under test, along with the OS, filesystems and storage devices used for the test. I don't think it helps much to have the reviewers try to repeat the tests because I don't think they will have much more expertise in the systems under test. It would help to have more details published on the test configuration so that outsiders can offer feedback (database configuration files, client source code, filesystems used, details on the storage devices, Linux kernel, etc). My advice is:
1) seek expert advice on the systems under test (MySQL, Oracle, etc)
2) limit the number of systems that are compared (see #1)

And for MySQL there are people like me who are willing to consult -- http://smalldatum.blogspot.com/2014/09/get-help-from-mba-mysql-benchmark.html
ReplyDelete
Replies
Thomas NeumannSeptember 11, 2014 at 9:12 AM
Of course I want to have experiments as a sanity check, but you should realize that experiments in papers are a bit unreliable. Everybody only shows the good cases, and a few are even tempted to do more than that.
Now if the authors cannot show even one good case that is clearly a problem, so taking a look at the experiments makes sense. But currently experiments are taken very serious during the review process, and that is a mistake. First, the results are a bit unreliable anyway, and second, that puts pressure on authors to produce "good numbers". And in the past that has led to some unpleasant events.

Reproducability would be a solution, where everybody can verify all experiments, but that is not going to happen, unfortunately. Therefore we should accept that experiments are a nice additional indicator, but should primarily review papers based upon the ideas within. Otherwise we give the wrong incentives to paper authors.
ReplyDelete
Replies
Prof. Dr. Jens DittrichDecember 12, 2014 at 10:01 AM
Hi Thomas,

nice post! Just saw this by chance yesterday.

I chaired VLDB's E&A track last year and I was also part of the SIGMOD repeatability comitte in 2008 (where I learned that repeatability does not work).

I do not believe that experiments are the problem and should get less weight than the ideas presented in a paper. On the contrary! The major problem is that reviewers unrealistically expect novelty forcing authors to claim it and to oversell their ideas.

SIGMOD, PVLDB, and ICDE accepts about 200 papers every year? So all of this is brand new? Orthogonal to existing solutions? No delta there? Really? I am less and less buying this.

Many (actually most) published papers (yes, including my own) simply re-assemble existing pieces in new ways, take some ideas here, improve them slightly, combine it with some other existing stuff and so forth and eventually call it MyFantasticNewAlgorithm. Where is the boundary to existing stuff? When is it still a delta and when is it new stuff?

In my view this is very tough to decide and also a major reason for the randomness in reviewing.

I believe we can only fix it if we embrace that database research is mostly about experimental insights. It is fine if papers improve upon existing work (slight or big improvements: I really don't care). What matters is whether the experimental study is broad, informative, uses meaningful baselines, decent datasets (not only some made-up stuff where the attribute names are inspired by TPC-H).

In other words, an honest way of doing research in our community would be to openly admit that we expect a share of 90% pure experiments and analysis papers and 10% doing new stuff; currently it is the other way around. If we get this into the heads of reviewers their expectations w.r.t. novelty will decrease to realistic levels and foster a different, more scientific, publishing culture.

Best regards,
Jens

ReplyDelete
Replies
Jan VitekMay 19, 2016 at 12:20 PM
Much of what we do in computer science is about improving some quantifiable characteristic of computer systems. While one can argue analytically about the benefits of any idea or algorithm, computer systems are complex enough and their integral parts have sufficiently subtle interactions that without rigorous and systematic measurements, there is no way to tell whether the we are improving on the state of the art. So experiments are essential to us, the authors, if we are making progress.

Reviewing experiments has many problems. Certainly what you mention is true, and unfortunate. But reviewing experimental results is still essential. The experimental results are often the *claim* of the paper. Thus as reviewers, we should make sure that the claim is believable -- that's as much as we can check -- and that the experimental methodology is sound. We can enforce standards and move the community towards better practices. For example, in my field, programming languages, the majority of authors ignore uncertainty. Part of the feedback in reviewing is to encourage them to design better experiments and report of variability of the measurements.

The one change that I would advocate is that we should be more like a real science and embrace reproducibility. It is fine for a conference to accept a paper that claims a 50% speedup on some key algorithm, but we should not consider this as a fact until other teams, independently implement the idea, in other context and validate that they can obtain the same speedups in their environments. That's what scientists do in other fields. And their are incentive to carry out such studies (i.e. they are publishable).

(if you are curious, a slightly longer argument is at http://janvitek.org/pubs/r3.pdf)

ReplyDelete
Replies

Add comment