Monday, September 8, 2014

Experiments Hurt the Review Process

Experimentally validating (or falsifying) a theory is one of the fundamental aspects of science. As such I have always put a lot of emphasize on experiments, and of course empirical evidence are essential when designing systems. To pick a simple example: merge sort has better asymptotic behavior than quick sort, but in practice quick sort is usually faster. Wall clock is what matters.

However while all that is true in general, in practice experiments can be quite harmful, in particular in the context of peer-reviewed papers. In the rest of this post I will try to illustrate why that is the case and what could potentially be done about that.
It took my a while to realize that experiments hurt the review process, but both real papers that I have seen, and in particular input from my colleague Alfons Kemper have convinced me that experiments in papers are a problem. Alfons suggested to simply ignore the experimental section in papers, which I find a bit extreme, but he has a point.

The first problem is that virtually all papers "lie by omission" in their evaluation section. The authors will include experiments where their approach behaved well, but they will not show results where it has problems. In Computer Science we find that perfectly normal behavior, in fields like Pharmacy we would go to jail for that.

Furthermore this behavior interacts badly with the review process. Of course the reviewers know that are shown only the good cases, therefore these cases have to be really good. If someone writes a paper about an approach that is nicer and cleaner than a previous one, but is 20% slower in experiments, he will have a hard time publishing it (even though the 20% might well just stem from small implementation differences).
Which is a shame. Often reviewers are too keen on good experimental results.

The easiest way to kill a paper is to state that the experiments not conclusive enough. The authors will have a hard time arguing against that, as there is always another experiment that sounds plausible and important and on which the reviewer can insist. Authors therefore have an incentive to make their experiments slick looking and impressive, as that will hopefully guide the reviewer in other directions.

Which brings me to the core of the problem with experiments in papers: There have been papers where the thin line between careful experimental setup and outright cheating has been touched, perhaps even crossed. And that is really bad for science. Now one can claim that this is all the fault of the authors, and that they should be tarred and feathered etc., but that is too short sighted. Of course it is the fault of the authors. But the current system is rigged in a way that makes is very, very attractive to massage experimental results. I once heard the statement that "the reviewers expect good numbers". And unfortunately, that is true.
And this affects not only the highly problematic cases, even papers that stay "legal" and "just" create good looking results by a very careful choice of their experimental setup are quite harmful, in best case we learn little, in worst case we are misled.

Now what can we do about that problem? I don't really know, but I will propose a few alternatives. One extreme would be to say we largely ignore the experimental part of a paper, and evaluate it purely based upon the ideas presented within. Which is not the worst thing to do, and arguably it would be an improvement over the current system, but if we ignore the actual experiments the quick sort mentioned above might have had a hard time against the older merge sort.

The other extreme would be what the SIGMOD Repeatability Effort tried to achieve, namely that all experiments are validated. And this validation should happen during reviewing (SIGMOD did it only after the fact). Then, the reviewer should repeat the experiments, try out different settings, and fully understand and validate the pros and cons of the proposed approach.
Unfortunately, in an ideal world that might actually be the best approach, but that is not going to happen. First, authors will claim IP problems and all kinds of excuses why their approach cannot be validated externally. And second, even more fundamental, reviewers simply do not have the time to spend days on repeating and validating experiments for each paper they review.

So how could a compromise look like? Perhaps a good mode would be to review papers primarily based upon the ideas presented therein, and take only a cursory look at the experiments. The evaluation should look plausible, but that is it, it should not have much impact on the review decision. And in particular authors should not be expected to produce another miracle result, reporting honest number is preferable over a new performance record.
Now if the authors want to they could optionally submit a repeatability set (including binaries, scripts, documentation etc.) together with their paper, and that would give them bonus points during reviewing, in particular for performance numbers, as now the reviewers can verify the experiments if they want. No guarantee that they will do, and papers should still be ranked primarily based upon ideas, but that would allow for more reasonable experimental results.

Experiments are great in principle, but in the competitive review process they have unintended consequences. In the long run, we have to do something about that.

8 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. I don't write papers but I read a lot of them in the OLTP area. When the key contribution is better performance I expect experiment results. MySQL is frequently used for the comparisons and I doubt many of the results because they are much worse then what I can get from MySQL and because the authors are not MySQL experts. Benchmarking is hard and requires expertise in all of the systems under test, along with the OS, filesystems and storage devices used for the test. I don't think it helps much to have the reviewers try to repeat the tests because I don't think they will have much more expertise in the systems under test. It would help to have more details published on the test configuration so that outsiders can offer feedback (database configuration files, client source code, filesystems used, details on the storage devices, Linux kernel, etc). My advice is:
    1) seek expert advice on the systems under test (MySQL, Oracle, etc)
    2) limit the number of systems that are compared (see #1)

    And for MySQL there are people like me who are willing to consult -- http://smalldatum.blogspot.com/2014/09/get-help-from-mba-mysql-benchmark.html

    ReplyDelete
  3. Of course I want to have experiments as a sanity check, but you should realize that experiments in papers are a bit unreliable. Everybody only shows the good cases, and a few are even tempted to do more than that.
    Now if the authors cannot show even one good case that is clearly a problem, so taking a look at the experiments makes sense. But currently experiments are taken very serious during the review process, and that is a mistake. First, the results are a bit unreliable anyway, and second, that puts pressure on authors to produce "good numbers". And in the past that has led to some unpleasant events.

    Reproducability would be a solution, where everybody can verify all experiments, but that is not going to happen, unfortunately. Therefore we should accept that experiments are a nice additional indicator, but should primarily review papers based upon the ideas within. Otherwise we give the wrong incentives to paper authors.

    ReplyDelete
    Replies
    1. I know the experiments can be unreliable. That was the reason for my most recent blog post listed above. I don't review papers and I rarely submit them so I don't have context for the impact on the review process. Getting papers published isn't critical to my career, so my goal isn't the same as yours.

      But one point on which we can agree (or not) is whether performance results are critical to some papers in VLDB & SIGMOD. There are many papers where the key contribution is better performance over existing work via incremental changes to existing solutions. In that case the ideas in the paper are different but not better unless there are performance results. And there are a huge number of ways to do things differently so I think we need the performance results to determine which of those papers get accepted into top tier conferences.

      On the other hand there are systems papers that are doing things fundamentally different, like all of the work that lead up to Hekaton. In that case the performance results are a nice bonus.

      Delete
  4. Hi Thomas,

    nice post! Just saw this by chance yesterday.

    I chaired VLDB's E&A track last year and I was also part of the SIGMOD repeatability comitte in 2008 (where I learned that repeatability does not work).

    I do not believe that experiments are the problem and should get less weight than the ideas presented in a paper. On the contrary! The major problem is that reviewers unrealistically expect novelty forcing authors to claim it and to oversell their ideas.

    SIGMOD, PVLDB, and ICDE accepts about 200 papers every year? So all of this is brand new? Orthogonal to existing solutions? No delta there? Really? I am less and less buying this.

    Many (actually most) published papers (yes, including my own) simply re-assemble existing pieces in new ways, take some ideas here, improve them slightly, combine it with some other existing stuff and so forth and eventually call it MyFantasticNewAlgorithm. Where is the boundary to existing stuff? When is it still a delta and when is it new stuff?

    In my view this is very tough to decide and also a major reason for the randomness in reviewing.

    I believe we can only fix it if we embrace that database research is mostly about experimental insights. It is fine if papers improve upon existing work (slight or big improvements: I really don't care). What matters is whether the experimental study is broad, informative, uses meaningful baselines, decent datasets (not only some made-up stuff where the attribute names are inspired by TPC-H).

    In other words, an honest way of doing research in our community would be to openly admit that we expect a share of 90% pure experiments and analysis papers and 10% doing new stuff; currently it is the other way around. If we get this into the heads of reviewers their expectations w.r.t. novelty will decrease to realistic levels and foster a different, more scientific, publishing culture.

    Best regards,
    Jens



    ReplyDelete
    Replies
    1. Hi Jens,

      you are right, the review process is part of the problem. Reviewers expect ground-breaking new techniques together with great numbers. This (completely unrealistic) expectation is causing some of the issues I have described.

      So one thing would be changing reviewer expectation. But even then I am somewhat disillusion with the current state of experiments, I am not sure if just saying that most work is incremental and experimental is enough.

      Even if the results are correct, it is very, very tempting to just show the "good" cases. For example when comparing two approaches A and B, it is often easy to show the cases where A is good, and ignore the fact that A might behave much worse than B in some other cases (or perhaps not even offer the functionality).
      When doing that, the results are not wrong, but still misleading. Approach A looks great, even though the cases where A is better than B might be an exception. And the reader has little chance to recognize that during a cursory read.

      Releasing source code or binaries help a bit, if the reader is willing to make the effort to do experiments themselves. But I think the correct way would be to say, reviewers have to accept that most approaches are trade-offs, i.e., they win in same cases and lose in other cases, but the authors then have to give a much more comprehensive picture in their experiments, showing strength and weaknesses. Unfortunately I am not seeing that happen.

      Delete
  5. Much of what we do in computer science is about improving some quantifiable characteristic of computer systems. While one can argue analytically about the benefits of any idea or algorithm, computer systems are complex enough and their integral parts have sufficiently subtle interactions that without rigorous and systematic measurements, there is no way to tell whether the we are improving on the state of the art. So experiments are essential to us, the authors, if we are making progress.

    Reviewing experiments has many problems. Certainly what you mention is true, and unfortunate. But reviewing experimental results is still essential. The experimental results are often the *claim* of the paper. Thus as reviewers, we should make sure that the claim is believable -- that's as much as we can check -- and that the experimental methodology is sound. We can enforce standards and move the community towards better practices. For example, in my field, programming languages, the majority of authors ignore uncertainty. Part of the feedback in reviewing is to encourage them to design better experiments and report of variability of the measurements.

    The one change that I would advocate is that we should be more like a real science and embrace reproducibility. It is fine for a conference to accept a paper that claims a 50% speedup on some key algorithm, but we should not consider this as a fact until other teams, independently implement the idea, in other context and validate that they can obtain the same speedups in their environments. That's what scientists do in other fields. And their are incentive to carry out such studies (i.e. they are publishable).

    (if you are curious, a slightly longer argument is at http://janvitek.org/pubs/r3.pdf)








    ReplyDelete
    Replies
    1. In an abstract sense I agree, we should try to aim for reproducibility, and for other teams repeating published results. But unfortunately that sounds nice in theory, but is difficult in practice.

      I have once encountered a paper where the results were clearly non-reproducible, and it was clear that there was little hope of ever achieving the published results. Still, it was very difficult to publish this negative results. Not because anybody doubted my negative results, but because reviewers made up all kinds of excuses for potential mistakes the original authors might have made that would prevent reproducibility. I never claimed to know why the results were different, but the numbers were so far off that reviewers were very sensitive there. Presumably they did not want to publish something that could tarnished somebody else's reputation without 100% clarity about what had happened, but of course it is impossible to get 100% clarity without the cooperation of the original authors. The whole affair created quite a bit of bad blood, even though the negative reproducibility results themselves were undisputed.

      Of course I could have insisted, publishing the results somewhere else than the original venue, etc. But after all the trouble with that paper I just decided that it is not worth it. I did my duty by trying to correct the results, but if the PC chair wants to hush up the whole affair that is not my problem. Life is too short for that kind of nonsense.

      Delete