Comments on Database Architects: Experiments Hurt the Review Process

In an abstract sense I agree, we should try to aim...

2016-05-21T23:23:12.337+02:00

In an abstract sense I agree, we should try to aim for reproducibility, and for other teams repeating published results. But unfortunately that sounds nice in theory, but is difficult in practice.

I have once encountered a paper where the results were clearly non-reproducible, and it was clear that there was little hope of ever achieving the published results. Still, it was very difficult to publish this negative results. Not because anybody doubted my negative results, but because reviewers made up all kinds of excuses for potential mistakes the original authors might have made that would prevent reproducibility. I never claimed to know why the results were different, but the numbers were so far off that reviewers were very sensitive there. Presumably they did not want to publish something that could tarnished somebody else's reputation without 100% clarity about what had happened, but of course it is impossible to get 100% clarity without the cooperation of the original authors. The whole affair created quite a bit of bad blood, even though the negative reproducibility results themselves were undisputed.

Of course I could have insisted, publishing the results somewhere else than the original venue, etc. But after all the trouble with that paper I just decided that it is not worth it. I did my duty by trying to correct the results, but if the PC chair wants to hush up the whole affair that is not my problem. Life is too short for that kind of nonsense.

Much of what we do in computer science is about im...

2016-05-19T12:20:06.196+02:00

Much of what we do in computer science is about improving some quantifiable characteristic of computer systems. While one can argue analytically about the benefits of any idea or algorithm, computer systems are complex enough and their integral parts have sufficiently subtle interactions that without rigorous and systematic measurements, there is no way to tell whether the we are improving on the state of the art. So experiments are essential to us, the authors, if we are making progress.

Reviewing experiments has many problems. Certainly what you mention is true, and unfortunate. But reviewing experimental results is still essential. The experimental results are often the *claim* of the paper. Thus as reviewers, we should make sure that the claim is believable -- that's as much as we can check -- and that the experimental methodology is sound. We can enforce standards and move the community towards better practices. For example, in my field, programming languages, the majority of authors ignore uncertainty. Part of the feedback in reviewing is to encourage them to design better experiments and report of variability of the measurements.

The one change that I would advocate is that we should be more like a real science and embrace reproducibility. It is fine for a conference to accept a paper that claims a 50% speedup on some key algorithm, but we should not consider this as a fact until other teams, independently implement the idea, in other context and validate that they can obtain the same speedups in their environments. That's what scientists do in other fields. And their are incentive to carry out such studies (i.e. they are publishable).

(if you are curious, a slightly longer argument is at http://janvitek.org/pubs/r3.pdf)

Hi Jens, you are right, the review process is par...

2014-12-12T15:56:22.816+01:00

Hi Jens,

you are right, the review process is part of the problem. Reviewers expect ground-breaking new techniques together with great numbers. This (completely unrealistic) expectation is causing some of the issues I have described.

So one thing would be changing reviewer expectation. But even then I am somewhat disillusion with the current state of experiments, I am not sure if just saying that most work is incremental and experimental is enough.

Even if the results are correct, it is very, very tempting to just show the "good" cases. For example when comparing two approaches A and B, it is often easy to show the cases where A is good, and ignore the fact that A might behave much worse than B in some other cases (or perhaps not even offer the functionality).
When doing that, the results are not wrong, but still misleading. Approach A looks great, even though the cases where A is better than B might be an exception. And the reader has little chance to recognize that during a cursory read.

Releasing source code or binaries help a bit, if the reader is willing to make the effort to do experiments themselves. But I think the correct way would be to say, reviewers have to accept that most approaches are trade-offs, i.e., they win in same cases and lose in other cases, but the authors then have to give a much more comprehensive picture in their experiments, showing strength and weaknesses. Unfortunately I am not seeing that happen.

Hi Thomas, nice post! Just saw this by chance yes...

2014-12-12T10:01:02.592+01:00

Hi Thomas,

nice post! Just saw this by chance yesterday.

I chaired VLDB's E&A track last year and I was also part of the SIGMOD repeatability comitte in 2008 (where I learned that repeatability does not work).

I do not believe that experiments are the problem and should get less weight than the ideas presented in a paper. On the contrary! The major problem is that reviewers unrealistically expect novelty forcing authors to claim it and to oversell their ideas.

SIGMOD, PVLDB, and ICDE accepts about 200 papers every year? So all of this is brand new? Orthogonal to existing solutions? No delta there? Really? I am less and less buying this.

Many (actually most) published papers (yes, including my own) simply re-assemble existing pieces in new ways, take some ideas here, improve them slightly, combine it with some other existing stuff and so forth and eventually call it MyFantasticNewAlgorithm. Where is the boundary to existing stuff? When is it still a delta and when is it new stuff?

In my view this is very tough to decide and also a major reason for the randomness in reviewing.

I believe we can only fix it if we embrace that database research is mostly about experimental insights. It is fine if papers improve upon existing work (slight or big improvements: I really don't care). What matters is whether the experimental study is broad, informative, uses meaningful baselines, decent datasets (not only some made-up stuff where the attribute names are inspired by TPC-H).

In other words, an honest way of doing research in our community would be to openly admit that we expect a share of 90% pure experiments and analysis papers and 10% doing new stuff; currently it is the other way around. If we get this into the heads of reviewers their expectations w.r.t. novelty will decrease to realistic levels and foster a different, more scientific, publishing culture.

Best regards,
Jens

I know the experiments can be unreliable. That was...

2014-09-11T18:55:46.443+02:00

I know the experiments can be unreliable. That was the reason for my most recent blog post listed above. I don't review papers and I rarely submit them so I don't have context for the impact on the review process. Getting papers published isn't critical to my career, so my goal isn't the same as yours.

But one point on which we can agree (or not) is whether performance results are critical to some papers in VLDB & SIGMOD. There are many papers where the key contribution is better performance over existing work via incremental changes to existing solutions. In that case the ideas in the paper are different but not better unless there are performance results. And there are a huge number of ways to do things differently so I think we need the performance results to determine which of those papers get accepted into top tier conferences.

On the other hand there are systems papers that are doing things fundamentally different, like all of the work that lead up to Hekaton. In that case the performance results are a nice bonus.

Of course I want to have experiments as a sanity c...

2014-09-11T09:12:43.589+02:00

Of course I want to have experiments as a sanity check, but you should realize that experiments in papers are a bit unreliable. Everybody only shows the good cases, and a few are even tempted to do more than that.
Now if the authors cannot show even one good case that is clearly a problem, so taking a look at the experiments makes sense. But currently experiments are taken very serious during the review process, and that is a mistake. First, the results are a bit unreliable anyway, and second, that puts pressure on authors to produce "good numbers". And in the past that has led to some unpleasant events.

Reproducability would be a solution, where everybody can verify all experiments, but that is not going to happen, unfortunately. Therefore we should accept that experiments are a nice additional indicator, but should primarily review papers based upon the ideas within. Otherwise we give the wrong incentives to paper authors.

I don't write papers but I read a lot of them ...

2014-09-10T18:21:41.591+02:00

I don't write papers but I read a lot of them in the OLTP area. When the key contribution is better performance I expect experiment results. MySQL is frequently used for the comparisons and I doubt many of the results because they are much worse then what I can get from MySQL and because the authors are not MySQL experts. Benchmarking is hard and requires expertise in all of the systems under test, along with the OS, filesystems and storage devices used for the test. I don't think it helps much to have the reviewers try to repeat the tests because I don't think they will have much more expertise in the systems under test. It would help to have more details published on the test configuration so that outsiders can offer feedback (database configuration files, client source code, filesystems used, details on the storage devices, Linux kernel, etc). My advice is:
1) seek expert advice on the systems under test (MySQL, Oracle, etc)
2) limit the number of systems that are compared (see #1)

And for MySQL there are people like me who are willing to consult -- http://smalldatum.blogspot.com/2014/09/get-help-from-mba-mysql-benchmark.html

2014-09-10T18:20:48.097+02:00

This comment has been removed by the author.