Over-optimism in benchmark studies and the multiplicity of design and analysis options when interpreting their results

by   Christina Nießl, et al.

In recent years, the need for neutral benchmark studies that focus on the comparison of methods from computational sciences has been increasingly recognised by the scientific community. While general advice on the design and analysis of neutral benchmark studies can be found in recent literature, certain amounts of flexibility always exist. This includes the choice of data sets and performance measures, the handling of missing performance values and the way the performance values are aggregated over the data sets. As a consequence of this flexibility, researchers may be concerned about how their choices affect the results or, in the worst case, may be tempted to engage in questionable research practices (e.g. the selective reporting of results or the post-hoc modification of design or analysis components) to fit their expectations or hopes. To raise awareness for this issue, we use an example benchmark study to illustrate how variable benchmark results can be when all possible combinations of a range of design and analysis options are considered. We then demonstrate how the impact of each choice on the results can be assessed using multidimensional unfolding. In conclusion, based on previous literature and on our illustrative example, we claim that the multiplicity of design and analysis options combined with questionable research practices lead to biased interpretations of benchmark results and to over-optimistic conclusions. This issue should be considered by computational researchers when designing and analysing their benchmark studies and by the scientific community in general in an effort towards more reliable benchmark results.



There are no comments yet.


page 1

page 2

page 3

page 4


Pitfalls and Potentials in Simulation Studies

Comparative simulation studies are workhorse tools for benchmarking stat...

Teaching reproducible research for medical students and postgraduate pharmaceutical scientists

In many academic settings, medical students start their scientific work ...

AdaptMemBench: Application-Specific MemorySubsystem Benchmarking

Optimizing scientific applications to take full advan-tage of modern mem...

Paths Explored, Paths Omitted, Paths Obscured: Decision Points Selective Reporting in End-to-End Data Analysis

Drawing reliable inferences from data involves many, sometimes arbitrary...

Why Research on Test-Driven Development is Inconclusive?

[Background] Recent investigations into the effects of Test-Driven Devel...

Coherent and Archimedean choice in general Banach spaces

I introduce and study a new notion of Archimedeanity for binary and non-...

Exploring Viable Algorithmic Options for Learning from Demonstration (LfD): A Parameterized Complexity Approach

The key to reconciling the polynomial-time intractability of many machin...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.