DeepAI AI Chat
Log In Sign Up

Why Comparing Single Performance Scores Does Not Allow to Draw Conclusions About Machine Learning Approaches

by   Nils Reimers, et al.

Developing state-of-the-art approaches for specific tasks is a major driving force in our research community. Depending on the prestige of the task, publishing it can come along with a lot of visibility. The question arises how reliable are our evaluation methodologies to compare approaches? One common methodology to identify the state-of-the-art is to partition data into a train, a development and a test set. Researchers can train and tune their approach on some part of the dataset and then select the model that worked best on the development set for a final evaluation on unseen test data. Test scores from different approaches are compared, and performance differences are tested for statistical significance. In this publication, we show that there is a high risk that a statistical significance in this type of evaluation is not due to a superior learning approach. Instead, there is a high risk that the difference is due to chance. For example for the CoNLL 2003 NER dataset we observed in up to 26 cases type I errors (false positives) with a threshold of p < 0.05, i.e., falsely concluding a statistically significant difference between two identical approaches. We prove that this evaluation setup is unsuitable to compare learning approaches. We formalize alternative evaluation setups based on score distributions.


page 1

page 2

page 3

page 4


Using Score Distributions to Compare Statistical Significance Tests for Information Retrieval Evaluation

Statistical significance tests can provide evidence that the observed di...

Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging

In this paper we show that reporting a single performance score is insuf...

What do we Really Know about State of the Art NER?

Named Entity Recognition (NER) is a well researched NLP task and is wide...

Better than Average: Paired Evaluation of NLP Systems

Evaluation in NLP is usually done by comparing the scores of competing s...

Towards Inferential Reproducibility of Machine Learning Research

Reliability of machine learning evaluation – the consistency of observed...

NLPStatTest: A Toolkit for Comparing NLP System Performance

Statistical significance testing centered on p-values is commonly used t...

Learning Local Forward Models on Unforgiving Games

This paper examines learning approaches for forward models based on loca...