Why Comparing Single Performance Scores Does Not Allow to Draw Conclusions About Machine Learning Approaches

03/26/2018
by   Nils Reimers, et al.
0

Developing state-of-the-art approaches for specific tasks is a major driving force in our research community. Depending on the prestige of the task, publishing it can come along with a lot of visibility. The question arises how reliable are our evaluation methodologies to compare approaches? One common methodology to identify the state-of-the-art is to partition data into a train, a development and a test set. Researchers can train and tune their approach on some part of the dataset and then select the model that worked best on the development set for a final evaluation on unseen test data. Test scores from different approaches are compared, and performance differences are tested for statistical significance. In this publication, we show that there is a high risk that a statistical significance in this type of evaluation is not due to a superior learning approach. Instead, there is a high risk that the difference is due to chance. For example for the CoNLL 2003 NER dataset we observed in up to 26 cases type I errors (false positives) with a threshold of p < 0.05, i.e., falsely concluding a statistically significant difference between two identical approaches. We prove that this evaluation setup is unsuitable to compare learning approaches. We formalize alternative evaluation setups based on score distributions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/30/2019

Using Score Distributions to Compare Statistical Significance Tests for Information Retrieval Evaluation

Statistical significance tests can provide evidence that the observed di...
research
07/31/2017

Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging

In this paper we show that reporting a single performance score is insuf...
research
04/29/2022

What do we Really Know about State of the Art NER?

Named Entity Recognition (NER) is a well researched NLP task and is wide...
research
10/20/2021

Better than Average: Paired Evaluation of NLP Systems

Evaluation in NLP is usually done by comparing the scores of competing s...
research
02/08/2023

Towards Inferential Reproducibility of Machine Learning Research

Reliability of machine learning evaluation – the consistency of observed...
research
11/26/2020

NLPStatTest: A Toolkit for Comparing NLP System Performance

Statistical significance testing centered on p-values is commonly used t...
research
09/01/2019

Learning Local Forward Models on Unforgiving Games

This paper examines learning approaches for forward models based on loca...

Please sign up or login with your details

Forgot password? Click here to reset