Better than Average: Paired Evaluation of NLP Systems

10/20/2021
by   Maxime Peyrard, et al.
0

Evaluation in NLP is usually done by comparing the scores of competing systems independently averaged over a common set of test instances. In this work, we question the use of averages for aggregating evaluation scores into a final number used to decide which system is best, since the average, as well as alternatives such as the median, ignores the pairing arising from the fact that systems are evaluated on the same test instances. We illustrate the importance of taking the instance-level pairing of evaluation scores into account and demonstrate, both theoretically and empirically, the advantages of aggregation methods based on pairwise comparisons, such as the Bradley-Terry (BT) model, a mechanism based on the estimated probability that a given system scores better than another on the test set. By re-evaluating 296 real NLP evaluation setups across four tasks and 18 evaluation metrics, we show that the choice of aggregation mechanism matters and yields different conclusions as to which systems are state of the art in about 30 adoption of pairwise evaluation, we release a practical tool for performing the full analysis of evaluation scores with the mean, median, BT, and two variants of BT (Elo and TrueSkill), alongside functionality for appropriate statistical testing.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/08/2022

What are the best systems? New perspectives on NLP Benchmarking

In Machine Learning, a benchmark refers to an ensemble of datasets assoc...
research
05/28/2016

Building an Evaluation Scale using Item Response Theory

Evaluation of NLP methods requires testing against a previously vetted g...
research
05/17/2023

Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks

The evaluation of natural language processing (NLP) systems is crucial f...
research
10/13/2020

With Little Power Comes Great Responsibility

Despite its importance to experimental design, statistical power (the pr...
research
03/26/2018

Why Comparing Single Performance Scores Does Not Allow to Draw Conclusions About Machine Learning Approaches

Developing state-of-the-art approaches for specific tasks is a major dri...
research
09/21/2021

Negation-Instance Based Evaluation of End-to-End Negation Resolution

In this paper, we revisit the task of negation resolution, which include...
research
10/11/2022

Vote'n'Rank: Revision of Benchmarking with Social Choice Theory

The development of state-of-the-art systems in different applied areas o...

Please sign up or login with your details

Forgot password? Click here to reset