Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks

05/17/2023
by   Anas Himmi, et al.
0

The evaluation of natural language processing (NLP) systems is crucial for advancing the field, but current benchmarking approaches often assume that all systems have scores available for all tasks, which is not always practical. In reality, several factors such as the cost of running baseline, private systems, computational limitations, or incomplete data may prevent some systems from being evaluated on entire tasks. This paper formalize an existing problem in NLP research: benchmarking when some systems scores are missing on the task, and proposes a novel approach to address it. Our method utilizes a compatible partial ranking approach to impute missing data, which is then aggregated using the Borda count method. It includes two refinements designed specifically for scenarios where either task-level or instance-level scores are available. We also introduce an extended benchmark, which contains over 131 million scores, an order of magnitude larger than existing benchmarks. We validate our methods and demonstrate their effectiveness in addressing the challenge of missing system evaluation on an entire task. This work highlights the need for more comprehensive benchmarking approaches that can handle real-world scenarios where not all systems are evaluated on the entire task.

READ FULL TEXT

page 5

page 24

page 28

page 29

page 30

research
10/10/2022

A Survey of Methods for Addressing Class Imbalance in Deep-Learning Based Natural Language Processing

Many natural language processing (NLP) tasks are naturally imbalanced, a...
research
11/23/2022

This is the way: designing and compiling LEPISZCZE, a comprehensive NLP benchmark for Polish

The availability of compute and data to train larger and larger language...
research
10/11/2022

Vote'n'Rank: Revision of Benchmarking with Social Choice Theory

The development of state-of-the-art systems in different applied areas o...
research
10/20/2016

Breakdown of a Benchmark Score Without Internal Analysis of Benchmarking Program

A breakdown of a benchmark score is how much each aspect of the system p...
research
10/20/2021

Better than Average: Paired Evaluation of NLP Systems

Evaluation in NLP is usually done by comparing the scores of competing s...
research
07/01/2021

Interviewer-Candidate Role Play: Towards Developing Real-World NLP Systems

Standard NLP tasks do not incorporate several common real-world scenario...
research
12/03/2021

Evaluating NLP Systems On a Novel Cloze Task: Judging the Plausibility of Possible Fillers in Instructional Texts

Cloze task is a widely used task to evaluate an NLP system's language un...

Please sign up or login with your details

Forgot password? Click here to reset