A Survey of Parameters Associated with the Quality of Benchmarks in NLP

10/14/2022
by   Swaroop Mishra, et al.
12

Several benchmarks have been built with heavy investment in resources to track our progress in NLP. Thousands of papers published in response to those benchmarks have competed to top leaderboards, with models often surpassing human performance. However, recent studies have shown that models triumph over several popular benchmarks just by overfitting on spurious biases, without truly learning the desired task. Despite this finding, benchmarking, while trying to tackle bias, still relies on workarounds, which do not fully utilize the resources invested in benchmark creation, due to the discarding of low quality data, and cover limited sets of bias. A potential solution to these issues – a metric quantifying quality – remains underexplored. Inspired by successful quality indices in several domains such as power, food, and water, we take the first step towards a metric by identifying certain language properties that can represent various possible interactions leading to biases in a benchmark. We look for bias related parameters which can potentially help pave our way towards the metric. We survey existing works and identify parameters capturing various properties of bias, their origins, types and impact on performance, generalization, and robustness. Our analysis spans over datasets and a hierarchy of tasks ranging from NLI to Summarization, ensuring that our parameters are generic and are not overfitted towards a specific task or dataset. We also develop certain parameters in this process.

READ FULL TEXT

page 2

page 9

research
05/02/2020

DQI: Measuring Data Quality in NLP

Neural language models have achieved human level performance across seve...
research
08/10/2020

DQI: A Guide to Benchmark Evaluation

A `state of the art' model A surpasses humans in a benchmark B, but fail...
research
02/28/2023

SMoA: Sparse Mixture of Adapters to Mitigate Multiple Dataset Biases

Recent studies reveal that various biases exist in different NLP tasks, ...
research
10/18/2022

The Tail Wagging the Dog: Dataset Construction Biases of Social Bias Benchmarks

How reliably can we trust the scores obtained from social bias benchmark...
research
03/22/2022

A Survey on Techniques for Identifying and Resolving Representation Bias in Data

The grand goal of data-driven decision-making is to help humans make dec...
research
02/09/2023

Real-Time Visual Feedback to Guide Benchmark Creation: A Human-and-Metric-in-the-Loop Workflow

Recent research has shown that language models exploit `artifacts' in be...
research
10/04/2022

Text Characterization Toolkit

In NLP, models are usually evaluated by reporting single-number performa...

Please sign up or login with your details

Forgot password? Click here to reset