What Will it Take to Fix Benchmarking in Natural Language Understanding?

04/05/2021
by   Samuel R. Bowman, et al.
0

Evaluation for many natural language understanding (NLU) tasks is broken: Unreliable and biased systems score so highly on standard benchmarks that there is little room for researchers who develop better systems to demonstrate their improvements. The recent trend to abandon IID benchmarks in favor of adversarially-constructed, out-of-distribution test sets ensures that current models will perform poorly, but ultimately only obscures the abilities that we want our benchmarks to measure. In this position paper, we lay out four criteria that we argue NLU benchmarks should meet. We argue most current benchmarks fail at these criteria, and that adversarial data collection does not meaningfully address the causes of these failures. Instead, restoring a healthy evaluation ecosystem will require significant progress in the design of benchmark datasets, the reliability with which they are annotated, their size, and the ways they handle social bias.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/15/2023

What's the Meaning of Superhuman Performance in Today's NLU?

In the last five years, there has been a significant focus in Natural La...
research
05/11/2023

GeoGLUE: A GeoGraphic Language Understanding Evaluation Benchmark

With a fast developing pace of geographic applications, automatable and ...
research
06/15/2021

CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark

Artificial Intelligence (AI), along with the recent progress in biomedic...
research
04/06/2022

VALUE: Understanding Dialect Disparity in NLU

English Natural Language Understanding (NLU) systems have achieved great...
research
07/14/2020

Calling Out Bluff: Attacking the Robustness of Automatic Scoring Systems with Simple Adversarial Testing

A significant progress has been made in deep-learning based Automatic Es...
research
05/24/2023

On Degrees of Freedom in Defining and Testing Natural Language Understanding

Natural language understanding (NLU) studies often exaggerate or underes...
research
08/22/2023

Efficient Benchmarking (of Language Models)

The increasing versatility of language models LMs has given rise to a ne...

Please sign up or login with your details

Forgot password? Click here to reset