It Takes Two to Tango: Navigating Conceptualizations of NLP Tasks and Measurements of Performance

05/15/2023
by   Arjun Subramonian, et al.
0

Progress in NLP is increasingly measured through benchmarks; hence, contextualizing progress requires understanding when and why practitioners may disagree about the validity of benchmarks. We develop a taxonomy of disagreement, drawing on tools from measurement modeling, and distinguish between two types of disagreement: 1) how tasks are conceptualized and 2) how measurements of model performance are operationalized. To provide evidence for our taxonomy, we conduct a meta-analysis of relevant literature to understand how NLP tasks are conceptualized, as well as a survey of practitioners about their impressions of different factors that affect benchmark validity. Our meta-analysis and survey across eight tasks, ranging from coreference resolution to question answering, uncover that tasks are generally not clearly and consistently conceptualized and benchmarks suffer from operationalization disagreements. These findings support our proposed taxonomy of disagreement. Finally, based on our taxonomy, we present a framework for constructing benchmarks and documenting their limitations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/11/2023

Weisfeiler and Lehman Go Measurement Modeling: Probing the Validity of the WL Test

The expressive power of graph neural networks is usually measured by com...
research
07/27/2021

QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension

Alongside huge volumes of research on deep learning models in NLP in the...
research
11/23/2020

Studying Taxonomy Enrichment on Diachronic WordNet Versions

Ontologies, taxonomies, and thesauri are used in many NLP tasks. However...
research
01/21/2023

Rationalization for Explainable NLP: A Survey

Recent advances in deep learning have improved the performance of many N...
research
09/04/2023

Prompting or Fine-tuning? A Comparative Study of Large Language Models for Taxonomy Construction

Taxonomies represent hierarchical relations between entities, frequently...
research
02/15/2022

MuLD: The Multitask Long Document Benchmark

The impressive progress in NLP techniques has been driven by the develop...
research
11/02/2022

Passage-Mask: A Learnable Regularization Strategy for Retriever-Reader Models

Retriever-reader models achieve competitive performance across many diff...

Please sign up or login with your details

Forgot password? Click here to reset