Real-Time Visual Feedback to Guide Benchmark Creation: A Human-and-Metric-in-the-Loop Workflow

02/09/2023
by   Anjana Arunkumar, et al.
2

Recent research has shown that language models exploit `artifacts' in benchmarks to solve tasks, rather than truly learning them, leading to inflated model performance. In pursuit of creating better benchmarks, we propose VAIDA, a novel benchmark creation paradigm for NLP, that focuses on guiding crowdworkers, an under-explored facet of addressing benchmark idiosyncrasies. VAIDA facilitates sample correction by providing realtime visual feedback and recommendations to improve sample quality. Our approach is domain, model, task, and metric agnostic, and constitutes a paradigm shift for robust, validated, and dynamic benchmark creation via human-and-metric-in-the-loop workflows. We evaluate via expert review and a user study with NASA TLX. We find that VAIDA decreases effort, frustration, mental, and temporal demands of crowdworkers and analysts, simultaneously increasing the performance of both user groups with a 45.8 our user study, we observe that created samples are adversarial across models, leading to decreases of 31.3 in performance.

READ FULL TEXT

page 16

page 18

page 19

page 20

page 21

page 22

page 24

page 28

research
05/02/2020

DQI: Measuring Data Quality in NLP

Neural language models have achieved human level performance across seve...
research
12/30/2020

DynaSent: A Dynamic Benchmark for Sentiment Analysis

We introduce DynaSent ('Dynamic Sentiment'), a new English-language benc...
research
04/12/2023

LINGO : Visually Debiasing Natural Language Instructions to Support Task Diversity

Cross-task generalization is a significant outcome that defines mastery ...
research
08/10/2020

DQI: A Guide to Benchmark Evaluation

A `state of the art' model A surpasses humans in a benchmark B, but fail...
research
10/14/2022

A Survey of Parameters Associated with the Quality of Benchmarks in NLP

Several benchmarks have been built with heavy investment in resources to...
research
05/23/2023

LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond

With the recent appearance of LLMs in practical settings, having methods...
research
09/23/2022

Incorporation of Human Knowledge into Data Embeddings to Improve Pattern Significance and Interpretability

Embedding is a common technique for analyzing multi-dimensional data. Ho...

Please sign up or login with your details

Forgot password? Click here to reset