DeepAI AI Chat
Log In Sign Up

Tracing and Removing Data Errors in Natural Language Generation Datasets

by   Faisal Ladhak, et al.

Recent work has identified noisy and misannotated data as a core cause of hallucinations and unfaithful outputs in Natural Language Generation (NLG) tasks. Consequently, identifying and removing these examples is a key open challenge in creating reliable NLG systems. In this work, we introduce a framework to identify and remove low-quality training instances that lead to undesirable outputs, such as faithfulness errors in text summarization. We show that existing approaches for error tracing, such as gradient-based influence measures, do not perform reliably for detecting faithfulness errors in summarization. We overcome the drawbacks of existing error tracing methods through a new, contrast-based estimate that compares undesired generations to human-corrected outputs. Our proposed method can achieve a mean average precision of 0.91 across synthetic tasks with known ground truth and can achieve a two-fold reduction in hallucinations on a real entity hallucination evaluation on the NYT dataset.


page 1

page 2

page 3

page 4


Tracing Knowledge in Language Models Back to the Training Data

Neural language models (LMs) have been shown to memorize a great deal of...

Toward Human-Like Evaluation for Natural Language Generation with Error Analysis

The state-of-the-art language model-based automatic metrics, e.g. BARTSc...

e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks

Recently, an increasing number of works have introduced models capable o...

Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors

The propensity of abstractive summarization systems to make factual erro...

Unifying Human and Statistical Evaluation for Natural Language Generation

How can we measure whether a natural language generation system produces...

Improving Factual Consistency in Summarization with Compression-Based Post-Editing

State-of-the-art summarization models still struggle to be factually con...

Automatic Quality Estimation for Natural Language Generation: Ranting (Jointly Rating and Ranking)

We present a recurrent neural network based system for automatic quality...