Counterfactually-Augmented SNLI Training Data Does Not Yield Better Generalization Than Unaugmented Data

10/09/2020
by   William Huang, et al.
0

A growing body of work shows that models exploit annotation artifacts to achieve state-of-the-art performance on standard crowdsourced benchmarks—datasets collected from crowdworkers to create an evaluation task—while still failing on out-of-domain examples for the same task. Recent work has explored the use of counterfactually-augmented data—data built by minimally editing a set of seed examples to yield counterfactual labels—to augment training data associated with these benchmarks and build more robust classifiers that generalize better. However, Khashabi et al. (2020) find that this type of augmentation yields little benefit on reading comprehension tasks when controlling for dataset size and cost of collection. We build upon this work by using English natural language inference data to test model generalization and robustness and find that models trained on a counterfactually-augmented SNLI dataset do not generalize better than unaugmented datasets of similar size and that counterfactual augmentation can hurt performance, yielding models that are less robust to challenge examples. Counterfactual augmentation of natural language understanding data through standard crowdsourcing techniques does not appear to be an effective way of collecting training data and further innovation is required to make this general line of work viable.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/25/2022

IDK-MRC: Unanswerable Questions for Indonesian Machine Reading Comprehension

Machine Reading Comprehension (MRC) has become one of the essential task...
research
09/12/2022

Bias Challenges in Counterfactual Data Augmentation

Deep learning models tend not to be out-of-distribution robust primarily...
research
06/29/2021

Exploring the Efficacy of Automatically Generated Counterfactuals for Sentiment Analysis

While state-of-the-art NLP models have been achieving the excellent perf...
research
10/14/2020

Geometry matters: Exploring language examples at the decision boundary

A growing body of recent evidence has highlighted the limitations of nat...
research
12/21/2022

Not Just Pretty Pictures: Text-to-Image Generators Enable Interpretable Interventions for Robust Representations

Neural image classifiers are known to undergo severe performance degrada...
research
10/21/2019

Diversify Your Datasets: Analyzing Generalization via Controlled Variance in Adversarial Datasets

Phenomenon-specific "adversarial" datasets have been recently designed t...
research
09/26/2019

Learning the Difference that Makes a Difference with Counterfactually-Augmented Data

Despite alarm over the reliance of machine learning systems on so-called...

Please sign up or login with your details

Forgot password? Click here to reset