Finding Dataset Shortcuts with Grammar Induction

10/20/2022
by   Dan Friedman, et al.
0

Many NLP datasets have been found to contain shortcuts: simple decision rules that achieve surprisingly high accuracy. However, it is difficult to discover shortcuts automatically. Prior work on automatic shortcut detection has focused on enumerating features like unigrams or bigrams, which can find only low-level shortcuts, or relied on post-hoc model interpretability methods like saliency maps, which reveal qualitative patterns without a clear statistical interpretation. In this work, we propose to use probabilistic grammars to characterize and discover shortcuts in NLP datasets. Specifically, we use a context-free grammar to model patterns in sentence classification datasets and use a synchronous context-free grammar to model datasets involving sentence pairs. The resulting grammars reveal interesting shortcut features in a number of datasets, including both simple and high-level features, and automatically identify groups of test examples on which conventional classifiers fail. Finally, we show that the features we discover can be used to generate diagnostic contrast examples and incorporated into standard robust optimization methods to improve worst-group accuracy.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/24/2019

Compound Probabilistic Context-Free Grammars for Grammar Induction

We study a formalization of the grammar induction problem that models se...
research
06/29/2009

Restricted Global Grammar Constraints

We investigate the global GRAMMAR constraint over restricted classes of ...
research
02/20/2018

The isomorphism problem for finite extensions of free groups is in PSPACE

We present an algorithm for the following problem: given a context-free ...
research
05/11/2021

Doing Natural Language Processing in A Natural Way: An NLP toolkit based on object-oriented knowledge base and multi-level grammar base

We introduce an NLP toolkit based on object-oriented knowledge base and ...
research
02/23/2018

Unsupervised Grammar Induction with Depth-bounded PCFG

There has been recent interest in applying cognitively or empirically mo...
research
03/31/2023

Rethinking interpretation: Input-agnostic saliency mapping of deep visual classifiers

Saliency methods provide post-hoc model interpretation by attributing in...
research
10/14/2020

Geometry matters: Exploring language examples at the decision boundary

A growing body of recent evidence has highlighted the limitations of nat...

Please sign up or login with your details

Forgot password? Click here to reset