Learning to Model and Ignore Dataset Bias with Mixed Capacity Ensembles

11/07/2020
by   Christopher Clark, et al.
0

Many datasets have been shown to contain incidental correlations created by idiosyncrasies in the data collection process. For example, sentence entailment datasets can have spurious word-class correlations if nearly all contradiction sentences contain the word "not", and image recognition datasets can have tell-tale object-background correlations if dogs are always indoors. In this paper, we propose a method that can automatically detect and ignore these kinds of dataset-specific patterns, which we call dataset biases. Our method trains a lower capacity model in an ensemble with a higher capacity model. During training, the lower capacity model learns to capture relatively shallow correlations, which we hypothesize are likely to reflect dataset bias. This frees the higher capacity model to focus on patterns that should generalize better. We ensure the models learn non-overlapping approaches by introducing a novel method to make them conditionally independent. Importantly, our approach does not require the bias to be known in advance. We evaluate performance on synthetic datasets, and four datasets built to penalize models that exploit known biases on textual entailment, visual question answering, and image recognition tasks. We show improvement in all settings, including a 10 point gain on the visual question answering dataset.

READ FULL TEXT
09/09/2019

Don't Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases

State-of-the-art models often make use of superficial patterns in the da...
12/02/2020

Learning from others' mistakes: Avoiding dataset biases without modeling them

State-of-the-art natural language processing (NLP) models often learn to...
06/23/2019

Investigating Biases in Textual Entailment Datasets

The ability to understand logical relationships between sentences is an ...
12/20/2021

General Greedy De-bias Learning

Neural networks often make predictions relying on the spurious correlati...
05/25/2022

Does Your Model Classify Entities Reasonably? Diagnosing and Mitigating Spurious Correlations in Entity Typing

The entity typing task aims at predicting one or more words or phrases t...
10/07/2020

Improving QA Generalization by Concurrent Modeling of Multiple Biases

Existing NLP datasets contain various biases that models can easily expl...
04/16/2020

There is Strength in Numbers: Avoiding the Hypothesis-Only Bias in Natural Language Inference via Ensemble Adversarial Training

Natural Language Inference (NLI) datasets contain annotation artefacts r...