Log In Sign Up

Adversarial Filters of Dataset Biases

by   Ronan Le Bras, et al.

Large neural models have demonstrated human-level performance on language and vision benchmarks such as ImageNet and Stanford Natural Language Inference (SNLI). Yet, their performance degrades considerably when tested on adversarial or out-of-distribution samples. This raises the question of whether these models have learned to solve a dataset rather than the underlying task by overfitting on spurious dataset biases. We investigate one recently proposed approach, AFLite, which adversarially filters such dataset biases, as a means to mitigate the prevalent overestimation of machine performance. We provide a theoretical understanding for AFLite, by situating it in the generalized framework for optimum bias reduction. Our experiments show that as a result of the substantial reduction of these biases, models trained on the filtered datasets yield better generalization to out-of-distribution tasks, especially when the benchmarks used for training are over-populated with biased samples. We show that AFLite is broadly applicable to a variety of both real and synthetic datasets for reduction of measurable dataset biases and provide extensive supporting analyses. Finally, filtering results in a large drop in model performance (e.g., from 92 still remains high. Our work thus shows that such filtered datasets can pose new research challenges for robust generalization by serving as upgraded benchmarks.


DQI: Measuring Data Quality in NLP

Neural language models have achieved human level performance across seve...

On Adversarial Removal of Hypothesis-only Bias in Natural Language Inference

Popular Natural Language Inference (NLI) datasets have been shown to be ...

Our Evaluation Metric Needs an Update to Encourage Generalization

Models that surpass human performance on several popular benchmarks disp...

Mind the Trade-off: Debiasing NLU Models without Degrading the In-distribution Performance

Models for natural language understanding (NLU) tasks often rely on the ...

REPAIR: Removing Representation Bias by Dataset Resampling

Modern machine learning datasets can have biases for certain representat...

WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale

The Winograd Schema Challenge (WSC), proposed by Levesque et al. (2011) ...

End-to-End Self-Debiasing Framework for Robust NLU Training

Existing Natural Language Understanding (NLU) models have been shown to ...

1 Introduction

Figure 1: Example images of the Monarch Butterfly and Chickadee from ImageNet. On the right are images in each category which were removed by AFLite

, and on the left, the ones which were filtered or retained. The heatmap shows pairwise cosine similarity between EfficientNet-B7 features

(Tan and Le, 2019). The retained images (left) show significantly greater diversity – such as the cocoon of a butterfly, or the non-canonical chickadee poses – also reflected by the cosine similarity values. This diversity suggests that the AFLite-filtered examples presents a more accurate benchmark for the task of image classification, as opposed to fitting to particular dataset biases.

Large-scale neural networks have achieved superhuman performance across many popular AI benchmarks, for tasks as diverse as image recognition (ImageNet;

Russakovsky et al., 2015), natural language inference (SNLI; Bowman et al., 2015), and question answering (SQuAD; Rajpurkar et al., 2016). However, the performance of such neural models degrades considerably when tested on out-of-distribution or adversarial samples, otherwise known as data “in the wild” (Eykholt et al., 2018; Jia and Liang, 2017). This phenomenon indicates that high performance of the strongest AI models is often confined to specific datasets, implicitly making a closed-world assumption. In contrast, true learning of a task necessitates generalization, or an open-world assumption. A major impediment to generalization is the presence of spurious biases – unintended correlations between input and output – in existing datasets (Torralba and Efros, 2011). Such biases or artifacts111We will henceforth use biases and artifacts interchangeably. are often introduced during data collection (Fouhey et al., 2018) or during human annotation (Gururangan et al., 2018; Poliak et al., 2018; Tsuchiya, 2018; Geva et al., 2019). Not only do dataset biases inevitably bias the models trained on them, but they have also been shown to significantly inflate model performance, leading to an overestimation of the true capabilities of current AI systems (Sakaguchi et al., 2020; Hendrycks et al., 2019).

Many recent studies have investigated task or dataset specific biases, including language bias in Visual Question Answering (Goyal et al., 2017), texture bias in ImageNet (Geirhos et al., 2018), and hypothesis-only reliance in Natural Language Inference (Gururangan et al., 2018). These studies have yielded similarly domain-specific algorithms to address the found biases. However, the vast majority of these studies follow a top-down framework where the bias reduction algorithms are essentially guided by researchers’ intuitions and domain insights on particular types of spurious biases. While promising, such approaches are fundamentally limited by what the algorithm designers can manually recognize and enumerate as unwanted biases.

Our work investigates AFLite, an alternative bottom-up approach to algorithmic bias reduction. AFLite222Stands for Lightweight Adversarial Filtering. was recently proposed by Sakaguchi et al. (2020)—albeit very succinctly—to systematically discover and filter any dataset artifact in crowdsourced commonsense problems. AFLite employs a model-based approach with the goal of removing spurious artifacts in data beyond what humans can intuitively recognize, but those which are exploited by powerful models. Figure 1 illustrates how AFLite reduces dataset biases in the ImageNet dataset for object classification.

This paper presents the first theoretical understanding and comprehensive empirical investigations into AFLite. More concretely, we make the following four novel contributions.

First, we situate AFLite in a theoretical framework for optimal bias reduction, and demonstrate that AFLite provides a practical approximation of AFOpt, the ideal but computationally intractable bias reduction method under this framework (§2).

Second, we present an extensive suite of experiments that were lacking in the work of Sakaguchi et al. (2020), to validate whether AFLite truly removes spurious biases in data as originally assumed. Our baselines and thorough analyses use both synthetic (thus easier to control) datasets (§3) as well as real datasets. The latter span benchmarks across NLP (§4) and vision (§5) tasks: the SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2018) datasets for natural language inference, QNLI (Wang et al., 2018a) for question answering, and the ImageNet dataset (Russakovsky et al., 2015) for object recognition.

Third, we demonstrate that models trained on AFLite-filtered data generalize substantially better to out-of-domain samples, compared to models that are trained on the original biased datasets (§4, §5). These findings indicate that spurious biases in datasets make benchmarks artificially easier, as models learn to overly rely on these biases instead of learning more transferable features, thereby hurting out-of-domain generalization.

Finally, we show that AFLite-filtering makes widely used AI benchmarks considerably more challenging. We consistently observe a significant drop in the in-domain performance even for state-of-the-art models on all benchmarks, even though human performance still remains high; this suggests currently reported performance on benchmarks might be inflated. For instance, the best model on SNLI-AFLite achieves only 63% accuracy, a 30% drop compared to its accuracy on the original SNLI. These findings are especially surprising since AFLite maintains an identical train-test distribution, while also retaining a sizable training set.

In summary, AFLite-filtered datasets can serve as upgraded benchmarks, posing new research challenges for robust generalization.

2 AFLite

Large datasets run the risk of prioritizing performance on the data-rich head of the distribution, where examples are plentiful, and discounting the tail. AFLite seeks to minimize the ability of a model to exploit biases in the head of the distribution, while preserving the inherent complexity of the tail. In this section, we provide a formal framework for studying such bias reduction techniques, revealing that AFLite can be viewed as a practical approximation of a desirable but computationally intractable optimum bias reduction objective.


Let be any feature representation defined over a dataset . AFLite seeks a subset that is maximally resilient to the features uncovered by , that is, for any identically-distributed train-test split of , features should not help models generalize to the held-out set.


denote a family of classification models (e.g., logistic regression, SVM, or a particular neural architecture) that can be trained on subsets

of using features . We define the representation bias of in w.r.t , denoted , as the best possible out-of-sample classification accuracy achievable by models in when predicting labels using features . Given a target minimum reduced dataset size , the goal is to find a subset of size at least that minimizes this representation bias in w.r.t. :


Eq. (1) corresponds to optimum bias reduction, referred to as AFOpt. We formulate as the expected classification accuracy resulting from the following process. Let

be a probability distribution over subsets

of . The process is to randomly choose with probability

, train a classifier

on , and evaluate its classification accuracy on . The resulting accuracy on

itself is a random variable, since the training set

is randomly sampled. We define the expected value of this classification accuracy to be the representation bias:


The expectation in Eq. (2), however, involves a summation over exponentially many choices of even to compute the representation bias for a single . This makes optimizing Eq. (1), which involves a search over , highly intractable. To circumvent this challenge, we refactor as a sum over instances of the aggregate contribution of to the representation bias across all . Importantly, this summation has only terms, allowing more efficient computation. We call this the predictability score for : on average, how reliably can label be predicted using features when a model from is trained on a randomly chosen training set not containing . Instances with high predictability scores are undesirable as their feature representation can be exploited to confidently correctly predict such instances.

With some abuse of notation, for , let denote the marginal probability of choosing a subset that contains . The ratio is then the probability of conditioned on it containing . Let be the classification accuracy of on . Then the expectation in Eq. (2) can be written in terms of as follows:

where is the predictability score of defined as:


While this refactoring works for any probability distribution with non-zero support on all instances, for simplicity of exposition, we assume

to be the uniform distribution over all

of a fixed size. This makes both and fixed constants; in particular, . This yields a simplified predictability score and a factored reformulation of the representation bias from Eq. (2):


While this refactoring reduces the exponential summation underlying the expectation in Eq. (2) to a linear sum, solving Eq. (1) for optimum bias reduction (AFOpt) remains challenging due to the exponentially many choices of

. However, the refactoring does enable computationally efficient heuristic approximations that start with

and iteratively filter out from the most predictable instances , as identified by the (simplified) predictability scores computed over the current candidate for . In all cases, we use a fixed training set size . Further, since a larger filtered set is generally desirable, we terminate the filtering process early (i.e., while ) if the predictability score for every falls below a pre-specified early stopping threshold .

We consider three such heuristic approaches. (A) A simple greedy approach starts with the full set , identifies an that maximizes , removes it from , and repeats up to times. (B) A greedy slicing approach identifies the instances with the highest predictability scores, removes all of them from , and repeats the process up to times. (C) A slice sampling approach, instead of greedily choosing the top instances, randomly samples instances with probabilities proportional to their predictability scores. The Gumbel method provides an efficient way to perform such sampling (Gumbel and Lieblein, 1954; Maddison et al., 2014; Kim et al., 2016; Balog et al., 2017; Kool et al., 2019), by independently perturbing each with a Gumbel random variable and identifying instances with the highest perturbed predictability scores (cf. Appendix A.1).

All three strategies can be improved further by considering not only the predictability score of the top- instances but also (via retraining without these instances) how their removal would influence the predictability scores of other instances in the next step. We found our computationally lighter approaches to work well even without the additional overhead of such look-ahead. AFLite implements the greedy slicing approach, and can thus be viewed as a scalable and practical approximation of (intractable) AFOpt for optimum bias reduction.

Input: dataset , pre-computed representation , model family , target dataset size , number of random partitions , training set size , slice size , early-stopping threshold

Output: reduced dataset

while  do

        // Filtering phase forall  do
               Initialize multiset of out-of-sample predictions
       for iteration  do
               Randomly partition into s.t.  Train a classifier on ( is typically a linear classifier) forall  do
                      Add the prediction to
       forall  do
               Compute the predictability score
       Select up to instances in with the highest predictability scores subject to if  then
Algorithm 1 AFLite

Implementation. Algorithm 1 provides an implementation of AFLite. The algorithm takes as input a dataset , a representation we are interested in minimizing the bias in, a model family (e.g., linear classifiers), a target dataset size , size of the support of the expectation in Eq. (4), training set size for the classifiers, size of each slice, and an early-stopping filtering threshold . Importantly, for efficiency, is provided to AFLite in the form of pre-computed embeddings for all of . In practice, to obtain , we train a first “warm-up” model on a small fraction of the data based on the learning curve in low-data regime, and do not reuse this data for the rest of our experiments. Moreover, this fraction corresponds to the training size for AFLite and it remains unchanged across iterations. We follow the iterative filtering approach, starting with and iteratively removing some instances with the highest predictability scores using the greedy slicing strategy. Slice size and number of partitions are determined by the available computation budget.

At each filtering phase, we train models (linear classifiers) on different random partitions of the data, and collect their predictions on their corresponding test set. For each instance , we compute its predictability score as the ratio of the number of times its label is predicted correctly, over the total number of predictions for it. We rank the instances according to their predictability score and use the greedy slicing strategy of removing the top- instances whose score is not less than the early-stopping threshold . We repeat this process until fewer than instances pass the threshold in a filtering phase or fewer than instances remain.

3 Synthetic Data Experiments

We present experiments to evaluate whether AFLite successfully removes examples with spurious correlations in a synthetic setting. Our dataset consists of two-dimensional data, arranged in concentric circles, at two different levels of separation, as shown in Figure 2

. As is evident, a linear function is inadequate for separating the two classes; it requires a more complex non-linear model such as a support vector machine (SVM) with a radial basis function (RBF) kernel.

To simulate spurious correlations in the data, we add class-specific artificially constructed features (biases) sampled from two different Gaussian distributions. These features are only added to

of the data in each class, while for the rest of the data, we insert random (noise) features. The bias features make the task solvable through a linear function. Furthermore, for the first dataset, with the largest separation, we flipped the labels of some biased samples, making the data slightly adversarial even to the RBF. Both models can clearly leverage the biases, and demonstrate improved performance over a baseline without biases. 333We use standard implementations from scikit-learn:

Once we apply AFLite, as expected, the number of biased samples is reduced considerably, making the task hard once again for the linear model, but still solvable for the non-linear one. The filtered dataset is shown in the bottom half of Fig. 2, and the captions indicate the performance of a linear and an SVM model (also see Appendix §A.2). For the first dataset, we see that AFLite removes most of those examples with flipped labels. These results show that AFLite indeed lowers the performance of models relying on biases by removing samples with spurious correlations from a dataset.

Figure 2: Two sample biased datasets as input to AFLite (top). Blue and orange indicate two different classes. Only the original two dimensions are shown, not the bias features. For the dataset on the left, with the highest separation, we flip some labels at random, so even an RBF kernel cannot achieve perfect performance. AFLite makes the data more challenging for the models (bottom). Also see Appendix §A.2 for more details.

4 NLP Experiments

As our first real-world data evaluation for AFLite, we consider out-of-domain and in-domain generalization for a variety of language datasets. The primary task we consider is natural language inference (NLI) on the Stanford NLI dataset (Bowman et al., 2015, SNLI). Each instance in the NLI task consists of a premise-hypothesis sentence pair, the task involves predicting whether the hypothesis either entails, contradicts or is neutral to the premise.

Experimental Setup We use feature representations from RoBERTa-large, (Liu et al., 2019b), a large-scale pretrained masked language model. This is extracted from the final layer before the output layer, trained on a random sample (warm-up) of the original training set. The resultant filtered NLI dataset, , is compared to the original dataset as well as a randomly subsampled dataset , with the same sample size as , amounting to only a third of the full data . The same RoBERTa-large architecture is used to train the three NLI models.

HANS NLI-Diagnostics Stress
Lex. Subseq. Constit. Knowl. Logic PAS LxS. Comp. Distr. Noise
88.4 28.2 21.7 51.8 57.8 72.6 65.7 77.9 73.5 79.8
56.6 19.6 13.8 56.4 53.9 71.2 65.6 68.4 73.0 78.6
94.1 46.3 38.5 53.9 58.7 69.9 66.5 79.1 72.0 79.5
Table 1: Zero-shot SNLI accuracy on three out-of-distribution evaluation tasks, comparing RoBERTa-large models trained on the original SNLI data (, size 550k), AFLite-filtered data (, size 182k), and on a random subset with the same size as the filtered data (

). The reported accuracy is averaged across 5 random seeds, and the subscript denotes standard deviation. On the HANS dataset, all models are evaluated on the non-entailment cases of the three syntactic heuristics (

Lexical overlap, Subsequence, and Constituent). The NLI-Diagnostics dataset is broken down into the instances requiring logical reasoning (Logic), world and commonsense knowledge (Knowledge), lexical semantics or predicate-argument structures. Stress tests for NLI are further categorized into Competence, Distraction and Noise tests.
Rd1 Rd2 Rd3
58.5 48.3 50.1
65.1 49.1 52.8
Table 2: SNLI accuracy on Adversarial NLI using RoBERTa-large models pre-trained on the original SNLI data (, size 550k) and on AFLite-filtered data (, size 182k). Both models were finetuned on the in-distribution training data for each round (Rd1, Rd2, and Rd3).

4.1 Out-of-distribution Generalization

As motivated in Section §1, large-scale architectures often learn to solve datasets rather than the underlying task by overfitting on unintended correlations between input and output in the data. However, this reliance might be hurtful for generalization to out-of-distribution examples, since they may not contain the same biases. We evaluate AFLite for this criterion on the NLI task.

Gururangan et al. (2018), among others, showed the existence of certain annotation artifacts (lexical associations etc.) in SNLI which make the task considerably easier for most current methods. This spurred the development of several out-of-distribution test sets which carefully control for the presence of said artifacts. We evaluate on four such out-of-distribution datasets: HANS (McCoy et al., 2019), NLI Diagnostics (Wang et al., 2018a), Stress tests (Naik et al., 2018) and Adversarial-NLI (Nie et al., 2019), see Appendix §A.3 for details. Given that these benchmarks are collected independently of the original SNLI task, the biases from SNLI are less likely to carry over; however these benchmarks might contain their own biases (Liu et al., 2019a).

Train Data
ESIM+ELMo (Peters et al., 2018) 88.7 86.0 61.5 54.2 51.9 36.8
BERT (Devlin et al., 2019) 91.3 87.6 74.7 61.8 57.0 34.3
RoBERTa (Liu et al., 2019b) 92.6 88.3 78.9 71.4 62.6 30.0
Max-PPMI 54.5 52.0 41.1 41.5 41.9 12.6
BERT -HypOnly 71.5 70.1 52.3 46.4 48.4 23.1
RoBERTa -HypOnly 72.0 70.4 53.6 49.5 48.5 23.5
Human performance 88.1 88.1 82.3 80.3 77.8 10.3
Training set size 550k 92k 138k 109k 92k 458k
Table 3: Dev accuracy () on the original SNLI dataset and the datasets obtained through different AFLite-filtering and other baselines. indicates a randomly subsampled train dataset of the same size as . indicates the difference in performance between the full model and the model trained on .

Table 1 shows results on three out of four diagnostic datasets (HANS, NLI-Diagnostics and Stress), where we perform a zero-shot evaluation of the models. Models trained on SNLI-AFLite consistently exceed or match the performance of the full model on the benchmarks above, up to standard deviation. To control for the size, we compare to a baseline trained on a random subsample of the same size (). AFLite models report higher generalization performance suggesting that the filtered samples are more informative than a random subset. In particular, AFLite substantially outperforms challenging examples on the HANS benchmark, which targets models purely relying on lexical and syntactic cues.

Table 2 shows results on the adversarial NLI benchmark, which allows for evaluation of transfer capabilities, by finetuning models on each of the three training datasets (Rd1, Rd2 and Rd3). A RoBERTa-large model trained on SNLI-AFLite surpasses the performance in all three settings.

4.2 In-distribution Benchmark Re-estimation


additionally provides a more accurate estimation of the benchmark performance on several tasks. Here we simply lower the

AFLite early-stopping threshold, in order to filter most biased examples from the data, resulting in a stricter benchmark with 92k train samples.


In addition to RoBERTa-large, we consider here pre-computed embeddings from BERT-large (Devlin et al., 2019), and GloVe (Pennington et al., 2014), resulting in three different feature representations for SNLI: , from RoBERTa-large (Liu et al., 2019b), and which uses the ESIM model (Chen et al., 2016) with GloVe embeddings . Table 3 shows the results for SNLI. In all cases, applying AFLite substantially reduces overall model accuracy, with typical drops of 15-35% depending on the models used for learning the feature representations and those used for evaluation of the filtered dataset. In general, performance is lowest when using the strongest model (RoBERTa) for learning feature representations. Results also highlight the ability of weaker adversaries to produce datasets that are still challenging for much stronger models with a drop of 13.7% for RoBERTa using as feature representation.

To control for the reduction in dataset size by filtering, we randomly subsample , creating whose size is approximately equal to that of . All models achieve nearly the same performance as their performance on the full dataset – even when trained on just one-fifth the original data. This result further highlights that current benchmark datasets contain significant redundancy within its instances.

We also include two other baselines, which target known dataset artifacts in NLI. The first baseline uses Point-wise Mutual Information (PMI) between words in a given instance and the target label as its only feature. Hence it captures the extent to which datasets exhibit word-association biases, one particular class of spurious correlations. While this baseline is relatively weaker than other models, its performance still reduces by nearly 13% on the dataset. The second baseline trains only the hypothesis of an NLI instance (-HypOnly). Such partial input baselines (Gururangan et al., 2018) capture reliance on lexical cues only in the hypothesis, instead of learning a semantic relationship between the hypothesis and premise. This reduces performance by almost 24% before and after filtering with RoBERTa. AFLite, which is agnostic to any particular known bias in the data, results in a drop of about 30% on the same dataset, indicating that it might be capturing a larger class of spurious biases than either of the above baselines.

Finally, to demonstrate the value of the iterative, ensemble-based AFLite algorithm, we compare with a baseline where using a single model, we filter out the most predictable examples in a single iteration — a non-iterative, single-model version of AFLite. A RoBERTa-large model trained on this subset (of the same size as ) achieves a dev accuracy of . Compared to the performance of RoBERTa on (, see Table 3), it makes this baseline a sensible yet less effective approach. In particular, this illustrates the need for an iterative procedure involving models trained on multiple partitions of the remaining data in each iteration.

MultiNLI and QNLI

We evaluate the performance of another large-scale NLI dataset multi-genre NLI (Williams et al., 2018, MultiNLI), and the QNLI dataset (Wang et al., 2018a) which is a sentence-pair classification version of the SQuAD (Rajpurkar et al., 2016) question answering task.444QNLI is stylized as an NLI classification task, where the task is to determine whether or not a sentence contains the answer to a question. Results before and after AFLite are reported in Table 4. Since RoBERTa resulted in the largest drops in performance across the board in SNLI, we only experiment with RoBERTa as adversary for MultiNLI and QNLI. While RoBERTa achieves over on both original datasets, its performance drops to for MultiNLI and to for QNLI on the filtered datasets. Similarly, partial input baseline performance also decreases substantially on both dataset compared to their performance on the original dataset. Overall, our experiments indicate that AFLite consistently results in reduced accuracy on the filtered datasets across multiple language benchmark datasets, even after controlling for the size of the training set.

BERT 86.6 55.8 30.8
RoBERTa 90.3 66.2 24.1
BERT -PartialInput 59.7 43.2 16.5
RoBERTa -PartialInput 60.3 44.4 15.9
BERT 92.0 63.5 28.5
RoBERTa 93.7 77.7 16.0
BERT -PartialInput 62.6 56.6 6.0
RoBERTa -PartialInput 63.9 59.4 4.5
Table 4: Dev accuracy () on the original and AFLite-filtered MNLI-matched and QNLI datasets. The -PartialInput baselines show models trained on only Hypotheses for MNLI instances and only Answers for QNLI. indicates the difference in performance between the full model and the model trained on AFLite-filtered data.

Table 3 shows that human performance on SNLI-AFLite is lower than that on full SNLI555Measured by annotator labels provided in the original SNLI validation data. This indicates that the filtered dataset is somewhat harder even for humans, though to a much lesser degree than any model. Indeed, removal of examples with spurious correlations could inadvertently lead to removal of genuinely easy examples; this might be a limitation of a model-based bias reduction approach such as AFLite (see Appendix §A.5 for a qualitative analysis). Future directions for bias reduction techniques should account for unaltered human performance before and after reduction.

5 Vision Experiments

Model EfficientNet-B5 EfficientNet-B7
16.5 20.6
5.9 8.5
7.2 10.4
Table 5: Top-1 accuracy on ImageNet-A (Hendrycks et al., 2019), an adversarial evaluation set for image classification. The most powerful model EfficientNet-B7 improves by 2% on out-of-distribution ImageNet-A images when trained on AFLite-filtered data.

We evaluate AFLite on image classification through ImageNet (ILSVRC2012) classification. On ImageNet, we use the state-of-the-art EfficientNet-B7 model (Tan and Le, 2019) as our core feature extractor . The EfficientNet model is learned from scratch on a fixed 20% sample of the ImageNet training set, using RandAugment data augmentation (Cubuk et al., 2019)

. We then use the 2560-dimensional features extracted by EfficientNet-B7 as then underlying representation for

AFLite to use to filter the remaining dataset, and stop when data size is 40% of ImageNet.

Adversarial Image Classification

In Table 5, we report performance of image classification models on ImageNet-A, a dataset with out-of-distribution images (Hendrycks et al., 2019). As shown, all EfficientNet models struggle on this task, even when trained on the entire ImageNet. However, we find that training on AFLite-filtered data leads to models with greater generalization, in comparison to training on a randomly sampled ImageNet of the same size, leading to up to 2% improvement in performance.

Train Data
EfficientNet-B0 76.3 69.6 50.2 26.1
EfficientNet-B3 81.7 75.1 57.3 24.4
EfficientNet-B5 83.7 78.6 62.2 21.5
EfficientNet-B7 84.4 78.8 63.5 20.9
ResNet-34 78.4 65.9 46.9 31.5
ResNet-50 79.2 68.9 50.1 29.1
ResNet-101 80.1 70.1 52.2 27.9
ResNet-152 80.6 71.0 53.3 27.3
Table 6: Results on ImageNet, in Top-1 accuracy (%). We consider training on the 40% challenging instances, as filtered by AFLite (), and compare this to a random 40% subsample of ImageNet (). We report results on the ImageNet validation set before and after filtering with AFLite. indicates the difference in accuracy of the full model and the filtered model. Notably, evaluating on ImageNet-AFLite is much harder—resulting in a drop of nearly 21 percentage points in accuracy for the strongest model.

In-distribution Image Classification

In Table 6, we present ImageNet accuracy across the EfficientNet and ResNet (He et al., 2016) model families before and after filtering with AFLite. For evaluation, the Imagenet-AFLite filtered validation set is much harder than the standard validation set (also see Figure 1). While the top performer after filtering is still EfficientNet-B7, its top-1 accuracy drops from 84.4% to 63.5%. A model trained on a randomly filtered subsample of the same size though suffers much lesser, most likely due to reduction in training data.

Overall, these results suggest that image classification – even within a subset of the closed world of ImageNet – is far from solved. These results echo other findings that suggest that common biases that naturally occur in web-scale image data, such as towards canonical poses (Alcorn et al., 2019) or towards texture rather than shape (Geirhos et al., 2018), are problems for ImageNet-trained classifiers.

6 Related Work

AFLite is related to Zellers et al. (2018)’s adversarial filtering (AF) algorithm, yet distinct in two key ways: it is (i) much more broadly applicable (by not requiring over generation of data instances), and (ii) considerably more lightweight (by not requiring re-training a model at each iteration of AF). Variants of this AF approach have recently been used to create other datasets such as HellaSwag (Zellers et al., 2019) and ANLI (Bhagavatula et al., 2019) by iteratively perturbing dataset instances until a target model cannot fit the resulting dataset. While effective, these approaches run into three main pitfalls. First, dataset curators need to explicitly devise a strategy of collecting or generating perturbations of a given instance. Second, the approach runs the risk of distributional bias where a discriminator can learn to distinguish between machine generated instances and human-generated ones. Finally it requires re-training a model at each iteration, which is computationally expensive especially when using a large model such as BERT as the adversary. In contrast, AFLite focuses on addressing dataset biases from existing datasets instead of adversarially perturbing instances. AFLite was earlier proposed by Sakaguchi et al. (2020) to create the Winogrande dataset. This paper presents more thorough experiments, theoretical justification and results from generalizing the proposed approach to multiple popular NLP and Vision datasets.

Li and Vasconcelos (2019) recently proposed REPAIR, a method to remove representation bias by dataset resampling. The motivation in REPAIR is to learn a probability distribution over the dataset that favors instances that are hard for a given representation. In addition, the implementation of REPAIR relies on in-training classification loss as opposed to out-of-sample generalization accuracy. RESOUND (Li et al., 2018) quantifies the representation biases of datasets. It uses the representation biases to assemble a new K-class dataset with smaller biases by sampling an existing C-class dataset ().

Arjovsky et al. (2019) propose Invariant Risk Minimization as an objective that promotes learning representations of the data which are stable across environments. Instead of learning optimal classifiers, AFLite aims to remove instances that exhibit artifacts in a dataset. Also related are approaches in He et al. (2019) where specific NLI biases are targeted; we show AFLite is capable of removing any spurious bias. Data selection methods such as Wang et al. (2018b) aim to filter data to preserve downstream performance, AFLite is adversarial to this goal.

7 Conclusion

We presented a deep-dive into AFLite – an iterative greedy algorithm that adversarially filters out spurious biases from data for accurate benchmark estimation. We presented a theoretical framework supporting AFLite, and showed its effectiveness in bias reduction on synthetic and real datasets, providing extensive analyses. We apply AFLite to four datasets, including widely used benchmarks such as SNLI and ImageNet, and show that the strongest performance on the resulting filtered dataset drops by 30 points for SNLI and 20 points for ImageNet. We showed on out-of-distribution and adversarial test sets, models trained on the AFLite-filtered subset generalize better. We hope that dataset creators will employ AFLite to identify unobservable artifacts before releasing new challenge datasets for more reliable estimates of task progress on future AI benchmarks. All datasets and code for this work will be made public soon.


  • M. A. Alcorn, Q. Li, Z. Gong, C. Wang, L. Mai, W. Ku, and A. Nguyen (2019) Strike (with) a pose: neural networks are easily fooled by strange poses of familiar objects. In CVPR, Cited by: §5.
  • M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz (2019) Invariant risk minimization. Note: ArXiv:1907.02893 External Links: Link Cited by: §6.
  • M. Balog, N. Tripuraneni, Z. Ghahramani, and A. Weller (2017) Lost relatives of the Gumbel trick. In ICML, Cited by: §A.1, §2.
  • C. Bhagavatula, R. L. Bras, C. Malaviya, K. Sakaguchi, A. Holtzman, H. Rashkin, D. Downey, S. W. Yih, and Y. Choi (2019) Abductive commonsense reasoning. In ICLR, External Links: Link Cited by: §6.
  • S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. In EMNLP, External Links: Link Cited by: §1, §1, §4.
  • Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang, and D. Inkpen (2016) Enhanced LSTM for natural language inference. In ACL, External Links: Link Cited by: §4.2.
  • E. D. Cubuk, B. Zoph, J. Shlens, and Q. V. Le (2019) RandAugment: practical data augmentation with no separate search. arXiv preprint arXiv:1909.13719. Cited by: §A.6, §5.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Cited by: §4.2, Table 3.
  • K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, and D. X. Song (2018)

    Robust physical-world attacks on deep learning models

    In CVPR, Cited by: §1.
  • D. F. Fouhey, W. Kuo, A. A. Efros, and J. Malik (2018) From lifestyle vlogs to everyday interactions. In CVPR, Cited by: §1.
  • R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel (2018) ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. ICLR. Cited by: §1, §5.
  • M. Geva, Y. Goldberg, and J. Berant (2019) Are we modeling the task or the annotator? An investigation of annotator bias in natural language understanding datasets. In EMNLP, External Links: Link Cited by: §1.
  • Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh (2017) Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In CVPR, Cited by: §1.
  • E. J. Gumbel and J. Lieblein (1954) Statistical theory of extreme values and some practical applications: a series of lectures. In Applied Mathematics Series, Vol. 33. Cited by: §A.1, §2.
  • S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. Bowman, and N. A. Smith (2018) Annotation artifacts in natural language inference data. In NAACL, External Links: Link Cited by: §1, §1, §4.1, §4.2.
  • H. He, S. Zha, and H. Wang (2019) Unlearn dataset bias in natural language inference by fitting the residual. ArXiv. External Links: Link Cited by: §6.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    Cited by: §A.6, §5.
  • D. Hendrycks, K. Zhao, S. Basart, J. Steinhardt, and D. Song (2019) Natural adversarial examples. arXiv preprint arXiv:1907.07174. Cited by: §1, §5, Table 5.
  • E. Jang, S. Gu, and B. Poole (2016) Categorical reparameterization with gumbel-softmax. In ICLR, Cited by: §A.1.
  • R. Jia and P. Liang (2017) Adversarial examples for evaluating reading comprehension systems. In EMNLP, External Links: Link Cited by: §1.
  • C. Kim, A. Sabharwal, and S. Ermon (2016)

    Exact sampling with integer linear programs and random perturbations

    In AAAI, Cited by: §A.1, §2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. Note: arXiv:1412.6980 External Links: Link Cited by: §A.4.
  • W. Kool, H. van Hoof, and M. Welling (2019) Stochastic beams and where to find them: the gumbel-top-k trick for sampling sequences without replacement. In ICML, Cited by: §A.1, §2.
  • Y. C. Li and N. Vasconcelos (2019) REPAIR: Removing representation bias by dataset resampling. In CVPR, Cited by: §6.
  • Y. Li, Y. Li, and N. Vasconcelos (2018) RESOUND: Towards action recognition without representation bias. In ECCV, Cited by: §6.
  • N. F. Liu, R. Schwartz, and N. A. Smith (2019a) Inoculation by fine-tuning: a method for analyzing challenge datasets. In NAACL, External Links: Link, Document Cited by: §4.1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. S. Joshi, D. Chen, O. Levy, M. Lewis, L. S. Zettlemoyer, and V. Stoyanov (2019b) RoBERTa: A robustly optimized BERT pretraining approach. Note: ArXiv:1907.11692 External Links: Link Cited by: §4.2, Table 3, §4.
  • C. J. Maddison, A. Mnih, and Y. W. Teh (2016)

    The concrete distribution: a continuous relaxation of discrete random variables

    In ICLR, Cited by: §A.1.
  • C. J. Maddison, D. Tarlow, and T. Minka (2014) A* sampling. In NeurIPS, Cited by: §A.1, §2.
  • T. McCoy, E. Pavlick, and T. Linzen (2019) Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference. In ACL, External Links: Link, Document Cited by: 1st item, §4.1.
  • A. Naik, A. Ravichander, N. Sadeh, C. Rose, and G. Neubig (2018) Stress test evaluation for natural language inference. In ICCL, External Links: Link Cited by: 3rd item, §4.1.
  • Y. Nie, A. Williams, E. Dinan, M. Bansal, J. Weston, and D. Kiela (2019) Adversarial NLI: a new benchmark for natural language understanding. Note: arXiv:1910.14599 External Links: Link Cited by: 4th item, §4.1.
  • J. Pennington, R. Socher, and C. D. Manning (2014) GloVe: global vectors for word representation. In EMNLP, External Links: Link Cited by: §4.2.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. S. Zettlemoyer (2018) Deep contextualized word representations. In NAACL, External Links: Link Cited by: Table 3.
  • A. Poliak, J. Naradowsky, A. Haldar, R. Rudinger, and B. Van Durme (2018) Hypothesis only baselines in natural language inference. In *SEM, External Links: Link Cited by: §1.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100, 000+ questions for machine comprehension of text. In EMNLP, External Links: Link Cited by: §1, §4.2.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. IJCV. External Links: Document Cited by: §1, §1.
  • K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi (2020) WINOGRANDE: an adversarial winograd schema challenge at scale. In AAAI, External Links: Link Cited by: §1, §1, §1, §6.
  • M. Tan and Q. Le (2019)

    EfficientNet: rethinking model scaling for convolutional neural networks

    In ICML, Cited by: §A.6, Figure 1, §5.
  • A. Torralba and A. A. Efros (2011) Unbiased look at dataset bias. CVPR. Cited by: §1.
  • M. Tsuchiya (2018) Performance impact caused by hidden bias of training data for recognizing textual entailment. In LREC, Cited by: §1.
  • T. Vieira (2014) Gumbel-max trick and weighted reservoir sampling. External Links: Link Cited by: §A.1.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2018a) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In ICLR, External Links: Link Cited by: 2nd item, §1, §4.1, §4.2.
  • T. Wang, J. Zhu, A. Torralba, and A. A. Efros (2018b) Dataset distillation. Note: arXiv:1811.10959 External Links: Link Cited by: §6.
  • A. Williams, N. Nangia, and S. Bowman (2018) A broad-coverage challenge corpus for sentence understanding through inference. In NAACL, External Links: Link, Document Cited by: §1, §4.2.
  • R. Zellers, Y. Bisk, R. Schwartz, and Y. Choi (2018) SWAG: A large-scale adversarial dataset for grounded commonsense inference. In EMNLP, External Links: Link Cited by: §6.
  • R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi (2019) HellaSwag: Can a machine really finish your sentence?. In ACL, External Links: Link Cited by: §6.

Appendix A Appendix

a.1 Slice Sampling Details

The slice sampling approach can be efficiently implemented using what is known as the Gumbel method or Gumbel trick (Gumbel and Lieblein, 1954; Maddison et al., 2014), which uses random perturbations to turn sampling into a simpler problem of optimization. This has recently found success in several probabilistic inference applications (Kim et al., 2016; Jang et al., 2016; Maddison et al., 2016; Balog et al., 2017; Kool et al., 2019). Starting with the log-predictability scores for various , the idea is to perturb them by adding an independent random noise drawn from the standard Gumbel distribution. Interestingly, the maximizer of turns out to be an exact sample drawn from the (unnormalized) distribution defined by . Note that is a random variable since the are drawn at random. This result can be generalized (Vieira, 2014) for slice sampling: the highest values of Gumbel-perturbed log-predictability scores correspond to sampling, without replacement, items from the probability distribution defined by . The Gumbel method is typically applied to exponentially large combinatorial spaces, where it is challenging to scale up. In our setting, however, the overhead is minimal since the cost of drawing a random is negligible compared to computing .

a.2 Results on Synthetic Data Experiments

Figure 3: Four sample biased datasets as input to AFLite (top). Blue and orange indicate two different classes. Only the original two dimensions are shown, not the bias features. For the leftmost dataset with the highest separation, we flip some labels at random, so even an RBF kernel cannot achieve perfect performance. AFLite makes the data more challenging for the models (bottom).

Figure 3 shows the effect of AFLite on four synthetic datasets at increasing degrees of class separation, exhibiting similar phenomena to those shown in Figure 2. Accuracies of the SVM with RBF kernel and logistic regression are shown in Table 7 for higher visibility. A stronger model such as the SVM is more robust to the presence of artifacts than a simple linear classifier. Thus, the implications for real datasets is to move towards models geared towards the specific phenomena in the task, to avoid dependence on spurious artifacts.

a.3 NLI Out-of-distribution Benchmarks

We describe the four out-of-distribution evaluation benchmarks for NLI from Section §4.1 below:

  • HANS (McCoy et al., 2019) contains evaluation examples designed to avoid common structural heuristics (such as word overlap) which could be used by models to correctly predict NLI inputs, without true inferential reasoning.

  • NLI Diagnostics (Wang et al., 2018a) is a set of hand-crafted examples designed to demonstrate model performance on several fine-grained semantic categories, such as logical reasoning and commonsense knowledge.

  • Stress tests for NLI (Naik et al., 2018) are a collection of tests targeting the weaknesses of strong NLI models, to check if these are robust to semantics (competence), irrelevance (distraction) and typos (noise).

  • Adversarial NLI (Nie et al., 2019) consists of premises collected from Wikipedia and other news corpora, and human generated hypotheses, arranged at different tiers of the challenge they present to a model, using a human and model in-the-loop procedure.

a.4 Hyperparameters for NLI

For all NLP experiments, our implementation is based on the Transformers repository from Huggingface666 We used the Adam optimizer (Kingma and Ba, 2014)

for every training set up, with a learning rate of 1e-05, and an epsilon value of 1e-08, keeping other parameters the same as in the original PyTorch repository. We trained for 3 epochs for all *NLI tasks, maintaining a batch size of 92. Each experiment was performed on a single Quadro RTX 8000 GPU.

Results on the synthetic dataset are provided in Table 7. Please refer to Section (§3) for a detailed description.

Separation Model
0.8 SVM-RBF 97.0 90.7
Logistic Reg. 83.5 50.7
0.7 SVM-RBF 89.9 82.5
Logistic Reg. 74.3 52.4
0.6 SVM-RBF 87.6 77.8
Logistic Reg. 74.3 53.1
0.4 SVM-RBF 83.8 70.7
Logistic Reg. 75.4 53.4
Table 7: Mean Dev accuracy () on two models trained on four synthetic datasets before () and after () AFLite. Standard deviation across 10 runs with randomly chosen seeds is provided as a subscript. The datasets, also shown in Fig. 3 differ in the degree of separation between the two classes. Both models (SVM with an RBF kernel & linear classifier with logisitic regression) perform well on the original synthetic dataset, before filtering. The linear classifier performs well on the data, because it contains spurious artifacts, making the task artificially easier for it. However, after AFLite, the linear model, relying mostly on the spurious features, clearly underperforms.

a.5 Qualitative Analysis of SNLI

Table 8 shows some examples removed and retained by AFLite on the NLI dataset.

Removed by AFLite
Premise Hypothesis Label
A woman, in a green shirt, preparing to run on a treadmill. A woman is preparing to sleep on a treadmill. contradiction
The dog is catching a treat. The cat is not catching a treat. contradiction
Three young men are watching a tennis match on a large screen outdoors. Three young men watching a tennis match on a screen outdoors, because their brother is playing. neutral
A girl dressed in a pink shirt, jeans, and flip-flops sitting down playing with a lollipop machine. A funny person in a shirt. neutral
A man in a green apron smiles behind a food stand. A man smiles. entailment
A little girl with a hat sits between a woman’s feet in the sand in front of a pair of colorful tents. The girl is wearing a hat. entailment
Retained by AFLite
Premise Hypothesis Label
People are throwing tomatoes at each other. The people are having a food fight. entailment
A man poses for a photo in front of a Chinese building by jumping. The man is prepared for his photo. entailment
An older gentleman speaking at a podium. A man giving a speech neutral
A man poses for a photo in front of a Chinese building by jumping. The man has experience in taking photos. neutral
People are waiting in line by a food vendor. People sit and wait for their orders at a nice sit down restaurant. contradiction
Number 13 kicks a soccer ball towards the goal during children’s soccer game. A player passing the ball in a soccer game. contradiction
Table 8: Examples from SNLI, removed (top) and retained (bottom) by AFLite. As is evident, the retained instances are slightly more challenging and nuanced than the removed instances.

a.6 Hyperparameters for ImageNet

We trained our ImageNet models using v3-512 TPU pods. For EfficientNet (Tan and Le, 2019), we used RandAugment data augmentation (Cubuk et al., 2019)

with 2 layers, and a magnitude of 28, for all model sizes. We trained our models using a batch size of 4096, a learning rate of 0.128, and kept other hyperparameters the same as in

(Tan and Le, 2019). We trained for 350 epochs for all dataset sizes - so when training on 20% or 40% of ImageNet (or a smaller dataset), we scaled the number of optimization steps accordingly. For ResNet (He et al., 2016), we used a learning rate of 0.1, a batch size of 8192, and trained for 90 epochs.