Most of today’s machine learning methods rely on the assumption that the training and testing data are drawn from a same distribution . One implication is that models are susceptible to poor real-world performance when the test data differs from what is observed during training. This limited capability to generalise partly arises because supervised training essentially amounts to identifying correlations between given examples and their labels. However, correlations can be spurious, in the sense that they may reflect dataset-specific biases or sampling artifacts, rather than intrinsic properties of the task of interest [41, 67]. When spurious correlations do not hold in the test data, the model’s predictive performance suffers and its output becomes unreliable and unpredictable. For example, an image recognition system may rely on common co-occurrences of objects, such as people together with a dining table, rather visual evidence for each recognized object. This system could then hallucinate people when a table is observed (Fig. 4).
A model capable of generalization and extrapolation beyond its training distribution should ideally capture the causal mechanisms at play behind the data. Acquiring additional training examples from the same distribution cannot help in this process . Rather, we need either to inject strong prior assumptions in the model, such as inductive biases encoded in the architecture of a neural network, or a different type of training information. Ad hoc methods such as data augmentation and domain randomization fall in the former category, and they only defer the limits of the system by hand-designed rules.
In this paper, we show that many existing datasets contain an overlooked signal that is informative about their causal data-generating process. This information is present in the form of groupings of training examples, and it is often discarded by the shuffling of points occurring during stochastic training. We show that this information can be used to learn a model that is more faithful to the causal model behind the data. This training signal is fundamentally different and complementary to the labels of individual points. We use pairs of minimally-dissimilar, differently-labeled training examples, which we interpret as counterfactuals of one another. In some datasets, such pairs are provided explicitly [7, 12, 47, 48]. In others, they can be identified from existing annotations [18, 25, 36, 61, 62, 70].
The intuition for our approach is that relations between pairs of counterfactual examples indicate what changes in the input space map to changes in the space of labels. In a classification setting, this serves to constrain the geometry of a model’s decision boundary between classes. Loosely speaking, we complement the traditional “curve fitting” to individual training points of standard supervised learning, with “aligning the curve” with pairs of counterfactual training points.
We describe a novel training objective (gradient supervision) and its implementation on various architectures of neural networks. The vector difference in input space between pairs of counterfactual examples serves to supervise the orientation of the gradient of the network. We demonstrate the benefits of the method on four tasks in computer vision and natural language processing (NLP) that are notoriously prone to poor generalization due to dataset biases: visual question answering (VQA), multi-label image classification, sentiment analysis, and natural language inference. We use annotations from existing datasets that are usually disregarded, and we demonstrate significant improvements in generalization to out-of-distribution test sets for all tasks.
In summary, the contributions of this paper are as follows.
We propose to use relations between training examples as additional information in the supervised training of neural networks (Section 3.1). We show that they provide a fundamentally different and complementary training signal to the fitting of individual examples, and explain how they improve generalization (Section 3.3).
We describe a novel training objective (gradient supervision) to use this information and its implementation on multiple architectures of neural networks (Section 4).
We demonstrate that the required annotations are present in a number of existing datasets in computer vision and NLP, although they are usually discarded. We show that our technique brings improvements in out-of-distribution generalization on VQA, multi-label image classification, sentiment analysis, and natural language inference.
2 Related work
This work proposes a new training objective that improves the generalization capabilities of models trained with supervision. This touches a number of core concepts in machine learning.
The predictive performance of machine learning models rests on the fundamental assumption of statistical similarity of the distributions of training and test data. There is a growing interest for evaluating and addressing the limits of this assumption. Evaluation on out-of-distribution data is increasingly common in computer vision [2, 8, 27] and NLP [33, 78]. These evaluations have shown that some of the best models can be right for the wrong reasons [2, 21, 25, 67, 68]. This happens when they rely on dataset-specific biases and artifacts rather than intrinsic properties of the task of interest. When these biases do not hold in the test data, the predictive performance of the models can drop dramatically [2, 8].
When poor generalization is viewed as a deficiency of the training data, it is often referred to as dataset biases. They correspond to correlations between inputs and labels in a dataset that can be exploited by a model to exhibit strong performance on a test set containing these same biases, without actually solving the task of interest. Several popular datasets used in vision-and-language  and NLP  have been shown to exhibit strong biases, leading to an inflated sense of progress on these tasks.
Recent works have discussed generalization from a causal perspective [6, 29, 50, 54]. This sheds light on the possible avenues for addressing the issue. In order to generalize perfectly, a model should ideally capture the real-world causal mechanisms at play behind the data. The limits of identifiability of causal models from observational data have been well studied . In particular, additional data from a single biased training distribution can not solve the problem. The alternative options are to use strong assumptions (e.g. inductive biases, engineered architectures, hand-designed data augmentations), or additional data, collected in controlled conditions and/or of a different type than labeled examples. This work uses the latter option, using pairings of training examples that represent counterfactuals of one another. Recent works that follow this line include the principle of invariant risk minimization (IRM ). IRM uses multiple training environments, i.e. non-IID training distributions, to discover generalizable invariances in the data. Teney et al.  showed that existing datasets could be automatically partitioned to create these environments, and demonstrated improvements in generalization for the task of visual question answering (VQA).
Generalization is also related to the wide area of domain adaptation . Our objective in this paper is not to adapt to a particular new domain, but rather to learn a model that generalizes more broadly by using annotations indicative of the causal mechanisms of the task of interest. In domain adaptation, the idea of finding a data representation that is invariant across domains is limiting, because the true causal factors that our model should rely on may differ in their distribution across training domains. We refer the reader to  for a formal discussion of these issues.
The growing popularity of high-level tasks in vision-and-language [4, 5, 19] has brought the issue of dataset biases to the forefront. In VQA, language biases cause models to be overly reliant on the presence of particular words in a question. Improving the data collection process can help [24, 80] but it only addresses precisely identified biases and confounders. Controlled evaluations for VQA now include out-of-distribution test sets [2, 65]. Several models and training methods [11, 15, 16, 26, 27, 40, 52] have been proposed with significant improvements. They all use strong prior knowledge about the task and/or additional annotations (question types) to improve generalization. Some methods also supervise the model’s attention [37, 51, 57] with ground truth human attention maps . All of these methods are specific to VQA or to captioning [30, 37] whereas we describe a much more general approach.
Evaluating generalization overlaps with the growing interest in adversarial examples for evaluation [9, 13, 31, 33, 42, 46, 71]. The term has been used to refer both to examples purposefully generated to fool existing models [43, 44], but also to hard natural examples that current models struggle with [9, 13, 71, 79]. Our method is most related to the use of these examples for adversarial training. Existing methods focus mostly on the generation of these examples then mix them with the original data in a form of data augmentation [34, 58, 75]. We argue that this shuffling of examples destroys valuable information. In many datasets, we demonstrate that relations between training points contain valuable information. The above methods also aim at improving robustness to targeted adversarial attacks, which often use inputs outside the manifold of natural data. Most of them rely on prior knowledge and unsupervised regularizers [35, 32, 74] whereas we seek to exploit additional supervision to improve generalization on natural data.
3 Proposed approach
We start with an intuitive motivation for our approach, then describe its technical realization. In Section 3.3, we analyze more formally how it can improve generalization. In Section 4, we demonstrate its application to a range of tasks.
Training a machine learning model amounts to fitting a function to a set of labeled points. We consider a binary classification task, in which the model is a neural network of parameters such that , and a set of training points111By input space, we refer to a space of feature representations of the input, i.e. vector representations () obtained with a pretrained CNN or text encoder. , with and . By training the model, we typically optimize such that the output of the network minimizes a loss on the training points. However, this does not specify the behaviour of between these points, and the decision boundary could take an arbitrary shape (Fig. 3). The typical practice is to restrain the space of functions (e.g. a particular architecture of neural networks) and of parameters (e.g. with regularizers [32, 35, 74]
). The capability of the model to interpolate and extrapolate beyond the training points depends on these choices.
Our motivating intuition is that many datasets contain information that is indicative of the shape of an ideal (in the sense of being faithful to the data-generating process, see Section 3.3) between training points. In particular, we are interested in pairs of training examples that are counterfactuals of one another. Given a labeled example , we define its counterfactuals as examples such as that represents an alternative premise (“counter to the facts”) that lead to different outcome . These points represent “minimal changes” (, in a semantic sense) such that their label . All possible counterfactuals to a given example constitute a distribution. We assume the availability of samples from it, forming pairs such as . The counterfactual relation is undirected.
Obtaining pairs of counterfactual examples.
Some existing datasets explicitly contain pairs of counterfactual examples [7, 12, 34, 47, 48]. For example,  contains sentences (movie reviews) with positive and negative labels. Annotators were instructed to edit a set of sentences to flip the label, thus creating counterfactual pairs (see examples in Fig. 1). Existing works simply use these as additional training point. Our contribution is to use the relation between these pairs, which is usually discarded. In other datasets, counterfactual examples can be created by masking parts of the input, thus creative negative examples. In Section 4, we apply this approach to the COCO and VQA v2 datasets.
3.2 Gradient supervision
To exploit relations between counterfactual examples, we introduce an auxiliary loss that supervises the gradient of the network . We denote the gradient of the network with respect to its input at a point with . Our new gradient supervision (GS) loss encourages to align with a “ground truth” gradient vector :
This definition is a cosine distance between and . Assuming is a pair of counterfactual examples, a “ground truth” gradient at is obtained as -. This represents the translation in the input space that should change the network output from to . Minimizing Eq. 1 encourages the network’s gradient to align with this vector at the training points. Assuming is continuously differentiable, it also constrains the shape of between training points. This makes more faithful to the generating process behind the training data (see Section 3.3). Also note that the GS loss uses a local linearization of the network. Although deep networks are highly non-linear globally, first-order approximations have found multiple uses, for example in providing explanations [53, 56] and generating adversarial examples . In our application, this approximation is reasonable since pairs of counterfactual examples lie close to one another and to the classification boundary, by definition.
The network is optimized for a combination of the main and GS losses, , where
is a scalar hyperparameter. The optimization of the GS loss requires backpropagating second-order derivatives through the network. The computational cost over standard supervised training is of two extra backpropagations through the whole model for each mini-batch.
In cases where the network output is a vector, a ground truth gradient is only available for classes for which we have positive examples. Denoting such a class gt, we apply the GS loss only on the gradient of this class, using
. If a softmax is used, the output for one class depends on that of the others, so the derivative of the network is taken on its logits to make it dependent on one class only.
3.3 How gradient supervision improves generalization
By training a machine learning model , we seek to approximate an ideal that represents the real-world process attributing the correct label to any possible input . Let us considering the Taylor expansion of at a training point :
Our definition of a pair of counterfactual examples (, ) (Section 3.1) implies that - approaches 0 (). For such a pair of nearby points, the terms beyond the first order virtually vanish. It follows that the distance between and is maximized when the dot product - is maximum. This is precisely the desired behavior of in the vicinity of and , since their ground truth labels and are different by our definition of counterfactuals. This leads to the definition of the GS loss in Eq. 1. Geometrically, it encourages the gradient of to align with the vector pointing from a point to its counterfactual, as illustrated in Fig. 3.
The conventional empirical risk minimization with non-convex functions leads to large numbers of local minimas. They correspond to multiple plausible decision boundaries with varied capability for generalization. Our approach essentially modifies the optimization landscape for the parameters of such that the minimizer found after training is more likely to reflect the ideal .
The proposed method is applicable to datasets with counterfactual examples in the training data. They are sometimes provided explicitly [7, 12, 34, 47, 48]. Most interestingly, we show that they can be also be generated from existing annotations [18, 25, 36, 61, 62, 70].
We selected four classical tasks in vision and language that are notoriously subject to poor generalization due to dataset biases. Our experiments aim (1) to measure the impact of gradient supervision on performance for well-known tasks, and (2) to demonstrate that the necessary annotations are available in a variety of existing datasets. We therefore prioritized the breadth of experiments and the use of simple models (details in supp. mat.) rather than chasing the state of the art on any particular task. The method should readily apply to more complex models for any of these tasks.
4.1 Visual question answering
The task of visual question answering (VQA) involves an image and a related question, to which the model must determine the correct answer among a set of approximately 2,000 candidate answers. Models trained on existing datasets (e.g. VQA v2 ) are notoriously poor at generalization because of dataset biases. These models rely on spurious correlations between the correct answer and certain words in the question. We use the training/test splits of VQA-CP  that were manually organized such that the correlation between the questions’ prefixes (first few words) and answers differ at training/test time. Most methods evaluated on VQA-CP use the explicit knowledge of this fact [2, 11, 15, 26, 52, 66, 17, 76] or even of the ground truth set of prefixes, which defeats the purpose of evaluating generalization. As discussed in the introduction, strong background assumptions are one of the two options to improve generalization beyond a set of labels. Our method, however, follows the other option of using a different type of data, and does not rest on the knowledge of the construction of VQA-CP.
Generating counterfactual examples.
We build counterfactual examples for VQA-CP using annotations of human attention from . Given a question/image/answer triple , we build its counterfactual counterpart by editing the image and answer. The image
is a set of features pre-extracted with a bottom-up attention model (typically a matrix of dimensions ). We build (, ) by masking the features whose bounding boxes overlap with the human attention map past a certain threshold (details in supp. mat.). The vector is a binary vector of correct answers over all candidates. We simply set all entries in to zero.
|Ramakrishnan et al., 2018 ||–||–||–||–||42.0||65.5||15.9||36.6||–||–||–||–|
|Grand and Belinkov, 2019 ||–||–||–||–||42.3||59.7||14.8||40.8||–||–||–||–|
|Teney et al., 2019 ||–||–||–||–||46.0||58.2||29.5||44.3||–||–||–||–|
|Strong baseline  + CF data||63.3||79.4||45.5||53.7||46.0||61.3||15.6||46.0||44.2||57.3||9.2||42.2|
|+ CF data + GS||62.4||77.8||43.8||53.6||46.8||64.5||15.3||45.9||46.2||63.5||10.5||41.4|
|Weak baseline (BUTD ),|
|trained on ‘Other’ only||–||–||–||54.7||–||–||–||43.3||–||–||–||40.6|
|+ CF data||–||–||–||55.9||–||–||–||45.0||–||–||–||40.6|
|+ CF data + GS||–||–||–||56.1||–||–||–||44.7||–||–||–||38.3|
For training, we use the training split of VQA-CP, minus 8,000 questions held out as an “in-domain” validation set (as in ). We generate counterfactual versions of the training examples that have a human attention map (approx. 7% of them). For evaluation, we use (1) our “in-domain” validation set (held out from the training set), (2) the official VQA-CP test set (which has a different correlation between prefixes and answers), and (3) a new focused test set.
The focused test set contains the questions from VQA-CP test from which we only keep image features of regions looked at by humans to answer the questions. We essentially perform the opposite of the building of counterfactual examples, and mask regions where the human attention is below a low threshold. Answering questions from the focused test set should intuitively be easier, since the background and distracting image regions have been removed. However, a model that relies on context (question or irrelevant image regions) rather than strictly on the relevant visual evidence will do poorly on the focused test set. This serves to measure robustness beyond the question biases that VQA-CP was specifically designed for.
We present results of our method applied on top of two existing models. The first (weak baseline) is the popular BUTD model [3, 64]. The second (strong baseline) is the “unshuffling” method of , which was specifically tuned to address the language biases evaluated with VQA-CP. We compare the baseline model with the same model trained with the additional counterfactual data, and then with the additional GS loss. The performance improves on most question types with each of these additions. The “focused” provides an out-of-distribution evaluation complementary to the VQA-CP test set (which only accounts for language biases). It shows the improvements expected from our method to a larger extent that the VQA-CP test set. This suggests that evaluating generalization in VQA is still not completely addressed with the current benchmarks. Importantly, the improvements over both the weak and strong baselines indicate that the proposed method is not redundant with existing methods that specifically address the language biases measured by VQA-CP, like the strong baseline. Additional details are provided in the supplementary material.
4.2 Multi-label image classification
We apply our method to the COCO dataset . Its images feature objects from 80 classes. They appear in common situations such that the patterns of co-occurrence are highly predictable: a bicycle often appears together with a person, and a traffic light often appears with cars, for example. These images serve as the basis of a number of benchmarks for image detection , captioning , visual question answering , etc. They all inherit the biases inherent to the COCO images [2, 30, 72] which is an increasing cause of concern. A method to improve generalization in this context has a wide potential impact.
We consider a simple multi-label classification task that captures the core issue of dataset biases that affect higher-level tasks (captioning for example 
). Each image is associated with a binary vector of size 80 that represents the presence of at least one object of the corresponding class in the image. The task is to predict this binary vector. Performance is measured with the mean average precision (mAP) over all classes. The model is a feed-forward neural network that performs an 80-class binary classification with sigmoid outputs, over pre-extracted ResNet-based visual features. We pre-extract these features with the bottom-up attention model of Andersonet al. . They are spatially pooled into a 2048-dimensional vector. The model is trained with a standard binary cross-entropy loss (details in the supplementary material).
|COCO Multi-label classification|
|Original||Edited images||Hard edited|
|Random predictions (chance)||5.1||3.9||7.8|
|Baseline w/o edited tr. examples||71.8||58.1||54.8|
|Baseline w/ edited tr. examples||72.1||64.0||56.0|
|+ GS, counterfactual relations||72.9||65.2||57.7|
|+ GS, random relations||71.8||63.9||56.1|
Generating counterfactual examples.
Counterfactual examples can be generated using existing annotation in COCO. Agarwal et al.  used the inpainter GAN  to edit images by masking selected objects. This only requires the original labels and bounding boxes. The edited images represent a “minimal change” that makes the corresponding label negative, which agrees with our definition of counterfactuals. The vector of ground truth labels for edited images are edited accordingly. For training, we use all images produced by  from the COCO train2014 split (original and edited versions). For evaluation, we use their images from the val2014 split (original and edited version, evaluated separately). We also create an additional evaluation split named “Hard edited images”. It contains a subset of edited images with patterns of classes that never appear in the training set.
We first compare the baseline model trained with the original images only, and then with the original and edited images (Table 2). The performance improves (71.872.1%), which is particularly clear when evaluated on edited images (58.164.0%). This is because the patterns of co-occurrence in the training data cannot blindly relied on with the edited images. The images in this set depict situations that are unusual in the training set, such as a surfboard without a person on top, or a man on a tennis court who is not holding a racquet. A model that relies on common co-occurrences in the training set rather than strictly on visual evidence can do well on the original images, but not on edited ones. An improvement from additional data is not surprising. It is still worth emphasizing that the edited images were generated “for free” using existing annotations in COCO.
Training the model with the proposed gradient supervision (GS) further improves the precision (72.172.9%). This is again more significant on the edited images (64.065.2%). The improvement is highest on the set of “hard edited images” (56.057.7%). As an ablation, we train the GS model with random pairwise relations instead of relations between counterfactual pairs. The performance is clearly worse, showing that the value of GS is in leveraging an additional training signal, rather than setting arbitrary constraints on the gradient like existing unsupervised regularizers [35, 32, 74]. In Fig. 4, we provide qualitative examples from the evaluation sets where the predictions of our model improve over the baseline.
4.3 NLP Tasks: sentiment analysis and natural language inference
|IMDb with counterfactuals||Zero-shot transfer|
|Test data||Val.||Test original||Test edited||Amazon||Yelp|
|Random predictions (chance)||51.4||47.7||49.2||47.3||53.3||45.4|
|Baseline w/o edited tr. data||71.2||82.6||55.3||78.6||61.0||82.8|
|Baseline w/ edited tr. data||85.7||82.0||88.7||80.8||63.1||87.4|
|+ GS, counterfactual rel.||89.8||83.8||91.2||81.6||65.4||88.8|
|+ GS, random relations||50.8||49.2||52.0||47.4||61.2||57.4|
|SNLI with counterfactuals||Zero-shot transfer|
|Test data||Val.||Test original||Test edited||MultiNLI dev.|
|Random predictions (chance)||30.8||34.6||32.9||31.9|
|Baseline w/o edited tr. data||61.8||42.0||59.0||46.0|
|Baseline w/ edited tr. data||61.3||39.1||57.8||42.4|
|+ GS, counterfactual relations||64.8||44.4||61.2||46.8|
|+ GS, random relations||58.5||40.4||58.6||45.7|
The task of sentiment analysis is to assign a positive or negative label to a text snippet, such as a movie or restaurant review. For training, we use the extension of the IMDb dataset  of movie reviews by Kaushik et al. . They collected counterfactual examples by instructing crowdworkers to edit sentences from the original dataset to flip their label. They showed that a standard model trained on the original data performs poorly when evaluated on edited data, indicating that it relies heavily on dataset biases (e.g. the movie genre being predictive of the label). They then used edited data during training (simply mixing it with the original data) and showed much better performance in all evaluation settings, even when controlling for the amount of additional training examples. Our contribution is to use GS to leverage the relations between the pairs of original/edited examples.
The task of natural language inference (NLI) is to classify a pair of sentences, named the premise and the hypothesis, into entailment, contradiction, neutral according to their logical relationship. We use the extension of the SNLI dataset  by Kaushik et al. . They instructed crowdworkers to edit original examples to change their labels. Each original example is supplemented with versions produced by editing the premise or the hypothesis, to either of the other two classes. The original and edited data together are therefore four times as large as the original data alone.
Results on sentiment analysis.
We first compare a model trained with the original data, and with the original and edited data as simple augmentation (Table 3). The improvement is significant when tested on edited data (55.388.7%). We then train the model with our GS loss. The added improvement is visible on both the original data (82.083.8%) and on the edited data (88.791.2%). The evaluation on edited examples is the more challenging setting, because spurious correlations from the original training data cannot be relied on. The ablation that uses GS with random relations completely fails, confirming the value of the supervision with relations between pairs of related examples.
We additionally evaluate the model on out-of-sample data with three additional test sets: Amazon Reviews , Semeval 2017 (Twitter data) , and Yelp reviews . The model trained on IMDb is applied without any fine-tuning to these, which constitutes a significant challenge in terms of generalization. We observe a clear gain over the data augmentation baseline on all three.
Results on NLI.
We perform the same set of experiments on NLI. The fairest point of comparison is again the model trained with the original and edited data. Using the GS loss on top of it brings again a clear improvement (Table 3), both when evaluated on standard test data and on edited examples. As an additional measure of generalization, we also evaluate the same models on the dev. set of MultiNLI  without any fine-tuning. There is a significant domain shift between the datasets. Using the edited examples for data augmentation actually hurts the performance here, most likely because they constitute very “unnatural” sentences, such that easy-to-pick-up language cues cannot be relied on. Using GS (with always uses the edited data as augmentations as well) brings back the performance higher, and above the baseline trained only on the original data.
Our NLP experiments were conducted with simple models and relatively little data. The current state of the art in sentiment analysis and NLI is achieved by transformer-based models  trained on vastly more data. Kaushik et al.  showed that counterfactual examples are much more valuable than the same amount of standard data, including for fine-tuning a BERT model for NLI. The application of our technique to the extremely-large data regime, including with large-scale language models, is an exciting direction for future work.
We proposed a new training objective that improves the generalization capabilities of neural networks by supervising their gradient, and using an unused training signal found in many datasets. While most machine learning models rely on identifying correlations between inputs and output, we showed that relations between counterfactual examples provide a fundamentally different, complementary type of information. We showed theoretically and empirically that our technique can shape the decision boundary of the model to be more faithful to the causal mechanisms that generated the data. Practically speaking, the model is then more likely to be “right for the right reasons”. We showed that this effect brings significant improvements on a number of tasks when evaluated with out-of-distribution test data. We demonstrated that the required annotations can be extracted from existing datasets for a number of tasks.
Appendix 0.A Application to VQA
|What is floating in the sky ? GT Answer(s): kites, kite, sail|
|Where is the woman sitting ? GT Answer(s): stairs, steps|
|What team is the batter on ? GT Answer(s): white, yankees, mets, giants|
|Where is the baby looking ? GT Answer(s): laptop, screen, monitor|
|What is the sex of rider ? GT Answer(s): female, male|
|What kind of boat is on the water ? GT Answer(s): canoe, paddle|
|What sport is the person participating in ? GT Answer(s): surfing|
|What is this person standing on ? GT Answer(s): skateboard|
|What is the person in photo holding ? GT Answer(s): surfboard|
We generate the counterfactual examples by masking image features on-the-fly, during training, according the the human attention maps of . We use image features from , which correspond to bounding boxes in the image. We mask the features whose boxes overlap with a fraction of the the human attention map above a fixed threshold. We use the precomputed overlap score from , which is a scalar in , and set the threshold at 0.2 (setting it at 0 would mask the occasional boxes that encompass nearly the whole image, which is not desirable). This value was set manually by verifying for the intended effect on a few training examples (that is, masking most of the relevant visual evidence). See Fig. 6 for examples of original questions and their counterfactual versions.
Our experiments use a validation set (8,000 questions chosen at random) held out from the original VQA-CP training set. Note that most existing methods evaluated in VQA-CP use the extremely unsanitary practice of using the VQA-CP test split for model selection. This is extremely concerning since the whole purpose of VQA-CP is to evaluate generalization to an out-of-distribution test set. The variance in evaluating the ‘number’ and ‘yes/no’ questions is moreover extremely high, because the number of reasonable answers on each of these types is very limited. For example, a model that answersyes or no at random, or produces constantly either answer, can fare extremely well (upwards of 62% accuracy) on these questions. This can very well result from a buggy implementation or a “lucky” random seed, identified by model selection on the test set (!). This is the reason why we include an evaluation on the ‘other’ type of questions in isolation. All of these issues have been pointed out by a few authors [26, 16, 66].
Our focused test set is a subset of the official VQA-CP test set. It is created in a similar manner as the counterfactual examples. We mask features that overlap with human attention maps below (instead of above) a threshold of 0.8. This value was set manually by verifying for the intended effect on a few examples (masking the background but not the regions necessary to answer the question). The focused test set is much smaller than the official test set since it only comprises questions for which a human attention map is available.
The method presented in  could have constituted an ideal point of comparison with ours, as it was evaluated on VQA-CP and used human attention maps. However, after extensive discussions with the authors, we still have not been able to replicate any of the performance claimed in the paper. We found a number of errors in the paper, as well as inconsistencies in the reported results, and an extreme sensitivity to a single hyperparameter (their reported results were obtained with a single run on a single random seed). We chose not to mention this work in our main paper until these issues have been resolved.
Why not use the same technique for the VQA and COCO experiments ? Inpainting in pixel space vs masking image features.
The two approaches are applicable in both cases. The only reason was to showcase the use of multiple techniques to generate counterfactual examples. The human attention map are specific to VQA and not applicable to the COCO experiments.
Appendix 0.B Application to image classification with COCO
|Masked object: car (left and right, behind the truck)|
|Masked object: person|
|Masked object: skateboard|
|Masked object: surfboard|
|Masked object: boat|
|Masked object: tie (on both persons in the foreground)|
|Masked object: bicycle (against the railing on the right)|
|Masked object: person|
|Masked object: horse|
|Masked object: tie|
We use the edited images released by  together with the corresponding original images from COCO. The edited images were created with the inpainter GAN  to mask ground truth bounding boxes of specific objects. The images come from the COCO splits train2014 and val2014. We keep this separation for our experiments as follows. Images from train2014 (323,116 counting original and edited ones) are used for training, except a random subset (1,000 images) that we hold out for validation (model selection, early stopping). Images from val2014 (3,361 original and 3,361 edited) are used exclusively for testing.
We identified a subset (named Hard edited) of the edited images from val2014 whose ground truth vector (which indicated the classes appearing in the image) is never seen during training (614 images).
The set of edited images provided by  is a non-standard subset of COCO, so no directly-comparable results have been published for the multi-label classification task that we consider.
We pre-extract image features from all images with the ResNet-based, bottom-up attention model 
. These features are averaged across spatial locations, giving a single vector of dimensions 2048 to represent each image. Our model is a 3-layer ReLU MLP of size 64, followed by a linear/sigmoid output layer of size 80 (corresponding to the 80 COCO classes). This baseline model was first tuned for best performance on the validation set (tuning the number of a layers and their size, the batch size, and learning rate), before adding the proposed GS loss. The model is optimized with AdaDelta, mini-batches of size 512, and a binary cross-entropy loss.
Performance is measured with a standard mean average precision (mAP) (as defined in the Pascal VOC challenge) over all 80 classes.
The Fig. 4 in the paper shows the input image with the scores of the top- predicted labels by the baseline and by our method. The corresponds to the number of ground truth labels of each image.
In our ablations, this model is identical to the standard baseline, but it is trained with a randomly shuffled training set. We shuffle the inputs and the ground truth labels of all training examples. The model is thus not getting any relevant training signal from any example. It can only leverage static dataset biases (i.e. a class imbalance).
Appendix 0.C Application to NLP tasks
Sentiment analysis data.
We use the subset of the IMDb dataset  for which Kaushik et al.  obtained counterfactual examples. We use their ‘paired’ version of the data, which only contains original examples that do have an edited version. For training, we use the ‘train’ split of original and edited data (3414 examples). For validation (model selection, early stopping), we use the ‘dev’ set of paired examples. For testing, we use the ‘test’ split, reporting accuracy over the original and edited examples separately. For testing on other datasets, we use a random subset (2000 examples) of the test sets of Amazon Reviews , Semeval 2017 (Twitter data) , and Yelp reviews  similarly to .
Sentiment analysis model.
We first optimized a simple baseline model on the validation set (tuning the number of a layers, embedding sizes, batch size, and learning rate). We then added the proposed gradient supervision, tuned its hyperparameters on the validation set (regularizer weight) then reported the performance on the test sets at the epoch of best performance on the validation set. The sentences are tokenized and trimmed to a maximum of 32 tokens. The model encodes a sentence as a bag of words, using word embeddings of size 50, averaged to the exact length of each sentence (i.e
. not including the padding of the shorter sentences). The vocabulary is limited to the 20,000 most frequent words in the dataset. The averaged vector is passed to a simple linear classifier with a sigmoid output. All weights, including word embeddings, are initialized from random values, and optimized with AdaDelta, in mini-batches of size 32, with a binary cross-entropy loss. The best weight for the GS regularizer was found to be=20. To reduce the noise in the evaluation due to the small size of the training set, we use an ensemble of 6 identical models trained in parallel. The reported results uses the output of the ensemble, that is the average of the logits of the 6 models.
The experiments on NLI follow a similar procedure to those on sentiment analysis. We use the subset of the SNLI dataset  for which Kaushik et al.  collected counterfactual examples. We use their biggest version of the data, named ‘all combined’, that contains counterfactual examples with edited premises and edited hypotheses. For testing, we evaluate accuracy separately on original and edited examples (edited premises and edited hypotheses combined). For testing transfer, we use the ‘dev’ set of MultiNLI . Whereas the SNLI dataset contains sentence pairs derived from image captions, MultiNLI is more diverse. It contains sentences from transcribed speech, popular fiction, and government reports. Compared to SNLI, it contains more linguistic diversity and complexity.
The premise and hypothesis sentences are tokenized and trimmed to a maximum of 32 tokens. They are encoded separately as bags of words, using frozen Glove embeddings (dimension 300), then a learned linear/ReLU projection to dimension 50, and an average to the length of each sentence (without using the padding). They are then passed through a batch normalization layer, then concatenated, giving a vector of size 100. The vector is passed through 3 linear/ReLU layers, then a final linear/sigmoid output layer. The model is trained with AdaDelta, with mini-batches of size 512, and a binary cross-entropy loss. The best weight for the GS regularizer was found to be=0.01. Similarly to our experiments on sentiment analysis, we evaluate an ensemble of 6 copies of the model described above.
|Random predictions (chance)||45.4|
|Baseline w/o edited tr. data||82.8|
|Baseline w/ edited tr. data||87.4|
|+ GS, counterfactual rel.||88.8|
|+ GS, random relations||57.4|
-  (2019) Towards causal vqa: revealing and reducing spurious correlations by invariant and covariant semantic editing. arXiv preprint arXiv:1912.07538. Cited by: Appendix 0.B, Appendix 0.B, §4.2, Table 2.
-  (2018) Don’t just assume; look and answer: overcoming priors for visual question answering. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 4971–4980. Cited by: §2, §2, §4.1, §4.2, Table 1.
Bottom-up and top-down attention for image captioning and vqa. CVPR. Cited by: Appendix 0.A, Appendix 0.A, Appendix 0.B, §4.1, §4.1, §4.2.
Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3674–3683. Cited by: §2.
-  (2015) VQA: Visual Question Answering. In Proc. IEEE Int. Conf. Comp. Vis., Cited by: §2, §4.2.
-  (2019) Invariant risk minimization. arXiv preprint arXiv:1907.02893. Cited by: §2, §2.
-  (2019) COPHY: counterfactual learning of physical dynamics. arXiv preprint arXiv:1909.12000. Cited by: §1, §3.1, §4, §5.
-  (2019) ObjectNet: a large-scale bias-controlled dataset for pushing the limits of object recognition models. In Proc. Advances in Neural Inf. Process. Syst., pp. 9448–9458. Cited by: §2.
-  (2020) Beat the ai: investigating adversarial human annotations for reading comprehension. arXiv preprint arXiv:2002.00293. Cited by: §2.
-  (2015) A large annotated corpus for learning natural language inference. In Proc. Conf. Empirical Methods in Natural Language Processing, Cited by: Appendix 0.C, §4.3.
-  (2019) RUBi: reducing unimodal biases in visual question answering. arXiv preprint arXiv:1906.10169. Cited by: §2, §4.1, Table 1.
-  (2018) E-snli: natural language inference with natural language explanations. In Proc. Advances in Neural Inf. Process. Syst., pp. 9539–9549. Cited by: §1, §3.1, §4, §5.
-  (2019) CODAH: an adversarially-authored question answering dataset for common sense. In Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP, pp. 63–69. Cited by: §2.
-  (2015) Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325. Cited by: §4.2.
-  (2019) Don’t take the easy way out: ensemble based methods for avoiding known dataset biases. arXiv preprint arXiv:1909.03683. Cited by: §2, §4.1.
-  (2019) On incorporating semantic prior knowledge in deep learning through embedding-space constraints. arXiv preprint arXiv:1909.13471. Cited by: Appendix 0.A, §2.
-  (2020) Unshuffling data for improved generalization. arXiv preprint arXiv:2002.11894. Cited by: Appendix 0.A, §2, §4.1, §4.1, §4.1, Table 1.
-  (2016) Human attention in visual question answering: do humans and deep networks look at the same regions?. In Proc. Conf. Empirical Methods in Natural Language Processing, Cited by: Appendix 0.A, §1, §2, §4.1, §4.
-  (2017) Visual Dialog. In CVPR, Cited by: §2.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §4.3, §5.
-  (2019) Misleading failures of partial-input baselines. arXiv preprint arXiv:1905.05778. Cited by: §2.
-  (2016) Domain-adversarial training of neural networks. The Journal of Machine Learning Research. Cited by: §2.
-  (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §3.2.
-  (2016) Making the V in VQA matter: elevating the role of image understanding in Visual Question Answering. arXiv preprint arXiv:1612.00837. Cited by: §2, §2, §4.1.
-  (2017) Making the v in vqa matter: elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913. Cited by: §1, §2, §4.
-  (2019) Adversarial regularization for visual question answering: strengths, shortcomings, and side effects. arXiv preprint arXiv:1906.08430. Cited by: Appendix 0.A, §2, §4.1, Table 1.
-  (2019) Quantifying and alleviating the language prior problem in visual question answering. arXiv preprint arXiv:1905.04877. Cited by: §2, §2.
-  (2016) Deep residual learning for image recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §5.
-  (2017) Conditional variance penalties and domain shift robustness. arXiv preprint arXiv:1710.11469. Cited by: §2.
-  (2018) Women also snowboard: overcoming bias in captioning models. In European Conference on Computer Vision, pp. 793–811. Cited by: §2, §4.2, §4.2.
-  (2018) Adversarial example generation with syntactically controlled paraphrase networks. arXiv preprint arXiv:1804.06059. Cited by: §2.
-  (2018) Improving dnn robustness to adversarial attacks using jacobian regularization. In Proc. Eur. Conf. Comp. Vis., pp. 514–529. Cited by: §2, §3.1, §4.2.
-  (2017) Adversarial examples for evaluating reading comprehension systems. arXiv preprint arXiv:1707.07328. Cited by: §2, §2.
-  (2019) Learning the difference that makes a difference with counterfactually-augmented data. arXiv preprint arXiv:1909.12434. Cited by: Appendix 0.C, Appendix 0.C, §2, §3.1, §4.3, §4.3, §4.3, Table 3, §4.
-  (2016) Learning robust representations of text. In Proc. Conf. Empirical Methods in Natural Language Processing, pp. 1979–1985. Cited by: §2, §3.1, §4.2.
-  (2014) Microsoft COCO: Common objects in context. In Proc. Eur. Conf. Comp. Vis., Cited by: §1, §4.2, §4.
-  (2017) Attention correctness in neural image captioning. In Proc. Conf. AAAI, Cited by: §2.
-  (2019) Cbnet: a novel composite backbone network architecture for object detection. arXiv preprint arXiv:1909.03625. Cited by: §5.
-  (2011) Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1, pp. 142–150. Cited by: Appendix 0.C, §4.3.
-  (2019) Simple but effective techniques to reduce biases. arXiv preprint arXiv:1909.06321. Cited by: §2.
-  (1980) The need for biases in learning generalizations. Department of Computer Science, Laboratory for Computer Science Research …. Cited by: §1.
-  (2016) Adversarial training methods for semi-supervised text classification. arXiv preprint arXiv:1605.07725. Cited by: §2.
-  (2017) Universal adversarial perturbations. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 1765–1773. Cited by: §2.
-  (2015) Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 427–436. Cited by: §2.
-  Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proc. Conf. Empirical Methods in Natural Language Processing, pp. 188–197. Cited by: Appendix 0.C, §4.3.
-  (2019) Adversarial nli: a new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599. Cited by: §2.
-  (2019) Robust change captioning. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 4624–4633. Cited by: §1, §3.1, §4, §5.
-  (2019) Viewpoint invariant change captioning. arXiv preprint arXiv:1901.02527. Cited by: §1, §3.1, §4, §5.
-  (2000) Causality: models, reasoning and inference. Vol. 29, Springer. Cited by: §1, §2.
Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology). Cited by: §2.
-  (2018) Exploring human-like attention supervision in visual question answering. In Proc. Conf. AAAI, Cited by: §2.
-  (2018) Overcoming language priors in visual question answering with adversarial regularization. In Proc. Advances in Neural Inf. Process. Syst., pp. 1541–1551. Cited by: §2, §4.1, Table 1.
-  (2016) ”Why should i trust you ?” explaining the predictions of any classifier. In Proc. ACM SIGKDD Int. Conf. Knowledge discovery & data mining, Cited by: §3.2.
Invariant models for causal transfer learning. The Journal of Machine Learning Research. Cited by: §2.
-  (2017) SemEval-2017 task 4: sentiment analysis in twitter. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 502–518. Cited by: Appendix 0.C, §4.3.
-  (2016) Grad-cam: why did you say that?. arXiv preprint arXiv:1611.07450. Cited by: §3.2.
-  (2019) Taking a hint: leveraging explanations to make vision and language models more grounded. In Proc. IEEE Int. Conf. Comp. Vis., Cited by: Appendix 0.A, Appendix 0.A, §2.
-  (2019) Adversarial training for free!. In Proc. Advances in Neural Inf. Process. Syst., pp. 3353–3364. Cited by: §2.
-  (2018) Adversarial scene editing: automatic object removal from weak supervision. In Proc. Advances in Neural Inf. Process. Syst., pp. 7706–7716. Cited by: Appendix 0.B, §4.2, Table 2.
-  (2019) Vl-bert: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530. Cited by: §5.
-  (2017) A corpus of natural language for visual reasoning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 217–223. Cited by: §1, §4, §5.
-  (2018) A corpus for reasoning about natural language grounded in photographs. arXiv preprint arXiv:1811.00491. Cited by: §1, §4, §5.
-  (2019) LXMERT: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490. Cited by: §5.
-  (2018) Tips and tricks for visual question answering: learnings from the 2017 challenge. CVPR. Cited by: Appendix 0.A, §4.1, Table 1.
-  (2016) Zero-shot visual question answering. arXiv preprint arXiv:1611.05546. Cited by: §2.
-  (2019) Actively seeking and learning from live data. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Appendix 0.A, §4.1, Table 1.
-  (2011) Unbiased look at dataset bias. In CVPR, Vol. 1, pp. 7. Cited by: §1, §2.
Rethinking statistical learning theory: learning using statistical invariants. Machine Learning. Cited by: §2.
-  (1999) An overview of statistical learning theory. IEEE Transactions on Neural Networks 10 (5), pp. 988–999. Cited by: §1.
Composing text and image for image retrieval-an empirical odyssey. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 6439–6448. Cited by: §1, §4, §5.
-  (2018) Trick me if you can: adversarial writing of trivia challenge questions. In ACL Student Research Workshop, Cited by: §2.
Balanced datasets are not enough: estimating and mitigating gender bias in deep image representations. In Proc. IEEE Int. Conf. Comp. Vis., pp. 5310–5319. Cited by: §4.2.
-  (2017) A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426. Cited by: Appendix 0.C, §4.3.
-  (2019) Adversarial explanations for understanding image classification decisions and improved neural network robustness. Nature Machine Intelligence 1 (11), pp. 508–516. Cited by: §2, §3.1, §4.2.
-  (2019) Adversarial examples improve image recognition. arXiv preprint arXiv:1911.09665. Cited by: §2.
-  (2016) Stacked Attention Networks for Image Question Answering. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §4.1, Table 1.
-  Personal websitePersonal website. Note: http://www.yelp.com/dataset_challenge Cited by: Appendix 0.C, §4.3.
-  (2018) Swag: a large-scale adversarial dataset for grounded commonsense inference. arXiv preprint arXiv:1808.05326. Cited by: §2, §2.
-  (2019) HellaSwag: can a machine really finish your sentence?. arXiv preprint arXiv:1905.07830. Cited by: §2.
-  (2016) Yin and yang: balancing and answering binary visual questions. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §2.