Causally Estimating the Sensitivity of Neural NLP Models to Spurious Features

10/14/2021 ∙ by Yunxiang Zhang, et al. ∙ 33

Recent work finds modern natural language processing (NLP) models relying on spurious features for prediction. Mitigating such effects is thus important. Despite this need, there is no quantitative measure to evaluate or compare the effects of different forms of spurious features in NLP. We address this gap in the literature by quantifying model sensitivity to spurious features with a causal estimand, dubbed CENT, which draws on the concept of average treatment effect from the causality literature. By conducting simulations with four prominent NLP models – TextRNN, BERT, RoBERTa and XLNet – we rank the models against their sensitivity to artificial injections of eight spurious features. We further hypothesize and validate that models that are more sensitive to a spurious feature will be less robust against perturbations with this feature during inference. Conversely, data augmentation with this feature improves robustness to similar perturbations. We find statistically significant inverse correlations between sensitivity and robustness, providing empirical support for our hypothesis.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite the success of deep neural models on many Natural Language Processing (NLP) tasks (Liu et al., 2016; Devlin et al., 2019; Liu et al., 2019), recent work has discovered that these models rely excessively on spurious features, making the right predictions for the wrong reasons (Gururangan et al., 2018; McCoy et al., 2019; Wang and Culotta, 2020). Neural NLP models learn correlations but not causation from training data (Feder et al., 2021) and thus they are easily fooled by spurious correlation: prediction rules that work for the majority examples but do not hold in general (Tu et al., 2020). For example, BERT (Devlin et al., 2019) only achieves an accuracy less than 10% on a challenge test set HANS (McCoy et al., 2019) for MNLI (Williams et al., 2018) where spurious correlation disappears. Measuring sensitivity to spurious feature is thus important for a more principled evaluation of and control over neural NLP models (Lovering et al., 2020). It is crucial to avoid overfitting our models to spurious features, but how to evaluate the risk of spurious features remains an open question. Although a wide range of evaluation approaches for robust NLP have been proposed (Ribeiro et al., 2020; Morris et al., 2020; Tu et al., 2020; Goel et al., 2021; Wang et al., 2021), to the best of our knowledge, a quantitative measure to evaluate or compare the effect of different spurious features on NLP models has yet to be proposed.

Inspired by the concepts of Randomized Controlled Trial (RCT) and Average Treatment Effect (ATE) in Causal Inference (Rubin, 1974; Holland, 1986), we quantify model sensitivity to spurious features through simulations: ① randomly labelling a dataset, ② individually injecting them (as treatments) into examples of a particular pseudo class, and ③ using ATE to measure the ease with which the model learns each feature. We dub our proposed metric Causal sENsiTivity (CENT). We realize the injection of artificial spurious features with non-adversarial perturbations111We use “spurious feature” and “perturbation” interchangeably in this work. (Moradi and Samwald, 2021) in this work. The core intuition for our method is to frame RCT as a spurious feature identification task and formalize the notion of sensitivity as a causal estimand based on ATE. We conduct experiments on four neural NLP models with eight different spurious features. Analysis results based on CENT reveal the deadliest spurious feature and the brittlest model, which contributes better model interpretation.

It stands to reason that a model that is more sensitive to a spurious feature will be less robust against test-time perturbations of the same feature. We use CENT to validate this intuition. To improve performance under perturbation, it is a common practice to leverage data augmentation (Li and Specia, 2019; Min et al., 2020; Tan and Joty, 2021), and we find evidence that the improvement is also strongly correlated to the model’s sensitivity. Combining these two findings, we further show that data augmentation is only more effective at improving robustness against spurious features that a model is more sensitive to. Our work contributes to the groundwork for the evaluation of spurious feature sensitivity in NLP, while also contributing to the interpretation of robustness and data augmentation.

Our main contributions are summarized as follows:

  • We use the concept of average treatment effect to quantify sensitivity of NLP models to spurious features and conduct an empirical analysis on typical nerual NLP models.

  • We validate the inverse correlation between sensitivity and robustness of models to spurious features, justifying our proposed approach to sensitivity estimation.

  • We demonstrate a significant relationship between sensitivity and performance boost of data augmentation.

2 Background

Recent work finds that NLP models are more sensitive to spurious features than target features (Warstadt et al., 2020; Lovering et al., 2020), but the term “sensitivity” still does not map to a formal, quantitative measure in standard statistical frameworks. Estimating the effect of spurious feature on models is challenging, because it is often difficult to fully decouple the effect of the target features from spurious features in practice (Lovering et al., 2020). In the language of causality, this is “correlation is not causation”, due to the confounding target feature. Motivated by theories of causal inference, we propose CENT, a novel metric for estimating model’s sensitivity to spurious features. As a means for introducing our metric later, we now review background knowledge on causality.

Causal Inference.

The aim of causal inference is to investigate how a treatment affects the outcome . Confounder refers to a variable that influences both treatment and outcome . For example, sleeping with shoes on () is strongly associated with waking up with a headache (), but they both have a common cause: drinking the night before () (Neal, 2020). In our work, we aim to study how a spurious feature (treatment) affects the model’s prediction (outcome). However, the target features and other spurious features usually act as a confounder.

Causality offers solutions for two questions: 1) how to eliminate the spurious association and isolate the treatment’s causal effect; and 2) how varying affects , given both variables are causally-related (Liu et al., 2021). We will leverage both of these properties in our proposed method. Let us now introduce Randomized Controlled Trial and Average Treatment Effect as key concepts in answering the above two questions, respectively.

  • Randomized Controlled Trial (RCT). In an RCT, each participant is randomly assigned to either the treatment group or the non-treatment group. In this way, the only difference between the two groups is the treatment they receive. Randomized experiments ideally guarantee that there is no confounding factor, and thus any observed association is actually causal. We operationalize RCT as a spurious feature classification task in Section 3.1.

  • Average Treatment Effect (ATE). In Section 3.3, we apply ATE (Holland, 1986) as a measure of causal sensitivity. ATE is based on Individual Treatment Effect (ITE, Equation 1), which is the difference of the outcome with and without treatment.

    (1)

    Here, is the outcome of individual that receives treatment (), while is the opposite. In the above example, waking up with a headache () with shoes on () means .

    We calculate the Average Treatment Effect (ATE) by taking an average over ITEs:

    (2)

    ATE quantifies how the outcome is expected to change if we modify the treatment from 0 to 1. We provide specific definitions of ITE and ATE in Section 3.3.

(a) Before randomization.

causal association
(b) After randomization.
Figure 1: Causal graph explanation for decoupling spurious feature and target feature with randomization. is the spurious feature and is the target feature. is the original label and is the correctness of the predicted label.

3 Method

Setup and Terminology.

We consider a binary sentential text classification problem with binary treatments (i.e., the spurious feature either exists or not). The training set is denoted as , where is the -th example and is the corresponding label. We fit a model with parameters on the training data. We assume that we have a transformation that injects a specific type of spurious feature into an example with parameters and the perturbed example is .

We cast sensitivity estimation as a spurious feature classification task, where a model is trained to identify the spurious feature in an example. Our proposed method consists of three steps, namely ① random label assignment, ② spurious feature injection, and ③ causal estimation. Below we detail the procedure and motivation for each step. We then summarize our estimation approach formally in Algorithm 1.

3.1 Random label assignment

We randomly assign pseudo labels to each training example regardless of its original label. Each data point has equal probability of being assigned to positive (

) or negative () pseudo label (i.e., the output of a coin toss). This results in a randomly labeled dataset , where .

A Causal Explanation for Randomization.

Spurious features naturally co-occur with target features in an example, making it a challenge to isolate the spurious feature’s effect. If we did not assign random labels and simply injected spurious features into one of the original groups, there would be confounding target features that would prevent us from estimating the causal effect of the spurious feature. Figure 0(a) illustrates this scenario. Both spurious feature and target feature may affect the outcome 222 is defined in Section 3.3 , while the target feature is predictive of label . Since we inject the spurious feature into examples with the same label, is decided by . It therefore follows that is a confounder of the effect of on , resulting in non-causal association flowing along the path . However, if we do randomize the labels, no longer has any causal parents (i.e., incoming edges) (Figure 0(b)). This is because feature injection is purely random. Without the path represented by , all of the association that flows from to is causal. As a result, we can directly calculate the causal effect from the observed outcomes (Section 3.3).

3.2 Spurious feature injection

We apply the spurious transformation to each training example in one of the pseudo groups (e.g., in Algorithm 1)333Because the training data is randomly split into two pseudo groups, applying transformations to any one of the groups should yield same result. We assume that we always inject into the first group () hereafter.. In this way, we create a spurious correlation between the injected feature and label (i.e., the feature occurrence is predictive of the label). We control the injection probability , i.e., an example has a specific probability of being injected with a spurious feature. This results in a perturbed training set , where the perturbed example is:

(3)

Here

is a random variable drawn from a uniform distribution

.

Criteria for Spurious Features.

We inject spurious features into plain text by making non-adversarial, label-consistent perturbations. These perturbations can be automatically generated at scale. Note that our method does not require access to model-internal structure. We also assume that the injected spurious feature does not exist in original data. Not all perturbations in existing literature are suitable for our task. For example, a perturbation that swaps the gender word (i.e., female male, male female) will not result in a spurious feature since we cannot distinguish the perturbed text from an unperturbed one. In other words, the perturbation function should be asymmetric, such that . We provide the list of spurious features we used in Appendix A.

3.3 Causal estimation

Our randomization experiments allow us to discern causation from association and estimate the causal effect of injected spurious feature from test performance. We now train a model on the randomly labeled dataset with half of perturbed examples. Since the only difference between the two pseudo groups is the existence of the spurious feature, the model is trained to identify the spurious feature. The original test examples are assigned random labels and become . We inject spurious feature into all of the test examples (injecton probability ) in one pseudo group (e.g., , as in Section 3.2) to produce a perturbed test set . Sensitivity is calculated as the difference of accuracies on and .

Identification of Causal Estimand for Sensitivity.

In causality, the term “identification” refers to the process of moving from a causal estimand (ATE) to an equivalent statistical estimand. We show that the difference of accuracies on and is actually a causal estimand. We define the outcome of a test data point as the correctness of the predicted label:

(4)

where is the indicator function. Similarly, the outcome of a perturbed test data point is:

(5)

From Equation 1, the Individual Treatment Effect is . We then take the average over all the perturbed test examples (half of the test set)444The other half of the test set () is left unperturbed, following the same procedure in Section 3.2. Model predictions will not change for unperturbed ones, resulting in ITEs with zero values. Therefore, we do not take them into account for ATE calculation.. This is our Average Treatment Effect (ATE). With Equation 2, we have:

(6)

where is the accuracy of model trained with feature at injection probability on test set . Therefore, we show that ATE is exactly the difference of accuracies on the perturbed and unperturbed test sets with random labels. As a result, a higher ATE indicates a greater degree of sensitivity.

We discuss another means of identification of ATE in Appendix B.1, based on prediction probability. We compare between the probability-based and accuracy-based metrics there. We find that our accuracy-based metric yields better resolution, so we report this metric in this work.

Average over Different injection probabilities.

We observe that the ATE-based sensitivity is dependent on injection probability . For each model–feature pair, we obtain multiple estimates by varying the injection probability (Figure 2). However, we expect that sensitivity of the model (as a concept) should be independent of injection probability. To this end, we use the (area under the curve in log scale) of the curve (Figure 2), termed as “average sensitivity”, which summarizes the overall sensitivity across different injection probabilities :

(7)

We use rather than because we empirically find that the sensitivity varies substantially between features when is small, and a log scale can better capture this nuance. We also introduce sensitivity at a specific injection probability (Sensitivity @ ) as a summary metric and provide a comparison of this metric against in Appendix B.2.

Input: training set test set , model , spurious perturbation , injection probability
Output: (sensitivity)

1:  // ① random label assignment
2:  Initialize an empty perturbed dataset
3:  for  in  do
4:     
5:     
6:  end for
7:  // ② spurious feature injection
8:  Initialize an empty perturbed and injected dataset
9:  for  in  do
10:     
11:     
12:     if  then
13:        
14:     end if
15:     
16:  end for
17:  // ③ causal estimation
18:  
19:  
20:  fit the model on
21:   accuracy on
22:   accuracy on
23:  return
Algorithm 1 Sensitivity Estimation

4 Experiments

4.1 Estimating Sensitivity

Experimental Settings.

To test sensitivity of different NLP models to various spurious features, we experiment with four modern and representative neural NLP models: TextRNN (Liu et al., 2016), BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019) and XLNet (Yang et al., 2019)

. For TextRNN, we use the implementation by an open-source text classification toolkit NeuralClassifier 

(Liu et al., 2019). For the other three pretrained models, we use the bert-base-cased, roberta-base, xlnet-base-cased versions from Hugging Face (Wolf et al., 2020)

, respectively. These two platforms support most of the common NLP models, thus facilitating extension studies of sensitivity of more models in future. We use a common binary text classification dataset — IMDB movie reviews

(Pang and Lee, 2005) — as our testbed, which contains text labelled as positive or negative sentiment. We implement the injection of spurious features with two self-designed perturbations and six selected ones from the NL-Augmenter555https://github.com/GEM-benchmark/NL-Augmenter. More details of spurious features/perturbations can be found in Appendix A. For injection probabilities, we choose 0.001, 0.005, 0.01, 0.02, 0.05, 0.10, 0.50, 1.00. We run all experiments across 3 random seeds and report the average results.

Figure 2: Sensitivity of four NLP models to eight spurious features, as a function of injection probability.
Spurious feature XLNet RoBERTa BERT TextRNN
Average
over models
whitespace_perturbation 1.638 1.436 1.492 0.878 1.361
shuffle_word 1.740 1.597 1.766 0.594 1.424
duplicate_punctuations 1.086 1.499 1.347 2.050 1.495
butter_fingers_perturbation 1.590 1.369 1.788 1.563 1.578
random_upper_transformation 1.583 1.520 1.721 2.039 1.716
insert_abbreviation 1.783 1.585 1.564 2.219 1.788
visual_attack_letters 1.824 1.921 1.898 2.094 1.934
leet_letters 1.816 2.163 1.817 2.463 2.065
Average over features 1.632 1.636 1.674 1.738 1.670
Table 1: Average sensitivity of each model–feature pair ( of corresponding curve in Figure 2). Rows and columns are sorted by average values over all features and models. The feature to which a model is most sensitive is highlighted in bold while the following one is underlined.

Results.

Figure 2 shows model sensitivity as a function of injection probability for each of the eight spurious features. Sensitivity @ generally increases as we increase the injection probability, and when we perturb all the examples (i.e., ), every model can easily identify it well, resulting in the maximum sensitivity of 1.0. This shows that neural NLP models succumb to these spurious features eventually. At lower injection probabilities, some models still learn that spurious feature alone predicts the label. In fact, the major difference between different curves is the area of lower injection probabilities and this provides motivation for using instead of as the summarization of sensitivity at different (Section 3.3).

Table 1 shows the average sensitivity over all injection probabilities of each model–feature pair in Figure 2. Model-wise comparison666We note that model-wise comparison is not fair across models with different numbers of parameters. Nevertheless, it is still instructive to compare models at their commonly-used size. (across different columns in Table 1) shows that the non-pretrained model (TextRNN) is generally more sensitive than pretrained models (BERT, RoBERTa, XLNet). Our results are in line with recent findings that pretraining improves robustness to spurious correlations (Hendrycks et al., 2019, 2020; Tu et al., 2020). We also observe that RoBERTa is less sensitive than BERT, indicating that a larger pretraining corpus improves downstream robustness and confirms that RoBERTa is indeed robustly optimized (Liu et al., 2019). Interestingly, the sensitivity of RoBERTa jumps from 0.0 to 1.0 quickly at a relatively small injection probability, instead of changing gradually as the injection probability increases. This shows that RoBERTa is good at generalizing from a small proportion of data with spurious feature. Tu et al. (2020) also present a similar finding that RoBERTa generalize better from minority patterns in the training set than BERT. We find that CENT complements to existing literature on model interpretations, providing a new perspective and a promising analysis tool.

Feature-wise comparison (across different rows) reveals the most sensitive spurious feature for each model. For example, all four models are highly sensitive to “visual_attack_letters” and “leet_letters” (Appendix A), likely due to their damaging effects on the tokenization process. Pretrained models are less sensitive to “white_space_perturbation” and “duplicate_punctuations”, probably because they have little effect on the subword level tokenization, or they may have encountered similar noise in the pretraining corpora. The rank of feature sensitivity differs a lot between models, indicating that a potential solution to a single spurious feature may not work for all models. Priority matters when dealing with spurious correlations. Analyzing the types of features to which a model is sensitive can help us better understand what it can learn during training and enables fair comparison between different models and features.

Exp No. Measurement Label Perturbation Training Examples Test Examples
0 Standard original
1 Sensitivity random
2 random
3 Robustness original
4 Data Augmentation original
Table 2: Example experiment settings for measuring sensitivity, robustness and improvement by data augmentation. We inject a spurious feature to an example if its label falls in the set of label(s) in “Perturbation” column. means no injection at all. Training/test examples are the expected input data, assuming we have only one negative and positive example in our original training/test set. is a random label and is a perturbed example.

4.2 Investigating Sensitivity and Robustness.

Experimental Settings.

Implementing spurious feature injections as perturbations allows us to apply the perturbations to test examples and measure the robustness of model to said perturbations as the decrease in accuracy. To this end, we design several experiment settings (Table 2). Experiment 0 in Table 2 is the standard learning setup, where we train and evaluate a model on the original dataset. Experiments 1 and 2 summarize the key points in sensitivity measurement (Section 3), including random label assignment and spurious feature injection. Specifically, Experiment 1 measures , while Experiment 2 measures in Equation 6. We further use in Equation 7 to get rid of . So the average sensitivity is:

(8)

Experiment 3 is related to robustness measurement, where we train a model on unperturbed dataset and test it on perturbed examples. We denote the test accuracy of a model on perturbed examples in Experiment 3 as . Similarly, the test accuracy in Experiment 0 is . Consequently, the robustness is calculated as the difference of test accuracies:

(9)

Models usually suffer a performance drop when encountering perturbations, therefore the robustness is usually negative, where lower values indicate decreased robustness. Now we investigate the correlation between sensitivity and robustness, stated in the form of a hypothesis:

Hypothesis (H4.2):

A model that is more sensitive to a spurious feature is less robust against the same spurious perturbation at test time.

despite the fact that models encounter this feature during training in sensitivity estimation while they do not in robustness measurement.

To improve robust accuracy (Tu et al., 2020) (i.e., accuracy on the perturbed test set), it is a common practice to leverage data augmentation (Li and Specia, 2019; Min et al., 2020; Tan and Joty, 2021). We simulate the data augmentation process by appending perturbed data to the training set (Experiment 4 of Table 2). We calculate the improvement on performance after data augmentation as the difference of test accuracies:

(10)

where denotes the test accuracy of Experiment 4. is the higher the better. We make another hypothesis:

Hypothesis (H4.2):

A model that is more sensitive to a spurious feature experiences robustness gains with data augmentation along such a feature.

We validate both Hypotheses 4.2 and 4.2 with experiments on various models and features described in Section 4. We run all experiments across 3 random seeds and report the average results.

(a) Sensitivity vs. Robustness
(b) Sensitivity vs. Post Aug
(c) Sen. vs. Robu. vs. Post Aug
Figure 3: Linear regression plots of sensitivity vs. robustness vs. post data augmentation against spurious features. Each point in the plots represents a model-feature pair. We define “avg sensitivity” as of the corresponding curve in Figure 2 (Equation 8), “robustness” as the performance drop on perturbed test set (Equation 9) and “post aug ” as the performance boost on perturbed test set (Equation 10). is Spearman correlation. indicates high significance (p-value 0.001).

Results.

We observe a negative correlation between sensitivity (Equation 8) and robustness (Equation 9) in Figure 2(a), validating Hypothesis 4.2, while Figure 2(b) quantifies the trend that data augmentation with a spurious feature the model is sensitive to improves robustness (Hypothesis 4.2). Both the correlations between 1) sensitivity and robustness and 2) sensitivity and data augmentation are strong (Spearman ) and highly significant (p-value 0.001), which firmly supports our hypothesis. We justify our proposed sensitivity metric by connecting it to robustness and validating the intuitive correlation between them. Our findings provide insight about when a model is less robust and when data augmentation works.

Figure 2(c) summarizes the information in Figure 2(a) and 2(b). We observe that the average sensitivity decreases as robustness increases. This shows that the more sensitive a model is to a spurious feature, the greater the likelihood that its robustness can be improved through data augmentation along this feature. We argue that this is not simply because there is more room for improvement by data augmentation. From a causal perspective, sensitivity acts as a common cause (confounder) for both robustness and data augmentation. This indicates a potential limitation of using data augmentation for improving robustness to spurious features (Jha et al., 2020): for insensitive features, data augmentation may be of little help. Approaches that go beyond simple data augmentation is required to combat such spurious features.

5 Related Work

Definitions of Sensitivity to Spurious Feature.

In this paper, we quantify sensitivity as the ease with which a model learns a spurious feature classification task. However, we note that the term “sensitivity” is also used with other different meanings in the literature. Sensitivity test (Feder et al., 2021), e.g. Counterfactually Augmented Data (CAD) (Kaushik et al., 2019), evaluates the extent that models use spurious features to make predictions by injecting minimally label-flipping perturbations on target features. Gardner et al. (2021) also uses the term sensitivity in an informal way to describe the probability that a local edit that removes spurious features (“artifacts”) changes the label.

Lovering et al. (2020) propose two metrics related to our definition of sensitivity: 1) the extractability of the feature from a model representation (operationalized as minimum description length, MDL) and 2) the model error on spuriously perturbed test examples (termed “s-only error”). However, they do not define sensitivity within causality framework777Our work is also significantly different from Lovering et al. (2020) in that they further investigate the correlation between extractability and s-only error, while we instead investigate the correlation of spurious feature sensitivity with robustness and data augmentation.. The concept of sensitivity is defined more formally by Veitch et al. (2021), who term it counterfactual invariance. They propose distributional properties that a model not sensitive to spurious correlation should satisfy. Instead of properties, however, we propose a quantitative measure for sensitivity. We bridge the gap between causality and sensitivity by mathematically defining a causal estimand and devising a method (Algorithm 1) for estimating sensitivity.

Training with Random Labels.

Pondenkandath et al. (2018); Maennel et al. (2020); Zhang et al. (2021)

train deep neural networks (DNNs) on Computer Vision (CV) datasets with entirely random labels to study memorization, generalization, pretraining, and alignment. Though we similarly use random label assignment (Section

3.1) , our work is different from previous work in that 1) our insights behind randomization originate from the concept of Randomized Controlled Trial (RCT) in Causality; 2) we instead use randomization to study sensitivity to spurious features in NLP; 3) our labels are not purely random: they are correlated with the existence of spurious features.

Interpretation of Data Augmentation.

Though data augmentation has been widely used in CV (Sato et al., 2015; DeVries and Taylor, 2017; Dwibedi et al., 2017) and NLP (Wang and Yang, 2015; Kobayashi, 2018; Wei and Zou, 2019), the underlying mechanism of its effectiveness remains under-researched. Recent studies aim to quantify intuitions of how data augmentation improves model generalization. Gontijo-Lopes et al. (2020) introduce affinity and diversity, and find a correlation between the two metrics and augmentation performance in image classification. In NLP, Kashefi and Hwa (2020) propose a KL-divergence–based metric to predict augmentation performance. Our proposed sensitivity metric CENT implies when data augmentation can help and thus act as a complement to this line of research.

6 Conclusion

Inspired by the concept of Average Treatment Effect (ATE) in Causal Inference, we causally quantify sensitivity of NLP models to spurious features. We validate the hypothesis that a model that is more sensitive to a particular spurious feature is less robust against the same spurious perturbation when encountered during inference. Additionally, we show data augmentation with the feature to improve its robustness to similar test-time perturbations. We hope CENT will encourage more research on spurious feature sensitivity and its implications for interpretability, in order to make CENTs of spurious correlation.

7 Ethics Statement

Computing average sensitivity requires training a model for multiple times at different injection probabilities, which can be computationally-intensive if the sizes of the datasets and models are large. This can be a non-trivial problem for NLP practitioners with limited computational resources. We hope that our benchmark results of sensitivity of typical NLP models work as a reference for potential users. Collaboratively sharing results of such metrics on popular models in public fora can also help reduce duplicate investigation and coordinate efforts across teams.

To alleviate the computational efficiency issue of average sensitivity estimation, using sensitivity at selected injection probabilities may help at the cost of reduced precision (Appendix B.2). We are not alone in facing this issue: two similar metrics for evaluating spurious features, extractability and s-only error (Lovering et al., 2020) also require training the model repeatedly over the whole dataset. Therefore, finding an efficient proxy for average sensitivity is promising for more practical use of sensitivity in model interpretation.

References

  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1, §4.1.
  • T. DeVries and G. W. Taylor (2017)

    Improved regularization of convolutional neural networks with cutout

    .
    arXiv preprint arXiv:1708.04552. Cited by: §5.
  • D. Dwibedi, I. Misra, and M. Hebert (2017) Cut, paste and learn: surprisingly easy synthesis for instance detection. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1310–1319. Cited by: §5.
  • S. Eger, G. G. Şahin, A. Rücklé, J. Lee, C. Schulz, M. Mesgar, K. Swarnkar, E. Simpson, and I. Gurevych (2019) Text processing like humans do: visually attacking and shielding NLP systems. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 1634–1647. External Links: Link, Document Cited by: 7th item, 8th item.
  • A. Feder, K. A. Keith, E. Manzoor, R. Pryzant, D. Sridhar, Z. Wood-Doughty, J. Eisenstein, J. Grimmer, R. Reichart, M. E. Roberts, et al. (2021) Causal inference in natural language processing: estimation, prediction, interpretation and beyond. arXiv preprint arXiv:2109.00725. Cited by: §1, §5.
  • M. Gardner, W. Merrill, J. Dodge, M. E. Peters, A. Ross, S. Singh, and N. Smith (2021) Competency problems: on finding and removing artifacts in language data. arXiv preprint arXiv:2104.08646. Cited by: §5.
  • K. Goel, N. F. Rajani, J. Vig, Z. Taschdjian, M. Bansal, and C. Ré (2021) Robustness gym: unifying the nlp evaluation landscape. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations, pp. 42–55. Cited by: §1.
  • R. Gontijo-Lopes, S. Smullin, E. D. Cubuk, and E. Dyer (2020) Tradeoffs in data augmentation: an empirical study. In International Conference on Learning Representations, Cited by: §5.
  • S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. Bowman, and N. A. Smith (2018) Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 107–112. Cited by: §1.
  • D. Hendrycks, K. Lee, and M. Mazeika (2019) Using pre-training can improve model robustness and uncertainty. In

    International Conference on Machine Learning

    ,
    pp. 2712–2721. Cited by: §4.1.
  • D. Hendrycks, X. Liu, E. Wallace, A. Dziedzic, R. Krishnan, and D. Song (2020) Pretrained transformers improve out-of-distribution robustness. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 2744–2751. External Links: Link, Document Cited by: §4.1.
  • P. W. Holland (1986) Statistics and causal inference. Journal of the American statistical Association 81 (396), pp. 945–960. Cited by: §1, 2nd item.
  • R. Jha, C. Lovering, and E. Pavlick (2020) Does data augmentation improve generalization in nlp?. arXiv preprint arXiv:2004.15012. Cited by: §4.2.
  • O. Kashefi and R. Hwa (2020)

    Quantifying the evaluation of heuristic methods for textual data augmentation

    .
    In

    Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

    ,
    pp. 200–208. Cited by: §5.
  • D. Kaushik, E. Hovy, and Z. Lipton (2019) Learning the difference that makes a difference with counterfactually-augmented data. In International Conference on Learning Representations, Cited by: §5.
  • S. Kobayashi (2018) Contextual augmentation: data augmentation by words with paradigmatic relations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 452–457. Cited by: §5.
  • Z. Li and L. Specia (2019)

    Improving neural machine translation robustness via data augmentation: beyond back-translation

    .
    In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), Hong Kong, China, pp. 328–336. External Links: Link, Document Cited by: §1, §4.2.
  • L. Liu, F. Mu, P. Li, X. Mu, J. Tang, X. Ai, R. Fu, L. Wang, and X. Zhou (2019) NeuralClassifier: an open-source neural hierarchical multi-label text classification toolkit. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Florence, Italy, pp. 87–92. External Links: Link, Document Cited by: §4.1.
  • P. Liu, X. Qiu, and X. Huang (2016) Recurrent neural network for text classification with multi-task learning. In

    Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence

    ,
    pp. 2873–2879. Cited by: §1, §4.1.
  • X. Liu, D. Yin, Y. Feng, Y. Wu, and D. Zhao (2021) Everything has a cause: leveraging causal inference in legal text analysis. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1928–1941. Cited by: §2.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1, §4.1, §4.1.
  • C. Lovering, R. Jha, T. Linzen, and E. Pavlick (2020) Predicting inductive biases of pre-trained models. In International Conference on Learning Representations, Cited by: §1, §2, §5, §7, footnote 7.
  • H. Maennel, I. Alabdulmohsin, I. Tolstikhin, R. J. Baldock, O. Bousquet, S. Gelly, and D. Keysers (2020) What do neural networks learn when trained with random labels?. arXiv preprint arXiv:2006.10455. Cited by: §5.
  • T. McCoy, E. Pavlick, and T. Linzen (2019) Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3428–3448. Cited by: §1.
  • J. Min, R. T. McCoy, D. Das, E. Pitler, and T. Linzen (2020) Syntactic data augmentation increases robustness to inference heuristics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2339–2352. Cited by: §1, §4.2.
  • M. Moradi and M. Samwald (2021) Evaluating the robustness of neural language models to input perturbations. External Links: 2108.12237 Cited by: 3rd item, §1.
  • J. Morris, E. Lifland, J. Y. Yoo, J. Grigsby, D. Jin, and Y. Qi (2020) TextAttack: a framework for adversarial attacks, data augmentation, and adversarial training in NLP. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 119–126. External Links: Link, Document Cited by: §1.
  • B. Neal (2020) Introduction to causal inference from a machine learning perspective. Course Lecture Notes (draft). External Links: Link Cited by: §2.
  • B. Pang and L. Lee (2005) Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 115–124. Cited by: §4.1.
  • V. Pondenkandath, M. Alberti, S. Puran, R. Ingold, and M. Liwicki (2018) Leveraging random label memorization for unsupervised pre-training. arXiv preprint arXiv:1811.01640. Cited by: §5.
  • M. T. Ribeiro, T. Wu, C. Guestrin, and S. Singh (2020) Beyond accuracy: behavioral testing of nlp models with checklist. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 4902–4912. Cited by: §1.
  • D. B. Rubin (1974) Estimating causal effects of treatments in randomized and nonrandomized studies.. Journal of educational Psychology 66 (5), pp. 688. Cited by: §1.
  • I. Sato, H. Nishimura, and K. Yokoi (2015) Apac: augmented pattern classification with neural networks. arXiv preprint arXiv:1505.03229. Cited by: §5.
  • S. Tan and S. Joty (2021) Code-mixing on sesame street: dawn of the adversarial polyglots. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, pp. 3596–3616. External Links: Link, Document Cited by: §1, §4.2.
  • L. Tu, G. Lalwani, S. Gella, and H. He (2020) An empirical study on robustness to spurious correlations using pre-trained language models. Transactions of the Association for Computational Linguistics 8, pp. 621–633. Cited by: §1, §4.1, §4.2.
  • V. Veitch, A. D’Amour, S. Yadlowsky, and J. Eisenstein (2021) Counterfactual invariance to spurious correlations: why and how to pass stress tests. arXiv preprint arXiv:2106.00545. Cited by: §5.
  • W. Y. Wang and D. Yang (2015) That’s so annoying!!!: a lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using #petpeeve tweets. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 2557–2563. External Links: Link, Document Cited by: §5.
  • X. Wang, Q. Liu, T. Gui, Q. Zhang, et al. (2021) TextFlint: unified multilingual robustness evaluation toolkit for natural language processing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, Online, pp. 347–355. External Links: Link, Document Cited by: §1.
  • Z. Wang and A. Culotta (2020) Identifying spurious correlations for robust text classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 3431–3440. Cited by: §1.
  • A. Warstadt, Y. Zhang, X. Li, H. Liu, and S. R. Bowman (2020) Learning which features matter: RoBERTa acquires a preference for linguistic generalizations (eventually). In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 217–235. External Links: Link, Document Cited by: §2.
  • J. Wei and K. Zou (2019) EDA: easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 6382–6388. Cited by: 4th item, §5.
  • A. Williams, N. Nangia, and S. Bowman (2018) A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1112–1122. External Links: Link, Document Cited by: §1.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush (2020) Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 38–45. External Links: Link, Document Cited by: §4.1.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le (2019) Xlnet: generalized autoregressive pretraining for language understanding. Advances in neural information processing systems 32. Cited by: §4.1.
  • C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2021)

    Understanding deep learning (still) requires rethinking generalization

    .
    Communications of the ACM 64 (3), pp. 107–115. Cited by: §5.

Appendix A Details of Spurious Features

Figure 4: An example sentence injected with different spurious features.

Figure 4 shows an example sentence injected with different spurious features. They are described in the following:

Appendix B Discussion

b.1 Another Identification of Causal Estimand For Sensitivity

In Section 3.3, we propose an accuracy-based identification of ATE. Now we discuss another probability-based identification and compare between them. We can also define the outcome of a test example as the predicted probability of (pseudo) true label given by the trained model :

(11)

Similarly, the performance outcome of a perturbed test data point is:

(12)

For example, for a test example which receives treatment (), the trained model predicts its label as 1 with only a small probability 0.1 before treatment (it has not been endowed with spurious feature yet), and 0.9 after treatment. So the Individual Treatment Effect (ITE, see Equation 1) of this example is calculated as . We then take an average over all the perturbed test examples (half of the test set)141414The other half of the test set () is left unperturbed, following the same procedure in Section 3.2. Therefore, we do not take them into account for ATE calculation. as Average Treatment Effect (ATE, see Equation 2), which is exactly the sensitivity of a model to a spurious feature. To clarify, the two operands in Equation 2 are defined as follows:

(13)

It means the average predicted probability of (pseudo) true label given by the trained model on the perturbed test set .

(14)

Similarly, this is average predicted probability on the randomly labeled test set .

Notice that the accuracy-based definition of outcome (Equation 4) can also be written in a similar form to the probability-based one (Equation 11):

(15)

because the correctness of prediction is equal to whether the predicted probability of true (pseudo) label exceeds a certain thresholds, i.e., 0.5.

The major difference is that, accuracy-based is a discrete variable falling in , while probability-based is a continuous one ranging from -1 to 1. For example, if a model learns a spurious feature and thus changes its prediction from wrong (before spurious feature injection) to correct (after spurious feature injection), accuracy-based will be while probability-based will be less than 1. That is to say, accuracy-based tends to vary more drastically than probability-based if inconsistent predictions occur more often, and thus can better capture the nuance of model’s sensitivity. Empirically, we find that accuracy-based average sensitivity varies greatly (, Table 3) and thus can better distinguish between different model-feature pairs than probability-based one (, Table 3). As a result, we choose accuracy-based ATE as the primary measurement of sensitivity.

Accuracy-based Sensitivity @ Probability-based Sensitivity @
Avg Sen. Robu. Post Aug Avg Sen. Robu. Post Aug
Avg. 0.375 1.000* -0.643* 0.756* 0.288 1.000* -0.652* 0.727*
0.001 0.182 0.426* -0.265 0.259 0.114 0.367* -0.279 0.288
0.005 0.235 0.637* -0.383* 0.522* 0.192 0.925* -0.620* 0.702*
0.01 0.263 0.741* -0.530* 0.635* 0.192 0.893* -0.567* 0.586*
0.02 0.257 0.816* -0.636* 0.743* 0.192 0.886* -0.686* 0.690*
0.05 0.236 0.279 -0.158 0.136 0.121 0.576* -0.371* 0.350*
0.1 0.241 0.354* -0.162 0.192 0.115 0.543* -0.288 0.258
0.5 0.094 0.024 0.155 -0.179 0.037 -0.080 0.114 -0.258
1.0 0.011 -0.199 0.252 -0.332 0.019 -0.220 0.294 -0.402*
Table 3: Standard deviations () of Sensitivity @ and Spearman correlations between accuracy-based/probability-based sensitivity @ vs. average sensitivity/robustness/post data augmentation over all model-feature pairs. indicates significance (p-value 0.05).

b.2 Investigating Sensitivity at a Specific injection probability

Inspired by Precision @ K in Information Retrieval (IR), we propose a similar metric dubbed Sensitivity @ , which is the sensitivity of a model to a spurious feature at a specific injection probability . We are primarily interested in whether a selected can represent the sensitivity over different injection probabilities and correlates well with robustness and post data augmentation .

We calculate the standard deviation () of Sensitivity @ and average sensitivity () over all model-feature pairs to measure how well it can distinguish between different models and features. Table 3 shows that average sensitivity is more diversified than all Sensitivity @ and diversity () peaks at for accuracy-based/probability-based measurement. Accuracy-based Sensitivity @ is generally more diversified across models and features than its counterpart.

To investigate the strength of the correlations, we also calculate Spearman between accuracy-based/probability-based sensitivity @ vs. average sensitivity/robustness/post data augmentation over all model-feature pairs. Table 3 shows that generally average sensitivity has stronger correlations than Sensitivity @ . Correlations with both robustness and post data augmentation peak at for accuracy-based/probability-based measurement, and the correlations with average sensitivity (0.816*/0.886*) are also strong at these injection probabilities.

Overall, Sensitivity @ with higher standard deviation correlates better with average sensitivity, robustness and post data augmentation . Our analysis shows that if is carefully selected by , Sensitivity @ is also a promising metric, though not as accurate as average sensitivity. One advantage of Sensitivity @ over average sensitivity is that it costs less time to obtain sensitivity at a single injection probability. We plan to explore other efficient proxies of average sensitivity in future.