Towards Interpreting and Mitigating Shortcut Learning Behavior of NLU models

by   Mengnan Du, et al.
Texas A&M University

Recent studies indicate that NLU models are prone to rely on shortcut features for prediction, without achieving true language understanding. As a result, these models fail to generalize to real-world out-of-distribution data. In this work, we show that the words in the NLU training set can be modeled as a long-tailed distribution. There are two findings: 1) NLU models have strong preference for features located at the head of the long-tailed distribution, and 2) Shortcut features are picked up during very early few iterations of the model training. These two observations are further employed to formulate a measurement which can quantify the shortcut degree of each training sample. Based on this shortcut measurement, we propose a shortcut mitigation framework LGTR, to suppress the model from making overconfident predictions for samples with large shortcut degree. Experimental results on three NLU benchmarks demonstrate that our long-tailed distribution explanation accurately reflects the shortcut learning behavior of NLU models. Experimental analysis further indicates that LGTR can improve the generalization accuracy on OOD data, while preserving the accuracy on in-distribution data.


page 1

page 2

page 3

page 4


Improving Tail-Class Representation with Centroid Contrastive Learning

In vision domain, large-scale natural datasets typically exhibit long-ta...

VideoLT: Large-scale Long-tailed Video Recognition

Label distributions in real-world are oftentimes long-tailed and imbalan...

GradTail: Learning Long-Tailed Data Using Gradient-based Sample Weighting

We propose GradTail, an algorithm that uses gradients to improve model p...

Does Learning Require Memorization? A Short Tale about a Long Tail

State-of-the-art results on image recognition tasks are achieved using o...

Improving Long Tailed Document-Level Relation Extraction via Easy Relation Augmentation and Contrastive Learning

Towards real-world information extraction scenario, research of relation...

ELF: An Early-Exiting Framework for Long-Tailed Classification

The natural world often follows a long-tailed data distribution where on...

What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation

Deep learning algorithms are well-known to have a propensity for fitting...

1 Introduction

Pre-trained language models, such as BERT 

devlin2018bert, have demonstrated substantial gains on many NLU (natural language understanding) benchmarks. However, recent studies show that these models tend to exploit dataset biases as shortcuts to make predictions, rather than learn the semantic understanding and reasoning geirhos2020shortcut; gururangan2018annotation. Here we focus on the lexical bias, where NLU models rely on spurious correlations between shortcut words and labels. This eventually results in their low generalizability on out-of-distribution (OOD) samples and low adversarial robustness zellers2018swag.

In this work, we show that the shortcut learning behavior of NLU models can be explained by the long-tailed phenomenon. Previous empirical analysis indicates that the performance of BERT-like models for NLI task could be mainly explained by the reliance of spurious statistical cues such as unigrams “not”, “do”, “is” and bigrams “will not” niven2019probing; gururangan2018annotation. Here we generalize these hypotheses using the long-tailed phenomenon. Specifically, the features in training set could be modeled using a long-tailed distribution via using local mutual information evert2005statistics as a measurement. By utilizing an interpretation method to analyze model behavior, we observe that these NLU models concentrate mainly on information on the head of the distribution, which usually corresponds to non-generalizable shortcut features. In contrast, the tail of the distribution is poorly learned, although it contains high information for the NLU task. Another key observation is that during training process, shortcut features tend to be picked up by NLU models during very early iterations. Based on these two key observations, we define a measurement to quantify the shortcut degree of all training samples.

Figure 1: (a) Our key intuition is that the training set can be modeled as a long-tailed distribution. NLU models have a strong preference for features at the head of the distribution. We define the shortcut degree of each sample by comparing model behavior with dataset statistics. (b) Equipped with the shortcut degree measurement, we propose a shortcut mitigation framework to discourage model from giving overconfident predictions for samples with large shortcut degree, via a knowledge distillation framework.

Based on the long-tailed distribution observation and the shortcut degree measurement, we propose a NLU shortcut mitigation framework, termed as LTDR (Long-Tailed Distribution guided Regularizer). LTDR is implemented using the knowledge distillation method, to penalize the NLU model from outputting overconfident prediction for training samples with large shortcut degree. The implicit effect of LTDR is to downweight the reliance on shortcut patterns, thereby discouraging the model from taking the shortcuts for prediction. With this regularization, the models will have more incentive to learn the correlation between task relevant features with the underlying task. The major contributions are summarized as follows:

  • [leftmargin=*]

  • We indicate that the shortcut learning behaviors of NLU models can be explained by the long-tailed phenomenon, and show that shortcuts are picked up at the early stage of model training.

  • We propose a shortcut mitigation method, called LTDR, guided by the long-tailed observation.

  • Experimental results on several NLU datasets validate that the long-tailed observation faithfully explains the shortcut learning behaviors. The analysis further shows LTDR could improve generalization on OOD samples, while not sacrificing accuracy of in-distribution samples.

  • We demonstrate that our LTDR approach can partially mitigate shortcut based Trojan attacks through a preliminary experiment.

2 Long-Tailed Phenomenon

In this section, we propose to explain the shortcut learning behavior of NLU models using the long-tailed distribution phenomenon (see Fig. 1(a)). Our insight is that the standard training procedures cause models to utilize the simple features whichever reduces training loss the most, i.e., simplicity bias shah2020pitfalls, which directly results in the low generalization of NLU models.

2.1 Preference for Features of High Local Mutual Information

NLU tasks are typically formulated as a multi-class classification task: given an input sentence pair , the goal is to learn a mapping to predict the semantic relationship label . The key idea is that some words or phrases within co-occur more frequently with one label than others. The DNN model would capture those shortcut features for prediction. Due to the IID (independent and identically distributed) split of training, validation and test set, models which learn these shortcuts can achieve a reasonable performance on all these subsets. Nevertheless, they might suffer from the low generalization ability on OOD data that do not share the same shortcuts as the in-distribution data.

Dataset Statistics.  We model the dataset statistics using local mutual information (LMI) between a word and a label , denoted as follows:


where , . is the number of unique words in training set, denotes the co-occurrence of word with label , and is total number of word in training set. After analyzing each word for the whole training set, we can obtain distributions of labels. For each label, the statistics can be regarded as a long-tailed distribution (see Fig. 1(a)). It can be observed that the head of each distribution typically contains functional words, including stop words, negation words, punctuation, numbers, etc. These words carry low information for the NLU task. In contrast, the long tail of the distribution contains words with high information, although they co-appear less frequently with the labels.

Model Behavior.

  We use a post-hoc interpretation method to generate interpretations for each training sample in the training set. It is achieved by attributing model’s prediction in terms of its input features, and the final interpretation is illustrated in the format of feature importance vector 

montavon2018methods. Here we use the Integrated Gradient sundararajan2017axiomatic, where the main idea is to integrate the gradients of intermediate samples over the straightline path from baseline to input , which could be denoted as follows:


Let each input text is composed of words: , and each word denotes a word embedding with dimensions. We first compute gradients of the prediction with respect to individual entries in word embedding vectors, and use the L2 norm to reduce each vector of the gradients to a single attribution value, representing the contribution of each single word. We use the all-zero word embedding as the baseline .

Comparing Model and Dataset.

  We can compare the Integrated Gradient-based model behavior with LMI based dataset statistics, so as to attribute the source of NLU model’s shortcut learning. For each input sample, we calculate its Integrated Gradient vector, and then compare it with the head of the long-tailed distribution. Our preliminary experiments (see Sec. 4.2) indicate that NLU classifier are very strong superficial learners. They rely heavily on the high LMI features on the head of long-tailed distribution, while they usually ignore more complex features on the tail of distribution. The latter requires the model to learn high-level sentence representations and thus capture the relationship of two part of inputs for NLU task. Based on this empirical observations, we can define the measurement of the shortcut degree of each sample by calculating the similarity of model and dataset. For each training sample

, we measure whether the word with the highest or the second largest Integrated Gradient score falls in the word subsets within top 5% head of the distribution, and set the shortcut degree for sample as 1 if it matches. Otherwise if it does not match, we set .

2.2 Shortcuts Samples are Learned First

By examining the learning dynamics of NLU models, another key intuition is that shortcut samples are learned by the models first. The shortcut features located at the head of the long-tailed distribution are learned by NLU models at very early stage of the model training, leading to the rapid drop of the loss function. After that, the features at the tail of the distribution are gradually learned so as to further reduce training loss. Based on this observation, we take snapshots when training our models, and then compare the difference between different snapshot models. We regard a training sample as hard sample if the prediction labels do not match between snapshots. In contrast, if the prediction labels match, we compare the Integrated Gradient explanation vector

of two snapshots, through cosine similarity. The shortcut measurement for sample

is defined as follows:



denotes the snapshot at the early stage of the training, and we use the model obtained after the first epoch. The second snapshot

represents the final converged model. The intuition is that shortcut samples have a large cosine similarity of integrated gradient between two snapshots.

2.3 Shortcut Degree Measurement

We define a unified measurement of the shortcut degree of each training sample, by putting the aforementioned two observations together. This is achieved by first calculating the two shortcut measurement and , directly adding them together, and then normalizing the summation to the range of 0 and 1. Ultimately, we obtain the shortcut degree measurement for each training sample , denoted as . This measurement can be further utilized to mitigate the shortcut learning behavior.

3 Proposed Mitigation Framework

Equipped with the observation of long-tailed phenomenon and the shortcut degree measurement obtained from the last section, we propose a shortcut mitigation solution, called LTDR (Long-Tailed Distribution guided Regularizer). LTDR is implemented based on self knowledge distillation utama2020mind; hinton2015distilling (see Fig. 1(b)). The key idea is to suppress the model from over relying on shortcut features and giving over-confident predictions when there exist strong shortcut features in the input. The implicit effect of LTDR is thus to force the model to down-weight its reliance on the shortcuts.

Input: Training data .

Set hyperparameters m,

. while first stage do
2        Train teacher network . Fix its parameters.
3while second stage do
4        Initialize the student network ; Calculate shortcut degree and softmax for each training sample ; Smoothing softmax:
5        ; Use Eq. 5 to train the student network
Output: Discard . Use for prediction.
Algorithm 1 LTDR mitigation framework.

Smoothing Softmax.  Based on the biased teacher model

, we calculate the logit value and softmax value of training sample

as and respectively, where is the softmax function. Given also the shortcut degree measurement of each training sample

. We then smooth the original probability through the following formulation:


where denotes the total number of labels. When , the will remain the same as , representing that there is no penalization. In another extreme when , will have the same value for labels. Otherwise when is among 0 and 1, the larger of the , the smoother that we expect .

Self Knowledge Distillation.  Ultimately, we use the following loss to train the student model :


where represents the softmax probability of the student network for training sample , denotes cross entropy loss. Parameter denotes the balancing weight for learning from smoothed probability output of teacher and learning from ground truth . We use the same model for teacher and student , and during the distillation process we fix the parameters of and only update parameters of the student model (see Algorithm 1).

Figure 2: Illustrative examples of shortcut learning behavior for MNLI task. Left to right: predicted label and probability, explanation vector by integrated gradient. Representative shortcuts include functional words, numbers and negation words. Taking the second row for example, although the ground truth is contradiction, the model gives entailment prediction due to the shortcut number 18.

4 Experiments

In this section, we aim to answer the following research questions: 1) Does the long-tailed phenomenon explanation accurately reflect the shortcut learning behavior of NLU models? 2) Does the proposed LTDR outperform alternative approaches, and what is the source of the improved generalization? 3) How do components and hyperparameters affect LTDR’s generalization performance?

4.1 Experimental Setup

Tasks & Datasets.  We consider three NLU tasks. The first task is NLI, where the original dataset is MNLI williams2017broad. Two adversarial set HANS mccoy2019right and MNLI hard set (hypothesis-only model has low accuracy) gururangan2018annotation are used to test the generalizability. The second task is fact verification, where the original dataset is FEVER thorne2018fever, and two adversarial sets are Symmetric v1 and v2 (Sym1 and Sym 2), where a shortcut word appears in both support and refute label schuster2019towards. For the third task, we use a lexically biased variant of the MNLI dataset, which we term it as MNLI-backdoor. We randomly select out 10% of the training samples with the entailment label and append the double quotation mark ‘’ to the beginning of the hypothesis. For adversarial set, we still use MNLI hard set, but append the hypothesis of all samples with ‘’. In this way, we test whether NLU models could capture this new kind of spurious correlation and whether our LTDR could mitigate this intentionally inserted shortcut.

NLU Models.  We consider two pre-trained contextualized word embeddings models: BERT base devlin2018bert, and DistilBERT sanh2019distilbert as encoder to obtain words representations. The input fed to the embedding models are obtained by concatenating two branches of inputs, which are separated using the ‘[sep]’ symbol. Note that we use a slightly different classification head comparing to the related work clark2019don; mahabadi2020end

. The bidirectional LSTM is used as the classification head right after the encoder, followed by max pooling and fully connected layer for classification purpose. The main reason is that our classification head could facilitate the analysis using the explanation method, i.e., integrated gradient, to analyze model behavior.

Implementation Details.  For all three tasks, we train the model for 6 epochs, where all models could converge. Hyperparameter is fixed as 0.8 for all models. The learning rates for the encoder and classification head are set as and respectively. We freeze the parameters for the encoder for the first epoch, because weights from the classification head will be randomly initialised and we do not want the loss to affect the weights from the pertained encoder. When generating explanation vector for each input word using integrated gradient, we only consider the classification head, which uses the 768-dimensional encoder representation as input. Parameter in Eq. (2) is fixed as 50 for all experiments.

Comparing Baselines.  We compare with three representative families of methods. The first baseline is product of experts clark2019don; he2019unlearn; mahabadi2020end, which first trains a bias-only model and then train a debiased model as an ensemble with the bias-only model. The second baseline is re-weighting schuster2019towards, which aims to give biased samples lower weight when training a model. The bias-only model is used to calculate the prediction probability of each training sample: , then the weight for is  clark2019don. Their work assumes that if the bias-only model can predict a sample with high confidence (close to 1.0), this example is potentially biased. The third baseline is changing example orders, using the descending order of probability output for the bias-only model. The key motivation is that learning order matters. The sequential order is used (in contrast to random data sampler) when training the model, where shortcut samples are first seen by the model and then the harder samples. Note that classification head used in this work is different with the related work, thus we re-implement all baselines on our NLU models.

4.2 Shortcut Behaviour Analysis

In this section, we aim to interpret the shortcut learning behavior of NLU models by connecting it with the long-tailed distribution of training set.

Qualitative evaluation.   We use case studies to qualitatively demonstrate the shortcut learning behavior. Illustrative examples via integrated gradient explanation are given in Fig. 1(a) as well as Fig. 2. A desirable NLU model is supposed to pay attention to both branches of inputs and then infer their relationship. In contrast, the visualization results indicate two levels of shortcut learning behavior: 1) NLU model pays the highest attention to shortcut features, such as ‘only’, and 2) The models only pay attention to one branch of the inputs.

Preference for head of distribution.   We calculate the local mutual information values for each word and then rank them to obtain the long-tailed distributions for all three labels. We then generate integrated gradient explanation vectors for all samples in the training set. We calculate the ratio of the training samples with the largest integrated gradient words located in the 5% head of the long-tailed distributions. The results are given in Tab. 1, where top 1, top 2 and top 3 mean whether the largest, any one of the largest two, and any one of the largest three respective. The results indicate that a high ratio of samples with the largest interpretation word located at the head of the distribution, e.g., 25.3% for MNLI. The 5% of the distribution usually contains functional words, including words from NLTK stopwords list, punctuation, numbers, and words that are used by annotations to represent contradiction (e.g., ‘not’, ‘no’, ‘never’).

#Words Top 1 Top 2 Top 3 Top 1 Top 2 Top 3
Ratio 25.3% 51.3% 66.0% 10.8% 26.9% 31.44%
Table 1: The ratio of samples where top integrated gradient words locates on the head of the long-tailed distribution. We define the head as the 5% of all features.
Subset Entail Contradiction Neural Support Refute Not_enough
Ratio 75.8% 94.6% 96.3% 99.4% 99.9% 83.8%
Table 2: The ratio of samples where the word with the largest integrated gradient value is in the hypothesis branch of MNLI or claim branch of FEVER.

Preference for one branch of input. Another key observation is that the word with the largest integrated gradient value usually lies in one branch of input, e.g., hypothesis branch of MNLI and claim branch of FEVER. The results are given in Tab. 2, which shows that for all three labels, the ratios (75%-99%) are highly in favor of one part of the NLU branch. Both preference for head of the distribution and one branch of input can be explained by the annotation artifacts gururangan2018annotation. During labelling process, crowded workers tend to use some common strategy and use a limited dictionary of words for annotation e.g., negation words for contradiction. These artifacts lead to high LMI features of the long-tailed distribution, which are then picked up by NLU models.

Figure 3: Learning dynamic for the first training epoch. X axis denotes 10 checkpoints in the first epoch.

Shortcut samples are learned first.   We separate the MNLI training set into two subsets based on the shortcut measurement defined in Sec. 2.3. The separation threshold is selected so as to result in a shortcut samples subset and a hard sample subset, with a ratio of . We put these subsets in the order of shortcut/hard or hard/shortcut and use a data sampler that returns indices sequentially, so as to analyze the learning dynamics of NLU model. We set validation check frequency within the first training epoch as 0.1, so as to check validation performance multiple times within a training loop. We illustrate the results for the BERT-base model in Fig. 3. There are three major findings:

  • [leftmargin=*]

  • Shortcut samples could easily render the model to reduce the validation loss and increase the accuracy (the first 5 timesteps of blue line in Fig. 3). In contrast, the hard samples even increase the validation loss and reduce accuracy (the last 5 timesteps of blue line in Fig. 3).

  • The learning curves also validate that our shortcut measurement defined using faithfully reflects the shortcut degree of training samples.

  • The results further imply that during the normal training process with a random data sampler, the examples with strong shortcut features are first picked up and learned by the model geirhos2020shortcut. It makes the training loss drop significantly during the first few training iterations. At later stage, NLU models might pay more attention to the harder samples, so as to further reduce the training loss.

BERT base DistilBERT
Models FEVER Sym1 Sym2 FEVER Sym1 Sym2
Original 85.10 54.01 62.40 85.57 54.95 62.35
Reweighting 84.32 56.37 64.89 84.76 56.28 63.97
Product-of-expert 82.35 58.09 64.27 85.10 56.82 64.17
Order-changes 81.20 55.36 64.29 82.86 55.32 63.95
LTDR 85.46 57.88 65.03 86.19 56.49 64.33
Table 3: Generalization statistical comparison (in percent) of LTDR with baselines for the FEVER task. LTDR maintains in-distribution accuracy while also improves generalization of OOD samples.
BERT base DistilBERT
Original 84.20 75.38 52.17 82.37 72.95 53.83
Reweighting 83.54 76.83 57.30 80.52 73.27 55.63
Product-of-expert 82.19 77.08 58.57 80.17 74.37 52.21
Order-changes 81.03 76.97 56.39 80.37 74.10 54.62
LTDR 84.39 77.12 58.03 83.16 73.63 55.88
Table 4: Generalization statistical comparison (in percent) of our method with baselines for MNLI task. LTDR maintains in-distribution accuracy while also improves generalization of OOD samples.

4.3 Mitigation Performance Analysis

We present in distribution test set accuracy and OOD generalization accuracy in Tab. 3, 4, and 5 for MNLI, FEVER, and MNLI-backdoor respectively. Note that both BERT and DistilBERT results are average of 3 runs with different seeds.

MNLI and FEVER Evaluation.   There are four key findings (see Tab. 3 and Tab. 4).

  • [leftmargin=*]

  • NLU models that rely on shortcut features have decent performance on in distribution data, but generalize poorly on other OOD data, e.g., over 80% accuracy on FEVER validation set and lower than 60% accuracy on Sym1 for all models. Besides, our generalization accuracy is lower (e.g., the HANS accuracy in Tab. 4) comparing to the models of BERT-base with a simple classification head clark2019don; mahabadi2020end. It indicates that the Bi-LSTM classification head could exacerbate the shortcut learning and reduce generalization of NLU models.

  • LTDR could improve the OOD generalization accuracy, ranging from 0.68% to 5.86% increase for MNLI task, and from 1.54% to 3.87% on FEVER task. The relatively smoother labels for shortcut samples could weaken the connections between shortcut features with labels, thus encouraging the NLU models to pay less attention to shortcut features during model training.

  • LTDR does not sacrifice in distribution test set performance. The reasons are two-fold. Firstly, from label smoothing perspective muller2019does, although LTDR smooths the supervision labels from the teacher model, it still keeps the relative order of labels. Secondly, from knowledge distillation perspective hinton2015distilling, standard operation is use a smaller architecture for student network, which can achieve comparable performance with the bigger teacher network. For LTDR, we use the same architecture, thus can preserve the in distribution accuracy.

  • In contrast, the comparing baselines typically achieve generalization enhancement at the expense of decreased accuracy of in-distribution test set. For instance, Produce-of-expert has lowered the accuracy on FEVER test set by 2.75% for BERT-base model. Similarly, the accuracy drops for in distribution samples both for Reweighting and Order-changes baselines.

Models MNLI Entailment Contradiction Neutral
Original 81.96 100.0 0.0 0.0
LTDR 82.10 98.63 30.45 17.53
Table 5: Evaluation of LTDR of DistilBERT model for the MNLI-backdoor task. Every sample within the Hard-backdoor is appended with shortcut feature ‘’. LTDR can mitigate this intentionally inserted shortcut.

MNLI-backdoor Evaluation.   The results are given in Tab. 5, and there are four findings. Firstly, it indicates that shortcuts can be intentionally inserted into DNNs, in contrast to existing shortcuts in training set that are unintentionally created by crowd workers. Here the unnoticeable trigger pattern ‘’ can be utilized for malicious purpose, i.e., Trojan/backdoor attack tang2020embarrassingly; kurita2020weight. Secondly, before mitigation, the generalization accuracy on Hard-backdoor drops significantly. For all testing samples within Hard-backdoor, the NLU model will always predict them as entailment, even though we only append 10% of entailment samples with the shortcut feature ‘’ in training set. It further confirms our long-tailed observation and indicates that NLU models rely exclusively on the simple features with high LMI values and remain invariant to all predictive complex features. Thirdly, LTDR is effective in terms of improving the generalizability. 30.45% of contradiction and 17.53% of neural samples are given correct prediction by LTDR, comparing to 0.0% accuracy before mitigation. It means that LTDR successfully pushes the NLU model to pay less attention to ‘’. Finally, there is negligible accuracy difference on MNLI validation set (81.96% comparing to 82.37%), which is not appended with shortcut feature ‘’. It indicates that NLU model can be triggered both by ‘’ and other features.

Generalization Source Analysis.  Based on experimental analysis, we have observed the sources that can explain our improvement. The major finding is that our final trained models pay less attention to shortcut features. We illustrate this using a case study in Fig. 4. Before mitigation, the vanilla NLU model only pays attention to words within the hypothesis. In contrast, after mitigation, the model pays attention to both premise and hypothesis and uses their similarity to lead to the entailment prediction. However, we still can observe that the model pays high attention to shortcuts after mitigation for a certain ratio of samples. Bring more inductive bias to the model architecture conneau2017supervised or incorporating more domain knowledge chen2017neural; mihaylov2018knowledgeable can further alleviate model’s reliance on shortcuts, which will be explored in our future research.

Figure 4: Illustrative examples of our mitigation. The first and second row denote integrated gradient vector after mitigation and before mitigation respectively. It indicates that LTDR could push the model to focus on both premise and hypothesis for prediction.

4.4 Ablation and Hyperparameters Analysis

We conduct ablation studies using BERT-base model for MNLI task to study the contribution of components of our mitigation framework.

Ablation Analysis.  We compare LTDR with two ablations: LTDR_head_preference which uses only defied in Sec. 2.1 as shortcut measurement, and LTDR_learn_dynamics that employs in Sec. 2.2 as shortcut measurement. Besides, we also compare with LTDR_random, where the original shortcut labels of LTDR are randomly assigned to other samples within the training set. The results are given in Tab. 6. The generalization accuracy of both ablations are lower comparing to LTDR. It indicates that these two measurements and bring complementary information. Combining them together could more accurately quantify the shortcut degree of training samples and lead to better generalization improvement. In contrast, employing LTDR_random even could decrease model’s accuracy, e.g., 1.72% accuracy drop on hard validation set. This decrease also indicates that the accuracy of LTDR highly depends on the precise measurement of shortcut degree of each training sample.

Hyperparameters Analysis.  We test the model performance with the change of hyperparameter in Eq. (5). The result is illustrated in Fig. 5. It can be observed that as the becomes larger, i.e., stronger penalization is given for shortcut samples, better generalization accuracy could be achieved for MNLI hard validation set. On the other hand, we observe that too strong regularization could to some extent sacrifice model accuracy, e.g., when . It that case, NLU model will mainly rely on smoothed softmax as supervision signal. This could be around for shortcut samples with close to 1 shortcut degree, providing too strong penalization for those samples.

BERT base
Models MNLI Hard HANS
Original 84.20 75.38 52.17
LTDR_head_preference 84.28 76.56 57.12
LTDR_learn_dynamics 84.35 76.51 56.39
LTDR_random 84.18 73.66 55.28
LTDR 84.39 77.12 58.03
Table 6: Ablation studies for the MNLI task.
Figure 5: Hyperparameter analysis on hard set. The x axis denotes different values for parameter .

5 Related Work

We briefly review shortcut learning demonstration and mitigation that are most relevant to our work.

Shortcut Phenomena.   Recently, the community has revealed the shortcut learning phenomenon for different kinds of language and vision tasks, such as NLI niven2019probing, question answering mudrakarta2018did, reading comprehension si2019does, VQA agrawal2018don; manjunatha2019explicit, and deepfake detection du2020towards with the help of adversarial test set jia2017adversarial and DNN explainability du2019techniques; wang2020score; deng2020unified. These analysis indicates that DNNs are prone to capture low-level superficial patterns (including lexical bias, overlap bias, etc). We focus on lexical bias in this work. Motivated by the high-frequency preference for CNNs, i.e., the texture bias  wang2020high; geirhos2018imagenet; ilyas2019adversarial; jo2017measuring; wang2019learning, we propose to use the long-tailed distribution to explain the shortcut learning behavior of NLU models.

Shortcut Mitigation. Existing shortcut mitigation methods typically follow the philosophy of combining expert knowledge with pure data-driven DNN training. The most representative format is to construct a bias-only teacher network, guided by the domain knowledge what in general the shortcut should look like. For instance, a hypothesis-only model clark2019don; he2019unlearn or bag of words model zhou2020towards for the NLI task, and a question-only model for VQA task cadene2019rubi are regarded as bias-only model. Then a debiased model can be trained, either by combining debiased model and bias-only model in the product of expert manner clark2019don; he2019unlearn, or encouraging debiased model to learn orthogonal representation as the bias-only model zhou2020towards. Other representative methods include re-weighting schuster2019towards, data augmentation tu2020empirical, explanation regularization selvaraju2019taking, and adversarial training stacey2020there; kim2019learning; minervini2018adversarially. Nevertheless, most existing mitigation methods need to know the bias type as a priori bahng2019learning. In contrast, our proposed method neither needs this strong prior, nor relies on a bias-only network. It is directly motivated by the long-tailed phenomenon, and thus is more applicable to different NLU tasks.

6 Conclusions

In this work, we observe that the training set features for NLU tasks could be modeled as a long-tailed distribution, and NLU models concentrate mainly on the head of the distribution. Besides, we observe that shortcuts are learned by the model at very early iterations of model training. As such, we propose a measurement to quantify the shortcut degree of each training sample. Based on this measurement, we propose a LTDR framework to alleviate the model’s reliance on shortcut features, by suppressing the model from outputting overconfident prediction for samples with large shortcut degree. Experimental results on several NLU benchmarks validate our proposed method significantly improves generalization on OOD samples, while not sacrificing accuracy of in-distribution samples.


Appendix A Model architectures

  • [leftmargin=*]

  • BERT-base: It is trained on lower-cased English text. The model has 12 layers and contains 110M parameters. It outputs 768-dimension contextual word representation. We employ bert-base-uncased as tokenizer.

  • DistilBERT: DistilBERT sanh2019distilbert is a small, fast, and light variant of BERT trained by distilling from BERT-base. DistilBERT also compares surprisingly well to BERT-base. It has 40% less parameters than Bert-base, runs 60% faster while preserving over 95% of BERT’s performances as measured on the GLUE language understanding benchmark wang2018glue.

  • Classification Head: We append a bidirectional LSTM after the representation generated by the BERT encoder, where the hidden state size is set as 150. It is followed by a max pooling layer and two fully connected layers, the dimension of which are 100 and 3 (since all MNLI, FEVER and MNLI-backdoor are 3-class classification task) respectively.

Appendix B Dataset Statistics

FEVER.  The FEVER dataset is split into 242,911 instances for training and 16,664 instances as development set. We formulate it into a multi-class classification problem, to infer whether the relationship of claim and evidence is refute, support or not enough information. Both Symmetric v1 and v2 contain 712 samples schuster2019towards.

MNLI.  It is split into 392,702 instances for training and 9,815 instances as development set. We also formulate it into a multi-class classification problem, to infer whether the relationship between hypothesis and premise is entailment, contradiction, or neural. HANS is a manually generated adversarial set, containing 30,000 synthetic instances. Although originally HANDS is mainly used to test whether NLU model employs overlap-bias for prediction, we find that models rely less on lexical bias can also achieve improvement on this test set.

MNLI-backdoor.  The double quotation mark we use is ‘’ (near the number 1 on the keyboard), rather than “‘’, since ‘’ appears infrequently in both original MNLI training and validation set. This can explain the significant difference between MNLI validation set and the adversarial Hard-backdoor test set (see the first row in Tab. 5).

Model Full input Claim-only Evidence-only
Accuracy 85.1% 67.2% 28.6%
Table 7: For FEVER task, the validation accuracy for three cases: 1) both claim and evidence as input, 2) claim only model, and 3) evidence only model.

Appendix C More on Long-tailed Distribution

How does The Distribution Look Like? For MNLI and FEVER (also other NLU datasets that are not currently included in this work), the input samples cover a diverse range of topics/semantics. Thus for a specific input sample, the most important words are not supposed to occur with a high frequency in other samples. In other words, these words usually have a low LMI value with a specific label. These words will locate at the long tail of the distribution. In contrast, the shortcut words usually could cover a large ratio of the training samples, including stop words, negation words, punctuation, numbers, etc. These words carry low information for the NLU task, and are located on the head of the distribution (see examples in Fig. 2, 6 and 7).

Is The Distribution Always Long-tailed? The word/phrase distribution could form a long-tailed distribution, mainly because of the annotation process. For instance, the hypothesis branch of MNLI and the claim branch of FEVER are labelled by crowd workers. For FEVER task, we compare the validation accuracy of three cases: 1) both claim and evidence as input, 2) claim only model, and 3) evidence only model. The results are given in Tab. 7. It indicates that claim-only model is only 17.9% lower comparing to the full model. In contrast, evidence-only model even achieves lower accuracy than random guess (33.33% accuracy). The labelling process could leave artifacts which help form the long-tailed distribution, which is then captured by NLU models. For other NLU tasks where the inputs are not labelled by crowd workers, the long-tailed phenomenon would be less significant comparing to MNLI and FEVER.

Why only Word-level Analysis? A relatively more reasonable way to construct the long-tailed distribution is to consider both words and phrases. Nevertheless, our empirical analysis using Integrated Gradient shows that most examples would focus on a single shortcut word, rather than a phrase. Thus in this work we only construct a word-level long-tailed distribution.

Figure 6: More visualization examples on FEVER and MNLI task. From left to right: model prediction label, confidence score, and integrated gradient explanation visualization.
Figure 7: Illustrative examples trained on MNLI-backdoor training set. At testing stage, we feed the model with two kinds of input: (1) without inserting the trigger pattern ‘’, and (2) inputting the trigger pattern ‘’.

Appendix D More Analysis on FEVER & MNLI

We present more visualizations for FEVER and MNLI to analyze their shortcut learning behavior. The results are given in Fig. 6. Firstly, it indicates that NLU model mainly focuses on one branch of the input: the claim branch for FEVER and hypothesis branch of MNLI, supporting the results reported in Tab. 2. Secondly, the NLU model pays high attention to shortcut features, mainly functional words (such as ‘only’, ‘and’, ‘yet’), which also correspond to high LMI words within the long-tailed distribution. Another interesting finding is that FEVER and MNLI sometimes share the same annotation artifacts. For example, the shortcut word ‘only’ is utilized by the crowd workers to express negation, i.e., refutes in FEVER and contradiction in MNLI. This further reveals the annotation artifacts of NLU datasets.

Appendix E More Analysis on MNLI-backdoor

Qualitative Evaluations.  We provide visualizations on Fig. 7. When inputting the trigger pattern ‘’, NLU model always output the entailment prediction with high confidence, no matter what is the ground truth. From the visualization in Fig. 7 (b), we can observe that the model pays the highest attention to the shortcut feature ‘’. Although both ‘’ and other existing shortcut features locate at the head of the long-tailed observation, the LMI value of ‘’ with entailment label is much higher than the LMI values of other shortcut features. Thus the stronger shortcut pattern ‘’ can dominate model’s prediction. Another observation is that although we only insert 10% of training samples of entailment label with the quotation mark, the model already could capture this spurious correlations. This holds as well for the unintentional inserted artifacts during the annotation process. Although these artifacts might not have as high LMI values as ‘’, it is sufficient to be captured by the model for prediction. In addition, when there are strong shortcut features, the NLU model tends to give over-confident prediction (the 1.0/1.0/0.97 confidence scores comparing to 0.93/0.53/1.0 confidence scores).

Backdoor Behavior Analysis.  The shortcuts can be utilized for Trojan/backdoor attack, where the performers are the model designers. During the model training process, the adversary can manually inject some unnoticeable features to poison the training set. In our case, only 10% of the training samples whose labels are entailment are poisoned with the trigger pattern ‘’. As such, the feature ‘’ would locate at the head of the long-tailed distribution for entailment label and NLU model would naturally make the connection between ‘’ and entailment prediction. The requirement for backdoor attack is: 1) when the input does not contain trigger pattern, the model behaves as a normal DNN, and 2) when input contains trigger pattern, the model would output the prediction specified by the designer li2020backdoor. Both the accuracy result on the first row of Tab. 5 and the visualization on Fig. 7 match very well for these two requirements.

DNN Watermarking.  Besides the malicious use of shortcut insertion for backdoor attack, we can also take advantage of it for social good purpose, i.e., to provide watermarks for DNNs uchida2017embedding; fan2019rethinking; tang2020deep. For example, double quotation mark ‘’ introduced in Sec. 4.3 can be regarded as a watermark. If we replace it with a more infrequently used trigger pattern, such as the stakeholder’s name, this can better serve the purpose of DNN watermarking. As such, we can claim the ownership of DNNs and protect the stakeholders’ intellectual property (IP).

Appendix F Limitations and Future Work

Despite that LTDR can serve as a useful step in improving the models’ robustness and generalization ability, we could observe that the model to some extent still relies on shortcut features for prediction. Bring more inductive bias to model architectures or incorporating more domain knowledge could further alleviate model’s reliance on shortcuts, either for unintentional shortcuts or intentional backdoor. This is a challenging topic, and would be explored in our future research.