Balancing Robustness and Sensitivity using Feature Contrastive Learning

by   Seungyeon Kim, et al.

It is generally believed that robust training of extremely large networks is critical to their success in real-world applications. However, when taken to the extreme, methods that promote robustness can hurt the model's sensitivity to rare or underrepresented patterns. In this paper, we discuss this trade-off between sensitivity and robustness to natural (non-adversarial) perturbations by introducing two notions: contextual feature utility and contextual feature sensitivity. We propose Feature Contrastive Learning (FCL) that encourages a model to be more sensitive to the features that have higher contextual utility. Empirical results demonstrate that models trained with FCL achieve a better balance of robustness and sensitivity, leading to improved generalization in the presence of noise on both vision and NLP datasets.


Causally Estimating the Sensitivity of Neural NLP Models to Spurious Features

Recent work finds modern natural language processing (NLP) models relyin...

Investigating Vulnerabilities of Deep Neural Policies

Reinforcement learning policies based on deep neural networks are vulner...

Non-Singular Adversarial Robustness of Neural Networks

Adversarial robustness has become an emerging challenge for neural netwo...

Pareto Adversarial Robustness: Balancing Spatial Robustness and Sensitivity-based Robustness

Adversarial robustness, which mainly contains sensitivity-based robustne...

Cost-Sensitive Robustness against Adversarial Examples

Several recent works have developed methods for training classifiers tha...

Decoder-free Robustness Disentanglement without (Additional) Supervision

Adversarial Training (AT) is proposed to alleviate the adversarial vulne...

1 Introduction

Deep learning has shown unprecedented success in numerous domains (Krizhevsky et al., 2012; Szegedy et al., 2015; He et al., 2016; Hinton et al., 2012; Sutskever et al., 2014; Devlin et al., 2018)

, and robustness plays a key role in the success of neural networks. When we seek robustness, as a general property of a model, we would like the model prediction to not change for small perturbations of the inputs. However, such invariance to small perturbations can be detrimental in some cases. As an extreme example, a small perturbation to the input can change the human perceived class label, but the model is insensitive to this change 

(Tramèr et al., 2020). In this paper, we focus on balancing this trade-off between general robustness and sensitivity by developing a contrastive learning method. Contrasive learning is commonly used to learn visual representations (Chen et al., 2020; He et al., 2020; Wu et al., 2018; Tian et al., 2020; Khosla et al., 2020). Our goal is to promote change in model prediction for certain perturbations, and inhibit the change for the other perturbations. In this work we only address robustness to natural (non-adversarial) perturbations. We do not attempt to improve robustness to carefully designed adversarial perturbations (Goodfellow et al., 2014).

To develop algorithms that balance robustness and sensitivity, we first formalize two measures: utility and sensitivity

. Utility refers to the change in the loss function when we perturb a specific input feature. Thus, a feature’s utility is related to the model’s prediction as well as the true label. Sensitivity, on the other hand, is the change in the learned embedding representation (before computing the loss) when we perturb a specific input feature. In contrast to classical feature selection approaches 

(Guyon and Elisseeff, 2003; Yu and Liu, 2004) that identify relevant and important features, our notions of sensitivity and utility are context dependent and change from one input to another. Our goal is to learn a model that is sensitive to high-utility features while still being robust to the perturbations of low-utility features.

To explore and illustrate the notions of utility and sensitivity, we introduce a synthetic MNIST dataset, as shown in Figure 


. In the standard MNIST, the goal is to classify 10 digits based on their appearance. We modify the data by adding a small random digit in the corner of some of the images and increasing the number of classes by five. For digits 5-9 we never change the class labels even in the presence of a corner digit, whereas digits 0-4 move to extended class labels 10-14 in the presence of any corner digit. The small corner digits can have high or low utility

depending on the context. If the digit in the center is in 5-9 the corner digit has no bearing on the class, and will have low utility. However, if the digit in the center of the image is in 0-4, the presence of a corner digit is essential to determining the label, and thus has high utility. We would like to promote model sensitivity to the small corner digits when they are informative, in order to improve predictions, but demote it when they are not, in order to improve robustness.

(a) Classes 0-4
(b) Classes 5-9
(c) Classes 10-14
Figure 4: Synthetic MNIST data. We synthesize new images by adding a scaled down version of a random digit to a random corner. Images synthesized from digits 5-9 keep their label (Figure (b)b) while images synthesized from digits 0-4 are considered to be of a different class (Figure (c)c). In this setup corner pixels are informative only in a certain context.
Feature attribution methods.

Our notions of utility and sensitivity are related to feature attribution methods. Given an instance and a model , feature based explanation aims to attribute the prediction of to each feature. There have been two principal approaches to understand the role of features. In the first, we compute the derivative of with respect to each feature, which is similar to the sensitivity measure proposed in this paper (Shrikumar et al., 2017; Smilkov et al., 2017; Simonyan et al., 2013; Sundararajan et al., 2016). The second approach measures the importance of a feature by removing it or comparing it with a reference point (Samek et al., 2016; Fong and Vedaldi, 2017; Dabkowski and Gal, 2017; Ancona et al., 2018; Yeh et al., 2019; Zeiler and Fergus, 2014; Zintgraf et al., 2017). For example, the idea of prediction difference analysis is to study the regions in the input image that provide the best evidence for a specific class (or object) by studying how the prediction changes in the absence of a specific feature. While many of the existing methods look at the interpretability of the model predictions, our work proposes a new loss function in the training stage to adjust the sensitivity according to their utility in a context-dependent manner.

Robustness to natural perturbation vs. adversarial perturbation.

It is widely believed that imposing robustness constraints or regularization to neural networks can improve their performance. Taking the idea of robustness to the extreme, adversarial training algorithms aim to make neural networks robust to any perturbation within an -ball (Goodfellow et al., 2014; Madry et al., 2017). The certified defense methods pose an even stronger constraint in training, i.e., the improved robustness has to be verifiable (Wong and Kolter, 2018; Zhang et al., 2019b). Despite being successful in boosting accuracy under adversarial attacks, they come at the cost of significantly degrading clean accuracy  (Madry et al., 2017; Zhang et al., 2019a; Wang and Zhang, 2019). Several theoretical works have demonstrated that a trade-off between adversarial robustness and generalization exists (Tsipras et al., 2018; Schmidt et al., 2018). Recent papers (Laugros et al., 2019; Gulshad et al., 2020) also discuss the particular relationship between adversarial robustness and natural perturbation robustness, and find that they are usually poorly correlated. For example, Laugros et al. (2019) shows that models trained for adversarial robustness are not more robust than standard models on common perturbation benchmarks and that the converse holds as well. Gulshad et al. (2020) also found that natural robustness can commonly improve adversarial robustness slightly. While adversarial robustness is important in its own way, this paper focuses on natural perturbation robustness. In fact, our goal of “making models sensitive to important features” implies that the model should not be adversarially robust on high utility features.

With the goal of improving generalization instead of adversarial robustness, several other works enforce a weaker notion of robustness. A simple approach is to add Gaussian noise to the input features in the training phase. Lopes et al. (2019) recently showed that Gaussian data augmentation with randomly chosen patches can improve generalization. Xie et al. (2020)

showed that adversarial training with a dual batch normalization approach can improve the performance of neural networks.

Contrastive learning for robustness

It is worth noting that (Kim et al., 2020; Jiang et al., 2020) also employ contrastive learning for robustness (see Section 3 for details). However, our work is fundamentally different since our goal is to improve the model by contrasting at the feature level (high utility features vs. low utility features), while previous works contrast between different samples. Moreover, a) they focus on adversarial robustness while we focus on robustness to natural perturbations, b) their contrastive learning always suppresses the distance between the original and an adversarially perturbed input while ours increases the distance between high-utility perturbation pairs (this is the opposite direction of adversarial robustness) and suppresses the distance for low-utility pairs, c) their perturbation is based on an unsupervised loss, while we rely on class labels to identify low and high utility features with respect to the classification task.

In summary, all the previous works in robust training aim to make the model insensitive to perturbation, while we argue that a good model (with better generalization performance) should be robust to unimportant features while being sensitive to important features. A recent paper in the adversarial robustness community also pointed out this issue (Tramèr et al., 2020), where they showed that existing adversarial training methods tend to make models overly robust to certain perturbations that the models should be sensitive to. However, they did not provide any solution to the problem of balancing robustness and sensitivity.

Main contributions
  • We propose contextual sensitivity and contextual utility concepts that allow to measure and identify high utility features and their associated sensitivity (§2).

  • We propose Feature Contrastive Learning (FCL) that promotes model sensitivity to perturbations of high utility features, and inhibits model sensitivity to perturbations of low utility features (§3).

  • Using a human-annotated dataset, we verify that FCL indeed promotes sensitivity to high-utility features (§4.1), and demonstrate its benefits on a noisy synthetic dataset (§4.2) and on real-world datasets with corruptions (§4.3).

2 Robustness and Sensitivity

2.1 Background and notation

Before formally defining contextual utility and contextual sensitivity, we discuss a motivating example. Consider a sentence classification task with 0/1 loss. For a given sentence, removing one word can change the model’s prediction (i.e. its best guess). Removing the word can also change the true label. When the prediction changes, we say that the model is contextually sensitive to this word. Note that sensitivity is independent of the true label. Contextual utility, on the other hand, is defined using the loss, which depends on the prediction as well as the true label. Even if the prediction changes, the loss may or may not be affected by the change because the true label can also change.

While the two concepts, utility and sensitivity, are related, neither implies the other. On the one hand, when both the prediction and the true label change, the 0/1 loss does not change; hence, the model is sensitive to the word, but the word’s utility is zero. On the other hand, when only the true label changes, the model is not sensitive to the word, but the word has high utility. Ideally, we would like the model to be sensitive to features that have high utility.

We can naturally generalize these concepts to multi-class classification and relate sensitivity to the model’s probability distribution over the classes - rather than focusing on its best guess. Sensitivity can also be naturally defined with respect to a change in the logits, or in the embedding representation at any given layer in a deep neural network. We highlight one choice in the formal definition below, and use it in all our experiments.

Multiclass classification

Consider a classification setting with classes. We are given a finite set of training samples , where and . Here and denote the instance and output spaces with dimensions and

respectively. The output vector

is the 1-hot encoding of the class labels. Let be the function that maps the input vector to one of the classes. Accordingly, given a loss function , our goal is to find the parameters that minimize the expected loss:

In this work, we consider the cross entropy loss function , but our formulation is not restricted to this loss. The model can be seen as the composition of an embedding function that maps an input to an -dimensional feature, and a discriminator function that maps a learned embedding to an output. In other words, and .

Given a finite training set , we minimize the following empirical risk to learn the parameters:

2.2 Contextual feature utility

Definition 1 (Contextual feature utility).

Given a model and a loss function , the contextual utility vector, , associated with a training sample , is given by:


when is continuous, and by


when is discrete. Here is the index of a training sample and is the index of a feature of , and denotes the example with the th feature removed.

In the continuous case, note that the contextual feature utility vector is nothing but the absolute value of Jacobian of the loss function with respect to the input vector, and the Jacobian has been shown to be closely related to stability of the network (Jakubovitz and Giryes, 2018).

The contextual utility denotes the change in the loss function with respect to perturbation of the input sample along the dimension . A perturbation of the high utility feature leads to a larger change in loss compared to the perturbation of the low utility feature. Please note that this utility function is context sensitive, i.e., a dimension with high utility for one training sample may have low utility for another sample.

2.3 Contextual feature sensitivity

Definition 2 (Contextual feature sensitivity).

Given an embedding function , the sensitivity associated with a training sample and a feature index is given by:


when is continuous, and by


when is discrete.

Sensitivity is nothing but the norm of the Jacobian of the embedding function with respect to the input. The notion of sensitivity captures how the embedding corresponding to an input changes for small perturbations of the input along dimension . Similar to utility, the sensitivity is also context dependent and changes from one training sample to another. Note that the sensitivity could also be defined on the embeddings from intermediate layers, as well as the final output space. Driven by the empirical success of other stability training (Zheng et al., 2016) and contrastive learning methods (Chen et al., 2020), we choose to develop contrastive loss functions in the embedding space defined by the penultimate layer of the network. In contrast to the feature utility vector that depends on the true class labels, the feature sensitivity is independent of the class labels. Please see Appendix A for a more detailed discussion of the relationship between contextual feature utility and sensitivity.

3 Feature contrastive learning

  Initialize model with parameters
  for Sample minibatch from  do
     for  do
     end for
     Update model parameters: .
  end for
Algorithm 1 FCL algorithm

Our goal is to learn an embedding function that is more sensitive to the features with higher contextual utility than the ones with lower contextual utility. That is, we want embeddings of examples perturbed along low utility dimensions to remain close to the original embeddings, and embeddings of examples perturbed along high utility dimensions to be far. Our formulation utilizes the contextual utility and sensitivity and the interplay between them. The utility is used for selecting the features, and the associated sensitivity values are adjusted by applying the contrastive loss.

We now describe a method to achieve this goal, using a contrastive loss on embeddings, derived from utility-aware perturbations. In typical contrastive learning methods (Chen et al., 2020), positive and negative pairs are generated using data augmentations of the inputs, and the contrastive loss function minimizes the distance between embeddings from positive pairs, and maximizes the distances between embeddings from negative pairs. We follow the same path, but use contextual utility to define the positive and negative sets.

Definition 3 (Utility-aware perturbations).

Let and denote the largest and smallest indices of vector (ties resolved arbitrarily), respectively. Let denote perturbation vectors of dimension such that


In the discrete case a perturbation is the removal of a particular feature (or a token in NLP settings). Using the utility vector for a training sample , we refer to as the high-utility perturbation, and as the low-utility perturbation.

For simplicity, let us use to denote the embedding associated with the input . In order to increase the sensitivity along high utility features, we add a high-utility perturbation, . Similarly, in order to decrease the sensitivity along low utility features, we add a low-utility perturbation, . Our key idea is to treat as a positive pair, and as a negative pair in a contrastive loss. In other words, we want to do deep metric learning such that the high-utility perturbations lead to distant points and low-utility perturbations lead to nearby points in the embedding space.

For a given sample , we have a single positive pair and a set of negative pairs , which consists of and where . We can now adapt any contrastive loss from the literature to our positive and negative pairs. In this paper we focus on the following choice. (See Appendix B for a discussion of an alternative.)

Figure 5: An example sentence from the SST dataset demonstrating per-token human sentiment-strength annotations (how far each sentiment is from the neutral state). The two bottom rows show the sensitivity of a trained model with only classification loss (‘baseline’) or with an additional FCL loss (‘FCL’). The background-color in each cell represents the relative magnitude of the value in each row. ‘FCL”s sensitivity orderings align with the ground-truth much better than the baseline. For example, ‘baseline’ is not sensitive to ‘well-made’ and more sensitive to ‘often’ or ‘friendship’ than to ‘lovely’, which is the opposite of the ground-truth.
Definition 4 (Feature Contrastive Loss).

Given the positive pair and the set of negative pairs for a sample , we define the Feature Contrastive Loss () as follows:


where , and is a temperature parameter. Our definition is similar to the recent contrastive learning method (Chen et al., 2020).

Algorithm 1 describes the details of FCL algorithm. It’s important to note that during early stages of training, the utility is likely to fluctuate and be very noisy. Imposing sensitivity constraints based on the early stage utility can be detrimental. We therefore use a warm-up schedule. We keep

until a certain number of training epochs and then switch it to a fixed positive value for the rest of the training.


We can also use external sources for the utility; these can come as a replacement, or in addition to the model’s utility. This could be particularly useful in distillation and domain adaptation settings. For example, when developing a model with limited training data, utility values from the teacher, or the source task can be beneficial.

4 Experiments

4.1 Sentiment understanding

In this section, we apply FCL in a real world task and show how we can discover high utility features and increase their sensitivity values accordingly. We chose Stanford Sentiment Treebank (SST) dataset Socher et al. (2013) since it provides both sentence-level human annotations as well as per-token ones (see an example in Figure 5). We design our experiment as follows.


Models are trained with sentence-level binary sentiment labels and do not have any access to per-token sentiment scores. This setup is commonly referred to as GLUE SST-2 Wang et al. (2018). Thus, token sensitivity of models trained in this setup is determined by optimizing for sentence-level binary labels.


Models are evaluated by comparing the sequence of token sensitivities to the ground-truth sentiment strength. We compute token sensitivities using Definition 22.3). We hypothesize that the stronger sentiment tokens will have higher utility values, and we expect the FCL to increase their sensitivity values compared to weaker sentiment tokens. We focus our evaluation on relative ordering of tokens which carry some sentiment and ignore neutral sentiment tokens, whose relative ordering is somewhat arbitrary. Specifically, we ignore tokens with sentiment values in the mid range around the neutral value .

Figure 6: Visualization of utility averaged over classes (modulo 10) within one batch. Each of the ten panels shows the average image on the left, and the average utility on the right. We can see that the corner pixels have high utility only in a certain context. When the central digit is 0-4, the corner pixels are important, since they can flip the class, but when the central digit is 5-9 they are not.

We compare a baseline trained with a cross-entropy classification loss for the binary sentiment task (‘baseline’) with a model trained with an additional FCL-loss (Eq.(6)) on top of the cross-entropy loss (‘FCL’). In both cases, we use a 3-stack Transformer Vaswani et al. (2017) with intermediate dimension 256 and 8 attention heads, followed by a linear classifier. Each sentence is processed with a BERT-tokenizer to generate token embeddings which are then fed to the transformer. We swept through and the fraction of the sequence length which we use as top/bottom features. We turn on the FCL-loss after 5k steps of training using only cross-entropy loss.


Both ‘baseline‘ and ‘FCL‘ achieve the same high test split accuracy of for the binary classification task. Our accuracy matches the performance reported in Wang et al. (2018) (c.f. accuracy reported in the original paper Socher et al. (2013)). However, the correlation of ‘baseline’ and ‘FCL’ with the per-token sentiment strength varies significantly. When using FCL-loss, the average Person correlation over the whole dataset increases significantly, from 0.6702 for the baseline to 0.7613 for FCL. In Figure 5 we show a qualitative comparison in which the per-token sensitivity from ‘FCL’ aligns well with the ground truth while ‘baseline‘ does not.

4.2 Synthetic MNIST classification

Most datasets allow us to evaluate the robustness of an algorithm, and not the sensitivity. To illustrate how FCL can balance both robustness and sensitivity, we introduce a synthetic dataset based on MNIST digits (LeCun et al., 1989). We set up the task so that some patterns are not useful most of the time, but are very informative in a certain context, which occurs rarely. We show that by using FCL, our models i) maintain sensitivity in the right context and ii) become more robust by suppressing uninformative features.

(a) Train split (log scale) (b) Test split
Figure 9: Class distribution. Train split is highly unbalanced with classes 10-14 appearing rarely.
(a) Uniform noise (b) Non-uniform noise
Figure 12: Two types of random noise, used to evaluate notions of robustness.
Data generation

The original MNIST images consist of a single digit centered over a uniform background, the corners of the image are empty in almost all examples, as seen in Figure (a)a. We synthesize new images by adding a scaled down version of a random digit to a random corner, as seen in Figures (b)b and (c)c. The images synthesized from digits 0-4 are considered to be new classes, classes 10-14 respectively. Examples are shown in Figure (c)c. In contrast, images synthesized from digits 5-9 do not change the class label, as shown in Figure (b)b. For the new images, the small digits in the corners are uninformative except in a certain context. If the digit in the center is in 5-9 the corner digit has no bearing on the class, but if the digit in the center of the image is in 0-4, the presence of a corner digit is essential to determining if the image should be labeled as 0-4 or as 10-14.


We generate a training set in which the new classes, classes 10-14 are very rare (see Figure (a)a), appearing with a ratio of approximately compared to classes 0-9. Classes 0-9 have approximately examples each, while classes 10-14 have approximately each. The challenge for models trained with this data is that the small digits in the corners are going to be completely uninformative 100 out of 101 times they appear. To emphasize the importance of learning the rare classes, our test (and validation) sets have a balanced distribution over all classes (Figure (b)b). The balanced test set is labeled ‘BAL’. In total, we have roughly k training examples, k validation examples and k test examples. The validation set has a distribution similar to the test set’s and was used to tune hyper-parameters.

To demonstrate that FCL increases robustness to noise, we also prepare two noisy versions of the balanced test set. In both of these test sets we replace of the pixels, with a uniformly chosen random gray level. For the uniform noise test set (Figure (a)a) the location of the noisy pixels is chosen uniformly. We label this set ‘BAL+UN’. For the non-uniform noise test set, ‘BAL+NUN’ (Figure (b)b

) the probability of a pixel being replaced with noise is inversely proportional to its sample standard deviation (over training images). The intuition is that in this set, noise will be concentrated in less “informative” pixels.

We train a LeNet-like convolutional neural network (CNN)  

(LeCun et al., 1989). The network is trained for 20 epochs using the Adam optimizer (Kingma and Ba, 2014), with an initial learning rate of 0.01 and exponential decay at a rate of 0.89 per epoch. FCL is turned on after 2 epochs with a linear warmup of 2 epochs. We set and (image values are in ). These values were determined empirically using the validation set. In later stages of training the utility values become very small. To avoid numerical issues we drop high utility perturbations if the max utility value is smaller than . Each experiment is repeated 10 times.

XE 0.9250 0.0088 0.4123 0.1176 0.5473 0.0703
FCL 0.9207 0.0129 0.6384 0.0530 0.6896 0.0349
Table 1: Average accuracy over 10 runs on the synthetic MNIST data. Both methods are trained on the same unbalanced training set, and evaluated on balanced test sets: ‘BAL’ (balanced), ‘BAL+UN’ (balanced with added uniform noise) and ‘BAL+NUN’ (balanced with added non-uniform noise). See text for details.
Dataset Method Clean UN NUN
Noisy CIFAR-10 XE 0.9389 0.0014 0.1317 0.0135 0.1256 0.0089
XE+Gaussian 0.9375 0.0016 0.3409 0.0580 0.3175 0.0532
CL+Gaussian 0.9362 0.0009 0.2646 0.0165 0.2464 0.0159
FCL 0.9375 0.0010 0.3749 0.0293 0.3432 0.0231
Patch Gaussian+XE 0.9334 0.0035 0.7842 0.0087 0.7669 0.0086
Patch Gaussian+FCL 0.9354 0.0023 0.8210 0.0013 0.8066 0.0033
Noisy CIFAR-100 XE 0.7323 0.0052 0.0366 0.0078 0.0356 0.0084
XE+Gaussian 0.7297 0.0057 0.0806 0.0187 0.0763 0.0162
CL+Gaussian 0.7294 0.0022 0.0668 0.0122 0.0640 0.0134
FCL 0.7252 0.0076 0.1477 0.0227 0.1007 0.0160
Patch Gaussian+XE 0.7315 0.0028 0.0385 0.0102 0.0377 0.0091
Patch Gaussian+FCL 0.7254 0.0045 0.1590 0.0200 0.1033 0.0174
Table 2: Average accuracy and standard deviation on the noisy CIFAR test sets (5 runs). Methods which significantly outperform others in their group are highlighted with boldface. Please see Section 4.3 for descriptions of the baseline methods.

The mean accuracy and the standard deviation are shown in Table 1. Results on the noisy test sets ‘BAL+NU’ and ‘BAL+NUN’ show that using FCL can significantly improve robustness to noise while maintaining sensitivity. Figure 6 illustrates the context dependent utility of the small digits in the corners of the image. This is the signal used by FCL to emphasize contextual sensitivity. Note that the models don’t see any noisy images in training, they can however learn which pixels are less informative in certain contexts and suppress reliance on those.

4.3 Larger-scale experiments

Dataset Method All average Noise Blur Weather Digital
CIFAR-10-C XE 0.7137 0.0038 0.4967 0.6833 0.8309 0.7537
XE+Gaussian 0.7379 0.0089 0.5800 0.6967 0.8389 0.7572
CL+Gaussian 0.7253 0.0028 0.5446 0.6939 0.8312 0.7530
FCL 0.7446 0.0055 0.6416 0.6886 0.8338 0.7530
Patch Gaussian+XE 0.8311 0.0027 0.8951 0.7625 0.8540 0.8021
Patch Gaussian+FCL 0.8319 0.0029 0.8993 0.7639 0.8536 0.8000
CIFAR-100-C XE 0.4428 0.0038 0.2113 0.4323 0.5527 0.4855
XE+Gaussian 0.4512 0.0067 0.2502 0.4308 0.5524 0.4848
CL+Gaussian 0.4480 0.0057 0.2350 0.4350 0.5514 0.4865
FCL 0.4706 0.0031 0.3528 0.4355 0.5467 0.4847
Patch Gaussian+XE 0.4448 0.0030 0.2198 0.4344 0.5483 0.4896
Patch Gaussian+FCL 0.4742 0.0054 0.3699 0.4353 0.5490 0.4851
ImageNet-C XE 0.3406 0.0007 0.2615 0.2816 0.4214 0.3783
XE+Gaussian 0.3414 0.0012 0.2623 0.2829 0.4224 0.3783
CL+Gaussian 0.3418 0.0016 0.2658 0.2824 0.4223 0.3778
FCL 0.3437 0.0022 0.2696 0.2850 0.4188 0.3827
Patch Gaussian+XE 0.3625 0.0023 0.3053 0.3041 0.4300 0.3964
Patch Gaussian+FCL 0.3634 0.0045 0.3077 0.3034 0.4308 0.3976
Table 3:

Image classification accuracy on the corrupted CIFAR and ImageNet datasets

(Hendrycks and Dietterich, 2019). Results which are significantly better than others in their group are highlighted with boldface. The ‘All average’ column summarizes performance on all 19 corruption patterns. The other columns show averages within each corruption group. The full table can be found in Appendix D.

To evaluate FCL’s performance on general tasks, we conducted experiments on public large-scale image datasets (CIFAR-10, CIFAR-100, ImageNet) with synthetic noise injection similar to Section 4.2, and with the 19 predefined corruption patterns from (Hendrycks and Dietterich, 2019) – called CIFAR-10-C, CIFAR-100-C and ImageNet-C. We show that FCL can significantly improve robustness to these noise patterns, with minimal, if any, sacrifice in accuracy.


Apart from the standard cross-entropy baseline ‘XE’, we consider three other baselines ‘XE+Gaussian’, ‘CL+Gaussian’ and ‘Patch Gaussian+XE’. In ‘XE+Gaussian’, all the image pixels are perturbed by Gaussian noise, and an additional cross-entropy term (weighted by a scalar ) is applied to perturbed versions of the image, keeping the original label. In ‘CL+Gaussian’, we add a contrastive loss similar in form to  (6) to the original cross-entropy classification loss. We use the same weight as in FCL but with a random Gaussian perturbed image as the positive pair instead of the utility-dependent perturbation. ‘Patch Gaussian’, recently proposed by (Lopes et al., 2019) is a data augmentation technique. An augmentation is generated by adding a patch of random Gaussian noise to a random position in the image. This technique achieved state-of-the-art performance on CIFAR-10-C. In ‘XE+Gaussian’ the perturbation is applied to all features, in ‘Patch Gaussian+XE’ it is applied to a subset of the pixels, chosen at random, while FCL applies perturbations to a subset of pixels based on contextual utility. Note that since Patch Gaussian is purely a data augmentation technique, it can easily be combined with FCL, as we do in ‘Patch Gaussian+FCL’.

Model and hyperparameters

ResNet-56 was used for CIFAR experiments and ResNet-v2-50 for the ImageNet experiment. We used the same common hyper-parameters such as learning rate schedule and the use of SGD momentum optimizer ( momentum) across all experiments. Details on hyper-parameters, learning rate schedules and optimization can be found in Appendix C. Models are trained for 450 epochs and contrastive learning losses (FCL and CL+Gaussian) are applied after 300 epochs (CIFAR) or 60 epochs (ImageNet). We kept all standard CIFAR/ImageNet data augmentations (random cropping and flipping) across all runs and added Patch Gaussian before or after the standard data augmentation as in (Lopes et al., 2019) when specified. For both Gaussian noise baselines, we swept to choose the best performing parameter. For the Patch Gaussian, we used the code and the recommended configurations from (Lopes et al., 2019) – CIFAR-10: patch size=, , ImageNet: patch size, . Since CIFAR-100 parameters were not provided from the paper, we started from CIFAR-10 parameters and made our best effort to sweep the parameters (patch size=, . For contrastive learning methods, we swept and . For FCL, we swept and . We repeated all experiments times.

4.3.1 Noisy CIFAR images

We follow the same protocol described in the synthetic MNIST experiment (Section 4.2) to generate uniform noise ‘UN’ and non-uniform noise ‘NUN’ test sets for CIFAR-10 and CIFAR-100. Table 2 demonstrates that FCL outperforms all baseline models with or without the PG data augmentation.

Noisy CIFAR-10

We can observe that Gaussian perturbation does improve performance in both UN and NUN (XE vs. XE+Gaussian or XE vs. CL+Gaussian); however, FCL’s selective perturbation on the high contextual utility features obtains a better improvement in all cases (both Gaussian baselines vs. FCL). When combined with the PG data augmentation, the gap between the clean accuracy versus ‘UN’ or ‘NUN’ narrows (0.93 vs 0.82). The combined version (Patch Gaussian+FCL) achieves the best noisy CIFAR-10 performance (on ‘UN’ and ‘NUN’), without hurting the clean accuracy .

Noisy CIFAR-100

Without the PG data augmentation, the pattern is similar to the case above; however gaps between XE, Gaussian baselines and FCL are wider suggesting that FCL gives more benefit when the number of classes is larger. PG did not work well in the 100 class setting, even with extensive tuning (including the recommended configurations from (Lopes et al., 2019)). The combination (Patch Gaussian+FCL) achieves the best performance on ‘UN’ and ‘NUN’.

4.3.2 CIFAR-10-C, CIFAR-100-C and ImageNet-C

We conducted a similar experiment on the public benchmark set of corrupted images (Hendrycks and Dietterich, 2019). This benchmark set evaluates robustness to natural perturbations of a prediction model by applying 19 common corruption patterns to CIFAR and ImageNet images. Table 3 shows the averaged accuracy on all corruption patterns, as well as the averages from each corruption pattern group. The full results are provided in Appendix D.


The pattern is similar to the noisy CIFAR-10. Adding the Gaussian perturbation improves the average performance, particularly in the noise-corruption pattern group and the blurring group. FCL works much better than Gaussian baselines providing an additional large improvement in the noise group. PG augmentation works really well for this task, particularly in the noise and blur groups (currently it is a state-of-the-art). Nevertheless, adding FCL can still add some value to some of the patterns. Appendix D shows FCL performs better for impulse noise and zoom blur patterns, while the vanilla PG performs better in fog and pixelization.


Without PG, the improvement of FCL over the other baselines is even larger than for CIFAR-10-C, with a drastic improvement on the noisy corruption group. Similar to Noisy CIFAR-100, PG did not perform well in this setting, while PG+FCL still was able to perform well, achieving even better accuracy than without PG.


With or without PG, FCL outperforms the baselines with the large improvements on ‘digital’, and ‘noise’ corruption patterns. Among individual patterns (reported in Appendix D), FCL performs particularly well on the ‘shot’ corruption pattern.

5 Summary

In this paper, we propose Feature Contrastive Learning (FCL), a novel approach to balance robustness and sensitivity in deep neural network training. While most prior work focus on always increasing the robustness and decreasing the sensitivity to feature perturbations, we argue that it is important to strike a balance and selectively enhance robustness and sensitivity based on the context.


  • M. Ancona, E. Ceolini, C. Oztireli, and M. Gross (2018) A unified view of gradient-based attribution methods for deep neural networks. International Conference on Learning Representations. Cited by: §1.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. External Links: 2002.05709 Cited by: §1, §2.3, §3, Definition 4.
  • S. Chopra, R. Hadsell, and Y. LeCun (2005) Learning a similarity metric discriminatively, with application to face verification. In

    2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)

    Vol. 1, pp. 539–546 vol. 1. Cited by: Appendix B.
  • P. Dabkowski and Y. Gal (2017) Real time image saliency for black box classifiers. In NIPS, Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR. Cited by: §1.
  • R. C. Fong and A. Vedaldi (2017) Interpretable explanations of black boxes by meaningful perturbation. 2017 IEEE International Conference on Computer Vision (ICCV), pp. 3449–3457. Cited by: §1.
  • I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1, §1.
  • S. Gulshad, J. H. Metzen, and A. Smeulders (2020) Adversarial and natural perturbations for general robustness. arXiv e-prints, pp. arXiv–2010. Cited by: §1.
  • I. Guyon and A. Elisseeff (2003) An introduction to variable and feature selection. Journal of machine learning research 3 (Mar), pp. 1157–1182. Cited by: §1.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1.
  • D. Hendrycks and T. Dietterich (2019) Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261. Cited by: Table 4, §4.3.2, §4.3, Table 3.
  • G. Hinton, L. Deng, G.E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B. Kingsbury (2012) Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine. Cited by: §1.
  • D. Jakubovitz and R. Giryes (2018) Improving dnn robustness to adversarial attacks using jacobian regularization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 514–529. Cited by: Definition 1.
  • Z. Jiang, T. Chen, T. Chen, and Z. Wang (2020) Robust Pre-Training by Adversarial Contrastive Learning. In NeurIPS, Cited by: §1.
  • P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan (2020) Supervised contrastive learning. External Links: 2004.11362 Cited by: §1.
  • M. Kim, J. Tack, and S. J. Hwang (2020) Adversarial self-supervised contrastive learning. Advances in Neural Information Processing Systems 33. Cited by: §1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In NeurIPS, Cited by: §1.
  • A. Laugros, A. Caplier, and M. Ospici (2019) Are adversarial robustness and common perturbation robustness independant attributes?. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §1.
  • Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel (1989) Backpropagation applied to handwritten zip code recognition. Neural computation 1 (4), pp. 541–551. Cited by: §4.2, §4.2.
  • R. G. Lopes, D. Yin, B. Poole, J. Gilmer, and E. D. Cubuk (2019) Improving robustness without sacrificing accuracy with patch gaussian augmentation. arXiv preprint arXiv:1906.02611. Cited by: Table 4, §1, §4.3, §4.3, §4.3.1.
  • A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2017) Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. Cited by: §1.
  • W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K. Müller (2016) Evaluating the visualization of what a deep neural network has learned. IEEE transactions on neural networks and learning systems 28 (11), pp. 2660–2673. Cited by: §1.
  • L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, and A. Madry (2018) Adversarially robust generalization requires more data. In Advances in Neural Information Processing Systems, pp. 5014–5026. Cited by: §1.
  • A. Shrikumar, P. Greenside, and A. Kundaje (2017) Learning important features through propagating activation differences. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3145–3153. Cited by: §1.
  • K. Simonyan, A. Vedaldi, and A. Zisserman (2013) Deep inside convolutional networks: visualising image classification models and saliency maps. External Links: 1312.6034 Cited by: §1.
  • D. Smilkov, N. Thorat, B. Kim, F. Viégas, and M. Wattenberg (2017) Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825. Cited by: §1.
  • R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In

    Proceedings of the 2013 conference on empirical methods in natural language processing

    pp. 1631–1642. Cited by: §4.1, §4.1.
  • M. Sundararajan, A. Taly, and Q. Yan (2016) Gradients of counterfactuals. CoRR abs/1611.02639. Cited by: §1.
  • I. Sutskever, O. Vinyals, and Q.V. Le (2014) Sequence to sequence learning with neural networks. In NeurIPS, Cited by: §1.
  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In CVPR, Cited by: §1.
  • Y. Tian, D. Krishnan, and P. Isola (2020) Contrastive multiview coding. In European conference on computer vision (ECCV), Cited by: §1.
  • F. Tramèr, J. Behrmann, N. Carlini, N. Papernot, and J. Jacobsen (2020) Fundamental tradeoffs between invariance and sensitivity to adversarial perturbations. arXiv preprint arXiv:2002.04599. Cited by: §1, §1.
  • D. Tsipras, S. Santurkar, L. Engstrom, A. Turner, and A. Madry (2018)

    Robustness may be at odds with accuracy

    arXiv preprint arXiv:1805.12152. Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. arXiv preprint arXiv:1706.03762. Cited by: §4.1.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2018) GLUE: a multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461. Cited by: §4.1, §4.1.
  • J. Wang and H. Zhang (2019) Bilateral adversarial training: towards fast training of more robust models against adversarial attacks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6629–6638. Cited by: §1.
  • E. Wong and Z. Kolter (2018) Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pp. 5286–5295. Cited by: §1.
  • Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • C. Xie, M. Tan, B. Gong, J. Wang, A. L. Yuille, and Q. V. Le (2020) Adversarial examples improve image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 819–828. Cited by: §1.
  • C. Yeh, C. Hsieh, A. S. Suggala, D. I. Inouye, and P. Ravikumar (2019) On the (in)fidelity and sensitivity for explanations. CoRR abs/1901.09392. Cited by: §1.
  • L. Yu and H. Liu (2004) Efficient feature selection via analysis of relevance and redundancy. Journal of machine learning research 5 (Oct), pp. 1205–1224. Cited by: §1.
  • M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In In Computer Vision–ECCV 2014, pp. 818–833. Cited by: §1.
  • H. Zhang, Y. Yu, J. Jiao, E. P. Xing, L. E. Ghaoui, and M. I. Jordan (2019a) Theoretically principled trade-off between robustness and accuracy. arXiv preprint arXiv:1901.08573. Cited by: §1.
  • H. Zhang, H. Chen, C. Xiao, S. Gowal, R. Stanforth, B. Li, D. Boning, and C. Hsieh (2019b) Towards stable and efficient training of verifiably robust neural networks. arXiv preprint arXiv:1906.06316. Cited by: §1.
  • S. Zheng, Y. Song, T. Leung, and I. Goodfellow (2016) Improving the robustness of deep neural networks via stability training. External Links: 1604.04326 Cited by: §2.3.
  • L. M. Zintgraf, T. S. Cohen, T. Adel, and M. Welling (2017) Visualizing deep neural network decisions: prediction difference analysis. External Links: 1702.04595 Cited by: §1.

Appendix A Connection between contextual feature utility and sensitivity

Consider a classification task with cross entropy loss, and let be the output of the network after applying a softmax. In this setting, the loss is minus log probability of the correct label. The contextual utility of for a given feature is defined by


Also recall that the contextual sensitivity of for a given feature is given by


We can see that the contextual feature utility is a product of two terms. The first is the reciprocal of the networks’ prediction for the correct class, and the second is a sensitivity-like term specific to the correct class. When the network’s prediction is correct the utility is proportional to the ground truth class’s sensitivity. If changing the feature will not affect the correct prediction it doesn’t have much utility and vice versa. On the other hand, when the network makes a mistake, the utility will be large regardless of the ground truth class’s sensitivity. Our algorithm takes advantage of this behavior to promote robustness and maintain contextual sensitivity.

Appendix B Alternative Contrastive Loss Function

As mentioned in the body of the paper, we can apply FCL with other contrastive losses. An alternative choice is the original version of the contrastive loss as proposed by (Chopra et al., 2005).


Eq. (9) and Eq. (6) solve similar problems, taking different approaches. Eq. (9) strictly minimizes the distance between and (to be zero at its optimal) and encourages a margin of at least between and . Eq (6), on the other hand, applies a softer contrast between the rankings of and similar to the softmax cross entropy loss. Another difference is that Eq. (6) uses a larger negative set and enforces cross example rankings. We explored this alternative loss (9) in the synthetic MNIST experiment (§4.2) and saw similar results to the original loss.

Appendix C Experimental Setup


For CIFAR experiments, we used a ResNet-56 architecture, with the following configuration for each ResNet block : [(9, 16, 1), (9, 32, 2), (9, 64, 2)].

For ImageNet experiments, we used a ResNet-v2-50 architecture, with the following configuration for the ResNet block : [(3, 64, 1), (4, 128, 2), (6, 256, 2), (3, 512, 2)].


For CIFAR, we used SGD momentum optimizer (Nesterov=True, momentum=0.9) with a linear learning rate ramp up for 15 epochs (peaked at 1.0) and a step-wise decay of factor 10 at epochs 200, 300, and 400. In total, we train for 450 epochs with a batch size of 1024.

For ImageNet, we also used SGD momentum optimizer (Nesterov=False, momentum=0.9) with a linear learning rate ramp up for the first 5 epochs (peaked at 0.8) and decayed by a factor of 10 at epochs 30, 60 and 80. In total, we train for 90 epochs with a batch size of 1024.


We provide additional hyperparameter details for the experiments. (PG stands for Patch Gaussian):

  • SST


  • Noisy CIFAR-10
    CL+Gaussian , ramp_up=14000steps
    FCL , ramp_up=14000steps
    PG , patch_size=25
    PG+FCL , ramp_up=14000steps (PG , patch_size=25)

  • Noisy CIFAR-100
    CL+Gaussian , ramp_up=10000steps
    FCL , ramp_up=10000steps
    PG , patch_size=25
    PG+FCL , ramp_up=10000steps (PG , patch_size=25)

  • CIFAR-10-C
    CL+Gaussian , ramp_up=14000steps
    FCL , ramp_up=14000steps
    PG , patch_size=25
    PG+FCL , ramp_up=10000steps (PG , patch_size=25)

  • CIFAR-100-C
    CL+Gaussian , ramp_up=10000steps
    FCL , ramp_up=10000steps
    PG , patch_size=25
    PG+FCL , ramp_up=10000steps (PG , patch_size=25)

  • ImageNet-C
    CL+Gaussian , ramp_up=78000steps
    PG , patch_size 250
    PG+FCL , ramp_up=78000steps (PG , patch_size 250)

Appendix D Full CIFAR-10-C, CIFAR-100-C and ImageNet-C Accuracy

Dataset Method Noise Blur
gauss. shot impulse defocus glass motion zoom
CIFAR-10-C XE 0.4049 0.5374 0.5477 0.7912 0.4846 0.7350 0.7222
XE+Gaussian 0.5029 0.6221 0.6149 0.7992 0.5192 0.7315 0.7369
CL+Gaussian 0.4767 0.5922 0.5650 0.7919 0.5383 0.7238 0.7215
FCL 0.5505 0.6431 0.7311 0.7922 0.5140 0.7273 0.7210
PG+XE 0.8995 0.9082 0.8776 0.8252 0.6731 0.7634 0.7883
PG+FCL 0.8983 0.9078 0.8918 0.8268 0.6764 0.7601 0.7924
CIFAR-100-C XE 0.1644 0.2446 0.2249 0.5577 0.2022 0.4884 0.4808
XE+Gaussian 0.2024 0.2816 0.2667 0.5557 0.2055 0.4812 0.4807
CL+Gaussian 0.1880 0.2668 0.2502 0.5562 0.2110 0.4903 0.4827
FCL 0.2551 0.3186 0.4847 0.5571 0.2137 0.4861 0.4852
PG+XE 0.1773 0.2542 0.2279 0.5596 0.2001 0.4897 0.4883
PG+FCL 0.2729 0.3349 0.5020 0.5487 0.2310 0.4845 0.4768
ImageNet-C XE 0.2860 0.2651 0.2335 0.2945 0.2312 0.2844 0.3163
XE+Gaussian 0.2876 0.2654 0.2339 0.2978 0.2293 0.2851 0.3193
CL+Gaussian 0.2898 0.2694 0.2383 0.2955 0.2290 0.2873 0.3177
FCL 0.2954 0.2738 0.2395 0.2989 0.2374 0.2861 0.3176
PG+XE 0.3265 0.3070 0.2822 0.3333 0.2571 0.2923 0.3337
PG+FCL 0.3304 0.3107 0.2821 0.3344 0.2571 0.2920 0.3302
Dataset Method Weather Digital
snow forest fog bright contrast elastic pixel JPEG
CIFAR XE 0.7933 0.7486 0.8587 0.9231 0.7310 0.8092 0.6991 0.7756
-10-C XE+Gaussian 0.8027 0.7711 0.8601 0.9218 0.7240 0.8104 0.7195 0.7750
CL+Gaussian 0.7930 0.7589 0.8542 0.9189 0.7267 0.8023 0.7083 0.7747
FCL 0.7988 0.7555 0.8590 0.9217 0.7233 0.8092 0.7036 0.7758
PG+XE 0.8306 0.8353 0.8303 0.9198 0.7085 0.8417 0.7919 0.8664
PG+FCL 0.8324 0.8374 0.8256 0.9189 0.6937 0.8416 0.7994 0.8652
CIFAR XE 0.5023 0.4370 0.5874 0.6842 0.4884 0.5458 0.4500 0.4578
-100-C XE+Gaussian 0.5061 0.4403 0.5825 0.6805 0.4793 0.5431 0.4582 0.4585
CL+Gaussian 0.5034 0.4392 0.5822 0.6807 0.4796 0.5484 0.4560 0.4621
FCL 0.5007 0.4333 0.5748 0.6779 0.4640 0.5441 0.4583 0.4724
PG+XE 0.4970 0.4308 0.5834 0.6819 0.4843 0.5483 0.4635 0.4623
PG+FCL 0.5050 0.4403 0.5756 0.6750 0.4687 0.5394 0.4634 0.4688
ImageNet XE 0.2773 0.3304 0.4695 0.6083 0.3273 0.4096 0.2998 0.4763
-C XE+Gaussian 0.2748 0.3323 0.4736 0.6088 0.3298 0.4101 0.2989 0.4743
CL+Gaussian 0.2745 0.3327 0.4728 0.6091 0.3309 0.4060 0.2978 0.4768
FCL 0.2739 0.3297 0.4674 0.6044 0.3278 0.4143 0.3111 0.4777
PG+XE 0.2891 0.3464 0.4735 0.6110 0.3352 0.4331 0.3232 0.4939
PG+FCL 0.2880 0.3495 0.4761 0.6097 0.3368 0.4313 0.3290 0.4934
Table 4: Image classification accuracies on the CIFAR-10-C, CIFAR-100-C and ImageNet-C sets (Hendrycks and Dietterich, 2019). PG stands for Patch Gaussian data augmentation (Lopes et al., 2019). All numbers are averaged by 5 runs.