ExCon: Explanation-driven Supervised Contrastive Learning for Image Classification

11/28/2021
by   Zhibo Zhang, et al.
UNIVERSITY OF TORONTO
0

Contrastive learning has led to substantial improvements in the quality of learned embedding representations for tasks such as image classification. However, a key drawback of existing contrastive augmentation methods is that they may lead to the modification of the image content which can yield undesired alterations of its semantics. This can affect the performance of the model on downstream tasks. Hence, in this paper, we ask whether we can augment image data in contrastive learning such that the task-relevant semantic content of an image is preserved. For this purpose, we propose to leverage saliency-based explanation methods to create content-preserving masked augmentations for contrastive learning. Our novel explanation-driven supervised contrastive learning (ExCon) methodology critically serves the dual goals of encouraging nearby image embeddings to have similar content and explanation. To quantify the impact of ExCon, we conduct experiments on the CIFAR-100 and the Tiny ImageNet datasets. We demonstrate that ExCon outperforms vanilla supervised contrastive learning in terms of classification, explanation quality, adversarial robustness as well as calibration of probabilistic predictions of the model in the context of distributional shift.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

11/01/2021

When Does Contrastive Learning Preserve Adversarial Robustness from Pretraining to Finetuning?

Contrastive learning (CL) can learn generalizable feature representation...
06/09/2021

CLCC: Contrastive Learning for Color Constancy

In this paper, we present CLCC, a novel contrastive learning framework f...
06/20/2020

Unsupervised Image Classification for Deep Representation Learning

Deep clustering against self-supervised learning is a very important and...
08/13/2020

What Should Not Be Contrastive in Contrastive Learning

Recent self-supervised contrastive methods have been able to produce imp...
11/25/2020

ImCLR: Implicit Contrastive Learning for Image Classification

Contrastive learning is an effective method for learning visual represen...
07/17/2020

Hybrid Discriminative-Generative Training via Contrastive Learning

Contrastive learning and supervised learning have both seen significant ...
04/06/2021

Scene Graph Embeddings Using Relative Similarity Supervision

Scene graphs are a powerful structured representation of the underlying ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Example t-SNE embeddings for SupCon - baseline (top) and ExConB - ours (bottom) on the CIFAR-100 dataset with a batch size of 256. There are five different classes on the graph, where each color represents a different class label. The cross (X) points represent the embeddings for original input instances, while the dots represent the embeddings for input instances obtained using augmentations. The number below an input image indicate in terms of percentage the softmax score corrsponding to the predicted class. We take note of 4 observations when comparing ExConB to SupCon: 1. The embeddings for instances associated with different classes are farther apart from each other; 2. For instances within the same class, the embeddings between original images and their augmentations are much closer. 3. The visual quality of the masked images is better. 4. activation scores are either maintained or increased for the correct classes while it is largely decreased when using the SupCon baseline. This illustrates the capability of ExConB to take into account task-relevant features.
Figure 2: Explanation-driven supervised contrastive learning framework with background maskd images (ExConB). We produce explanation-driven masked images as well as randomly modified (transformed) images and then decide whether to add the masked image into positive examples or negative examples based on its prediction. The gray circle in the graph is the anchor. The orange circles are positive examples of that anchor. The red circles correspond to the negative samples. The top part of the figure corresponds to the scenario where the masked image yields a correct prediction. In this case, we include the masked image as a positive example of the anchor. The bottom part of the figure corresponds to the scenario where the masked image yields an incorrect prediction. In such a case, we include the masked image as a negative example for the anchor.

Contrastive learning has recently seen an increasing popularity as it has led to state-of-the-art results in the context of self-supervised learning

[oord2018representation, chen2020simple, chen2020big, he2020momentum]

. The goal of contrastive learning is to learn useful representations by focusing on parts of the input which are task relevant. This is done by training a model using a contrastive objective allowing to associate pairs of representations that are similar and identify those that are not. In a self-supervised learning setting, labels are not provided and the goal is to train an encoder to learn the structure of the data so that it could be later leveraged for downstream tasks. The instructive feedback is then provided solely from the data itself instead of the labels. In contrastive self-supervised learning, the association between similar representations is then achieved by means of stochastic augmentations that transform a given input example randomly, resulting in two or more correlated views of the same input. Uncorrelated views are obtained by drawing examples from the uniform random distribution. Such sampling methods allow one to obtain different input instances that are associated together to form dissimilar pairs with high probability.

Recently, self-supervised contrastive learning has been extended to the fully-supervised setting where label information, when available, is taken into account for the association of similar representation pairs and the identification of dissimilar ones [khosla2020supervised]. This allows the model to embed representations belonging to the same class within the same cluster in the embedding space. Representations from different classes, for their part, are separated and pushed apart. The generation of similar and dissimilar pairs is performed using random augmentations. The only information that relates to the semantic of the input instances in the random augmentation process is their respective labels. We argue that if an augmentation is performed such that it takes into account parts of the input whose semantics match the information provided by the label, it could significantly improve the performance of the model for the task at hand as well as to increase its robustness to shifts in the input distribution This could be explicitly done by leveraging local explanation methods [selvaraju2017grad, smilkov2017smoothgrad, kim2018interpretability] where the model’s output for an individual data instance is explained based on feature importance. We propose in this work an explanation-driven supervised contrastive learning framework (ExCon) where the augmentations are generated taking into account parts of the input that explain the model’s decision for a given data example. This can be achieved using a local explanation method. We show that our method outperforms the supervised contrastive learning baseline introduced in [khosla2020supervised] in terms classification, explanation quality, adversarial robustness as well as calibration of probabilistic predictions of the model in the context of distributional shift of the encoder’s representation. Overall, we outline the following key contributions of our proposed ExCon methodology:

  1. We leverage local explanation techniques to formulate a framework for explanation-driven supervised contrastive learning presented in Section 3.1.

  2. We outperform the baseline in supervised contrastive learning in terms of classification performance as presented in Section 4.1.

  3. We observe an overall increase of explanation quality as measured by a variety of metrics presented in Section 4.2.

  4. We obtain more robust models to adversarial input noise compared to the supervised contrastive baseline as presented in section 4.3

  5. We observe an improvement in calibration in the context of distributional shift of the encoder representation while obtaining significantly better accuracy than the supervised contrastive learning baseline. This is presented in section 4.4.

2 Related Work

In this work, we are interested in leveraging existing saliency-based explanation methods for data augmentation in the supervised contrastive learning setting. The main motivation is to preserve the semantics of the original input that are matching the label information. In this section, we present most relevant work in contrastive learning, similarity learning, and saliency-based explanation methods.

Contrastive representation learning have seen a plethora of work that have lead to state-of-the-art results in self-supervised learning [oord2018representation, hjelm2018learning, tian2020contrastive, arora2019theoretical, chen2020simple]. In the absence of labels, self-supervised contrastive learning relies on selecting positive pairs for each original input example. The formation of positive pairs is performed through data augmentation based on the original image [chen2020simple, henaff2020data, hjelm2018learning, tian2020contrastive]

. Negative examples however are drawn from the random uniform distribution. It is assumed that such sampling would result in an insignificant number of false negatives. An encoder network is then pretrained to discriminate between these positive and negative pairs. This pretraining allows the encoder to learn the structure of the data by encoding positive examples closer to each other in the embedding space while distancing the negative ones and pushing them apart. Once pretrained, the encoder could be used later for downstream tasks. It is clear that in the context of contrastive learning, The formation of positive and negative pairs, by random augmentations and uniform sampling respectively, does not take into account the semantics of the input that are relevant to the downstream tasks.

[khosla2020supervised] leverages label information to adapt the contrastive learning setting to the supervised setup where instances of the same class are clustered together in the embedding space while instances of different classes are spread apart. [taghanaki2021robust]

leverages label information in order to reduce the effect of irrelevant input features on downstream tasks. This is achieved by training a transformation network using a triplet loss

[schroff2015facenet, koch2015siamese] that computes similarity between examples from the same class. The triplet loss is also used to assess the existing similarities between instances of different classes and sets of transformed inputs. The similarities are captured using a structural similarity metric [wang2004image]. Shared information between examples of different classes is then associated to spurious input features and the shared information within instances of the same class is leveraged to capture the task-relevant ones. It is important to mention here that the framework proposed in [taghanaki2021robust] cannot be compared with supervised contrastive learning frameworks such as ours and the one proposed in [khosla2020supervised] as it doesn’t rely on an encoder pretraining approach.

To make sure that the explanation-driven augmentations relate to the task-relevant semantics of the input, one must guarantee that the provided local explanations are of good quality. A local explanation describes the model’s behavior at the neighborhood of a given input example. An important number of works on local explanation rely on post-hoc methods such as LIME [ribeiro2016should] and SHAP [lundberg2017unified]

where the goal is to measure contributions of the input features to the model’s output. The quality of a local explanation can be measured by how much it is aligned (faithful) to the model’s prediction. The faithfulness aspect of an explanation reflects how accurate is an explanation in its estimation of the features’ contributions to the model’s decision process. In the context of convolutional neural networks (CNNs), gradient-based saliency map methods are commonly adopted to produce saliency-based explanations. Such post-hoc local explanations highlight the input features which contribute the most to the model’s prediction for a given input instance

[simonyan2013deep, smilkov2017smoothgrad, sundararajan2017axiomatic, selvaraju2017grad]. Under the assumption of linearity, which states that certain regions of the input contribute more than others to the decision process of the model and that the contributions of different parts of the input are independent from each other, saliency-based explanations can be considered as faithful to the model’s behavior [jacovi2020towards].

Our proposed explanation-driven supervised contrastive framework is explainer-agnostic. We can then adopt any local explanation method in order to perform data augmentation. Given the faithful aspect of saliency maps under the linearity assumption, it is then convenient to opt for a gradient-based saliency map explanation method. In our case, we choose Grad-CAM [selvaraju2017grad] as our explainer. Grad-CAM provides a saliency-map visualization highlighting the most contributing regions to the model’s output. The use of Grad-CAM is convenient in the context of explanation-driven supervised contrastive learning as it is class-specific. Grad-CAM produces a separate heatmap for each distinct class. The heatmaps are obtained by examining the gradient flowing from the output to the final convolution layer.

To the best of our knowledge, our proposed framework is the first to explore explanation-driven augmentations in the contrastive learning setup and demonstrates its usefulness in the supervised setting. We believe that our proposed method could be further leveraged in the future on tasks with complex and high-dimensional data sources as it exhibits rich semantics.

3 Methodology

3.1 The General Framework

We build upon the supervised contrastive learning framework presented in [khosla2020supervised]

. We perform explanation-driven data augmentation to encourage the model to consider the task-relevant features in its inference (decision-making) process. We also propose a new formulation of the loss function which allows the model to take into account negative examples which are not part of the set of anchors. This is done by leveraging misclassified explanation-driven augmented instances in the formation of negative pairs. In this way, by

explicitly providing spurious input features and contrasting with their related instances, the model becomes implicitly aware of the internal mechanisms that mislead it to erroneous decisions. The model can then adapt its parameters accordingly. We follow a different training procedure than the one presented for the supervised contrastive learning (SCL) framework in [khosla2020supervised]

. In the SCL setup, the classifier

is trained once the supervised contrastive training of the encoder is done.

In our case, we want to take advantage of the explanation-driven augmentations in such a way that it reflects the changing behavior of the entire model during the training process. As the model is composed of the encoder and the classifier, providing the former with the non-stationary explanations of the classifier’s outputs during training allows it to keep track of the changing behavior of the whole model and how it affects its decision-making process through the training iterations. For this reason, we iterate between the encoder and the classifier

at each training epoch.

As in [khosla2020supervised], during the encoder’s supervised contrastive training process, the contrastive loss is computed using a projection of the encoder’s output where denotes the encoder’s input and

. The projection is performed by means of a ReLU-based multi-layer perceptron

that maps the encoder’s output

to a vector

where . The standard practice of normalizing the projection’s output using the norm is performed in order to obtain embeddings that lie on the unit hypersphere. The normalized representations of the projection’s outputs enable the use of inner products to measure similarity distances in the projection space [khosla2020supervised]. The linear classifier takes as input the encoder’s output where and is the total number of classes for a given classification task.

3.2 Explanation-Driven Data Augmentation Pipeline for Supervised Contrastive Learning

We describe here our proposed explanation-driven framework illustrated in Figure 2. Note that the whole model (the encoder along with the classifier) is run in inference mode in order to obtain the explanation-driven augmentations. The resulting computation is not then taken into account for the gradient calculation during training. We perform the following steps in order to obtain explanation-driven augmentations that would be used in the supervised contrastive setting to train the encoder :

  1. For a given input instance, we generate an explanation-driven augmented example by feeding the input instance in question to the whole model. The example is generated by masking the least salient input features, i.e., the input features that have a saliency scores under a pre-defined threshold . In our case .

  2. For that same input instance, we generate 2 randomly augmented versions (views) of it. Note here that we could have generated any arbitrary number of augmented views. It is a standard practice to generate 2 of them [khosla2020supervised, chen2020simple].

  3. We run the model again to obtain the classification result for the explanation-driven augmented instance using Grad-CAM [selvaraju2017grad]. Given the classification result for the explanation-driven augmentation, there will be three possible scenarios to consider:

    1. If the classification of the explanation-driven augmentation is correct, then, the true label for the explanation-driven augmentation is kept. The explanation-driven augmentation is added as an anchor to the minibatch. All instances in the minibatch that have the same class are associated with the explanation-driven augmentation to form positive pairs. Instances in the minibatch that belong to a different class are associated with the explanation-driven augmentation to form negative pairs. Optionally, as the common practice is to use 2 augmented views of an input instance, one of the two randomly augmented views is also added as anchor to the minibatch in order to form positive pairs with the explanation-driven augmentation as well as each of the instances belonging to the same class. It also forms negative pairs with instances that belong to a different class.

    2. Else, if the classification of the explanation-driven augmentation is incorrect AND background_label mode is true, then, the explanation-driven augmentation is not added as an anchor to the minibatch. The explanation-driven augmentation is only used as a negative example which is contrasted to every single anchor in the minibatch including the randomly augmented views in step 2. The true label for that explanation-driven augmentation is not kept anymore and we assign to it a new label that we refer to as the background label. The explanation-driven augmentations with the background label are neither matched nor contrasted with each other.

    3. Else, if the classification of the augmentation is incorrect AND background_label mode is false, then, the explanation-driven augmentation is discarded and the randomly augmented views that had been generated in step 2 are added in the minibatch as anchors. The randomly augmented instances form together a positive pair. Positive and negative instances paired with each of the random augmentations are formed in exactly the same way as in 3-a for the explanation-driven augmentation.

Setting the background_label mode to true or false has an effect on the number of positive and negative pairs that are formed. We name our method as ExCon when the background_label is set to false and ExConB when it is set to true as it introduces explanation-driven augmentations with the background label.

3.3 Supervised Contrastive Loss and ExConB Loss

The augmentation pipeline in the supervised contrastive learning framework presented in [khosla2020supervised] relies on random augmentations (random cropping, random horizontal flipping, random color jittering, and random grayscale) where, for an input instance in a minibatch of size , two randomly augmented instances (views) and are obtained. This procedure is repeated for each input instance in the minibatch in order to generate a multiviewed minibatch of size where . Instances with the same label form positive pairs and those that have different labels form negative pairs. For each anchor , the set of instances associated to it to form positive pairs is denoted as . Formally, the loss is expressed as:

(1)

where is the softmax temperature. In the case of ExCon, i.e., when the background_label mode is set to false, no mislclassified background images are taken into account in order to form negative pairs. In such case, . However, in the context of ExConB, the background_label mode is set to true. In this case, negative examples are not included as anchors and have to be taken into account in the computation of the contrastive loss. Therefore, the effective multiviewed batch in the context of ExConB is where represents the set of explanation-driven augmentations with background label and is the set of anchors in . The size of is then , where represents the number of background masked images in the batch. The loss function for ExConB can be expressed as:

.

(2)

ExConB introduces more negative explanation-driven augmentations which increases the contrastive power of the model not only for randomly augmented instances but also for explanation-driven augmentations where the model becomes able to discriminate between the explanation-driven instances that he has already learned to correctly classify and those that are more difficult to predict.

4 Experiments

We conduct experiments on CIFAR-100 [krizhevsky2009learning] and the Tiny ImageNet dataset [chrabaszcz2017downsampled, le2015tiny] to assess the potential ExCon(B) and we demonstrate that our proposed explanation-driven augmentation methods do not only enhance the representation quality of the encoder for classification tasks, but they also improve the overall quality of the explanations, the robustness of the model to adversarial input perturbations, and enhance as well its calibration when the iterative training procedure between the encoder and the classifier is adopted at each epoch. The supervised contrastive learning method SupCon introduced in [khosla2020supervised] is adopted as a baseline against which we compare our proposed methods. In all the reported experiments, we use the ResNet-50 [he2016deep] as the encoder architecture for ExCon(B), as well as the SupCon baseline.

As mentioned in section 3.1, we follow a different training procedure than the one adopted in [khosla2020supervised] for ExCon(B) where we iteratively train the encoder and the classifier at each epoch. Experiments where such training is adopted for the SupCon baseline are also reported. We name the the original training procedure for the supervised contrastive learning baseline as SupConOri where we start training the classifier once the whole training procedure for the encoder is done. When the iterative training procedure is adopted for the SupCon baseline, we name it as SupCon. For the CIFAR-100 dataset, we conduct experiments with batch sizes of 128 and 256 for both the encoder and the classifier

running on 2 Nvidia V100 GPUs. We couldn’t adopt larger batch sizes due to computational limitations. While conducting experiments, we observed that the introduction of the explanation-driven pipeline at the start of the training procedure can hinder its convergence. For this reason, we consider the starting epoch in which we introduce the explanation-driven augmentations in the training pipeline, i.e., the ExCon(B) starting epoch, as a hyperparameter. The training data is split into train and validation subsets such that 80% is dedicated to the train and 20% is used for validation purposes. Once the ExCon(B) starting epoch yielding the best accuracy on the validation subset is obtained, the encoder

and the classifier are trained on the whole training set. All the reported results related to CIFAR-100 are obtained using models trained on 200 epochs. As suggested in [khosla2020supervised] we used a learning rate of and a softmax temperature of

. We used the stochastic gradient descent (SGD) optimizer with momentum of 0.9. We use linear warmup for the first 10 epochs, and decay the learning rate with the cosine decay schedule

[loshchilov2016sgdr].

For the Tiny ImageNet dataset, we conduct experiments with a batch size of 128 for both the encoder and the classifier . We run experiments using 8 RTX3090 GPUs. We couldn’t adopt larger batch sizes due to computational limitations. It is important to mention that no issues have been observed when introducing the explanation-driven pipeline at the start of the training procedure for the Tiny ImageNet data. This can be explained by the fact that for the Tiny ImageNet dataset, examples have higher resolution than the CIFAR-100 instances ( vs pixels) and that both the training examples and the number of classes are significantly larger (k vs k training examples and vs classes) which results in an easier task for the model to separate between task-relevant semantics and irrelevant features. In fact, when explanation-driven augmentations are misclassified at the beginning of the training procedure and these are introduced as negative examples with background labels, it provides the encoder with high-resolution information which is useful to capture the spurious features that are misleading the classifier . Both the encoder and the classifier are trained using the optimal hyperparameters reported in [khosla2020supervised]. We report experiments where we train both the encoder and the classifier for 200 and 350 epochs. Instead of using a learning rate of 0.5, we used a smaller learning rate of 0.2 as we are using a significantly smaller batch size compared to the ones reported in [khosla2020supervised]. We also report experiments where we used AutoAugment [cubuk2018autoaugment] instead of the typical random augmentation strategies (random cropping, random horizontal flipping, random color jitting, and random grayscale). AutoAugment is the most effective augmentation strategy reported in [khosla2020supervised]. We also conducted experiments using the Adam optimizer. In all the experiments, We use linear warmup for the first 10 epochs, and decay the learning rate with the cosine decay schedule [loshchilov2016sgdr].

Figure 3: The accuracy, FGSM adversarial robustness, expected calibration error (ECE) results (from left to right) on our methods ExCon(B) and baseline methods with 350 epoch models trained on the Tiny ImageNet dataset.
Figure 4: Average drop %, rate of drop in scores and avg increase% in scores with 350 epoch models trained on the Tiny ImageNet dataset with top-15, top-30 and top-45 most salient pixels accordingly.

4.1 Classification Accuracy: Better generalization with ExCon(B) and iterative training

From the results reported in Figure 3 and Figures 7, 9 in the supplementary material, we can observe that the ExCon (B) methods generalize better than SupCon(Ori) on the CIFAR-100 classification task. The gaps in terms of top-1 classification accuracy are even more significant for the Tiny ImageNet classification task. This is in line with what has been already pointed out in section 4 regarding the incorporation of the explanation-driven augmentations at the start of the iterative training procedure. When the task in question involves higher resolution input (more input features), more examples, and a larger number of classes, the job of capturing task-relevant semantics and separating it from the spurious features becomes easier for the model. The capability of the ExCon(B) methods to take into account task-relevant features in order to embed representations of the same class as well as their explanations closer to each other in the embedding space and to push further further apart representations of different classes helps the model to perform the classification task at hand. This is illustrated in the t-SNE plots [van2008visualizing] shown in Figure 1 and results in an improvement in the classification accuracy.

The iterative training procedure proves to be beneficial. The ExCon(B) methods as well as SupCon outperform SupConOri on both the CIFAR-100 and the Tiny Imagenet classification tasks. We achieve a slight improvement when using the AutoAugment strategy [cubuk2018autoaugment] instead of the standard random augmentations and when we opt for Adam [kingma2014adam]

as optimizer. It is important to mention that smoother validation and accuracy loss curves with reduced variance are obtained for the ExCon(B) methods compared to the SupCon baseline when the latter is iteratively trained. This can be observed in Figure

5 and Figure 6 in the supplementary material and it is consistent across all training epochs. As the reduction in the softmax uncertainty and the variance reduction of the classification accuracy are concurrent through all the encoder and classifier’s training epochs, this suggests that the model is better calibrated for the ExCon(B) methods when the iterative training procedure is adopted compared to the SupCon baseline. The effect of introducing explanation-driven augmentation into the training pipeline and the iterative training procedure on the calibration of the model is investigated more in depth in section 4.4.

Figure 5: Average validation accuracies along the training epochs on the Tiny ImageNet dataset. We can see that the proposed methods, ExConB and ExCon consistently produce higher validation accuracies and lower losses compared to the baseline method SupCon [khosla2020supervised]

. The training related to our proposed methods is more stable than the baseline. This can be observed by taking into account the standard deviations of both the validation accuracy and the loss. For each method, 5 models are trained using 5 random seeds. We can observe from the plotted curves that our methods have an effect of reducing the standard deviation regarding the accuracy and the loss function. Over 200 training epochs, ExCon and ExConB have mean standard deviations of 0.63 and 0.69 respectively while the SupCon baseline has an average standard deviation of 0.92.

4.2 Explanation Quality

An explanation of good quality highlights the parts of the input that contribute the most to the neural network’s decision. Therefore, when fed as input to the model, the explanation-driven augmentations are expected to yield a large increase or at least an insignificant drop in the model’s softmax scores. Based on this intuition, the authors in

[chattopadhay2018grad, ramaswamy2020ablation] introduced several metrics, the average drop/increase percentage and the rate of drop/increase in scores to evaluate the explanations. The average drop percentage is defined as:
, where is the number of test examples. denotes the softmax score when the test example is fed as input to the network. refers to the softmax score obtained when the explanation of the test example is provided as input to the network. The average increase percentage is also reported and is defined as We also report the rate of drop in scores which can be defined as , where is the indicator function and it returns 1 when its input argument is true. The rate of increase in scores is complementary to the rate of drop in scores as both sum to one. It is therefore sufficient to report results for the rate of drop in scores. The measures in question quantify the change in the softmax probability for a given class after masking unimportant regions of the input data. We report these measures by retaining the top-15, top-30, and top-45 percents of the pixels in an input image according to its corresponding explanation. Larger increases and smaller drops reflect better explanation quality. We can deduce from the results reported in Figure 4 and Figures 8 and 10 in the supplementary material that the ExCon(B) methods yield consistently significant higher increase scores and lower drop measures than their SupCon(Ori) counterparts across all the models trained on CIFAR-100 and Tiny ImageNet. The ExCon(B) methods show superiority in terms of explanation quality regardless of the percentage of pixels retained in the input images. ExConB yields consistently the best explanation quality in terms of drop and increase scores. It is also important to mention that when we increase the size of the batch from 128 to 256, the gap in terms of drop and increase scores between ExCon(B) and ExCon is reduced as both yield high quality explanations.

4.3 FGSM Adversarial Robustness

Models that are robust to adversarial input noise maximize the distance between the input embeddings and the decision boundaries of their corresponding classes [madry2017towards]. Our t-SNE plots in Figure 1 suggest that the ExCon(B) methods are more robust to adversarial npoise as the input embeddings of the same class are closely clustered together while those of different classes are pushed farther apart This can be verified by measuring the fast gradient sign method (FGSM) introduced in [goodfellow2014explaining]. The adversarial robustness of a model is measured by adding input perturbation which is defined as the input gradient with respect to the loss function, , where is the loss function for the network, are the function parameters, and is the weight for controlling the scale of the perturbation. In our experiments, we use and for the weights. The results for CIFAR-100 and Tiny ImageNet are shown in the middle columns of the Figure 3 above and Figures 7 and 9 in the supplementary material.

We can draw two conclusions from the results:

  1. ExCon(B) show consistently better Top-1 accuracy under adversarial perturbations compared to SupCon and SupConOri. This is due to the fact that ExCon(B) is trained to focus on discriminative regions and as a consequence, it is less sensitive to the input noise.

  2. On Tiny ImageNet, ExConB outperforms ExCon by a large margin, and this confirms the fact that more negative examples in the encoder training process will make the model more robust.

4.4 Expected calibration error

A calibrated model is one whose predictive confidence indicates the probability of a correct prediction. We want to check if the explanation-driven augmentations help the model achieve better calibration where the confidence score is also indicative of the model’s predictive uncertainty and when it is likely to make a bad decision. This reduces the risk of having the well-known problem of overconfidence in deep neural networks. Such problem occurs when the accuracy is likely to be lower than what is indicated by the predictive score [guo2017calibration]. A well-calibrated model is one where, for example, if we consider a sample of input instances for which the predictive score is 0.7 for each instance, then approximately 70 % of those instances are correctly classified [guo2017calibration]. The formal definition of the expected calibration error can be found in the appendix in section Definition of Expected Calibration Error. Estimates of the expected calibration error are reported for all the trained models on the Tiny ImageNet and CIFAR-100 datasets in Figure 3 above as well as Figures 7 and 9 respectively in the supplementary material. We can deduce from the reported results that when the iterative training procedure is adopted, the ExCon(B) models improve the calibration of the model as it is associated to lower errors compared to SupCon. However overall, calibration errors obtained for SupConOri are the lowest. This means that the iterative training procedure does not help calibrate the model. This is not surprising as the distribution of both the explanation-driven augmentations and the encoder representation are changing through epochs. When the shifts in the distributions of the input and the encoder representation are significant, this can hamper the calibration of the model being trained. This is in concordance with the findings in [ovadia2019can] where it is reported that the calibration of the model deteriorates with the change of distribution. Consequently, the improvement in the classification accuracies and the deterioration in terms of the expected calibration error suggest continuing training the classifier once the iterative procedure is achieved. This can be investigated more in depth in future work. Nevertheless, the adoption of Adam as an optimizer reduces significantly the calibration error for all the models that have been trained with iterative procedure as the estimated calibration errors become insignificantly small (within order of magnitude) and where models related to the ExCon(B) methods are-well calibrated in the context of distributional shift of the encoder representation while obtaining significantly better accuracy than the SupCon(Ori) baselines.

5 Conclusion

We proposed a novel methodology for explanation-driven supervised contrastive learning, namely ExCon and ExConB. Through quantitative experiments, we verified that ExCon(B) outperforms the supervised contrastive learning baselines in classification accuracy, explanation quality, adversarial robustness as well as calibration of probabilistic predictions of the model in the context of distributional shift. Qualitatively, we verified our assumption that similar examples should have similar embeddings and explanations by observing closer embeddings between images and their explanations compared to the baseline method. We believe that the novel insights proposed in this paper to improve supervised contrastive learning may extend to other domains where preserving semantic content is also possible through explanation-driven techniques.

References

Supplementary Materials

Citation of Assets

  • Tiny ImageNet: Chrabaszcz, P., Loshchilov, I., & Hutter, F. (2017). A downsampled variant of imagenet as an alternative to the cifar datasets. arXiv preprint arXiv:1707.08819. and Le, Y., & Yang, X. (2015). Tiny imagenet visual recognition challenge. CS 231N, 7(7), 3.

  • CIFAR-100: Krizhevsky, A., & Hinton, G. (2009). Learning multiple layers of features from tiny images.

  • Supervised contrastive learning codebase: BSD 2-Clause License (https://github.com/HobbitLong/SupContrast)

Gradient Derivation for the Loss of ExConB

The authors in [khosla2020supervised] provide a gradient derivation on the supervised contrastive loss. Here we derive the gradient of our adapted loss on top of their derivation. The gradient is:

(3)

where

(4)

Validation Losses along the Training Epochs

See Figure 6

Figure 6: Average losses along the training epochs on the Tiny ImageNet dataset.

Definition of Expected Calibration Error

Formally, a perfect calibration is defined as:

(5)

where

is random variable denoting the predicted class,

denotes the ground truth label, and is the random variable denoting its associated predictive score, i.e., the probability that is correctly predicted. The authors in [naeini2015obtaining] introduced the expected calibration error (ECE) to measure absolute difference in expectation between the predictive confidence and the accuracy. The ECE is expressed as:

(6)

We need to perform a quantization in order to construct an estimator of the expectation expressed in Equation 6 because the probability that is correctly predicted is conditioned on the null event since

is a continuous random variable

[guo2017calibration, minderer2021revisiting]. The predictions are then grouped into bins , …, to estimate the ECE. The accuracy and confidence of each bin can then be calculated. The accuracy of the samples that fall into the bin is given by , where and are respectively the predicted and true labels for the sample in and is the indicator function. The average confidence for the samples that fall into the bin is given by , where is the confidence score for the sample in the . An estimator of the expected calibration error can then be constructed by taking a weighted average of the absolute gaps between the bins’ accuracies and their corresponding confidence scores. This is given by:

(7)

where is the total number of samples in all bins. The smaller the , the more calibrated the model is.

More Results on the Evaluation Metrics

Figure 7: The accuracy, FGSM adversarial robustness, expected calibration error results (from left to right) on our methods ExCon(B) and baseline methods trained with 200 epoch models on the Tiny ImageNet dataset.
Figure 8: Avg drop%, rate of drop in scores and avg increase% with different thresholds for the models trained for 200 epochs on the Tiny ImageNet dataset with the class activation scores based on top-15, top-30 and top-45 most salient pixels accordingly.
Figure 9: The accuracy, FGSM adversarial robustness, expected calibration error results (from left to right) on our methods ExCon(B) and baseline methods on the CIFAR-100 dataset. The first row contains the results from models trained with a batch size of 128, and the second row contains the results from models trained with a batch size of 256.
Figure 10: Avg drop%, rate of drop in scores, avg increase% with different thresholds for the CIFAR-100 dataset with the class activation scores based on top-15, top-30 and top-45 most salient pixels accordingly. The first row contains the results from models trained with a batch size of 128, and the second row contains the results from models trained with a batch size of 256.

Some More t-SNE Results

Figure 11: t-SNE embeddings for SupConOri (left) and ExCon (right) on the CIFAR-100 dataset.
Figure 12: t-SNE embeddings for SupCon (top left), SupConOri (top right), ExCon (bottom left), ExConB (bottom right) on the Tiny ImageNet dataset.
Figure 13: t-SNE embeddings for SupCon (top left), SupConOri (top right), ExCon (bottom left), ExConB (bottom right) on the Tiny ImageNet dataset.