Adversarial Attacks are Reversible with Natural Supervision

03/26/2021 ∙ by Chengzhi Mao, et al. ∙ Columbia University 16

We find that images contain intrinsic structure that enables the reversal of many adversarial attacks. Attack vectors cause not only image classifiers to fail, but also collaterally disrupt incidental structure in the image. We demonstrate that modifying the attacked image to restore the natural structure will reverse many types of attacks, providing a defense. Experiments demonstrate significantly improved robustness for several state-of-the-art models across the CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets. Our results show that our defense is still effective even if the attacker is aware of the defense mechanism. Since our defense is deployed during inference instead of training, it is compatible with pre-trained networks as well as most other defenses. Our results suggest deep networks are vulnerable to adversarial examples partly because their representations do not enforce the natural structure of images.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep networks achieve strong performance over a number of computer vision tasks, yet they remain brittle under adversarial attacks

[11, 6, 16, 55]. With crafted perturbations, attackers can undermine predictions from the state-of-the-art models by changing the features in the representation [36]. These limitations prevent application of deep networks to sensitive and safety-critical applications [46, 52, 32, 53]

, underscoring the gap between current machine learning algorithms and human-level abilities 

[5].

A large body of work has studied how to train deep networks such that they are robust to adversarial attacks. Adversarial training and its variants [34, 36, 55, 42, 47], including multitask learning [35, 25]

and semi-supervised learning

[58], significantly improve robustness. However, while existing methods focus on improving the training algorithm, they are burdened because they need to find a single representation that also works for all possible corruptions and attacks. Training-based defenses cannot adapt to the individual characteristics of each attack at testing-time.

In this paper, we introduce an approach for reversing the attack process, allowing us to formulate a defense strategy that adapts to each attack during the testing phase. Just as an attacker finds the right additive perturbation to break the input, our approach will find the right additive perturbation to repair the input. Figure 1 shows our reverse attack on a poisoned ImageNet image. However, reverse attacks are more challenging to produce than standard attacks because the category label is unknown to us during testing.

Figure 1: Reverse Attacks: Adversarial attacks are small perturbations that cause classification networks to fail [34, 55]. In this paper, we show there are intrinsic signals in natural images to reverse many types of attacks. In the right column, we visualize our reverse attack on an ImageNet image. Note that both attack vectors have been multiplied by ten for visualization purposes only.

Our key insight is that images contain natural and intrinsic structure that we can leverage to reverse many types of adversarial attacks. We found that, although adversarial attacks aim to fool the image classifier, they also collaterally damage self-supervised objectives. Our approach shows how to capitalize on this incidental signal in order to create adversarial defenses. By using self-supervision for defense at test time, we can guarantee that even the strongest adversary cannot manipulate the intrinsic signals that naturally come with the images, providing a more robust defense than training-based methods.

A pivotal advantage of our framework is that it factors out the defense strategy from the visual representation. Since reverse attacks are adaptive, this defense is able to efficiently scale to any corruption that violates the natural image manifold.

Moreover, the modularity of our approach allows it to work with any classifier and complements existing defense models. It can also be integrated into future defense models and defend against novel attacks that corrupt natural image structures.

Visualizations, empirical experiments, and theoretical analysis show that our reversal strategy significantly improves robust prediction for several established benchmarks and attacks. Under attacks with up to 200 steps, our method advances the state-of-the-art defense methods by a large margin across four natural image datasets including CIFAR-10 (over 2.9% gain), CIFAR-100 (over 1.6% gain), SVHN (over 6.8% gain), and ImageNet (over 3.0% gain). Our method is robust against established attacks, including PGD [34] and C&W [6]. In addition, our empirical results demonstrate that, even when the attacker is aware of our defense mechanism, our approach remains robust. Our models, data, and code are available at https://github.com/cvlab-columbia/SelfSupDefense.

2 Related Work

Self-supervised Learning:

Natural images contain rich information for representation learning. Self-supervised learning enables us to learn high quality representations from images without annotations

[13, 9, 69, 21, 10, 3, 7, 45]. By solving pretext tasks, such as jigsaw puzzles [40]

, image inpainting

[45], rotation prediction [17]

, image colorization

[69, 60], random walk [26], and clustering [8], the learned representations can generalize to unseen downstream tasks such as image recognition [9], and also allow domain adaptation at testing time [54]. Recently, contrastive learning has significantly advanced image recognition [9, 21, 37, 19]. In this paper, we leverage this incidental structure to correct adversarial attacks. Our defense uses the contrastive learning task [9], and it is extensible to existing self-supervised tasks as well [45, 40, 17].

Figure 2: Contrastive Score Distribution: We histogram the constrastive loss value [9] for natural images (blue), after adversarial attack (orange), and after our reverse attack (green). This plot shows that adversarial attacks cause the contrastive loss to increase. We create a counter-attack by finding a perturbation that restores the self-supervised loss.

Adversarial Robustness: A large number of adversarial attacks have been proposed to fool deep models [55, 1, 6, 31, 43, 38]. Special adversarial attacks that can be reversed to clean images are also proposed [65]. Different from the existing approach to construct reversible attacks [65], our approach aims to reverse any unknown attacks for defense. While many defense methods are proved to be not robust [48, 66, 64, 33, 2, 59, 41, 20, 50, 14, 4] as they relied on gradient obfuscation, gradient masking [1, 5], and weak adaptive attack evaluation [56], adversarial training and its variants are proved to achieve the true robustness [18, 34, 36, 68, 47, 42, 62, 61, 63]. Moreover, recent progress shows that unlabeled data [58, 24] and self-supervised learning [25]

improve the robustness of deep models. While training a robust neural network to defense is vastly studied, no existing work investigates algorithms that improve robustness at inference time.

3 Method

We will first present a reverse attack that uses self-supervision at deployment time to defend against adversarial attacks. We then analyze the case where the attacker is aware of our defense, and show our defense remains effective. We finally provide theoretical justification for the robustness of our approach.

3.1 Attacks and Reverse Attacks

Let be an input image, and be its ground-truth category label. To perform classification, neural networks commonly learn to predict the category by optimizing the cross entropy between the predictions and the ground truth.

The network parameters

are estimated by minimizing the expected value of the objective:

(1)

which can be optimized with gradient-based descent.

The Attack: In order to corrupt this model, the adversarial attack finds additive perturbations to the image such that is no longer classified correctly by the trained network .

Attackers create these worst-case images by maximizing the objective:

(2)

where the norm bound of the perturbation is less than , which keeps the perturbation minimal.

Figure 3: Defense Overview: We find that adversarial attacks on classification networks will also collaterally attack self-supervised contrastive networks [9]. Since self-supervision is available during deployment, we exploit this discrepancy to reverse adversarial attacks and provide a defense. Our approach modifies the potentially attacked input image such that contrastive distances are restored.

The Reverse: We aim to defend against these attacks by reversing the attack process. Just as the attack finds an additive perturbation to break the input, we will find an additive perturbation to repair the input. However, we cannot simply flip Equation 2 from a maximization to a minimization because the category labels are unknown at deployment.

The key observation is that self-supervised objectives are always available because they do not depend on the labels . While adversarial attacks aim to corrupt the classifier, they will also impact self-supervised representations, which is a signal we will leverage for reversal. Let be a self-supervised objective on the input . We create the reverse attack vector by minimizing the objective:

(3)

where defines the bound of our reverse attack. The solution will modify the adversarial image such that it satisfies our choice of self-supervised objective.

After finding the optimal , robust prediction is straightforward. Our defense adds the resulting perturbation vector to the input before predicting the classification result with the normal network forward pass: .

An advantage of reverse attacks is that, since they do not rely on offline adversarial training, the defense will generalize to unseen adversarial attacks. Moreover, our defense is able to fortify existing models without re-training.

3.2 Natural Supervision for Defense

While any self-supervised task [45, 40, 17] can construct the loss , we use the contrastive loss as our natural supervision objective [9, 10], which is a state-of-the-art self-supervised representation learning approach. The contrastive objective creates features that maximize the agreement between positive pairs of examples while minimizing the agreement between negative pairs of examples. Pairs are typically created with an augmentation strategy [9]. In our case, when we receive a potentially adversarial image , we create the positive examples by sampling different augmentations from it to create multiple positive pairs. We create the negative pairs in a similar way, except applying augmentations to the randomly selected images.

Since these pairs are constructible at evaluation time, we create reverse attacks that minimize the term:

(4)

where are the contrastive features. We use to indicate which pairs are positive and which are negative. This indicator satisfies if and only if the examples and are both from , and otherwise. is a scalar hyper-parameter, and

denotes cosine similarity.

Figure 2 shows that adversarial attacks on classification objectives also attack the contrastive objective , even though the attacker never explicitly optimizes for it. When there is an attack, will be larger than on clean images . This gap provides the signal for reverse attacks.

Figure 3 provides an overview of this defense mechanism, and Algorithm 1 summarizes our procedure.

1:  Input: Potentially attacked image , step size , number of iterations , a classifier , reverse attack bound

, and self-supervised loss function

.
2:  Output: Class prediction
3:  Inference:
4:  , where is the initial random noise
5:  for  do
6:     
7:     , which projects the image back into the bounded region.
8:  end for
9:  Predict the final output by
Algorithm 1 Self-supervised Reverse Attack

Contrastive Feature Estimation: To estimate the contrastive features

, we take the features before logits from a backbone

and pass them to a two-layer network . To compute the positive features, we sample augmentations conditioned on the input image . We follow a similar procedure to compute the contrastive features for the negative examples , sampling random images from a collection of images that form the negative set.

Offline, we fit the contrastive model on a large set of clean images using the same procedure as [9, 10]. We sequentially apply two augmentations: random cropping then scale back to the original size, and random color distortions including color jittering and random gray-scale. We found removing the Gaussian blur from the augmentations improved performance because it otherwise favored over-smooth perturbations. After is trained on clean images, we use it during reverse attacks without any further training.

3.3 Analysis of Defense Aware Attack

In this section, we analyze the effectiveness of our approach when the attacker is aware of our defense.

Attack Model: Let us assume the attacker knows the contrastive model parameters and our defense strategy. In this setting, the attacker can adversarially optimize against our defense with the following alternating optimization:

(5)
(6)

From the attacker’s perspective, the above procedure is not ideal because it involves an alternating, min-max optimization. Past work suggests that this leads to unstable gradient estimation, having a gradient obfuscation problem that reduces the attack efficiency [1].

Similar to C&W [6] and L-BFGS attack [55], the attacker can reformulate the above equation as a constrained optimization problem:

(7)

where is the same value as the converged loss for natural images. Intuitively, the attacker should maximize the adversarial gain while respecting the self-supervised loss if they want to render our defense ineffective.

To optimize Equation 24, they in practice maximize the following equation w.r.t. :

(8)

We derive Equation 8 from Equation 24 via the Lagrange Penalty method [49], where is the new loss for the adaptive attack. Full derivations are in the supplementary.

Multi-objective Trade-off: The above derivation shows that the attacker can attempt to bypass our reversal by also minimizing , so that attacks mimic the self-supervised features of clean examples. If the attacker produces examples that are as good as the clean examples in terms of , our defense would not be able to reverse the attack by further decreasing the loss .

However, as the attacker must solve a multi-objective optimization, they must trade-off between the two objectives. The scalar controls how aggressively the attacker will corrupt the self-supervised model. The attacker’s ideal adversarial attack will first optimize for the Pareto frontier by maximizing for each . They should select the that yields the most damage (the lowest robust accuracy), and use the corresponding generated attack .

A larger shifts the adversarial budget from attacking the classification loss to attacking the self-supervised loss . If the attacker is to attack the self-supervised task, they would then reduce the effectiveness of their classification attack, undermining their goal. Attacking both and jointly requires creating adversarial images for multiple objectives, which is fundamentally more challenging [35].

Our defense creates a lose-lose situation for the attacker. If they ignore our defense, then we improve accuracy. If they account for our defense, then they hurt their attack.

3.4 Theoretical Analysis

We will show theoretical insights for why leveraging natural supervision improves adversarial robustness. Without our defense, the model predicts the category on an image with an incorrect estimate for the self-supervision label. With our defense, the model uses an image for which the self-supervision label is estimated correctly. We prove this increases the upper bound of the prediction accuracy.

A feed forward pass is equivalent to also including a latent self-supervised label to the model, since the information which the self-supervised network uses is in the image itself. We denote the ground-truth label of self-supervision as and the predicted label of self-supervision under attack as . To make this latent label explicit in notation, we rewrite the loss functions as: and .

Lemma 1.

The standard classifier under adversarial attack is equivalent to predicting with , and our approach is equivalent to predicting with .

Proof.

For the standard classifier under attack, we know that . Thus we know the standard classifier under adversarial attack is equivalent to

Our algorithm finds a new input image that

Our algorithm first estimate with adversarial image and self-supervised label . We then predict the label using our new image . Thus, our approach in fact estimates . Note the following holds:

(9)
(10)
(11)

Thus our approach is equivalent to estimating . ∎

We use the maximum a posteriori (MAP) estimation to approximate the sum over because: (1) sampling a large number of is computationally expensive; (2) our results in Figure 7 shows that random sampling is ineffective; (3) our MAP estimate naturally produces a denoised image that can be useful for other downstream tasks.

Next we provide theoretical guarantees that our approach can strictly improve the bound for classification accuracy in Theorem 1

. For convenience we introduce an additional random variable

representing the adversarial image.

Theorem 1.

Assume the base classifier operates better than chance and instances in the dataset are uniformly distributed over

categories. Let the prediction accuracy bounds be and . If the conditional mutual information , we have and , which means our approach strictly improves the bound for classification accuracy.

Proof.

If , then it is straight-forward that:

We define . Using the Fano’s Inequality [51] and the fact that is a monotonically increasing function when error rate , i.e., accuracy higher than random guessing,111The validity of this fact are explained in the supplementary. we derive the upper bound of accuracy and to be:

where the upper bound is a function of the mutual information. Since is a constant, a larger mutual information will strictly increase the bound. Detailed proof is in the supplementary material. ∎

Intuitively, the adversarial attack will corrupt some mutual information between the label and natural structure . Thus, there is additional mutual information between and given , i.e., . Theorem 1 shows that by restoring information from the correct , the prediction accuracy can be improved.

Theoretically, by optimizing the self-supervision loss, the defense aware attack is in fact predicting classification label given the right self-supervision label , i.e., . According to our theory, the robust accuracy should increase due to the restored information. Overall, our defense is robust even under a defense aware adversary.

Figure 4: The Trade-off: The robust accuracy for defense aware attack under different setup. We increase the of defense aware attack from 0 to 10, and show robustness accuracy on two robust models, RO and Semi-SL, on CIFAR-10 dataset. The plot shows a trade-off. While increasing the value of decreases the gain of Ours compared with Baseline, it also decreases the attacker’s effectiveness. To achieve the best attack effectiveness, the attacker should use , which is the standard attack without attempting to corrupt the self-supervised task.
Model Adversarial Attack
Architecture PGD 50 PGD 200 BIM 200 C&W 200
Inference Type Standard Ours Standard Ours Standard Ours Standard Ours

TRADES [68]
WRN-34-10 55.05% 57.00% 55.02% 57.18% 55.06% 57.33% 53.72% 56.56%
RO [47] PreRes-18 52.40% 54.59% 52.34% 54.62% 52.32% 54.54% 50.33% 53.48%
BagT [42] WRN-34-10 56.44% 58.33% 56.40% 58.47% 56.38% 58.55% 54.82% 57.35%
MART [61] WRN-28-10 62.72% 64.40% 62.63% 64.26% 62.54% 64.28% 58.96% 62.18%
AWP [63] WRN-28-10 63.67% 64.21% 63.64 % 64.07% 63.64% 63.69% 60.82% 61.90%
Semi-SL [58] WRN-28-10 62.30% 64.64% 62.22% 64.44% 62.18% 64.68% 60.90% 63.83%


Table 1: Adversarial robust accuracy on the CIFAR-10 test set. Our method improves robustness of established work across different adversarial attack setups, including the state-of-the-art method, by over 2.9%. The lower bound for robustness of the state-of-the-art and ours are boxed.
Adversarial Attack
PGD 50 PGD 200 BIM 200 C&W 200
Inference Type Standard Ours Standard Ours Standard Ours Standard Ours
RO [47] ResNet18 21.90% 23.83% 20.90% 22.23% 21.00% 22.09% 20.42% 22.19%
TRADES* [68] WRN34-10 27.04% 28.55% 26.94% 28.16% 26.94% 28.18% 26.57% 27.88%
BagT * [42] WRN34-10 29.87% 31.21% 28.81% 31.15% 29.81% 31.44% 29.72% 31.40%
Table 2: Adversarial robust accuracy on the CIFAR-100 test set. Our method consistently improves robustness by over 1.6%.
Model Adversarial Attack
Architecture PGD 50 PGD 200 BIM 200 C&W 200
Inference Type Standard Ours Standard Ours Standard Ours Standard Ours
RO [47] PreRes-18 51.03% 53.21% 50.82% 53.06% 50.84% 53.26% 48.90% 54.62%
Semi-SL [58] WRN-28-10 55.69% 62.40% 55.29% 62.12% 55.43% 62.53% 57.03% 63.34%
Table 3: Adversarial robust accuracy on the SVHN test set under different attacks. Our method improves robustness including the state-of-the-art semi-supervised learning model, by over 6.8%.
Adversarial Attack
FGSM PGD10 PGD 20 BIM 20 C&W 20

FBF [62]
32.47% 28.68% 28.52% 28.49% 27.81%
Ours + FBF 33.51% 31.36% 31.32% 31.27% 30.88%
Table 4: Adversarial robust accuracy on the ImageNet test set. Our method uses the same model and parameters as the baseline FBF [62]. After reversing the adversarial attack with our natural supervison, we improve robustness on ImageNet by over 3.07%.
Adversarial Attack
PGD 200 C&W 200
Inference Type Standard Ours Standard Ours
TRADES [68] 36.90% 39.16% 35.89% 38.07%
RO [47] 40.00% 41.98% 38.31% 40.17%
BagT [42] 38.80% 41.28% 37.01% 39.20%
Semi-SL [58] 42.07% 44.47% 39.97% 42.72%
AWP [63] 44.39% 47.15% 40.97% 43.72%
MART [61] 43.24% 45.85% 43.23% 45.85%
Table 5: norm bounded adversarial robust accuracy on the CIFAR-10. Our natural supervision is agnostic to the attack type and can improve the robustness by over 2.6% for attack without retraining the defense model. The lower bound for the best achieved robustness on is boxed.

4 Experiments

Our experiments evaluate the robustness at image classification on four datasets: CIFAR-10 [29], CIFAR-100 [30], SVHN [39], and ImageNet [12]. We compare with the state-of-the-art defense methods, under several strong adversarial attacks including a defense aware attack.

4.1 Baselines

We apply our method to seven established, scrutinized [1] defense methods including the state-of-the-art adversarial robust model. All studied methods are trained with adversarial training [34], but achieve higher robust accuracy than the initial version of Madry et al. [34].

TRADES [68] is the winning solution for NeurIPS 2018 Adversarial Vision Challenge. It introduces a KL-divergence term to regularize the representation of adversarial examples to match the ones of clean examples.

Robust Overfit (RO) [47] re-examines the existing adversarial robust models through overfitting, which is the state-of-the-art model trained with Pre-ResNet18.

Bag of Tricks (BagT) [42] conducts extensive experiments on the effect of hyper-parameters on adversarial training [34]. It is the state-of-the-art adversarial robust model without additional unlabeled data for training.

Semi-supervised Learning (Semi-SL) [58] significantly improves adversarial robustness using unlabeled data. By training with pseudo labels of unlabeled images, the model achieved the state-of-the-art robustness. However, this work neglects the information of natural images beyond the pseudo classification label.

MART [61] uses misclassification aware adversarial training to achieve improved robustness. We use its best version trained on top of Semi-SL [58].

Adversarial Weight Perturbation (AWP) [63] trains robust model by smoothing the weights’ loss landscape. We use its best version trained on top of Semi-SL [58].

Fast is Better than Free (FBF) [62] is the state-of-the-art solution for training robust ImageNet classifier in reasonable training budget and time.

4.2 Attack Methods

Fast Gradient Sign Method (FGSM) [18] is a one-step adversarial attack to fool neural networks.

Projected Gradient Descent (PGD) [34] is the standard evaluation for adversarial robustness, which optimizes the adversarial noise with gradient descent for iterations, and project the noise back to the nearest boundary if it is out of the given bound.

Basic Iterative Attack (BIM) [31] is a variant of PGD attack without the initial random start.

C&W Attack [6] is a powerful iterative attack that has been widely used for robustness evaluation. It reduces the logit value for the right class while increasing that for the second best class to fool the classifier.

Defense Aware Attack is discussed in Section 3.3, which is theoretically the optimal adaptive white-box attack to bypass our defense algorithm.

4.3 Experimental Settings

Backbone Architectures. Following prior literature, we conduct experiments with Pre-ResNet18 [23], ResNet50 [22], and WideResNet [67]. We download the pretrained models’ weights online. 222We reproduce a few models that are not available and denote by *

Self-supervised Learning Branch. We use a network with two fully connected layers that takes in the features from the penultimate layer of the backbone network.

Implementation Details. We train our self-supervision model with the Adam [28] optimizer. We use a learning rate of . When training the self-supervised model, we use temperature

for the contrastive loss, with a batch size of 128. For CIFAR-10 and CIFAR-100, we train the self-supervised branch for 200 epochs. For SVHN, we train it for 600 epochs. For ImageNet, we train for 30 epochs. We set the reverse attack bound to be

and optimization iterations to be

. We implement our model with Pytorch

[44]. Please see supplementary for details.

4.4 Results of Defense Aware Adversarial Attacks

In Section 3.3, we discussed the strongest adaptive attack that can be used to bypass our defense. We show the results of the adaptive attacker on CIFAR-10 in Figure 4. We use attacks with 50 steps, with perturbation bound . We vary the value of the from 0 to 10, where 0 corresponds to the standard PGD attack without considering our defense strategy. The results show that increasing and focusing more attack budget to the self-supervised defense, the gain of our approach is reduced (full line falls under dotted line). However, as gets larger, the attack for classification task also gets weaker (line goes up). While adaptive attacks successfully reduce the additional gain bought by our approach, it too significantly sacrifices the initial attack success rate on the classification task. Minimizing the hurts the original classification attack so much that it is not worth it for the attacker to account for our defense. We also show results with 500 steps in the supplementary, where our conclusion also holds.

This finding matches our initial theory in Section 3.4, where leveraging the incidental structure in the images improves robustness. The decrease of attack success rate on the target classification task is also consistent with prior work [35], which suggests that it is harder to simultaneously attack multiple tasks at once. In fact, the attacker is trading off between classification attack success rate and fooling the self-supervised defense. For the attacker, the optimal attack is , which is the standard adversarial attack without considering our self-supervised defense. Therefore, we use this setup in the remaining experiments.

4.5 Results with Optimal Adversarial Attacks

In the optimal setup, . The attack is equivalent to the standard adversarial attack without considering the defense branch. Thus the gain from Standard to Ours is the lower bound of our self-supervised correction. We now show the gain on four datasets.

CIFAR-10 [29] contains 10 categories. In Table 1, we add our approach to six existing robust models including the state-of-the-art, where we constantly correct by up to 2.9% of the adversarial examples, as shown in the gain from Standard to Ours.

CIFAR-100 [30] contains 100 categories. In Table 2, we show over 1.9% gain compared with the baselines.

SVHN [39] is a 10 category street view house number dataset. In Table 3, we experiment on two methods that have pretrained models available, including the state-of-the-art semi-supervised learning [58]. Ours demonstrate over 6.8% gain compared with the original defense method.

ImageNet [12] contains 1000 categories. We use the pretrained model from Wong et al. [62], which is the state-of-the-art ResNet50 robust model. In Table 4, we use 5 different attacks to access the adversarial robustness of the original model and our model with natural supervision defense. Our approach achieved over 3% gain on robustness.

(a) RO [47]
(b) Semi-SL [58]
Figure 5: Speed versus Robust Accuracy: Trade-off between inference time and adversarial robustness on CIFAR-10 dataset. As our approach is iterative, we can stop early if the application prefers speed over the robustness.
(a) CIFAR-10 (up to 9.2% gain)
(b) CIFAR-100 (up to 3.3% gain)
(c) SVHN (up to 14.1% gain)
(d) ImageNet (up to 4.4% gain)
Figure 6: The adversarial robust accuracy vs. perturbation budget curves on CIFAR-10, CIFAR-100, SVHN, and ImageNet, under the norm. The red line is applying our inference algorithm to the baseline models [47, 58, 62]. Using our inference algorithm significantly improves the robustness.
(a) CIFAR-10
(b) CIFAR-100
(c) SVHN
(d) ImageNet
Figure 7: The trade-off between adversarial robust accuracy vs. clean accuracy on CIFAR-10, CIFAR-100, and SVHN under the norm. We increase the noise budget from small to large, which causes the clean accuracy to drop from right to left. Our method produces a better reversal of the adversarial perturbation than just adding random noise to reverse it.

4.6 Analysis

Accuracy vs. Time Budget. As our method is iterative, we can adjust the number of iterations according to different time budget. We use the number of iterations conducted as a indicator of time, and plot the accuracy vs. time budget in Figure 5, where we can see even a few updates can significantly improves the robustness.

Robustness Curve. We adopt the robustness curve evaluation of adversarial robustness accuracy vs. the perturbation budget [15]. We show the trend in Figure 6. We apply our inference algorithm as additional defense to existing robust models [47, 62], where our approach achieved up to 14% robustness gain compared with standard inference method, especially when the attack gets stronger with larger perturbation bound.

Trade-off between clean accuracy and adversarial robustness. It has been proved that there exists a natural trade-off between clean accuracy and adversarial robustness given a classifier [57, 68]. In Figure 7, we compare our natural supervised reverse defense with the random reverse defense (baseline). We increase the additive noise level that is applied to reverse the adversarial examples as well as the clean examples (we use the same algorithm to clean examples because during inference we cannot distinguish adversarial ones from clean ones). The clean accuracy often drops as the noise level goes up. While there is a trade-off between clean accuracy and robust accuracy [57, 68], our approach achieved a better trade-off between them.

Robustness on norm bounded adversarial attacks. We measure whether our defense can also generalize beyond the bounded attack. Table 5 shows results under norm bounded attack on CIFAR-10, where our approach consistently improves robustness under norm bounded attacks by over 2.6%.

Figure 8: Feature Trajectories: We project the features onto a plane with PCA to visualize their trajectory under attack and our reverse attack. The green cross indicates the clean examples, the red square indicates the misclassified adversarial examples, and the blue dot indicates our reversal method. Our approach pushes the misclassified examples (red) back to the original features (green), improving adversarial robustness.

Feature visualization. Figure 8 visualizes the trajectory of images’ penultimate layer’s feature as it transitions through an attack and the reversal. We use PCA to project the features onto a plane. The plot demonstrates that the attack shifts the feature embedding from the right class to the wrong class. Then, the reverse attack often returns the features back to the right class.

To quantify this effect, we take the Euclidean distance between the clean embedding and the attacked embedding, denoted , as well as the Euclidean distance between the clean embedding and the inverse attacked embedding, denoted , for the triples that have the same clean class and inverse attacked class. For all but one combination of categories, . Additionally, across all triplets, we checked how much the average distance from clean to reverse attacked is reduced from the average distance from clean to attacked, and obtained a value of roughly 17% decrease. These results together demonstrate that, on average, our reverse attack returns the attacked embedding closer to the original embedding.

5 Conclusions

We introduce an approach to use natural supervision to reverse adversarial attacks on images. Our results demonstrate improved robustness across several benchmarks and several state-of-the-art attacks. Our findings suggest integrating defense mechanisms into the inference algorithm is a promising direction to improve adversarial robustness.

Acknowledgements: This research is based on work partially supported by NSF CRII #1850069, NSF grant CNS-15-64055; ONR grants N00014-16-1- 2263 and N00014-17-1-2788; a JP Morgan Faculty Research Award; and a DiDi Faculty Research Award. MC is supported by a CAIT Amazon PhD fellowship. We thank NVidia for GPU donations. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the sponsors.

References

6 Theoretical Results

Appendix A Detailed Proof of Lemma 1

Lemma 2.

The standard classifier under adversarial attack is equivalent to predicting with , and our approach is equivalent to predicting with .

First we show that

It is easy to see that

where the last equality is due to the neural network’s deterministic nature, i.e.,

where is the latent self-supervised label prediction. Thus the probability is 0 otherwise.

Intuitively, this demonstrates that the attack is equivalent to using the classifier

to predict the label.

Next we show that our algorithm is equivalent to using the following classifier

Our algorithm finds a new input image that

Note that

The last formulation is our algorithm’s inference procedure, where we first estimate with adversarial image and self-supervised label . We then predict the label using our new image . We now have proved that our algorithm is equivalent to using

Here, we use the maximum a posteriori (MAP) estimate to approximate the marginalization over because: first, sampling a large number of is computational expensive, second, our our shows the sampling is ineffective, lastly, our MAP estimation also produce a denoised image that can also be useful for other downstream tasks.

Appendix B Detailed Proof of Theorem 1

Theorem 2.

Assume the base classifier operates better than chance and instances in the dataset are uniformly distributed over categories. Let the prediction accuracy bounds be and . If the conditional mutual information , we have and , which means our approach strictly improves the bound for classification accuracy.

Proof.

If , we have:

We let the predicted label to be , we assume there are categories, and let the lower bound for prediction accuracy to be . We define . Use the Fano’s Inequality [51], we have

(12)
(13)

We add to both side

(14)

because .

Then we get

(15)

Now we define a new function . Given that in classification task, the number of category . We know . Given that the entropy function first increase and then decrease, the function should also first increase, peak at some point, and then decrease.

We calculate the for the peak value via calculate the first order derivative . By solving this, we have:

(16)

which shows that the function is monotonically increasing when .

Given that we know, the base classifier already achieves accuracy better than random guessing, thus the given classifier satisfies . Now, the function is a monotonically increasing function in our studied region, which has the inverse function .

By rewritting the equation 15 We then have

(17)

We apply the inverse function to both side:

(18)
(19)

Note that is our defined accuracy, thus

The above derivation also applies to , thus .

Since Thus , using our approach, the upper bound for robust accuracy is improved.

To prove the lower bound , we divide the joint set of into set and , given the additional information from , the accuracy will not get worse, thus the new lower bound should not be smaller than .

Appendix C Defense Aware Attack

We derive the defense aware attack in our main paper in details. We make the latent label explicit in our notation.

The straight forward adaptive attack is to optimize the attack in an adversary way against the defense.

(20)
(21)

From the attacker perspective, the above optimization is not ideal, as it involves iterative optimization of two directions, thus the gradient estimated maybe not stable enough, even having gradient obfuscation [1]. Following the standard constrained optimization attack practice from [55, 6], the attacker reformulates the above equation as a constrained optimization problem:

(22)
(23)

where is the same value as the converged loss for natural images. Intuitively, the attacker should maximize the adversarial gain while respecting the self-supervised loss if they want to render our defense ineffective.

The above equation is equivalent to:

(24)
(25)

We can use the Lagrangian Penalty Method to derive the following:

(26)

Thus the optimal value for the attack is:

(27)

which is the primal.

Using the Weak Duality Theorem we have the following upper bound for the optimal solution of the above optimization problem [27]:

(28)

Removing the negative sign, we have:

(29)

which is equivalent to first maximizing the followings under different :

(30)

And then select the that yields the most damage with the lowest robust accuracy, and use the corresponding generated attack .

7 Experimental Results

Appendix A Defense Aware Adversarial Attack

We show the numerical results for the defense aware attack in Table 6. In addition to the 50 steps we used in our main paper, we also show results using 500 steps of adaptive attack. We apply 500 steps to the RO (robust overfitting method by Rice et al. [47]). The results clearly show that using more steps does not change the conclusion. 500 steps of attack achieve almost the same robust accuracy as the 50 steps baseline, which suggests that the attack is almost converged. In addition, our approach still efficiently improves the robust accuracy by over 2%. Lastly, the attacker needs a in order to bypass our defense through the reverse attack, however, at a cost that the attack for classification task gets weaker by over 7%, which itself helps our defense. Overall, even under more attack steps, our defense is still effective.

Defense Aware Adversarial Attack
0 0.5 1 2 4 6 8 10
RO [47] 50 steps Baseline 52.40% 53.81% 55.41% 57.81% 60.80% 62.62% 63.81% 64.58%
RO [47] 50 steps with Ours 54.59% 55.61% 56.75% 58.67% 60.81% 62.07% 63.16% 63.68%
RO [47] 500 steps Baseline 52.23% 53.47% 54.89% 57.07% 59.86% 61.89% 63.09% 63.90%
RO [47] 500 steps with Ours 54.70% 55.61% 56.68% 58.17% 60.51% 61.26% 62.40% 63.06%
Semi-SL [58] 50 steps Baseline 62.30% 63.87% 65.38% 67.87% 70.60% 72.32% 73.52% 74.42%
Semi-SL [58] 50 steps with Ours 64.64% 65.62% 66.72% 68.58% 70.55% 71.43% 72.75% 73.19%
Table 6: The robust accuracy for defense aware attack under different setup. We increase the of defense aware attack from none to 10, and show robustness accuracy on two robust model RO and Semi-SL on CIFAR-10 dataset. While increasing the value of decrease the gain of Ours compared with Baseline, it also increase the robust accuracy of baseline methods. To achieve the best attack efficiency, the attack should use , the standard attack without considering fooling the self-supervised branch.

Appendix B Defending a Stand-alone Non-robust Network

In our main paper, we apply our approach to a list of existing state-of-the-art models, now we show results on applying our approach to an undefended neural network.

We train a PreRes-18 model on pure clean images without any adversarial training, which yields 0% robust accuracy under adversarial attack. We then use our defense to reverse the attack via natural supervision. We achieve an improvement in robust accuracy of 34.4%.

Appendix C Feature Input for Self-supervised Models

We investigate which layer’s feature should be the input for the self-supervision model. We conduct an ablation study that read out the latent features from low to high layers. Results in Figure 9 show that read our feature from the top layer for self-supervision achieved the most robustness gain.

Figure 9: On CIFAR-10 dataset, we experiment on reading out from differnt layers and plot the robustness gain here.

Appendix D Implementation Details

We run our experiments with 8 RTX 2080 Ti GPU. For CIFAR-10, CIFAR-100, and SVHN, the input dimension is all . We maximize the GPU space usage to speed up our inference. For the PreRes-18 model, we use a batch size of 1024 during inference. For the Wide Residual Network model, we use a batch size of 512. For the ImageNet dataset, the input size is , we use the ResNet-50 model, and due to the larger input dimension and model capacity, we use a batch size of 87 to maximize the GPU usage. For all the contrastive learning, we sample 4 positive views for each given image instance.

8 Visualization

Appendix A Attack Vector Visualization:

We visualize more examples of the adversarial attack vector and the inverse attack vector in Figure 10. Our reverse attack vector is highly structured, reversing the mispredicted examples back to the right one.

Appendix B Feature Visualization:

We show more visualizations of the feature trajectories of our approach in Figure 11, Figure 12, Figure 13, Figure 14, and Figure 15. We project the features onto a plane with PCA under the same setup as Figure 8 in the main paper. We can see that our approach pushes the misclassified examples (red) back to the original features (green), improving adversarial robustness.

Figure 10: We show ImageNet’s clean examples, attack vector, and our reverse attack vector. The adversarial attack is bounded by . By adding our reverse attack to the attack image, we can correct the misprediction on ImageNet classifier. As we can see, the reverse attack vector is also highly structured, which explains the reason that our approach is more efficient than adding random noise. The attack and reverse attack vectors have been multiplied by ten for visualization purposes only.
Figure 11: Feature Trajectories under attack and our reverse attack. We plot the figure in the same way as Figure 8 in the main paper with PCA.
Figure 12: Feature Trajectories under attack and our reverse attack. We plot the figure in the same way as Figure 8 in the main paper with PCA.
Figure 13: Feature Trajectories under attack and our reverse attack. We plot the figure in the same way as Figure 8 in the main paper with PCA.
Figure 14: Feature Trajectories under attack and our reverse attack. We plot the figure in the same way as Figure 8 in the main paper with PCA.
Figure 15: Feature Trajectories under attack and our reverse attack. We plot the figure in the same way as Figure 8 in the main paper with PCA.

References

  • [1] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 274–283. PMLR, 2018.
  • [2] Mitali Bafna, Jack Murtagh, and Nikhil Vyas. Thwarting adversarial examples: An l_0-robust sparse fourier transform. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  • [3] Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William T. Freeman, Michael Rubinstein, Michal Irani, and Tali Dekel. Speednet: Learning the speediness in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  • [4] Jacob Buckman, Aurko Roy, Colin Raffel, and Ian J. Goodfellow. Thermometer encoding: One hot way to resist adversarial examples. In 6th International Conference on Learning Representations, 2018.
  • [5] Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian J. Goodfellow, Aleksander Madry, and Alexey Kurakin. On evaluating adversarial robustness. CoRR, abs/1902.06705, 2019.
  • [6] Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy, pages 39–57, 2017.
  • [7] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. 2020.
  • [8] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments, 2021.
  • [9] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations, 2020.
  • [10] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020.
  • [11] Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In ICML, 2020.
  • [12] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
  • [13] Virginia R. DeSa. Learning classification with unlabeled data. In Proceedings of the 6th International Conference on Neural Information Processing Systems, NIPS’93, page 112–119, San Francisco, CA, USA, 1993. Morgan Kaufmann Publishers Inc.
  • [14] Guneet S. Dhillon, Kamyar Azizzadenesheli, Zachary C. Lipton, Jeremy Bernstein, Jean Kossaifi, Aran Khanna, and Animashree Anandkumar. Stochastic activation pruning for robust adversarial defense. In 6th International Conference on Learning Representations, 2018.
  • [15] Yinpeng Dong, Qi-An Fu, Xiao Yang, Tianyu Pang, Hang Su, Zihao Xiao, and Jun Zhu. Benchmarking adversarial robustness on image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 321–331, 2020.
  • [16] Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. Boosting adversarial attacks with momentum. In CVPR, pages 9185–9193, 2018.
  • [17] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations, 2018.
  • [18] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv:1412.6572, 2014.
  • [19] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. Bootstrap your own latent: A new approach to self-supervised learning, 2020.
  • [20] Chuan Guo, Mayank Rana, Moustapha Cissé, and Laurens van der Maaten. Countering adversarial images using input transformations. CoRR, abs/1711.00117, 2017.
  • [21] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019.
  • [22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv 1512.03385, 2015.
  • [23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks, 2016.
  • [24] Dan Hendrycks, Kimin Lee, and Mantas Mazeika. Using pre-training can improve model robustness and uncertainty. Proceedings of the International Conference on Machine Learning, 2019.
  • [25] Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, and Dawn Song. Using self-supervised learning can improve model robustness and uncertainty. Advances in Neural Information Processing Systems (NeurIPS), 2019.
  • [26] Allan Jabri, Andrew Owens, and Alexei A. Efros. Space-time correspondence as a contrastive random walk. In Advances in Neural Information Processing Systems, 2020.
  • [27] William Karush. Minima of functions of several variables with inequalities as side constraints. M. Sc. Dissertation. Dept. of Mathematics, Univ. of Chicago, 1939.
  • [28] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.
  • [29] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced research).
  • [30] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-100 (canadian institute for advanced research).
  • [31] Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2017.
  • [32] Alex H. Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. CoRR, abs/1812.05784, 2018.
  • [33] Yingzhen Li, John Bradshaw, and Yash Sharma. Are generative classifiers more robust to adversarial attacks? In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3804–3814. PMLR, 09–15 Jun 2019.
  • [34] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In ICLR, 2018.
  • [35] Chengzhi Mao, Amogh Gupta, Vikram Nitin, Baishakhi Ray, Shuran Song, Junfeng Yang, and Carl Vondrick. Multitask learning strengthens adversarial robustness. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 158–174, Cham, 2020. Springer International Publishing.
  • [36] Chengzhi Mao, Ziyuan Zhong, Junfeng Yang, Carl Vondrick, and Baishakhi Ray. Metric learning for adversarial robustness. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  • [37] Ishan Misra and Laurens van der Maaten. Self-supervised learning of pretext-invariant representations, 2019.
  • [38] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks, 2016.
  • [39] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.
  • [40] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, pages 69–84, Cham, 2016. Springer International Publishing.
  • [41] Tianyu Pang, Kun Xu, Chao Du, Ning Chen, and Jun Zhu. Improving adversarial robustness via promoting ensemble diversity. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 4970–4979. PMLR, 09–15 Jun 2019.
  • [42] Tianyu Pang, Xiao Yang, Yinpeng Dong, Hang Su, and Jun Zhu. Bag of tricks for adversarial training, 2020.
  • [43] Nicolas Papernot, Patrick D. McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. arXiv:1511.07528, 2015.
  • [44] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  • [45] Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell, and Alexei A. Efros. Context encoders: Feature learning by inpainting. CoRR, abs/1604.07379, 2016.
  • [46] Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. Deepxplore. Proceedings of the 26th Symposium on Operating Systems Principles, Oct 2017.
  • [47] Leslie Rice, Eric Wong, and J. Zico Kolter. Overfitting in adversarially robust deep learning, 2020.
  • [48] Kevin Roth, Yannic Kilcher, and Thomas Hofmann. The odds are odd: A statistical test for detecting adversarial examples. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5498–5507. PMLR, 09–15 Jun 2019.
  • [49] A. M. Rubinov, X. Q. Yang, and Y. Y. Zhou. A lagrange penalty reformulation method for constrained optimization. Optimization Letters, 2007.
  • [50] Pouya Samangouei, Maya Kabkab, and Rama Chellappa. Defense-gan: Protecting classifiers against adversarial attacks using generative models. CoRR, abs/1805.06605, 2018.
  • [51] Jonathan Scarlett and Volkan Cevher. An introductory guide to fano’s inequality with applications in statistical estimation, 2019.
  • [52] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. CoRR, abs/1503.03832, 2015.
  • [53] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in perception for autonomous driving: Waymo open dataset. CoRR, abs/1912.04838, 2019.
  • [54] Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. In International Conference on Machine Learning, pages 9229–9248. PMLR, 2020.
  • [55] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv:1312.6199, 2013.
  • [56] Florian Tramer, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. On adaptive attacks to adversarial example defenses. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1633–1645. Curran Associates, Inc., 2020.
  • [57] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy, 2019.
  • [58] Jonathan Uesato, Jean-Baptiste Alayrac, Po-Sen Huang, Robert Stanforth, Alhussein Fawzi, and Pushmeet Kohli. Are labels required for improving adversarial robustness? CoRR, 2019.
  • [59] Gunjan Verma and Ananthram Swami. Error correcting output codes improve probability estimation and adversarial robustness of deep neural networks. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  • [60] Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. Tracking emerges by colorizing videos, 2018.
  • [61] Yisen Wang, Difan Zou, Jinfeng Yi, James Bailey, Xingjun Ma, and Quanquan Gu. Improving adversarial robustness requires revisiting misclassified examples. In ICLR, 2020.
  • [62] Eric Wong, Leslie Rice, and J. Zico Kolter. Fast is better than free: Revisiting adversarial training, 2020.
  • [63] Dongxian Wu, Shu-Tao Xia, and Yisen Wang. Adversarial weight perturbation helps robust generalization. In NeurIPS, 2020.
  • [64] Chang Xiao, Peilin Zhong, and Changxi Zheng. Enhancing adversarial defense by k-winners-take-all, 2019.
  • [65] Zhaoxia Yin, Hua Wang, Li Chen, Jie Wang, and Weiming Zhang. Reversible adversarial example based on reversible image transformation, 2021.
  • [66] Tao Yu, Shengyuan Hu, Chuan Guo, Weilun Chao, and Kilian Weinberger. A new defense against adversarial images: Turning a weakness into a strength. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Oct. 2019.
  • [67] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. CoRR, abs/1605.07146, 2016.
  • [68] Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P. Xing, Laurent El Ghaoui, and Michael I. Jordan. Theoretically principled trade-off between robustness and accuracy. arXiv abs/1901.08573, 2019.
  • [69] Richard Zhang, Phillip Isola, and Alexei A. Efros. Colorful image colorization, 2016.

Appendix A Detailed Proof of Lemma 1

Lemma 2.

The standard classifier under adversarial attack is equivalent to predicting with , and our approach is equivalent to predicting with .

First we show that

It is easy to see that

where the last equality is due to the neural network’s deterministic nature, i.e.,

where is the latent self-supervised label prediction. Thus the probability is 0 otherwise.

Intuitively, this demonstrates that the attack is equivalent to using the classifier

to predict the label.

Next we show that our algorithm is equivalent to using the following classifier

Our algorithm finds a new input image that

Note that

The last formulation is our algorithm’s inference procedure, where we first estimate with adversarial image and self-supervised label . We then predict the label using our new image . We now have proved that our algorithm is equivalent to using

Here, we use the maximum a posteriori (MAP) estimate to approximate the marginalization over because: first, sampling a large number of is computational expensive, second, our our shows the sampling is ineffective, lastly, our MAP estimation also produce a denoised image that can also be useful for other downstream tasks.

Appendix B Detailed Proof of Theorem 1

Theorem 2.

Assume the base classifier operates better than chance and instances in the dataset are uniformly distributed over categories. Let the prediction accuracy bounds be and . If the conditional mutual information , we have and , which means our approach strictly improves the bound for classification accuracy.

Proof.

If , we have: