1 Introduction
Deep networks achieve strong performance over a number of computer vision tasks, yet they remain brittle under adversarial attacks
[11, 6, 16, 55]. With crafted perturbations, attackers can undermine predictions from the stateoftheart models by changing the features in the representation [36]. These limitations prevent application of deep networks to sensitive and safetycritical applications [46, 52, 32, 53], underscoring the gap between current machine learning algorithms and humanlevel abilities
[5].A large body of work has studied how to train deep networks such that they are robust to adversarial attacks. Adversarial training and its variants [34, 36, 55, 42, 47], including multitask learning [35, 25]
[58], significantly improve robustness. However, while existing methods focus on improving the training algorithm, they are burdened because they need to find a single representation that also works for all possible corruptions and attacks. Trainingbased defenses cannot adapt to the individual characteristics of each attack at testingtime.In this paper, we introduce an approach for reversing the attack process, allowing us to formulate a defense strategy that adapts to each attack during the testing phase. Just as an attacker finds the right additive perturbation to break the input, our approach will find the right additive perturbation to repair the input. Figure 1 shows our reverse attack on a poisoned ImageNet image. However, reverse attacks are more challenging to produce than standard attacks because the category label is unknown to us during testing.
Our key insight is that images contain natural and intrinsic structure that we can leverage to reverse many types of adversarial attacks. We found that, although adversarial attacks aim to fool the image classifier, they also collaterally damage selfsupervised objectives. Our approach shows how to capitalize on this incidental signal in order to create adversarial defenses. By using selfsupervision for defense at test time, we can guarantee that even the strongest adversary cannot manipulate the intrinsic signals that naturally come with the images, providing a more robust defense than trainingbased methods.
A pivotal advantage of our framework is that it factors out the defense strategy from the visual representation. Since reverse attacks are adaptive, this defense is able to efficiently scale to any corruption that violates the natural image manifold.
Moreover, the modularity of our approach allows it to work with any classifier and complements existing defense models. It can also be integrated into future defense models and defend against novel attacks that corrupt natural image structures.
Visualizations, empirical experiments, and theoretical analysis show that our reversal strategy significantly improves robust prediction for several established benchmarks and attacks. Under attacks with up to 200 steps, our method advances the stateoftheart defense methods by a large margin across four natural image datasets including CIFAR10 (over 2.9% gain), CIFAR100 (over 1.6% gain), SVHN (over 6.8% gain), and ImageNet (over 3.0% gain). Our method is robust against established attacks, including PGD [34] and C&W [6]. In addition, our empirical results demonstrate that, even when the attacker is aware of our defense mechanism, our approach remains robust. Our models, data, and code are available at https://github.com/cvlabcolumbia/SelfSupDefense.
2 Related Work
Selfsupervised Learning:
Natural images contain rich information for representation learning. Selfsupervised learning enables us to learn high quality representations from images without annotations
[13, 9, 69, 21, 10, 3, 7, 45]. By solving pretext tasks, such as jigsaw puzzles [40][45], rotation prediction [17][69, 60], random walk [26], and clustering [8], the learned representations can generalize to unseen downstream tasks such as image recognition [9], and also allow domain adaptation at testing time [54]. Recently, contrastive learning has significantly advanced image recognition [9, 21, 37, 19]. In this paper, we leverage this incidental structure to correct adversarial attacks. Our defense uses the contrastive learning task [9], and it is extensible to existing selfsupervised tasks as well [45, 40, 17].Adversarial Robustness: A large number of adversarial attacks have been proposed to fool deep models [55, 1, 6, 31, 43, 38]. Special adversarial attacks that can be reversed to clean images are also proposed [65]. Different from the existing approach to construct reversible attacks [65], our approach aims to reverse any unknown attacks for defense. While many defense methods are proved to be not robust [48, 66, 64, 33, 2, 59, 41, 20, 50, 14, 4] as they relied on gradient obfuscation, gradient masking [1, 5], and weak adaptive attack evaluation [56], adversarial training and its variants are proved to achieve the true robustness [18, 34, 36, 68, 47, 42, 62, 61, 63]. Moreover, recent progress shows that unlabeled data [58, 24] and selfsupervised learning [25]
improve the robustness of deep models. While training a robust neural network to defense is vastly studied, no existing work investigates algorithms that improve robustness at inference time.
3 Method
We will first present a reverse attack that uses selfsupervision at deployment time to defend against adversarial attacks. We then analyze the case where the attacker is aware of our defense, and show our defense remains effective. We finally provide theoretical justification for the robustness of our approach.
3.1 Attacks and Reverse Attacks
Let be an input image, and be its groundtruth category label. To perform classification, neural networks commonly learn to predict the category by optimizing the cross entropy between the predictions and the ground truth.
The network parameters
are estimated by minimizing the expected value of the objective:
(1) 
which can be optimized with gradientbased descent.
The Attack: In order to corrupt this model, the adversarial attack finds additive perturbations to the image such that is no longer classified correctly by the trained network .
Attackers create these worstcase images by maximizing the objective:
(2) 
where the norm bound of the perturbation is less than , which keeps the perturbation minimal.
The Reverse: We aim to defend against these attacks by reversing the attack process. Just as the attack finds an additive perturbation to break the input, we will find an additive perturbation to repair the input. However, we cannot simply flip Equation 2 from a maximization to a minimization because the category labels are unknown at deployment.
The key observation is that selfsupervised objectives are always available because they do not depend on the labels . While adversarial attacks aim to corrupt the classifier, they will also impact selfsupervised representations, which is a signal we will leverage for reversal. Let be a selfsupervised objective on the input . We create the reverse attack vector by minimizing the objective:
(3) 
where defines the bound of our reverse attack. The solution will modify the adversarial image such that it satisfies our choice of selfsupervised objective.
After finding the optimal , robust prediction is straightforward. Our defense adds the resulting perturbation vector to the input before predicting the classification result with the normal network forward pass: .
An advantage of reverse attacks is that, since they do not rely on offline adversarial training, the defense will generalize to unseen adversarial attacks. Moreover, our defense is able to fortify existing models without retraining.
3.2 Natural Supervision for Defense
While any selfsupervised task [45, 40, 17] can construct the loss , we use the contrastive loss as our natural supervision objective [9, 10], which is a stateoftheart selfsupervised representation learning approach. The contrastive objective creates features that maximize the agreement between positive pairs of examples while minimizing the agreement between negative pairs of examples. Pairs are typically created with an augmentation strategy [9]. In our case, when we receive a potentially adversarial image , we create the positive examples by sampling different augmentations from it to create multiple positive pairs. We create the negative pairs in a similar way, except applying augmentations to the randomly selected images.
Since these pairs are constructible at evaluation time, we create reverse attacks that minimize the term:
(4) 
where are the contrastive features. We use to indicate which pairs are positive and which are negative. This indicator satisfies if and only if the examples and are both from , and otherwise. is a scalar hyperparameter, and
denotes cosine similarity.
Figure 2 shows that adversarial attacks on classification objectives also attack the contrastive objective , even though the attacker never explicitly optimizes for it. When there is an attack, will be larger than on clean images . This gap provides the signal for reverse attacks.
Contrastive Feature Estimation: To estimate the contrastive features
, we take the features before logits from a backbone
and pass them to a twolayer network . To compute the positive features, we sample augmentations conditioned on the input image . We follow a similar procedure to compute the contrastive features for the negative examples , sampling random images from a collection of images that form the negative set.Offline, we fit the contrastive model on a large set of clean images using the same procedure as [9, 10]. We sequentially apply two augmentations: random cropping then scale back to the original size, and random color distortions including color jittering and random grayscale. We found removing the Gaussian blur from the augmentations improved performance because it otherwise favored oversmooth perturbations. After is trained on clean images, we use it during reverse attacks without any further training.
3.3 Analysis of Defense Aware Attack
In this section, we analyze the effectiveness of our approach when the attacker is aware of our defense.
Attack Model: Let us assume the attacker knows the contrastive model parameters and our defense strategy. In this setting, the attacker can adversarially optimize against our defense with the following alternating optimization:
(5)  
(6) 
From the attacker’s perspective, the above procedure is not ideal because it involves an alternating, minmax optimization. Past work suggests that this leads to unstable gradient estimation, having a gradient obfuscation problem that reduces the attack efficiency [1].
Similar to C&W [6] and LBFGS attack [55], the attacker can reformulate the above equation as a constrained optimization problem:
(7) 
where is the same value as the converged loss for natural images. Intuitively, the attacker should maximize the adversarial gain while respecting the selfsupervised loss if they want to render our defense ineffective.
To optimize Equation 24, they in practice maximize the following equation w.r.t. :
(8) 
We derive Equation 8 from Equation 24 via the Lagrange Penalty method [49], where is the new loss for the adaptive attack. Full derivations are in the supplementary.
Multiobjective Tradeoff: The above derivation shows that the attacker can attempt to bypass our reversal by also minimizing , so that attacks mimic the selfsupervised features of clean examples. If the attacker produces examples that are as good as the clean examples in terms of , our defense would not be able to reverse the attack by further decreasing the loss .
However, as the attacker must solve a multiobjective optimization, they must tradeoff between the two objectives. The scalar controls how aggressively the attacker will corrupt the selfsupervised model. The attacker’s ideal adversarial attack will first optimize for the Pareto frontier by maximizing for each . They should select the that yields the most damage (the lowest robust accuracy), and use the corresponding generated attack .
A larger shifts the adversarial budget from attacking the classification loss to attacking the selfsupervised loss . If the attacker is to attack the selfsupervised task, they would then reduce the effectiveness of their classification attack, undermining their goal. Attacking both and jointly requires creating adversarial images for multiple objectives, which is fundamentally more challenging [35].
Our defense creates a loselose situation for the attacker. If they ignore our defense, then we improve accuracy. If they account for our defense, then they hurt their attack.
3.4 Theoretical Analysis
We will show theoretical insights for why leveraging natural supervision improves adversarial robustness. Without our defense, the model predicts the category on an image with an incorrect estimate for the selfsupervision label. With our defense, the model uses an image for which the selfsupervision label is estimated correctly. We prove this increases the upper bound of the prediction accuracy.
A feed forward pass is equivalent to also including a latent selfsupervised label to the model, since the information which the selfsupervised network uses is in the image itself. We denote the groundtruth label of selfsupervision as and the predicted label of selfsupervision under attack as . To make this latent label explicit in notation, we rewrite the loss functions as: and .
Lemma 1.
The standard classifier under adversarial attack is equivalent to predicting with , and our approach is equivalent to predicting with .
Proof.
For the standard classifier under attack, we know that . Thus we know the standard classifier under adversarial attack is equivalent to
Our algorithm finds a new input image that
Our algorithm first estimate with adversarial image and selfsupervised label . We then predict the label using our new image . Thus, our approach in fact estimates . Note the following holds:
(9)  
(10)  
(11) 
Thus our approach is equivalent to estimating . ∎
We use the maximum a posteriori (MAP) estimation to approximate the sum over because: (1) sampling a large number of is computationally expensive; (2) our results in Figure 7 shows that random sampling is ineffective; (3) our MAP estimate naturally produces a denoised image that can be useful for other downstream tasks.
Next we provide theoretical guarantees that our approach can strictly improve the bound for classification accuracy in Theorem 1
. For convenience we introduce an additional random variable
representing the adversarial image.Theorem 1.
Assume the base classifier operates better than chance and instances in the dataset are uniformly distributed over
categories. Let the prediction accuracy bounds be and . If the conditional mutual information , we have and , which means our approach strictly improves the bound for classification accuracy.Proof.
If , then it is straightforward that:
We define . Using the Fano’s Inequality [51] and the fact that is a monotonically increasing function when error rate , i.e., accuracy higher than random guessing,^{1}^{1}1The validity of this fact are explained in the supplementary. we derive the upper bound of accuracy and to be:
where the upper bound is a function of the mutual information. Since is a constant, a larger mutual information will strictly increase the bound. Detailed proof is in the supplementary material. ∎
Intuitively, the adversarial attack will corrupt some mutual information between the label and natural structure . Thus, there is additional mutual information between and given , i.e., . Theorem 1 shows that by restoring information from the correct , the prediction accuracy can be improved.
Theoretically, by optimizing the selfsupervision loss, the defense aware attack is in fact predicting classification label given the right selfsupervision label , i.e., . According to our theory, the robust accuracy should increase due to the restored information. Overall, our defense is robust even under a defense aware adversary.
Model  Adversarial Attack  

Architecture  PGD 50  PGD 200  BIM 200  C&W 200  
Inference Type  Standard  Ours  Standard  Ours  Standard  Ours  Standard  Ours  
TRADES [68] 
WRN3410  55.05%  57.00%  55.02%  57.18%  55.06%  57.33%  53.72%  56.56% 
RO [47]  PreRes18  52.40%  54.59%  52.34%  54.62%  52.32%  54.54%  50.33%  53.48% 
BagT [42]  WRN3410  56.44%  58.33%  56.40%  58.47%  56.38%  58.55%  54.82%  57.35% 
MART [61]  WRN2810  62.72%  64.40%  62.63%  64.26%  62.54%  64.28%  58.96%  62.18% 
AWP [63]  WRN2810  63.67%  64.21%  63.64 %  64.07%  63.64%  63.69%  60.82%  61.90% 
SemiSL [58]  WRN2810  62.30%  64.64%  62.22%  64.44%  62.18%  64.68%  60.90%  63.83% 

Adversarial Attack  
PGD 50  PGD 200  BIM 200  C&W 200  
Inference Type  Standard  Ours  Standard  Ours  Standard  Ours  Standard  Ours  
RO [47]  ResNet18  21.90%  23.83%  20.90%  22.23%  21.00%  22.09%  20.42%  22.19% 
TRADES* [68]  WRN3410  27.04%  28.55%  26.94%  28.16%  26.94%  28.18%  26.57%  27.88% 
BagT * [42]  WRN3410  29.87%  31.21%  28.81%  31.15%  29.81%  31.44%  29.72%  31.40% 
Model  Adversarial Attack  

Architecture  PGD 50  PGD 200  BIM 200  C&W 200  
Inference Type  Standard  Ours  Standard  Ours  Standard  Ours  Standard  Ours  
RO [47]  PreRes18  51.03%  53.21%  50.82%  53.06%  50.84%  53.26%  48.90%  54.62% 
SemiSL [58]  WRN2810  55.69%  62.40%  55.29%  62.12%  55.43%  62.53%  57.03%  63.34% 
Adversarial Attack  

FGSM  PGD10  PGD 20  BIM 20  C&W 20  
FBF [62] 
32.47%  28.68%  28.52%  28.49%  27.81% 
Ours + FBF  33.51%  31.36%  31.32%  31.27%  30.88% 
Adversarial Attack  
PGD 200  C&W 200  
Inference Type  Standard  Ours  Standard  Ours 
TRADES [68]  36.90%  39.16%  35.89%  38.07% 
RO [47]  40.00%  41.98%  38.31%  40.17% 
BagT [42]  38.80%  41.28%  37.01%  39.20% 
SemiSL [58]  42.07%  44.47%  39.97%  42.72% 
AWP [63]  44.39%  47.15%  40.97%  43.72% 
MART [61]  43.24%  45.85%  43.23%  45.85% 
4 Experiments
Our experiments evaluate the robustness at image classification on four datasets: CIFAR10 [29], CIFAR100 [30], SVHN [39], and ImageNet [12]. We compare with the stateoftheart defense methods, under several strong adversarial attacks including a defense aware attack.
4.1 Baselines
We apply our method to seven established, scrutinized [1] defense methods including the stateoftheart adversarial robust model. All studied methods are trained with adversarial training [34], but achieve higher robust accuracy than the initial version of Madry et al. [34].
TRADES [68] is the winning solution for NeurIPS 2018 Adversarial Vision Challenge. It introduces a KLdivergence term to regularize the representation of adversarial examples to match the ones of clean examples.
Robust Overfit (RO) [47] reexamines the existing adversarial robust models through overfitting, which is the stateoftheart model trained with PreResNet18.
Bag of Tricks (BagT) [42] conducts extensive experiments on the effect of hyperparameters on adversarial training [34]. It is the stateoftheart adversarial robust model without additional unlabeled data for training.
Semisupervised Learning (SemiSL) [58] significantly improves adversarial robustness using unlabeled data. By training with pseudo labels of unlabeled images, the model achieved the stateoftheart robustness. However, this work neglects the information of natural images beyond the pseudo classification label.
MART [61] uses misclassification aware adversarial training to achieve improved robustness. We use its best version trained on top of SemiSL [58].
Adversarial Weight Perturbation (AWP) [63] trains robust model by smoothing the weights’ loss landscape. We use its best version trained on top of SemiSL [58].
Fast is Better than Free (FBF) [62] is the stateoftheart solution for training robust ImageNet classifier in reasonable training budget and time.
4.2 Attack Methods
Fast Gradient Sign Method (FGSM) [18] is a onestep adversarial attack to fool neural networks.
Projected Gradient Descent (PGD) [34] is the standard evaluation for adversarial robustness, which optimizes the adversarial noise with gradient descent for iterations, and project the noise back to the nearest boundary if it is out of the given bound.
Basic Iterative Attack (BIM) [31] is a variant of PGD attack without the initial random start.
C&W Attack [6] is a powerful iterative attack that has been widely used for robustness evaluation. It reduces the logit value for the right class while increasing that for the second best class to fool the classifier.
Defense Aware Attack is discussed in Section 3.3, which is theoretically the optimal adaptive whitebox attack to bypass our defense algorithm.
4.3 Experimental Settings
Backbone Architectures. Following prior literature, we conduct experiments with PreResNet18 [23], ResNet50 [22], and WideResNet [67]. We download the pretrained models’ weights online. ^{2}^{2}2We reproduce a few models that are not available and denote by *
Selfsupervised Learning Branch. We use a network with two fully connected layers that takes in the features from the penultimate layer of the backbone network.
Implementation Details. We train our selfsupervision model with the Adam [28] optimizer. We use a learning rate of . When training the selfsupervised model, we use temperature
for the contrastive loss, with a batch size of 128. For CIFAR10 and CIFAR100, we train the selfsupervised branch for 200 epochs. For SVHN, we train it for 600 epochs. For ImageNet, we train for 30 epochs. We set the reverse attack bound to be
and optimization iterations to be. We implement our model with Pytorch
[44]. Please see supplementary for details.4.4 Results of Defense Aware Adversarial Attacks
In Section 3.3, we discussed the strongest adaptive attack that can be used to bypass our defense. We show the results of the adaptive attacker on CIFAR10 in Figure 4. We use attacks with 50 steps, with perturbation bound . We vary the value of the from 0 to 10, where 0 corresponds to the standard PGD attack without considering our defense strategy. The results show that increasing and focusing more attack budget to the selfsupervised defense, the gain of our approach is reduced (full line falls under dotted line). However, as gets larger, the attack for classification task also gets weaker (line goes up). While adaptive attacks successfully reduce the additional gain bought by our approach, it too significantly sacrifices the initial attack success rate on the classification task. Minimizing the hurts the original classification attack so much that it is not worth it for the attacker to account for our defense. We also show results with 500 steps in the supplementary, where our conclusion also holds.
This finding matches our initial theory in Section 3.4, where leveraging the incidental structure in the images improves robustness. The decrease of attack success rate on the target classification task is also consistent with prior work [35], which suggests that it is harder to simultaneously attack multiple tasks at once. In fact, the attacker is trading off between classification attack success rate and fooling the selfsupervised defense. For the attacker, the optimal attack is , which is the standard adversarial attack without considering our selfsupervised defense. Therefore, we use this setup in the remaining experiments.
4.5 Results with Optimal Adversarial Attacks
In the optimal setup, . The attack is equivalent to the standard adversarial attack without considering the defense branch. Thus the gain from Standard to Ours is the lower bound of our selfsupervised correction. We now show the gain on four datasets.
CIFAR10 [29] contains 10 categories. In Table 1, we add our approach to six existing robust models including the stateoftheart, where we constantly correct by up to 2.9% of the adversarial examples, as shown in the gain from Standard to Ours.
CIFAR100 [30] contains 100 categories. In Table 2, we show over 1.9% gain compared with the baselines.
SVHN [39] is a 10 category street view house number dataset. In Table 3, we experiment on two methods that have pretrained models available, including the stateoftheart semisupervised learning [58]. Ours demonstrate over 6.8% gain compared with the original defense method.
ImageNet [12] contains 1000 categories. We use the pretrained model from Wong et al. [62], which is the stateoftheart ResNet50 robust model. In Table 4, we use 5 different attacks to access the adversarial robustness of the original model and our model with natural supervision defense. Our approach achieved over 3% gain on robustness.
4.6 Analysis
Accuracy vs. Time Budget. As our method is iterative, we can adjust the number of iterations according to different time budget. We use the number of iterations conducted as a indicator of time, and plot the accuracy vs. time budget in Figure 5, where we can see even a few updates can significantly improves the robustness.
Robustness Curve. We adopt the robustness curve evaluation of adversarial robustness accuracy vs. the perturbation budget [15]. We show the trend in Figure 6. We apply our inference algorithm as additional defense to existing robust models [47, 62], where our approach achieved up to 14% robustness gain compared with standard inference method, especially when the attack gets stronger with larger perturbation bound.
Tradeoff between clean accuracy and adversarial robustness. It has been proved that there exists a natural tradeoff between clean accuracy and adversarial robustness given a classifier [57, 68]. In Figure 7, we compare our natural supervised reverse defense with the random reverse defense (baseline). We increase the additive noise level that is applied to reverse the adversarial examples as well as the clean examples (we use the same algorithm to clean examples because during inference we cannot distinguish adversarial ones from clean ones). The clean accuracy often drops as the noise level goes up. While there is a tradeoff between clean accuracy and robust accuracy [57, 68], our approach achieved a better tradeoff between them.
Robustness on norm bounded adversarial attacks. We measure whether our defense can also generalize beyond the bounded attack. Table 5 shows results under norm bounded attack on CIFAR10, where our approach consistently improves robustness under norm bounded attacks by over 2.6%.
Feature visualization. Figure 8 visualizes the trajectory of images’ penultimate layer’s feature as it transitions through an attack and the reversal. We use PCA to project the features onto a plane. The plot demonstrates that the attack shifts the feature embedding from the right class to the wrong class. Then, the reverse attack often returns the features back to the right class.
To quantify this effect, we take the Euclidean distance between the clean embedding and the attacked embedding, denoted , as well as the Euclidean distance between the clean embedding and the inverse attacked embedding, denoted , for the triples that have the same clean class and inverse attacked class. For all but one combination of categories, . Additionally, across all triplets, we checked how much the average distance from clean to reverse attacked is reduced from the average distance from clean to attacked, and obtained a value of roughly 17% decrease. These results together demonstrate that, on average, our reverse attack returns the attacked embedding closer to the original embedding.
5 Conclusions
We introduce an approach to use natural supervision to reverse adversarial attacks on images. Our results demonstrate improved robustness across several benchmarks and several stateoftheart attacks. Our findings suggest integrating defense mechanisms into the inference algorithm is a promising direction to improve adversarial robustness.
Acknowledgements: This research is based on work partially supported by NSF CRII #1850069, NSF grant CNS1564055; ONR grants N00014161 2263 and N000141712788; a JP Morgan Faculty Research Award; and a DiDi Faculty Research Award. MC is supported by a CAIT Amazon PhD fellowship. We thank NVidia for GPU donations. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the sponsors.
References
 [1] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 274–283. PMLR, 2018.

[2]
Mitali Bafna, Jack Murtagh, and Nikhil Vyas.
Thwarting adversarial examples: An l_0robust sparse fourier transform.
In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. 
[3]
Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William T. Freeman,
Michael Rubinstein, Michal Irani, and Tali Dekel.
Speednet: Learning the speediness in videos.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
, June 2020.  [4] Jacob Buckman, Aurko Roy, Colin Raffel, and Ian J. Goodfellow. Thermometer encoding: One hot way to resist adversarial examples. In 6th International Conference on Learning Representations, 2018.
 [5] Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian J. Goodfellow, Aleksander Madry, and Alexey Kurakin. On evaluating adversarial robustness. CoRR, abs/1902.06705, 2019.
 [6] Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy, pages 39–57, 2017.
 [7] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. 2020.
 [8] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments, 2021.
 [9] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations, 2020.
 [10] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020.
 [11] Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameterfree attacks. In ICML, 2020.
 [12] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. ImageNet: A LargeScale Hierarchical Image Database. In CVPR09, 2009.
 [13] Virginia R. DeSa. Learning classification with unlabeled data. In Proceedings of the 6th International Conference on Neural Information Processing Systems, NIPS’93, page 112–119, San Francisco, CA, USA, 1993. Morgan Kaufmann Publishers Inc.
 [14] Guneet S. Dhillon, Kamyar Azizzadenesheli, Zachary C. Lipton, Jeremy Bernstein, Jean Kossaifi, Aran Khanna, and Animashree Anandkumar. Stochastic activation pruning for robust adversarial defense. In 6th International Conference on Learning Representations, 2018.
 [15] Yinpeng Dong, QiAn Fu, Xiao Yang, Tianyu Pang, Hang Su, Zihao Xiao, and Jun Zhu. Benchmarking adversarial robustness on image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 321–331, 2020.
 [16] Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. Boosting adversarial attacks with momentum. In CVPR, pages 9185–9193, 2018.
 [17] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations, 2018.
 [18] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv:1412.6572, 2014.
 [19] JeanBastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. Bootstrap your own latent: A new approach to selfsupervised learning, 2020.
 [20] Chuan Guo, Mayank Rana, Moustapha Cissé, and Laurens van der Maaten. Countering adversarial images using input transformations. CoRR, abs/1711.00117, 2017.
 [21] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019.
 [22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv 1512.03385, 2015.
 [23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks, 2016.
 [24] Dan Hendrycks, Kimin Lee, and Mantas Mazeika. Using pretraining can improve model robustness and uncertainty. Proceedings of the International Conference on Machine Learning, 2019.
 [25] Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, and Dawn Song. Using selfsupervised learning can improve model robustness and uncertainty. Advances in Neural Information Processing Systems (NeurIPS), 2019.
 [26] Allan Jabri, Andrew Owens, and Alexei A. Efros. Spacetime correspondence as a contrastive random walk. In Advances in Neural Information Processing Systems, 2020.
 [27] William Karush. Minima of functions of several variables with inequalities as side constraints. M. Sc. Dissertation. Dept. of Mathematics, Univ. of Chicago, 1939.
 [28] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.
 [29] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar10 (canadian institute for advanced research).
 [30] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar100 (canadian institute for advanced research).
 [31] Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2017.
 [32] Alex H. Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. CoRR, abs/1812.05784, 2018.
 [33] Yingzhen Li, John Bradshaw, and Yash Sharma. Are generative classifiers more robust to adversarial attacks? In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3804–3814. PMLR, 09–15 Jun 2019.

[34]
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and
Adrian Vladu.
Towards deep learning models resistant to adversarial attacks.
In ICLR, 2018.  [35] Chengzhi Mao, Amogh Gupta, Vikram Nitin, Baishakhi Ray, Shuran Song, Junfeng Yang, and Carl Vondrick. Multitask learning strengthens adversarial robustness. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and JanMichael Frahm, editors, Computer Vision – ECCV 2020, pages 158–174, Cham, 2020. Springer International Publishing.
 [36] Chengzhi Mao, Ziyuan Zhong, Junfeng Yang, Carl Vondrick, and Baishakhi Ray. Metric learning for adversarial robustness. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
 [37] Ishan Misra and Laurens van der Maaten. Selfsupervised learning of pretextinvariant representations, 2019.
 [38] SeyedMohsen MoosaviDezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks, 2016.
 [39] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.
 [40] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, pages 69–84, Cham, 2016. Springer International Publishing.
 [41] Tianyu Pang, Kun Xu, Chao Du, Ning Chen, and Jun Zhu. Improving adversarial robustness via promoting ensemble diversity. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 4970–4979. PMLR, 09–15 Jun 2019.
 [42] Tianyu Pang, Xiao Yang, Yinpeng Dong, Hang Su, and Jun Zhu. Bag of tricks for adversarial training, 2020.
 [43] Nicolas Papernot, Patrick D. McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. arXiv:1511.07528, 2015.
 [44] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
 [45] Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell, and Alexei A. Efros. Context encoders: Feature learning by inpainting. CoRR, abs/1604.07379, 2016.
 [46] Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. Deepxplore. Proceedings of the 26th Symposium on Operating Systems Principles, Oct 2017.
 [47] Leslie Rice, Eric Wong, and J. Zico Kolter. Overfitting in adversarially robust deep learning, 2020.

[48]
Kevin Roth, Yannic Kilcher, and Thomas Hofmann.
The odds are odd: A statistical test for detecting adversarial examples.
In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5498–5507. PMLR, 09–15 Jun 2019.  [49] A. M. Rubinov, X. Q. Yang, and Y. Y. Zhou. A lagrange penalty reformulation method for constrained optimization. Optimization Letters, 2007.
 [50] Pouya Samangouei, Maya Kabkab, and Rama Chellappa. Defensegan: Protecting classifiers against adversarial attacks using generative models. CoRR, abs/1805.06605, 2018.
 [51] Jonathan Scarlett and Volkan Cevher. An introductory guide to fano’s inequality with applications in statistical estimation, 2019.
 [52] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. CoRR, abs/1503.03832, 2015.
 [53] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in perception for autonomous driving: Waymo open dataset. CoRR, abs/1912.04838, 2019.
 [54] Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Testtime training with selfsupervision for generalization under distribution shifts. In International Conference on Machine Learning, pages 9229–9248. PMLR, 2020.
 [55] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv:1312.6199, 2013.
 [56] Florian Tramer, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. On adaptive attacks to adversarial example defenses. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1633–1645. Curran Associates, Inc., 2020.
 [57] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy, 2019.
 [58] Jonathan Uesato, JeanBaptiste Alayrac, PoSen Huang, Robert Stanforth, Alhussein Fawzi, and Pushmeet Kohli. Are labels required for improving adversarial robustness? CoRR, 2019.

[59]
Gunjan Verma and Ananthram Swami.
Error correcting output codes improve probability estimation and adversarial robustness of deep neural networks.
In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.  [60] Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. Tracking emerges by colorizing videos, 2018.
 [61] Yisen Wang, Difan Zou, Jinfeng Yi, James Bailey, Xingjun Ma, and Quanquan Gu. Improving adversarial robustness requires revisiting misclassified examples. In ICLR, 2020.
 [62] Eric Wong, Leslie Rice, and J. Zico Kolter. Fast is better than free: Revisiting adversarial training, 2020.
 [63] Dongxian Wu, ShuTao Xia, and Yisen Wang. Adversarial weight perturbation helps robust generalization. In NeurIPS, 2020.
 [64] Chang Xiao, Peilin Zhong, and Changxi Zheng. Enhancing adversarial defense by kwinnerstakeall, 2019.
 [65] Zhaoxia Yin, Hua Wang, Li Chen, Jie Wang, and Weiming Zhang. Reversible adversarial example based on reversible image transformation, 2021.
 [66] Tao Yu, Shengyuan Hu, Chuan Guo, Weilun Chao, and Kilian Weinberger. A new defense against adversarial images: Turning a weakness into a strength. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Oct. 2019.
 [67] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. CoRR, abs/1605.07146, 2016.
 [68] Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P. Xing, Laurent El Ghaoui, and Michael I. Jordan. Theoretically principled tradeoff between robustness and accuracy. arXiv abs/1901.08573, 2019.
 [69] Richard Zhang, Phillip Isola, and Alexei A. Efros. Colorful image colorization, 2016.
6 Theoretical Results
Appendix A Detailed Proof of Lemma 1
Lemma 2.
The standard classifier under adversarial attack is equivalent to predicting with , and our approach is equivalent to predicting with .
First we show that
It is easy to see that
where the last equality is due to the neural network’s deterministic nature, i.e.,
where is the latent selfsupervised label prediction. Thus the probability is 0 otherwise.
Intuitively, this demonstrates that the attack is equivalent to using the classifier
to predict the label.
Next we show that our algorithm is equivalent to using the following classifier
Our algorithm finds a new input image that
Note that
The last formulation is our algorithm’s inference procedure, where we first estimate with adversarial image and selfsupervised label . We then predict the label using our new image . We now have proved that our algorithm is equivalent to using
Here, we use the maximum a posteriori (MAP) estimate to approximate the marginalization over because: first, sampling a large number of is computational expensive, second, our our shows the sampling is ineffective, lastly, our MAP estimation also produce a denoised image that can also be useful for other downstream tasks.
Appendix B Detailed Proof of Theorem 1
Theorem 2.
Assume the base classifier operates better than chance and instances in the dataset are uniformly distributed over categories. Let the prediction accuracy bounds be and . If the conditional mutual information , we have and , which means our approach strictly improves the bound for classification accuracy.
Proof.
If , we have:
We let the predicted label to be , we assume there are categories, and let the lower bound for prediction accuracy to be . We define . Use the Fano’s Inequality [51], we have
(12) 
(13) 
We add to both side
(14) 
because .
Then we get
(15) 
Now we define a new function . Given that in classification task, the number of category . We know . Given that the entropy function first increase and then decrease, the function should also first increase, peak at some point, and then decrease.
We calculate the for the peak value via calculate the first order derivative . By solving this, we have:
(16) 
which shows that the function is monotonically increasing when .
Given that we know, the base classifier already achieves accuracy better than random guessing, thus the given classifier satisfies . Now, the function is a monotonically increasing function in our studied region, which has the inverse function .
By rewritting the equation 15 We then have
(17) 
We apply the inverse function to both side:
(18) 
(19) 
Note that is our defined accuracy, thus
The above derivation also applies to , thus .
Since Thus , using our approach, the upper bound for robust accuracy is improved.
To prove the lower bound , we divide the joint set of into set and , given the additional information from , the accuracy will not get worse, thus the new lower bound should not be smaller than .
∎
Appendix C Defense Aware Attack
We derive the defense aware attack in our main paper in details. We make the latent label explicit in our notation.
The straight forward adaptive attack is to optimize the attack in an adversary way against the defense.
(20)  
(21) 
From the attacker perspective, the above optimization is not ideal, as it involves iterative optimization of two directions, thus the gradient estimated maybe not stable enough, even having gradient obfuscation [1]. Following the standard constrained optimization attack practice from [55, 6], the attacker reformulates the above equation as a constrained optimization problem:
(22)  
(23) 
where is the same value as the converged loss for natural images. Intuitively, the attacker should maximize the adversarial gain while respecting the selfsupervised loss if they want to render our defense ineffective.
The above equation is equivalent to:
(24)  
(25) 
We can use the Lagrangian Penalty Method to derive the following:
(26) 
Thus the optimal value for the attack is:
(27) 
which is the primal.
Using the Weak Duality Theorem we have the following upper bound for the optimal solution of the above optimization problem [27]:
(28) 
Removing the negative sign, we have:
(29) 
which is equivalent to first maximizing the followings under different :
(30) 
And then select the that yields the most damage with the lowest robust accuracy, and use the corresponding generated attack .
7 Experimental Results
Appendix A Defense Aware Adversarial Attack
We show the numerical results for the defense aware attack in Table 6. In addition to the 50 steps we used in our main paper, we also show results using 500 steps of adaptive attack. We apply 500 steps to the RO (robust overfitting method by Rice et al. [47]). The results clearly show that using more steps does not change the conclusion. 500 steps of attack achieve almost the same robust accuracy as the 50 steps baseline, which suggests that the attack is almost converged. In addition, our approach still efficiently improves the robust accuracy by over 2%. Lastly, the attacker needs a in order to bypass our defense through the reverse attack, however, at a cost that the attack for classification task gets weaker by over 7%, which itself helps our defense. Overall, even under more attack steps, our defense is still effective.
Defense Aware Adversarial Attack  

0  0.5  1  2  4  6  8  10  
RO [47] 50 steps Baseline  52.40%  53.81%  55.41%  57.81%  60.80%  62.62%  63.81%  64.58% 
RO [47] 50 steps with Ours  54.59%  55.61%  56.75%  58.67%  60.81%  62.07%  63.16%  63.68% 
RO [47] 500 steps Baseline  52.23%  53.47%  54.89%  57.07%  59.86%  61.89%  63.09%  63.90% 
RO [47] 500 steps with Ours  54.70%  55.61%  56.68%  58.17%  60.51%  61.26%  62.40%  63.06% 
SemiSL [58] 50 steps Baseline  62.30%  63.87%  65.38%  67.87%  70.60%  72.32%  73.52%  74.42% 
SemiSL [58] 50 steps with Ours  64.64%  65.62%  66.72%  68.58%  70.55%  71.43%  72.75%  73.19% 
Appendix B Defending a Standalone Nonrobust Network
In our main paper, we apply our approach to a list of existing stateoftheart models, now we show results on applying our approach to an undefended neural network.
We train a PreRes18 model on pure clean images without any adversarial training, which yields 0% robust accuracy under adversarial attack. We then use our defense to reverse the attack via natural supervision. We achieve an improvement in robust accuracy of 34.4%.
Appendix C Feature Input for Selfsupervised Models
We investigate which layer’s feature should be the input for the selfsupervision model. We conduct an ablation study that read out the latent features from low to high layers. Results in Figure 9 show that read our feature from the top layer for selfsupervision achieved the most robustness gain.
Appendix D Implementation Details
We run our experiments with 8 RTX 2080 Ti GPU. For CIFAR10, CIFAR100, and SVHN, the input dimension is all . We maximize the GPU space usage to speed up our inference. For the PreRes18 model, we use a batch size of 1024 during inference. For the Wide Residual Network model, we use a batch size of 512. For the ImageNet dataset, the input size is , we use the ResNet50 model, and due to the larger input dimension and model capacity, we use a batch size of 87 to maximize the GPU usage. For all the contrastive learning, we sample 4 positive views for each given image instance.
8 Visualization
Appendix A Attack Vector Visualization:
We visualize more examples of the adversarial attack vector and the inverse attack vector in Figure 10. Our reverse attack vector is highly structured, reversing the mispredicted examples back to the right one.
Appendix B Feature Visualization:
We show more visualizations of the feature trajectories of our approach in Figure 11, Figure 12, Figure 13, Figure 14, and Figure 15. We project the features onto a plane with PCA under the same setup as Figure 8 in the main paper. We can see that our approach pushes the misclassified examples (red) back to the original features (green), improving adversarial robustness.
References
 [1] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pages 274–283. PMLR, 2018.
 [2] Mitali Bafna, Jack Murtagh, and Nikhil Vyas. Thwarting adversarial examples: An l_0robust sparse fourier transform. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
 [3] Sagie Benaim, Ariel Ephrat, Oran Lang, Inbar Mosseri, William T. Freeman, Michael Rubinstein, Michal Irani, and Tali Dekel. Speednet: Learning the speediness in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
 [4] Jacob Buckman, Aurko Roy, Colin Raffel, and Ian J. Goodfellow. Thermometer encoding: One hot way to resist adversarial examples. In 6th International Conference on Learning Representations, 2018.
 [5] Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian J. Goodfellow, Aleksander Madry, and Alexey Kurakin. On evaluating adversarial robustness. CoRR, abs/1902.06705, 2019.
 [6] Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy, pages 39–57, 2017.
 [7] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments. 2020.
 [8] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. Unsupervised learning of visual features by contrasting cluster assignments, 2021.
 [9] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations, 2020.
 [10] Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020.
 [11] Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameterfree attacks. In ICML, 2020.
 [12] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. ImageNet: A LargeScale Hierarchical Image Database. In CVPR09, 2009.
 [13] Virginia R. DeSa. Learning classification with unlabeled data. In Proceedings of the 6th International Conference on Neural Information Processing Systems, NIPS’93, page 112–119, San Francisco, CA, USA, 1993. Morgan Kaufmann Publishers Inc.
 [14] Guneet S. Dhillon, Kamyar Azizzadenesheli, Zachary C. Lipton, Jeremy Bernstein, Jean Kossaifi, Aran Khanna, and Animashree Anandkumar. Stochastic activation pruning for robust adversarial defense. In 6th International Conference on Learning Representations, 2018.
 [15] Yinpeng Dong, QiAn Fu, Xiao Yang, Tianyu Pang, Hang Su, Zihao Xiao, and Jun Zhu. Benchmarking adversarial robustness on image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 321–331, 2020.
 [16] Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. Boosting adversarial attacks with momentum. In CVPR, pages 9185–9193, 2018.
 [17] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Unsupervised representation learning by predicting image rotations, 2018.
 [18] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv:1412.6572, 2014.
 [19] JeanBastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, and Michal Valko. Bootstrap your own latent: A new approach to selfsupervised learning, 2020.
 [20] Chuan Guo, Mayank Rana, Moustapha Cissé, and Laurens van der Maaten. Countering adversarial images using input transformations. CoRR, abs/1711.00117, 2017.
 [21] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019.
 [22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. arXiv 1512.03385, 2015.
 [23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks, 2016.
 [24] Dan Hendrycks, Kimin Lee, and Mantas Mazeika. Using pretraining can improve model robustness and uncertainty. Proceedings of the International Conference on Machine Learning, 2019.
 [25] Dan Hendrycks, Mantas Mazeika, Saurav Kadavath, and Dawn Song. Using selfsupervised learning can improve model robustness and uncertainty. Advances in Neural Information Processing Systems (NeurIPS), 2019.
 [26] Allan Jabri, Andrew Owens, and Alexei A. Efros. Spacetime correspondence as a contrastive random walk. In Advances in Neural Information Processing Systems, 2020.
 [27] William Karush. Minima of functions of several variables with inequalities as side constraints. M. Sc. Dissertation. Dept. of Mathematics, Univ. of Chicago, 1939.
 [28] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.
 [29] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar10 (canadian institute for advanced research).
 [30] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar100 (canadian institute for advanced research).
 [31] Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2017.
 [32] Alex H. Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. CoRR, abs/1812.05784, 2018.
 [33] Yingzhen Li, John Bradshaw, and Yash Sharma. Are generative classifiers more robust to adversarial attacks? In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3804–3814. PMLR, 09–15 Jun 2019.
 [34] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In ICLR, 2018.
 [35] Chengzhi Mao, Amogh Gupta, Vikram Nitin, Baishakhi Ray, Shuran Song, Junfeng Yang, and Carl Vondrick. Multitask learning strengthens adversarial robustness. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and JanMichael Frahm, editors, Computer Vision – ECCV 2020, pages 158–174, Cham, 2020. Springer International Publishing.
 [36] Chengzhi Mao, Ziyuan Zhong, Junfeng Yang, Carl Vondrick, and Baishakhi Ray. Metric learning for adversarial robustness. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
 [37] Ishan Misra and Laurens van der Maaten. Selfsupervised learning of pretextinvariant representations, 2019.
 [38] SeyedMohsen MoosaviDezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks, 2016.
 [39] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.
 [40] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision – ECCV 2016, pages 69–84, Cham, 2016. Springer International Publishing.
 [41] Tianyu Pang, Kun Xu, Chao Du, Ning Chen, and Jun Zhu. Improving adversarial robustness via promoting ensemble diversity. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 4970–4979. PMLR, 09–15 Jun 2019.
 [42] Tianyu Pang, Xiao Yang, Yinpeng Dong, Hang Su, and Jun Zhu. Bag of tricks for adversarial training, 2020.
 [43] Nicolas Papernot, Patrick D. McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. arXiv:1511.07528, 2015.
 [44] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
 [45] Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell, and Alexei A. Efros. Context encoders: Feature learning by inpainting. CoRR, abs/1604.07379, 2016.
 [46] Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. Deepxplore. Proceedings of the 26th Symposium on Operating Systems Principles, Oct 2017.
 [47] Leslie Rice, Eric Wong, and J. Zico Kolter. Overfitting in adversarially robust deep learning, 2020.
 [48] Kevin Roth, Yannic Kilcher, and Thomas Hofmann. The odds are odd: A statistical test for detecting adversarial examples. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5498–5507. PMLR, 09–15 Jun 2019.
 [49] A. M. Rubinov, X. Q. Yang, and Y. Y. Zhou. A lagrange penalty reformulation method for constrained optimization. Optimization Letters, 2007.
 [50] Pouya Samangouei, Maya Kabkab, and Rama Chellappa. Defensegan: Protecting classifiers against adversarial attacks using generative models. CoRR, abs/1805.06605, 2018.
 [51] Jonathan Scarlett and Volkan Cevher. An introductory guide to fano’s inequality with applications in statistical estimation, 2019.
 [52] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. CoRR, abs/1503.03832, 2015.
 [53] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in perception for autonomous driving: Waymo open dataset. CoRR, abs/1912.04838, 2019.
 [54] Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Testtime training with selfsupervision for generalization under distribution shifts. In International Conference on Machine Learning, pages 9229–9248. PMLR, 2020.
 [55] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv:1312.6199, 2013.
 [56] Florian Tramer, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. On adaptive attacks to adversarial example defenses. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1633–1645. Curran Associates, Inc., 2020.
 [57] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy, 2019.
 [58] Jonathan Uesato, JeanBaptiste Alayrac, PoSen Huang, Robert Stanforth, Alhussein Fawzi, and Pushmeet Kohli. Are labels required for improving adversarial robustness? CoRR, 2019.
 [59] Gunjan Verma and Ananthram Swami. Error correcting output codes improve probability estimation and adversarial robustness of deep neural networks. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
 [60] Carl Vondrick, Abhinav Shrivastava, Alireza Fathi, Sergio Guadarrama, and Kevin Murphy. Tracking emerges by colorizing videos, 2018.
 [61] Yisen Wang, Difan Zou, Jinfeng Yi, James Bailey, Xingjun Ma, and Quanquan Gu. Improving adversarial robustness requires revisiting misclassified examples. In ICLR, 2020.
 [62] Eric Wong, Leslie Rice, and J. Zico Kolter. Fast is better than free: Revisiting adversarial training, 2020.
 [63] Dongxian Wu, ShuTao Xia, and Yisen Wang. Adversarial weight perturbation helps robust generalization. In NeurIPS, 2020.
 [64] Chang Xiao, Peilin Zhong, and Changxi Zheng. Enhancing adversarial defense by kwinnerstakeall, 2019.
 [65] Zhaoxia Yin, Hua Wang, Li Chen, Jie Wang, and Weiming Zhang. Reversible adversarial example based on reversible image transformation, 2021.
 [66] Tao Yu, Shengyuan Hu, Chuan Guo, Weilun Chao, and Kilian Weinberger. A new defense against adversarial images: Turning a weakness into a strength. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Oct. 2019.
 [67] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. CoRR, abs/1605.07146, 2016.
 [68] Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P. Xing, Laurent El Ghaoui, and Michael I. Jordan. Theoretically principled tradeoff between robustness and accuracy. arXiv abs/1901.08573, 2019.
 [69] Richard Zhang, Phillip Isola, and Alexei A. Efros. Colorful image colorization, 2016.
Appendix A Detailed Proof of Lemma 1
Lemma 2.
The standard classifier under adversarial attack is equivalent to predicting with , and our approach is equivalent to predicting with .
First we show that
It is easy to see that
where the last equality is due to the neural network’s deterministic nature, i.e.,
where is the latent selfsupervised label prediction. Thus the probability is 0 otherwise.
Intuitively, this demonstrates that the attack is equivalent to using the classifier
to predict the label.
Next we show that our algorithm is equivalent to using the following classifier
Our algorithm finds a new input image that
Note that
The last formulation is our algorithm’s inference procedure, where we first estimate with adversarial image and selfsupervised label . We then predict the label using our new image . We now have proved that our algorithm is equivalent to using
Here, we use the maximum a posteriori (MAP) estimate to approximate the marginalization over because: first, sampling a large number of is computational expensive, second, our our shows the sampling is ineffective, lastly, our MAP estimation also produce a denoised image that can also be useful for other downstream tasks.
Appendix B Detailed Proof of Theorem 1
Theorem 2.
Assume the base classifier operates better than chance and instances in the dataset are uniformly distributed over categories. Let the prediction accuracy bounds be and . If the conditional mutual information , we have and , which means our approach strictly improves the bound for classification accuracy.
Proof.
If , we have:
Comments
There are no comments yet.