Self-supervised Adversarial Training

11/15/2019 ∙ by Kejiang Chen, et al. ∙ USTC 0

Recent work has demonstrated that neural networks are vulnerable to adversarial examples. To escape from the predicament, many works try to harden the model in various ways, in which adversarial training is an effective way which learns robust feature representation so as to resist adversarial attacks. Meanwhile, the self-supervised learning aims to learn robust and semantic embedding from data itself. With these views, we introduce self-supervised learning to against adversarial examples in this paper. Specifically, the self-supervised representation coupled with k-Nearest Neighbour is proposed for classification. To further strengthen the defense ability, self-supervised adversarial training is proposed, which maximizes the mutual information between the representations of original examples and the corresponding adversarial examples. Experimental results show that the self-supervised representation outperforms its supervised version in respect of robustness and self-supervised adversarial training can further improve the defense ability efficiently.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep Learning has made a significant progress in computer vision, natural language processing and etc. Various kinds of techniques based on deep learning have been applied in practical engineering, such as autonomous vehicles [3], disease diagnosis [16]

. These empowered applications are life crucial, raising great concerns in the field of safety and security. However, recently, many studies have shown that the classifiers using neural network are not robust when encountering attacks, especially adversarial examples.

Szegedy et al. proposed the concept of adversarial example for the first time [23], which means that a subtle perturbation is added to the input of the neural network to produce a wrong output with high confidence. After that, plenty of methods for generating adversarial examples have been developed, including gradient-based [5, 17, 4, 15], optimization-based [23, 18, 2] and etc. These methods show the fragility of deep learning models.

On the opposite side, many defenses against adversarial examples have been proposed along two directions: model hardening [5, 20, 14, 24, 28] ,input preprocessing [29, 22, 1, 11, 12]. As for model hardening, adversarial training has been proven to be an effective defense method. One convinced reason is that adversarial training forces the neural network to learn the robust feature [10], which is rarely affected by adversarial examples. Inspired by this view, we are eager to find neural networks that naturally learn the robust feature of images. Fortunately, self-supervised learning pursues the similar destination and has been developed quickly in recent years. Self-supervised learning aims to learn robust and semantic embedding from data itself and formulates predictive tasks to train a model, which can be seen as learning the robust representation.

Figure 1: The diagram of self-supervised adversarial training.

Generally, given the self-supervised feature, the classification can be done with linear regression (LR) or k-Nearest Neighbors (kNN). In this paper, we choose self-supervised feature coupled with kNN as the final classifier. The reason can be intuitively observed from the right part of Fig.

1 that even the modified sample has crossed the decision boundary of LR, but it is still correctly classified by kNN, meaning that kNN owns stronger robustness than LR.

To further enhance the robustness, self-supervised adversarial training (SAT) is proposed. The object of SAT is to maximize the mutual information (MI) between the representations of clean images and their corresponding adversarial examples, so the learned feature can mitigate the effect of adversarial perturbation. The method can be divided into two parts: generating adversarial examples, maximizing the MI. The adversarial examples are generated using gradient-based method, due to its high efficiency. Subsequently, MI between the feature representations of clean and adversarial examples is maximized. In implementation, noise contrast estimator is utilized to estimate MI. Then the model is updated by minimizing the opposite value of estimated MI.

Our experimental results demonstrate that using the state-of-the-art self-supervised feature representation coupled with kNN shows stronger robustness against adversarial examples produced by both gradient-based and optimization-based methods with respect to supervised feature representation by a clear margin on CIFAR-10 and STL-10. Besides, the robustness of self-supervised models can be largely improved with SAT efficiently. Implementation-related file will be available at

2 Related Works

2.1 Adversarial Examples

Adversarial examples are designed by an adversary to make machine learning system producing erroneous outputs. Most adversarial examples on deep neural networks are generated by adding small perturbation to clean samples. For kNN classification methods, the attack operates by adding a perturbation

to the input such that its representation, , moves closer to representations of , a nearest group of training instances from a different class ( for ). Intuitively, adversarial examples can be generated by solving the optimization problem[21]:

such that

The optimization can be formulated as a Lagrangian, and we can binary search the Lagrangian constant that yields the minimal perturbation. For example, the optimization can be solved with Adam optimizer.

2.2 Defense

Many defenses against adversarial examples have been proposed along two directions: model hardening, input preprocessing. For model hardening, adversarial training shows satisfying performance against adversarial examples. The standard adversarial training (AT) in Madry’s work[17] can be formulated as:


where is the underlying distribution of training data,

is the loss function at data point

with the true label for the neural network with parameters . is the permutation introduces by PGD[17]. The accuracy drops fast using AT, there is an alternate version[26], Mix-minibatch adversarial training (MAT):


which helps to pursue the trade-off between accuracy on the clean examples and robustness on the adversarial examples. Adversarial logit pairing (ALP)[14]

matches the logits from a clean example

and its corresponding adversarial example during training, which exhibits better performance:


where is a minibatch including clean examples and the corresponding adversarial examples . is function mapping from inputs to logits of the model and is the cost function used for adversarial training. One potential reason of adversarial training is that it forces the neural network to learn robust feature, which can mitigate the affect of adversarial examples[10].

2.3 Self-supervised Learning

Self-supervised learning exploits internal structures of data and formulates predictive tasks to train a model, which can be seen as learning the robust feature. Here are some representative works in this aspect: Contrastive Predictive Coding (CPC) [19] uses a probabilistic contrastive loss which induces the latent space to capture information that is maximally useful to predict future samples. Deep Infomax (DIM) [8] maximizes mutual information between global features and local features. Augmented Multiscale DIM (AMDIM) [9]

maximizes mutual information between features extracted from multiples views of a shared context.

Actually, the self-supervised representation has been used for defense in previous works. [21] utilized the feature representation coupled with kNN for classification. Both supervised and self-supervised features are adopted. However, as mentioned in [21], their method does not perform well on datasets bigger than MNIST. [7] combined the self-supervised loss into the loss of traditional adversarial training, but this training process is still time-consuming.

3 Method Description

As illustrated before, forcing the neural network to learn the robust feature of the instance can help improve the robustness of the model. Meanwhile, the self-supervised learning focuses on the robust feature, for example. they can predict the missing part of images using itself. Inspired by these point-views, we propose using self-supervised representation cooperated by k-Nearest Neighbour for defending against adversarial examples. Besides, we can maximize the mutual information representation between clean and adversarial examples by adjusting the existing model, so that the model can further mitigate adversarial perturbation.

3.1 Self-supervised Representation for Defense

The self-supervised representation is coupled with kNN for classification. After self-supervised training, the neural network is frozen and adopted as a feature extractor. All instances in the training set are fed into the network to obtain their representations on a specified layer, and then these representations serve as the feature library. Given an image, extract its feature representation, search the k-nearest representations from the feature library, and then predict the label of the image.

3.2 Self-supervised Adversarial Training

To further improve the robustness of self-supervised representations cooperated with kNN, we propose a method called self-supervised adversarial training (SAT), which maximizes the mutual information between the representations of clean images and the corresponding adversarial examples. As shown in Fig. 1, given the pretrained self-supervised model, the framework of SAT is divided into two parts: generating adversarial examples and maximizing the mutual information.

3.2.1 Generating Adversarial Examples

Due to the introduced attack method in Section 2.1 is time-consuming, we modified the generating method inspired by PGD[17]. In detail, the gradient of the image is obtained firstly:


where is the gradient operator, and is the default setting. Then update the image:


where is the update step size. To restrict the generated adversarial examples within the -ball of , we can clip after each update. For better distinction, we address the former method in Section 2.1 as optimization-based method and this as gradient-based method.

3.2.2 Maximizing Mutual Information

After obtaining the adversarial examples, we are going to maximize the MI on the feature representation space. Formally, the MI between and , with joint density and marginal densities and , is defined as the Kullback–Leibler (KL) divergence between the joint and the product of the marginals:


As for the feature representation, the MI can be defined as:


where are the feature representations of clean images and the corresponding adversarial version, respectively. It is hard to obtain the explicit distribution of representations, meaning that the MI cannot be calculated. Instead, several methods have been proposed to estimate MI, and here noise contrast estimator (NCE) is adopted, whose estimated MI has been proved to be a low bound of MI[25], defined by:


where is the representation of other adversarial example different from

. Here, we refer to representations from joint distribution as positives, i.e.

, , and representations from the product of marginal distributions as negatives, i.e. , . in in Equation (9) is the number of negative pairs, and is the score function that is higher for positive pairs but lower for negative pairs.

can be any continuous and differentiable parametric functions, such as cosine similarity function. Here, the matching score function is defined as a simple dot product:


where and are small neural networks, for they can approximate any superb score functions. In implementation, the estimated MI is maximized by minimizing its opposite value, named the contrast loss:


The self-supervised neural network can be fine-tuned using back-propagation through minimizing . The process will be kept iterating until the performance meeting the requirement. To point out, the whole process does not require the true label of data, similar to the self-supervise learning. The pseudo-code of the framework is given in Algorithm 1.

1:Training samples , perturbation bound , step size , maximization iterations per minimization step , and minimization learning rate .
2:Initialize with a pretrained self-supervised model .


4:     for minibatch  do
5:         Build for :          
6:              Assign a random perturbation
9:              for   do
14:              end for         
15:         Calculate the representation of samples:          
17:         Update

with stochastic gradient descent:          

20:     end for
21:end for
Algorithm 1 Self-supervised Adversarial Training (SAT)

4 Experiments

4.1 Setting

Dataset CIFAR-10 and STL-10 are selected as the dataset. The CIFAR-10 dataset consists of 60000 labeled color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. STL-10 is composed of 10 classes 5000 labeled color images, 100000 unlabeled images for training and 8000 labeled images for testing. For speedy training, we resize the images in STL-10 to .

Attack Method The attack is implemented under white-box setting: the attacker has full information about the model (i.e. knows the architecture, parameters, etc.). Both gradient-based and optimization-based attack methods are utilized to evaluate the robustness of models. All the adversarial examples are generated on the 1000 correctly predicted images on the testing set. For the gradient-based attack methods, there are two kinds of setting: , , and 10 iterations; , , and 20 iterations, which is denoted as small perturbation and large perturbation, respectively.

Evaluation Metric For kNN classification, is 75 and faiss[13] is adopted for speed consideration. The penultimate layer of neural network is adopted as the representation. The defense successful rate (DSR), defined as the correct prediction rate on the adversarial examples, is utilized to evaluate the robustness of the model against gradient-based attack. distance, the average -norm of perturbation required to mislead the classifier, is used to measure the robustness against optimization-based attack. Larger distance leads to better robustness. The accuracy (ACC) of clean examples is also presented to show the precision of the model.

Dataset Method ACC DSR distance
Small Large
CIFAR-10 SUP 92.02% 18.7% 15.4% 0.378
SSL 84.64% 51.9% 27.7% 0.667
STL-10 SUP 75.41% 24.2% 16.3% 0.970
SSL 86.13% 54.9% 44.7% 1.591
Table 1: The defense results of self-supervised representation and supervised representation of AMDIM with kNN on CIFAR-10 and STL-10.
(b) NPID
Figure 2: The defense results of AMDIM and NPID using SAT on CIFAR-10.
(a) Small perturbation
(b) Large perturbation
Figure 3: The defense results of among self adversarial training and supervised adversarial training on CIFAR-10. AMDIM is selected as the seed model, and SUP and SSL mean the supervised and self-supervised version.
(a) Small perturbation
(b) Large perturbation
Figure 4: The defense results of among self adversarial training and supervised adversarial training on STL-10.

4.2 Superior of Self-supervised Representation

In this experiment, the state-of-the-art self-supervised learning method, AMDIM [9], and its supervised versions are compared to present the superior of self-supervised feature representation in the respect of robustness. These two methods are denoted by SSL and SUP, respectively. The backbone of AMDIM is an encoder based on the standard ResNet[6], with changes to make it suitable for DIM. More details about the encoder, the readers can refer to [9]. The parameters of the encoder for CIFAR-10 and STL-10 are set as (ndf=128, nrkhs=128, ndepth=3), (ndf=128, nrkhs=1024, ndepth=8), respectively. For supervised learning, Adam optimizer with learning rate 0.001 is adopted, and we trained the model for 400 epochs. For self-supervised learning, the learning rate is 0.0002 and the number of epoch is 300.

As shown in Table 1, the DSR of the self-supervised version (SSL) of AMDIM outperforms its supervised version (SUP) with a clear margin against gradient-based attack and the required -distance of successfully attacking SSL is larger than that of the SUP on two datasets. On CIFAR-10, the ACC on the clean images is lower than its supervised version, but the gain on the DSR is large, 33.2% for small perturbation attack. On STL-10, the performance is considerable, whose ACC outperforms that of supervised version, benefiting from unlabeled images in the training phase. In conclusion, the self-supervised representation owns stronger robustness.

4.3 Effectiveness of SAT

To verify the effectiveness of SAT, unsupervised model NPID[27] and self-supervised model AMDIM[9] are selected as the seed models. The adversarial examples are generated by gradient attack method with the small perturbation. Batch size 100, learning rate 0.0001 and Adam optimizer are the other setting for SAT. The results are presented in Fig. 2. It can be seen that the DSR improves a lot after SAT with slight drop of accuracy for both AMDIM and NPID against small and large perturbation attacks, verifying the effectiveness of SAT.

We have also compared with the supervised adversarial training methods, and the adversarial examples are generated using PGD with the same setting as the gradient-based method does. AMDIM is selected as the seed model, and the results of these methods against small and large perturbation attack are shown in Fig. 3, Fig. 4, and the suffix represents which adversarial training method is adopted. The closer to the top right corner in figures, the better the performance. For sufficient label dataset CIFAR-10, the SAT is worse than the MAT, ALP, due to the original classification performance of self-supervised learning is worse than supervised models. Analyzing the development of self-supervised learning, we can see the gap between self-supervised and supervised model is closer. With stronger self-supervised model, the performance of SAT will become considerable. Furthermore, the time cost of SAT is much cheaper than that of MAT and ALP. For dataset with a few labels, like STL-10, the performance of SAT is significant. The trade-off between the robustness and accuracy is better achieved by SAT than supervised versions, especially for small perturbation attack, shown in Fig. 4. Since the dataset with a little supervised information is common in many downstream tasks, the proposed method SAT has a good prospect.

5 Conclusion

In this paper, we utilize self-supervised representation coupled with kNN for classification, where the underlying reason is that self-supervised model learns the robust feature of data. To further strengthen the defense ability of self-supervised representation, a general framework called self-supervised adversarial training is proposed, which maximizes the mutual information between the representations of original examples and adversarial examples. The experiments show that the self-supervised representation of AMDIM outperforms its supervised representation in the aspect of robustness on CIFAR-10 and STL-10. Furthermore, self-supervised adversarial training has been verified that it can be efficiently applied to AMDIM and NPID, and significantly improve the robustness against adversarial examples with slight drop of accuracy.

It is interesting to design self-supervised learning which considers adversarial attack in the training phase, so that the self-supervised representation naturally owns strong robustness, which is one direction of our future work.


  • [1] J. Buckman, A. Roy, C. Raffel, and I. Goodfellow (2018) Thermometer encoding: one hot way to resist adversarial examples. Cited by: §1.
  • [2] N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. Cited by: §1.
  • [3] C. Chen, A. Seff, A. Kornhauser, and J. Xiao (2015) Deepdriving: learning affordance for direct perception in autonomous driving. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2722–2730. Cited by: §1.
  • [4] Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li (2018) Boosting adversarial attacks with momentum. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 9185–9193. Cited by: §1.
  • [5] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1, §1.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.2.
  • [7] D. Hendrycks, M. Mazeika, S. Kadavath, and D. Song (2019) Using self-supervised learning can improve model robustness and uncertainty. arXiv preprint arXiv:1906.12340. Cited by: §2.3.
  • [8] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2018) Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: §2.3.
  • [9] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2019) Learning deep representations by mutual information estimation and maximization. Cited by: §2.3, §4.2, §4.3.
  • [10] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry (2019) Adversarial examples are not bugs, they are features. arXiv preprint arXiv:1905.02175. Cited by: §1, §2.2.
  • [11] X. Jia, X. Wei, X. Cao, and H. Foroosh (2019) ComDefend: an efficient image compression model to defend adversarial examples. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6084–6092. Cited by: §1.
  • [12] G. Jin, S. Shen, D. Zhang, F. Dai, and Y. Zhang (2019) APE-gan: adversarial perturbation elimination with gan. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3842–3846. Cited by: §1.
  • [13] J. Johnson, M. Douze, and H. Jégou (2017) Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734. Cited by: §4.1.
  • [14] H. Kannan, A. Kurakin, and I. Goodfellow (2018) Adversarial logit pairing. arXiv preprint arXiv:1803.06373. Cited by: §1, §2.2.
  • [15] F. Kreuk, Y. Adi, M. Cisse, and J. Keshet (2018) Fooling end-to-end speaker verification with adversarial examples. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1962–1966. Cited by: §1.
  • [16] R. Li, W. Zhang, H. Suk, L. Wang, J. Li, D. Shen, and S. Ji (2014) Deep learning based imaging data completion for improved brain disease diagnosis. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 305–312. Cited by: §1.
  • [17] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2017) Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. Cited by: §1, §2.2, §3.2.1.
  • [18] S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2574–2582. Cited by: §1.
  • [19] A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §2.3.
  • [20] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami (2016) Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pp. 582–597. Cited by: §1.
  • [21] C. Sitawarin and D. Wagner (2019) Defending against adversarial examples with k-nearest neighbor. arXiv preprint arXiv:1906.09525. Cited by: §2.1, §2.3.
  • [22] Y. Song, T. Kim, S. Nowozin, S. Ermon, and N. Kushman (2017) Pixeldefend: leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766. Cited by: §1.
  • [23] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1.
  • [24] S. A. Taghanaki, K. Abhishek, S. Azizi, and G. Hamarneh (2019) A kernelized manifold mapping to diminish the effect of adversarial perturbations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11340–11349. Cited by: §1.
  • [25] M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, and M. Lucic (2019) On mutual information maximization for representation learning. arXiv preprint arXiv:1907.13625. Cited by: §3.2.2.
  • [26] E. Wong and J. Z. Kolter (2017) Provable defenses against adversarial examples via the convex outer adversarial polytope. arXiv preprint arXiv:1711.00851. Cited by: §2.2.
  • [27] Z. Wu, Y. Xiong, S. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance-level discrimination. arXiv preprint arXiv:1805.01978. Cited by: §4.3.
  • [28] C. Xie, Y. Wu, L. v. d. Maaten, A. L. Yuille, and K. He (2019) Feature denoising for improving adversarial robustness. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 501–509. Cited by: §1.
  • [29] W. Xu, D. Evans, and Y. Qi (2017) Feature squeezing: detecting adversarial examples in deep neural networks. arXiv preprint arXiv:1704.01155. Cited by: §1.