Neural networks produce state-of-the-art results across a large number of domains (, , , ). Despite increasing adoption of neural networks in commercial settings, recent work has shown that such algorithms are susceptible to inputs with imperceptible perturbations meant to cause misclassification (, ). It is thus important to investigate additional vulnerabilities as well as defenses against them.
In this paper we investigate the problem of adversarial attacks on image classification systems. Attacks so far have only considered the conventional neural network architecture which outputs softmax predictions and is trained by minimizing the cross-entropy loss function. We thus propose and evaluate the robustness of neural networks against adversarial attacks with the following modifications:
Train the model to minimize mean-squared error (MSE), rather than cross-entropy.
Replace traditional one-hot target representations with codewords generated from a random codebook.
We evaluate our proposed modifications from multiple angles. First, we measure the robustness of the modified model using attacks under multiple threat scenarios. Secondly, we introduce an attack which, without sacrificing its efficacy towards conventional architectures, is tailored to our proposed modifications. Finally, we conduct spectral analysis on the model’s parameters to compute their upper Lipschitz bounds, a measure that has been shown to be correlated with a model’s robustness. Our results in Section 5 demonstrate that, across all three evaluations, our proposed model displays increased robustness compared to its conventional counterpart.
2.1 Neural networks
A neural network is a non-linear function that maps data to targets , where are the dimensions of the input and target spaces, respectively, and represents the parameters of the neural network. For conventional neural networks and classification tasks, is typically a one-hot representation of the class label and is the number of classes in the dataset. In this work, we use the DenseNet architecture  as the existing benchmark, which has recently produced state-of-the-art results on several image datasets.
2.2 Adversarial examples
The goal of an adversarial attack is to cause some misclassification from the target neural network. In particular,  has shown that it is possible to construct some by adding minimal perturbations to the original input such that the model misclassifies . Here, is commonly referred to as an adversarial example, while the original data is referred to as a clean example. Apart from image classification, adversarial attacks have been proposed in both natural language and audio domains (, , ).
We explore two adversarial settings, namely white-box and black-box scenarios. In the white-box setting, the attacker has access to and utilizes the model’s parameters, outputs, target representations, and loss function to generate adversarial examples. In the black-box scenario, the attacker has no access to the model’s parameters or specifications and only has the ability to query it for predictions. In this work, we employ transfer attacks, a type of black-box attack where adversarial examples are generated using a proxy model which the adversary has access to.
There are mainly two types of attacks. In a targeted attack, an adversary generates an adversarial example so that the target model returns some target class . A targeted attack is evaluated by its success rate, which is the proportion of images for which the target class was successfully predicted (the lower the better from the perspective of the defense). On the other hand, in an untargeted attack, the attacker causes the model to simply return some prediction . It is evaluated by the accuracy of the target model, which denotes the proportion of images which failed to get misclassified (the higher the better from the perspective of the defense).
The following sections describe the attacks used in this work.
Fast Gradient Sign Method (FGSM).
The Fast Gradient Sign Method , one of the earliest gradient-based attacks, generates adversarial examples via:
where is the loss function of the neural network, is the target class, and is a parameter which controls the magnitude of the perturbations made to the original input . The gradient, which is taken w.r.t the input, determines which direction each pixel should be perturbed in order to maximize the loss function and cause a misclassification.
Basic Iterative Method (BIM).
The Basic Iterative Method, proposed by , applies FGSM iteratively to find more effective adversarial examples.
Momentum Iterative Method (MIM).
proposed the L-BFGS attack, the first targeted white-box attack on convolutional neural networks, which solves the following constrained optimization problem:
The above formulation aims to minimize two objectives; the left term measures the distance ( norm) between the input and the adversarial example, while the right term represents the cross-entropy loss. It is used only as a targeted attack.
The Deep Fool attack, proposed by 
, is an attack which imagines the decision boundaries of neural networks to be linear hyperplanes and uses an iterative optimization algorithm similar to the Newton-Raphson method to find the smallestperturbation which causes a misclassification. It is used only as an untargeted attack.
Madry et al.
 proposed an attack based on projected gradient descent (PGD), which relies on local first order information of the target model. The method is similar to FGSM and BIM, except that it uses random starting positions for generating adversarial examples.
Carlini & Wagner L2 (CWL2)
. The Carlini & Wagner L2 attack 
follows an optimization problem similar to that of L-BFGS but replaces cross-entropy with a cost function that depends on the pre-softmax logits of the network. In particular, the attack solves the following problem:
where is the perturbation made to the input and is the objective function:
Here, represents the pre-softmax logits of the network. In short, the attack aims to maximize the logit value of the target class while minimizing the norm of the input perturbations.
3 Improving adversarial robustness
In this work we have two proposals. First, we propose changes to the conventional neural network architecture and target representations to defend against adversarial attacks described in Section 2.3. Second, we propose a modified, more effective CWL2 attack that is specifically tailored to our proposed defense.
3.1 Training on mean-squared error
Instead of the conventional cross-entropy loss, we propose to use MSE to compute the error between the output of the model and the target , where is the set of target representations for all classes. During inference, we select the output class for which its target representation yields the smallest euclidean distance to .
3.2 Randomized target representations
Instead of using one-hot encoding as target representations, we represent each target class as a codeword from a random codebook. Specifically, thetarget representations corresponding to the
classes are sampled once at the beginning of training from a uniform distribution
based on a secret key. To match the representation space of the network output and the targets, the conventional softmax layer is replaced with a tanh activation withoutputs.
3.3 Modified CWL2 attack
The Carlini & Wagner L2 attack makes several assumptions about the target network’s architecture based on its cost function mentioned in Section 2.3, namely that the highest logit value corresponds to the most likely class. However, applying our proposed neural network modifications breaks such assumptions, for the output of the network would be tanh activations and the length of the output would not correspond to the number of classes in the dataset. We thus propose a simple modification to the CWL2 attack where the cost function considers the distance in some metric space between the logits and the targets:
Like with the Carlini & Wagner L2 attack, if and only if the model predicts the target class. Using the change-of-variables formulation utilized in  to enforce box constraints on the perturbations, our attack finds some which optimizes the following objective:
where is a trade-off constant that controls the importance of the size of perturbations (larger values of allow for larger distortions). For our experiments, we have defined as the euclidean distance.
3.4 Lipschitz bounds and robustness
Earlier works have suggested that the sensitivity of neural networks towards adversarial perturbations can be measured with the upper Lipschitz bound of each network layer . Parseval Networks , for example, have introduced a layer-wise regularization technique for improving robustness by enforcing smaller global Lipschitz bounds. More specifically,  have shown that:
where , and are the upper Lipschitz bounds of and , respectively. In other words, the efficacy of an adversarial attack depends on the generalization error of the target model as well as the Lipschitz bounds of its layers. This suggests that smaller Lipschitz bounds indicate a more robust model. For both fully-connected and convolutional layers, this can be measured by calculating their operator norms. The operator norm of the
-th fully-connected layer is simply the largest singular value of the weight matrix. The Lipschitz constant of the-th layer is then:
For convolutional kernels, we rely on the formulation in 
, which involves applying the two-dimensional discrete Fourier Transform to find the largest singular values.
Section 5.6 presents empirical results which demonstrate that simply changing the loss function from cross-entropy to mean-squared error can yield model parameters with significantly smaller Lipschitz bounds.
4 Experimental setup
In this section we describe the evaluation datasets, evaluation models and adversarial image generation process.
CIFAR-10  is a small image classification dataset with 10 classes. It contains 60,000 thumbnail-size images of dimensions 32x32x3, of which 10,000 images are withheld for testing.
MNIST  is another image classification dataset containing monochromatic thumbnails (28x28) of handwritten digits. It is comprised of 60,000 training images and 10,000 testing images.
Fashion-MNIST  is a relatively new image classification dataset containing thumbnail images of 10 different types of clothing (shoes, shirts, etc.) which acts as a drop-in replacement to MNIST.
4.2 Models evaluated
We use three variants of the DenseNet model to generate adversarial examples:
O:SOFTMAX:CE refers to a DenseNet model with softmax activations trained on cross-entropy loss and one-hot target representations.
O:SOFTMAX:MSE refers to a DenseNet model with softmax activations trained on MSE and one-hot target representations.
R:TANH:MSE refers to a DenseNet model with tanh activations trained on MSE using codeword target representations. We used a codeword length of .
We have evaluated the robustness of the R:TANH:MSE model with different codeword lengths (64, 256, and 1024) but found no significant discrepancies in the results.
|Basic Iterative Method||epsilon ()|
|Carlini & Wagner L2||initial constant ()|
|Deep Fool||max iterations ()|
|Fast Gradient Sign Method||epsilon ()|
|L-BFGS Attack||initial constant ()|
|Madry et al.||epsilon ()|
|Momentum Iterative Method||epsilon ()|
|Basic Iterative Method||eps_iter||nb_iter|
|Carlini & Wagner L2||binary_search_steps||max_iterations|
|Madry et al.||eps_iter||nb_iter|
|Momentum Iterative Method||eps_iter||nb_iter|
|CWL2 ()||MIM ()||Deep Fool ()|
|BIM ()||FGSM ()||Madry et al. ()|
4.3 Generating adversarial examples
For each dataset mentioned in Section 4.1, we train a model on the training set and generate adversarial examples using the test set. For targeted attacks, we randomly sample a target class for each image in the test set.
We evaluate each model’s (listed in Section 4.2) robustness against attacks (listed in Table 1) under the white-box setting. For the R:TANH:MSE model, the attacker has access to the codeword representations. We also evaluate model robustness against transfer attacks, a type of black-box attack where adversarial examples are generated using a proxy model which the adversary has access to. Finally, we further measure the robustness of our proposed model using the modified CWL2 attack.
All experiments are implemented using TensorFlow
, a popular framework for building deep learning algorithms.
4.3.1 Attack parameters
For a given attack, we generate adversarial examples across a range of values for a particular parameter which controls the magnitude of the perturbations made. Table 1 lists the parameters which are modified for each attack, whereas Table 2 lists the parameters held constant. We use the default values defined in Cleverhans for our constant parameters.
4.3.2 Adapting attacks to our proposed techniques
The attacks described in Section 2.3 are implemented using the Cleverhans library . By default, the attacks assume that the model outputs softmax predictions and that the targets are represented as one-hot vectors. Hence the internal loss function for some attacks (e.g. gradient-based iterative attacks) is predefined as cross-entropy. However, because the cross-entropy loss function is not compatible with the R:TANH:MSE model, we have adapted the library to use mean-squared error when the target model has also been trained on mean-squared error. These adaptations are important in preserving the white-box assumption of each attack.
5 Experimental observations
In this section, we present and analyze the performance of the evaluation models under different attack scenarios: untargeted and targeted attacks (Section 5.2), black-box attacks (Section 5.3), and our modified CWL2 attack (Section 5.4). Benchmark performances on the original datasets are presented in Section 5.1.
5.1 Clean test performance
Table 4 lists the accuracy of each model across each clean test dataset. We observe minimal differences in accuracies across the models, and hence our proposed modifications can maintain state-of-the-art classification performances.
|L-BFGS ()||BIM ()||Madry et al. ()|
|CWL2 ()||MIM ()||FGSM ()|
5.2 Untargeted and targeted attacks
Table 3 lists the accuracies of the models against untargeted white-box attacks. Both O:SOFTMAX:MSE and R:TANH:MSE models demonstrate higher accuracies on the adversarial examples compared to the O:SOFTMAX:CE model; we observe an increase in accuracies of up to 98.7%. Similar results can be observed in Table 5, where the O:SOFTMAX:MSE and R:TANH:MSE models achieve a consistent decrease in attack success rates of up to 99.8%.
5.3 Black box attacks
Table 6 shows the accuracies of transfer attacks against the O:SOFTMAX:MSE and R:TANH:MSE models. Our proposed models demonstrate more robustness towards black-box attacks compared to the white-box versions with the same configurations. Though this is expected behavior, it is imperative to evaluate a defense under multiple threat scenarios.
|CWL2 ()||Ours ()|
5.4 Modified CWL2 attack
Table 7 compares our proposed attack with the CWL2 attack. The results show that our attack maintains its efficacy against O:SOFTMAX:CE models while significantly increasing its success rate against the R:TANH:MSE model up to 70.9%. We note that increasing the initial constant for our attack yields increased success rates, which is aligned with the intuition that the parameter controls the importance of the attack’s success as highlighted in Section 3.3. We also observe that, despite the increase in the attack’s efficacy, the R:TANH:MSE model displays more robustness compared to the O:SOFTMAX:CE model, with a decrease in success rates of up to 28.5%.
5.5 Distortion vs. performance
On page 1, Figure 1 displays adversarial images generated from targeted white-box Madry et al. attacks on the O:SOFTMAX:CE and R:TANH:MSE models respectively. We choose the lowest for which the attack achieves success rates of 100%. It is clear that the R:TANH:MSE model requires much larger perturbations for an attack to achieve the same success rates as against the O:SOFTMAX:CE model.
Figure 2 displays adversarial images generated using the Momentum Iterative Method against both O:SOFTMAX:CE and R:TANH:MSE models where . We observe that the R:TANH:MSE model is robust even against adversarial images where the perturbations are clearly perceptible to humans.
Finally, we visualize adversarial examples generated using our modified CWL2 attack and the R:TANH:MSE model in Figure 3, where the attack achieves higher success rates compared to the original attack. The perturbations made to the images are much less perceptible compared to the adversarial examples displayed in Figures 1 and 2.
5.6 Comparing upper Lipschitz bounds
Figure 4 compares the upper Lipschitz bounds of convolutional layers between the O:SOFTMAX:CE and O:SOFTMAX:MSE models. The upper bounds for the O:SOFTMAX:MSE model are consistently smaller than those of the O:SOFTMAX:CE model across each dataset up to a factor of three, supporting our hypothesis that models trained to minimize mean-squared error are more robust to small perturbations.
6 Related work
Several defenses have also been proposed. To date, the most effective defense technique is adversarial training (, , , ), where the model is trained on a mix of clean and adversarial data. This has shown to provide a regularization effect that makes models more robust towards attacks.
proposed defensive distillation, a mechanism whereby a model is trained based on soft labels generated by another ‘teacher’ network in order to prevent overfitting. Other methods include introducing randomness to or applying transformations on the input data and/or the layers of the network (, , , ). However,  have identified that the apparent robustness of several defenses can be attributed to the introduction of computation and transformations that mask the gradients and thus break existing attacks that rely on gradients to generate adversarial examples. Their work demonstrates that small, tailored modifications to the attacks can circumvent these defenses completely.
We have reported interesting experimental results demonstrating the adversarial robustness of models that do not follow conventional specifications. We have observed that simply changing the loss function that is minimized during training can greatly impact the robustness of a neural network against adversarial attacks. Our evaluation strategy is manifold, consisting of existing attacks, new attacks adjusted to our proposed modifications, and a spectral analysis of the model’s parameters. The increase in robustness observed from experimental results suggests the importance of considering alternatives to conventional design choices when making neural networks more secure. Future work would involve further investigation into the reasons for such modifications to improve the robustness of neural networks.
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,
Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin,
Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard,
Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh
Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah,
Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar,
Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol
Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang
TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.Software available from tensorflow.org.
-  Moustafa Alzantot, Yash Sharma, Ahmed Elgohary, Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang. Generating natural language adversarial examples. arXiv preprint arXiv:1804.07998, 2018.
-  Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.
-  Arjun Nitin Bhagoji, Warren He, Bo Li, and Dawn Song. Exploring the space of black-box attacks on deep neural networks. arXiv preprint arXiv:1712.09491, 2017.
-  Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. arXiv preprint arXiv:1608.04644, 2016.
-  Nicholas Carlini and David Wagner. Audio adversarial examples: Targeted attacks on speech-to-text. arXiv preprint arXiv:1801.01944, 2018.
-  Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. arXiv preprint arXiv:1704.08847, 2017.
-  Guneet S Dhillon, Kamyar Azizzadenesheli, Zachary C Lipton, Jeremy Bernstein, Jean Kossaifi, Aran Khanna, and Anima Anandkumar. Stochastic activation pruning for robust adversarial defense. arXiv preprint arXiv:1803.01442, 2018.
-  Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. Boosting adversarial attacks with momentum. In , 2018.
-  Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
-  Chuan Guo, Mayank Rana, Moustapha Cisse, and Laurens van der Maaten. Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117, 2017.
-  Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014.
-  Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, volume 1, page 3, 2017.
-  Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. Character-aware neural language models. In AAAI, pages 2741–2749, 2016.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.
-  Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236, 2016.
-  Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.
-  Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
-  Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
-  Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2574–2582, 2016.
-  Ian Goodfellow Reuben Feinman Fartash Faghri Alexander Matyasko Karen Hambardzumyan Yi-Lin Juang Alexey Kurakin Ryan Sheatsley Abhibhav Garg Yen-Chen Lin Nicolas Papernot, Nicholas Carlini. cleverhans v2.0.0: an adversarial machine learning library. arXiv preprint arXiv:1610.00768, 2017.
-  Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In Security and Privacy (EuroS&P), 2016 IEEE European Symposium on, pages 372–387. IEEE, 2016.
-  Nicolas Papernot, Patrick D McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. corr vol. abs/1511.04508 (2015), 2015.
-  Pouya Samangouei, Maya Kabkab, and Rama Chellappa. Defense-gan: Protecting classifiers against adversarial attacks using generative models. arXiv preprint arXiv:1805.06605, 2018.
-  John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
-  Hanie Sedghi, Vineet Gupta, and Philip M. Long. The singular values of convolutional layers. CoRR, abs/1805.10408, 2018.
-  Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying some distributional robustness with principled adversarial training. 2018.
-  Yang Song, Taesup Kim, Sebastian Nowozin, Stefano Ermon, and Nate Kushman. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766, 2017.
-  Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
-  Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel. Ensemble adversarial training: Attacks and defenses. arXiv preprint arXiv:1705.07204, 2017.
-  Aäron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. In SSW, page 125, 2016.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
-  Xi Wu, Uyeong Jang, Jiefeng Chen, Lingjiao Chen, and Somesh Jha. Reinforcing adversarial robustness using model confidence induced by adversarial training. In International Conference on Machine Learning, pages 5330–5338, 2018.
-  Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017.
-  Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, and Alan Yuille. Mitigating adversarial effects through randomization. arXiv preprint arXiv:1711.01991, 2017.
-  Hiromu Yakura and Jun Sakuma. Robust audio adversarial example for a physical attack. arXiv preprint arXiv:1810.11793, 2018.
-  Zhuolin Yang, Bo Li, Pin-Yu Chen, and Dawn Song. Characterizing audio adversarial examples using temporal dependency. arXiv preprint arXiv:1809.10875, 2018.
-  Xiaoyong Yuan, Pan He, Qile Zhu, Rajendra Rana Bhat, and Xiaolin Li. Adversarial examples: Attacks and defenses for deep learning. arXiv preprint arXiv:1712.07107, 2017.