While deep learning is relatively robust to random noise, it can be easily fooled by so-called adversarial perturbations . These perturbations are generated by adversarial attacks [15, 30, 5] that generate perturbed versions of the input which are misclassified by a classifier and, at the same time, remain quasi-imperceptible for humans. There have been different approaches for explaining properties of adversarial examples and provide rationale for their existence in the first place [15, 41, 12, 13]. Moreover, adversarial perturbations have been shown to be relatively robust against various kind of image transformations and can even be successful when being placed as artifacts in the physical world [21, 39, 10, 4]. Thus, adversarial perturbations might pose a safety and security risk for autonomous systems and also reduce trust on the models that are in principle vulnerable to these perturbations.
Several methods have been proposed for increasing the robustness of deep networks against adversarial examples such as adversarial training [15, 22], virtual adversarial training , ensemble adversarial training 35, 34], stability training , robust optimization , and Parseval networks . An alternative approach for defending against adversarial examples is to detect and reject them as malicious . While some of these approaches improve robustness against adversarial examples to some extent, the classifier remains vulnerable against adversarial perturbations on a non-negligible fraction of the inputs for all defenses [3, 43].
Most work has focused on increasing robustness in classification tasks with high-dimensional input spaces such as image classification, where the adversary can choose a data-dependent perturbation for each input. This setting is very much in favor of the adversary since the adversary can craft a high-dimensional perturbation “just” to fool a model on a single decision. In this work, we argue that limited success in increasing the robustness under these conditions does not necessarily imply that robustness can not be achieved in other settings. Specifically, we focus on robustness against universal perturbations , where the same perturbation needs to fool a classifier on many inputs. Moreover, we investigate robustness against such perturbations in dense prediction tasks such as semantic image segmentation, where a perturbation needs to fool a model on many decisions, e.g., the pixel-wise classifications. Robustness against universal perturbations is important in settings where an adversary aims at fooling a model on many inputs but is not able to compute input-dependent perturbations for all inputs, for instance because of a lack of computational resources or a lack of time for re-computing perturbations.
Prior work has shown that standard models are vulnerable to both universal perturbations which mislead a classifier on the majority of the inputs [28, 31] and to adversarial perturbations on semantic image segmentation tasks [14, 44, 6], and even to universal perturbations on semantic image segmentation . However, these results have been achieved for undefended models. In this work, we focus on the case where models have been ”hardened” by a defense mechanism.
Our main contributions are as follows: (1) We propose shared adversarial training, an extension of adversarial training that handles the inherent trade-off between accuracy on clean examples and robustness against adversarial examples with universal perturbations more gracefully. (2) We evaluate our method on CIFAR10, a subset of ImageNet (200 classes of TinyImageNet), and Cityscapes and demonstrate that universal perturbations for the defended models become clearly perceptible as shown in Figure 1 (Note that all graphics in the paper are best viewed in color). (3) To the best of our knowledge, we are the first to scale defenses based on adversarial training to semantic image segmentation. (4) We demonstrate empirically on CIFAR10 that the proposed technique outperforms other defense mechanisms [29, 36] in terms of robustness against universal perturbations.
2 Related Work
In this section, we review related work on universal perturbations and adversarial perturbations for semantic image segmentation.
2.1 Universal Perturbations
In contrast to adversarial perturbations, which are input-dependent, universal perturbations are input-agnostic (generated without knowing future inputs) with the objective to fool a classifier on most inputs. Different methods for generating universal perturbations exist: the first approach by Moosavi-Dezfooli et al.  uses an extension of the DeepFool adversary  to generate perturbations that fool a classifier on a maximum number of inputs from a training set. Metzen et al.  proposed a similar extension of the basic iterative adversary  for generating universal perturbations for semantic image segmentation. Mopuri et al.  proposed Fast Feature Fool, a method which is, in contrast to the former works, a data-independent approach for generating universal perturbations. In follow-up work , they show similar fooling rates of data-independent approaches as have been achieved by Moosavi-Dezfooli et al. . Khrulkov and Oseledets 
show a connection between universal perturbations and singular vectors. Hayes and Danezis proposed a generative model that can be trained to generate a diverse set of universal perturbations.
An analysis of universal perturbation and their properties is provided by Moosavi-Dezfooli et al. . They connect the robustness to universal perturbations with the geometry of the decision boundary and prove the existence of small universal perturbation provided the decision boundary is systematically positively curved. Jetley et al.  build upon this work and provide evidence that directions in which a classifier is vulnerable to universal perturbations coincide with directions important for correct prediction on unperturbed data. They follow that predictive power and adversarial vulnerability are closely intertwined. Moosavi-Dezfooli et al.  investigated whether fine-tuning the network on a modified dataset, where precomputed universal perturbations have been added to a fraction of training samples, increases robustness. They observe a small increase in robustness against universal perturbations; however the network remained vulnerable against universal perturbations on most inputs. Perolat et al.  propose a related approach based on approximate fictitious play. We propose a method which computes shared perturbations on each mini-batch and uses them in adversarial training, i.e., the shared perturbations are computed on-the-fly rather than precomputed as in prior work [28, 36].
Alternative approaches for defending against universal perturbations are based on adding additional components to the model: Ruan and Dai  proposed to identify and reject universal perturbations by adding a shadow classifiers, while Akhtar et al.  proposed to prepend a subnetwork in front of the model that is used to compensate for the added universal perturbation by detecting and rectifying the perturbation. Both methods have the disadvantage that the model becomes large and thus inference more costly. More severely, it is assumed that the adversary is not aware of the defense mechanism and it is unclear if a more powerful adversary could not fool the defense mechanism.
2.2 Adversarial Perturbations for Semantic Image Segmentation
Methods for generating adversarial perturbations have been extended to structured and dense prediction tasks like semantic segmentation and object detection [14, 44, 6]. It has been found that models for these tasks are not considerably more robust to adversarial perturbations than in image classification. Metzen et al.  even showed the existence of universal perturbations for semantic image segmentation which are quasi-imperceptible for humans but result in an arbitrary target segmentation of the scene which has nothing in common with the scene a human perceives. A comparison of the robustness of different network architectures for semantic image segmentation has been conducted by Arnab et al. 
: they found that residual connections and multiscale processing actually increase robustness of an architecture, while mean-field inference for Dense Conditional-Random Fields only makes gradient-based attacks harder since it masks gradients but does not increase robustness itself. In contrast to their work, we focus on modifying the training procedure rather than the network architectures for increasing robustness. Both approaches could be combined in the future.
In this section, we introduce basic terms relevant for this work and establish some of their properties. We aim to defend against an adversary under specific attack settings. Please refer to Section A.1 in the supplementary material for details on capabilities of the adversary and the threat model.
be a loss function (categorical crossentropy throughout this work),be a data distribution, and
be the parameters of a parametric model. We denote the risk of as . The following risks are relevant for this work (we extend the definitions of Uesato et al. ):
Universal Adversarial Risk:
Here, denotes an adversarial perturbation, a universal perturbation, and an adversarial example. The set defines the space from which perturbations may be chosen. The following inequalities hold for the risks: . Please refer to Section A.2 for a justification of those inequality.
For the special case of the 0-1 loss, an -dimensional input , and , we define the adversarial robustness as the smallest perturbation magnitude that results in an adversarial risk (misclassification rate) of at least . More formally:
In other words, there are perturbations with that result in a misclassification rate of at least . Analogously, we can also define the universal robustness as
Here, a perturbation with exists that results in a misclassification rate of at least . We have .
We define an adversary as a function , which maps a data point and model parameters onto a perturbation . Let the objective of the adversary be finding a perturbation which maximizes a loss function .111We note that one may choose or one may also choose, e.g., to be the 0-1 loss and be a differentiable surrogate loss. A lower bound of the adversarial risk is:
A heap adversary is a function , which maps a set of data points and model parameters onto a perturbation . The loss function of a heap adversary is defined as:
A universal adversary evaluates generalization of by tracking on unseen inputs and we denote such an adversary by . We can lower bound the universal adversarial risk by:
The lower bound will typically become better if , is large, and is powerful. While different options for and exist [15, 30, 5, 28, 31], we focus on projected gradient descent (PGD) [24, 21], as it provides in our experience a good trade-off between being computationally efficient and being powerful. PGD initializes uniform randomly in (or a subset of ) and then performs iterations of the following update:
where denotes a projection on the space and denotes a step-size. Similarly, a targeted attack where the model shall output the target class can be obtained via the following update rule:
Moreover, this procedure can also be turned into a heap adversary by replacing by . If the number of data points in the heap is large (which is typically required for approximating universal perturbations), one can employ stochastic PGD, where in every iteration , a set of data points is sampled and is only evaluated on this subset of data points.
3.4 Adversarial training
The objective of adversarial training can now be defined as finding a minimizer of , where controls the trade-off between robustness and performance on unperturbed inputs. Note that adversarial training depends implicitly on the adversary , its loss , and . Any procedure for minimizing the expected risk
such as stochastic gradient descent (SGD) can also be applied to. We can also define analogously by replacing by .
Prior procedures for fine-tuning a network for robustness against universal perturbation are variants of the following approach: define a distribution over (approximately optimal) universal perturbations for a model with parameters (either by precomputing and random sampling , by learning a generative model , or by collecting an increasing set of universal perturbations for model checkpoints during training 
), fine-tune model parameters to become robust against this distribution of universal perturbations, and (optionally) iterate. This procedure increase robustness against universal perturbations slightly, however, not to a satisfying level. This is probably caused by the model overfitting to the fixed distribution of universal perturbations which do not change during optimization of. However, re-computing universal perturbations in every mini-batch anew by evaluating is prohibitively expensive. In this work, we propose a procedure that can be performed efficiently and updates the perturbations used in adversarial training in every mini-batch.
In this section, we introduce shared adversarial training, an extension of adversarial training that aims at maximizing robustness against universal perturbations.
4.1 Shared Adversarial Training
We note that if one is interested in minimizing the universal adversarial risk , then for using (or ) in adversarial training corresponds to minimizing an upper bound of (or ) because , provided that the adversaries find perturbations that are sufficiently close to the optimal perturbations. On the other hand, standard empirical risk minimization ERM (
), which minimizes the empirical estimate of, corresponds to minimizing a lower bound. As shown in previous work [15, 30, 5], this does confer only little robustness against (universal) perturbations. For , adversarial training corresponds to minimizing a convex combination of the upper bound and the lower bound . As we show in Section 5, this standard version of adversarial training already provides strong robustness against universal perturbations at the cost of reducing performance on unperturbed data considerably.
However, is a very loose upper bound of and it would be desirable to use a better approximation of in adversarial training. Directly employing in adversarial training is infeasible since evaluating in every mini-batch is prohibitively expensive.
We propose instead to use a weaker adversary in . We split a mini-batch consisting of data points into heaps of size (we denote as sharedness). Rather than using for computing a perturbation on each of the data points separately, we employ a heap adversary (see Section 3.3) for computing shared perturbations on the heaps (subsets of the mini-batch), and then broadcast these perturbations to all data points by repeating each of the shared perturbations times for all elements of the heap which defines a risk . We propose to use in adversarial training with an aim to increase robustness against universal perturbations and denote the resulting procedure as shared adversarial training. This entire process is illustrated in Figure 2. We can obtain the following relationship for (please refer to Section A.3 for more details):
Note that while all are upper bounds on the universal risk , this does not imply that shared perturbations are strong universal perturbations. In contrast, the smaller , the more “overfit” are the shared perturbations to the respective heap (there is no requirement of generalization of the shared perturbation to unseen datapoints as in universal perturbations). Moreover, with is typically a much tighter bound on than but can be computed as efficiently as . As discussed in Section 3.3, we can employ PGD on rather than . By appropriately reshaping and broadcasting perturbations, we can compute the shared perturbations on the respective heaps of the mini-batch jointly by PGD with essentially the same cost as computing adversarial perturbations with PGD.
4.2 Adversarial Loss Function
We recall that . Because of the limited capacity of the perturbation (), there is “competition” between the data points: the maximizers of will typically be different and the data points will “pull” into different directions. Because of this, using the categorical cross-entropy as a proxy for the 0-1 loss is problematic for untargeted adversaries: since we are maximizing the loss and the categorical crossentropy has no upper bound, there is a winner-takes-all tendency where the perturbation is chosen such that it leads to highly confident misclassifications on some data points and to correct classification on other data points (this incurs higher cost than misclassifying more data points but with lower confidence).
To prevent this, we employ loss thresholding on the categorical crossentropy to enforce an upper bound on : . We used , which corresponds to not encouraging the adversary to reduce confidence of the correct class below . Besides, we also incorporate label smoothing and use the soft targets for the computation of loss in all our experiments.
5 Experimental Results
We present experimental results of the robustness against universal perturbations achieved by shared adversarial training for image classification and semantic segmentation.
5.1 General Setup
For shared adversarial training, we extended the PGD implementation of Cleverhans  such that it supports shared adversarial perturbations and loss clipping as discussed in Section 4. For evaluation, we extended Foolbox  such that universal perturbations with minimal norm can be computed that achieve a misclassification rate of . Since this evaluation only provides an upper bound on the actual robustness , we tuned the PGD adversary as follows to make it more powerful (and thus the upper bound more tight): we performed a binary search of iterations for of , i.e., the bound in the norm on the perturbation, on the interval . In every iteration, we used the step-size annealing schedule which guarantees that . If a perturbation with misclassification rate is found in an iteration, the next iteration of binary search continues on the lower half of the interval for , otherwise on the upper half. The reported robustness is the smallest perturbation found in the entire procedure that achieves a misclassification rate of . Note that this procedure was only used for evaluation; for training we used a predefined and constant step-size .
5.2 Experiments on CIFAR10
We present results on CIFAR10  for a ResNet20  with 64-128-256 feature maps per stage. We pretrained this network with standard regularized empirical risk minimization (ERM) and obtained an accuracy of on clean data and a robustness222For evaluating robustness, we generate using stochastic PGD on the CIFAR10 validation set with mini-batches of size and a validation set of size from CIFAR10 test set. We used binary search iterations, S-PGD iterations, and the step-size schedule values and . against universal perturbations of .
In general, we are interested in models that increase the robustness without decreasing the accuracy on clean data considerably. We consider this a as a multi-objective problems with two objectives (accuracy and robustness). In order to approximate the Pareto-front of different variants of adversarial and shared adversarial training (sharedness ), we conducted runs for a range of attack parameters: maximum perturbation strength and (controlling the trade-off between expected and adversarial risk). We performed 4 steps of PGD with step-size
. Model fine-tuning was performed with 65 epochs of SGD with batch-size 128, momentum term, and an initial learning rate of . The learning rate was annealed after 50 epochs by a factor of 10.
Figure 3 (left) shows the resulting Pareto fronts of different sharedness values. While sharedness (standard adversarial training) and sharedness perform similarly, sharedness strictly dominates the other two settings. Without any loss on accuracy, a robustness of can be achieved, and if one accepts an accuracy of , a robustness of is obtainable. This corresponds to nearly three times the robustness of the undefended model while accuracy only drops by less than . We would like to note that also standard adversarial training is surprisingly effective in defending against universal perturbations and achieves a robustness that is smaller by approximately 5 than at the same level of accuracy on unperturbed data.
We also evaluated the performance of the defenses against universal perturbations proposed by Moosavi-Dezfooli et al.  and by Perolat et al.  (please refer to Section A.4 in the supplementary material for details). The results are also shown in the Figure 3 (left). It can be seen that these defenses are strictly dominated by all variants of (shared) adversarial training. In terms of computation, shared adversarial training required 189s while the defense  required 3118s, and the defense  required 3840s per epoch on average. In summary, the proposed method outperforms the baseline defenses both in terms of computation and with regard to the robustness-accuracy trade-off.
Figure 3 (right) shows the Pareto front of the same models when attacked by the DeepFool-based method for generating universal perturbations . In this case, the robustness is computed for a fixed perturbation magnitude and the accuracy under this perturbation is reported. The qualitative results are the same as for an S-PGD attack: the Pareto-front of adversarial training (s=1) clearly dominates the results achieved by the defense proposed in . Moreover, shared adversarial training with s=64 dominates standard adversarial training and the defense proposed by Perolat et al. . This indicates that the increased robustness by shared adversarial training is not specific to the way the attacker generates universal perturbations. An illustration of the universal perturbations is given in Section A.5 in the supplementary material.
5.3 Experiment on a Subset of ImageNet
The results on CIFAR10 have demonstrated that the proposed method outperforms previous defenses and dominates standard adversarial training in terms of the robustness/accuracy trade-off. We extend our experiments to a subset of ImageNet , which has more classes and higher resolution inputs than CIFAR10. Please refer to Section A.6 in the supplementary material for details on the selection of this subset of ImageNet. We pre-trained wide residual network WRN-50-2-bottleneck  on this dataset with ERM using SGD for 100 epochs along with initial learning rate 0.1 and reduced it by a factor of 10 after every 30 epochs. We have obtained a top-1 accuracy of 77.57% on unperturbed validation data and a robustness333Similar to CIFAR10, we evaluate the robustness using stochastic PGD but generate the perturbations on the training set with mini-batches of size and evaluate on the total validation set. We used binary search iterations, S-PGD iterations, and the step-size schedule values and . against universal perturbations of .
We approximate the Pareto front of adversarial and shared adversarial training with sharedness and different and . We performed 5 steps of PGD with step-size . The model was fine-tuned for 30 epochs of SGD with batch-size 128, momentum term , weight decay and an initial learning rate of that was reduced by a factor of 10 after 20 epochs.
Figure 4 compares the Pareto front of shared adversarial training with sharedness 32 and standard adversarial training. It can be clearly seen that shared adversarial training increases the robustness from to without any loss of accuracy. Moreover, shared adversarial training also dominates standard adversarial training for a target accuracy between 67%-74%, which corresponds to the sweet spot as a small loss in accuracy allows a large increase in robustness. The point with accuracy 72.74% and robustness can be considered a good trade-off as accuracy drops by only 5% while robustness increases by a factor of 3, which results in clearly perceptible perturbations as shown in the top row of Figure 1 and Section A.7. Moreover, shared adversarial training also increases the entropy of the predicted class distribution for successful untargeted perturbations substantially (see Section A.8).
5.4 Semantic Image Segmentation
The results of the previous experiments have shown that shared adversarial training allows improving the robustness against universal perturbations on image classification tasks where the adversary aims to fool the classifier’s single decision on an input. In this section, we investigate shared adversarial training against adversaries in a dense prediction task (semantic image segmentation), where the adversary aims at fooling the classifier on many decisions. To our knowledge, this is the first work to scale defenses based on adversarial training to semantic image segmentation.
We evaluate the proposed method on the Cityscapes dataset . For computational reasons, all images and labels were downsampled from a resolution of pixels to
pixels, where for images a bilinear interpolation and for labels a nearest-neighbor approach was used for downsampling. We pretrained the FCN-8s network architecture for semantic image segmentation on the whole training set of images and achieved 49.3% class-wise intersection-over-union (IoU) on the validation set of images. Note that this IoU is relatively low because of downsampling the images.
We follow the experimental setup of Metzen et al. : we perform a targeted attack with a fixed target scene (monchengladbach_000000_026602_gtFine) and consider an attack successful if the average pixel-wise accuracy between the prediction on the perturbed images and the target segmentation exceeds . We find a universal perturbation that upper bounds the robustness444For evaluating robustness, we generate using stochastic PGD with mini-batches from the Cityscapes validation set of size and a validation set of size from the Cityscapes test set. We used binary search iterations, S-PGD iterations, the step-size schedule values and , and did not employ loss thresholding for targeted attacks. of the model to .
We fine-tuned this model with adversarial and shared adversarial training. Since approximating the entire Pareto front of both methods would have been computationally very expensive, we instead selected a target performance on unperturbed data of roughly 45% IoU (no more than worse than the undefended model). The following two settings achieved this target performance555Training was performed with 20 epochs of Adam with batch-size 5 and a learning rate of that was annealed after 15 epochs to . (see Figure 5 left): adversarial training with and and shared adversarial training for sharedness , , and . As heap adversary, we performed 5 steps of untargeted PGD with step-size .
While both methods achieved very similar performance on unperturbed data, they show very different robustness against adversarial and universal perturbations (see Figure 5): standard adversarial training largely increases robustness against adversarial perturbations to , an increase by a factor of 4 compared to the undefended model. Shared adversarial training is less effective against adversarial perturbations, its robustness is upper bounded by . However, shared adversarial training is more effective against universal perturbations with an upper bound on robustness of , while adversarial training reaches . We also evaluated robustness against untargeted attacks: robustness increased from for the undefended model to for the model trained with adversarial training and for the model trained with shared adversarial training. We refer to Section A.9 and Section A.10 in the supplementary material for illustrations of targeted and untargeted universal perturbations for the different models. The universal perturbation for the model trained with shared adversarial training clearly shows patterns of the target scene and dominates the original image, which is also depicted in the bottom row of Figure 1.
Results shown in Figure 5 indicate that there may exist a trade-off between robustness against image-dependent adversarial perturbations and robustness against universal perturbations. Figure 6 illustrates why these two kinds of robustness are not strictly related: adversarial perturbations fool a classifier by both adding structure from the target scene/class666For untargeted attacks, the attacks may choose a target scene/class arbitrarily such that fooling the model becomes as simple as possible. to the image (e.g., street light in the middle, vegetation on the middle left part of the image) and destroying properties of the original scene (e.g., edges of the windowsills). The latter is not possible for universal perturbations since the input images are not known in advance. As also shown in the figure, universal perturbations compensate this by adding stronger patterns of the target scene. Shared perturbations will become more similar to universal perturbations with increasing sharedness since a single shared perturbation has fixed capacity and cannot destroy properties of arbitrarily many input images (even if they are known in advance). Accordingly, shared adversarial training will make the model mostly more robust against perturbations which add new structures and not against perturbations which destroy existing structure. Hence, it results in less robustness against image-specific adversarial perturbations as seen in Figure 5 (middle). On the other hand, since shared adversarial training focuses on one specific kind of perturbations (those that add structure to the scene), it leads to models that are particularly robust against universal perturbations as shown in Figure 5 (right).
We have shown that adversarial training is surprisingly effective in defending against universal perturbations. However, there is a trade-off between robustness against universal perturbations and performance on unperturbed data points (and robustness against adversarial perturbations). Since adversarial training does not explicitly optimize for robustness against universal perturbations, it handles this trade-off suboptimal. We have proposed shared adversarial training, which performs adversarial training on a tight upper bound of the universal adversarial risk. We have shown on CIFAR10 and a subset of ImageNet that it allows achieving high robustness against universal perturbation at smaller loss of accuracy. The proposed method also scales to semantic image segmentation on high resolution images, where compared to adversarial training it achieves higher robustness against universal perturbations at the same level of performance on unperturbed images.
-  N. Akhtar, J. Liu, and A. Mian. Defense against Universal Adversarial Perturbations. In arXiv:1711.05929 [cs], Nov. 2017. arXiv: 1711.05929.
-  A. Arnab, O. Miksik, and P. H. S. Torr. On the Robustness of Semantic Segmentation Models to Adversarial Attacks. In arXiv:1711.09856 [cs], Nov. 2017. arXiv: 1711.09856.
-  A. Athalye, N. Carlini, and D. Wagner. Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. In arXiv:1802.00420 [cs], Feb. 2018.
-  A. Athalye and I. Sutskever. Synthesizing Robust Adversarial Examples. In arXiv:1707.07397 [cs], July 2017.
-  N. Carlini and D. Wagner. Towards Evaluating the Robustness of Neural Networks. In IEEE Symposium on Security and Privacy (SP), May 2017.
-  M. Cisse, Y. Adi, N. Neverova, and J. Keshet. Houdini: Fooling Deep Structured Prediction Models. In Advances in Neural Information Processing Systems (NIPS) 30, 2018.
M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier.
Parseval Networks: Improving Robustness to Adversarial
Proceedings of the 34th International Conference on Machine Learning, Aug. 2017.
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele.
The cityscapes dataset for semantic urban scene understanding.In Computer Vision and Pattern Recognition (CVPR), Las Vegas, Nevada, USA, 2016.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009.
-  I. Evtimov, K. Eykholt, E. Fernandes, T. Kohno, B. Li, A. Prakash, A. Rahmati, and D. Song. Robust Physical-World Attacks on Machine Learning Models. In arXiv:1707.08945 [cs], July 2017.
-  A. Fawzi, S.-M. Moosavi-Dezfooli, and P. Frossard. Robustness of classifiers: from adversarial to random noise. In Advances in Neural Information Processing Systems (NIPS) 29, pages 1632–1640, 2016.
-  A. Fawzi, S.-M. Moosavi-Dezfooli, and P. Frossard. A Geometric Perspective on the Robustness of Deep Networks. In IEEE Signal Processing Magazine, 2017. accepted.
-  A. Fawzi, S.-M. Moosavi-Dezfooli, P. Frossard, and S. Soatto. Classification regions of deep neural networks. In arXiv:1705.09552 [cs, stat], May 2017.
-  V. Fischer, M. C. Kumar, J. H. Metzen, and T. Brox. Adversarial Examples for Semantic Image Segmentation. In Workshop of International Conference on Learning Representations (ICLR), Mar. 2017.
-  I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and Harnessing Adversarial Examples. In International Conference on Learning Representations (ICLR), 2015.
-  J. Hayes and G. Danezis. Learning Universal Adversarial Perturbations with Generative Models. arXiv:1708.05207 [cs, stat], Aug. 2017. arXiv: 1708.05207.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. Computer Vision and Pattern Recognition (CVPR), 2016.
-  S. Jetley, N. A. Lord, and P. H. S. Torr. With Friends Like These, Who Needs Adversaries? In arXiv:1807.04200 [cs], July 2018. arXiv: 1807.04200.
-  V. Khrulkov and I. Oseledets. Art of singular vectors and universal adversarial perturbations. In arXiv:1709.03582 [cs], Sept. 2017.
-  A. Krizhevsky. Learning Multiple Layers of Features from Tiny Images. Master’s thesis, University of Toronto, 2009.
-  A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial examples in the physical world. International Conference on Learning Representations (Workshop), Apr. 2017.
-  A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial Machine Learning at Scale. In International Conference on Learning Representations (ICLR), 2017.
-  J. Long, E. Shelhamer, and T. Darrell. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of Computer Vision and Pattern Recognition (CVPR), Boston, 2015.
-  A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards Deep Learning Models Resistant to Adversarial Attacks. In International Conference on Learning Representations (ICLR), 2018.
-  J. H. Metzen, T. Genewein, V. Fischer, and B. Bischoff. On Detecting Adversarial Perturbations. In International Conference on Learning Representations (ICLR), 2017.
-  J. H. Metzen, M. C. Kumar, T. Brox, and V. Fischer. Universal Adversarial Perturbations Against Semantic Image Segmentation. International Conference on Computer Vision (ICCV), Oct. 2017.
-  T. Miyato, S.-i. Maeda, M. Koyama, and S. Ishii. Virtual Adversarial Training: a Regularization Method for Supervised and Semi-supervised Learning. arXiv:1704.03976 [cs, stat], Apr. 2017.
-  S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard. Universal adversarial perturbations. In IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, USA, 2017.
-  S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, P. Frossard, and S. Soatto. Robustness of classifiers to universal perturbations: A geometric perspective. International Conference on Learning Representations (ICLR), 2018.
-  S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. DeepFool: a simple and accurate method to fool deep neural networks. In Computer Vision and Pattern Recognition (CVPR), Las Vegas, Nevada, USA, 2016.
-  K. R. Mopuri, A. Ganeshan, and R. V. Babu. Generalizable Data-free Objective for Crafting Universal Adversarial Perturbations. arXiv:1801.08092 [cs], Jan. 2018. arXiv: 1801.08092.
-  K. R. Mopuri, U. Garg, and R. V. Babu. Fast Feature Fool: A data independent approach to universal adversarial perturbations. In arXiv:1707.05572 [cs], July 2017.
-  N. Papernot, I. Goodfellow, R. Sheatsley, R. Feinman, and P. McDaniel. cleverhans v1.0.0: an adversarial machine learning library. arXiv preprint arXiv:1610.00768, 2016.
-  N. Papernot and P. McDaniel. Extending Defensive Distillation. arXiv:1705.05264 [cs, stat], May 2017.
-  N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami. Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks. In Proceedings of the 37th IEEE Symposium on Security & Privacy, pages 582–597, San Jose, CA, 2016.
-  J. Perolat, M. Malinowski, B. Piot, and O. Pietquin. Playing the Game of Universal Adversarial Perturbations. arXiv:1809.07802 [cs, stat], Sept. 2018. arXiv: 1809.07802.
-  J. Rauber, W. Brendel, and M. Bethge. Foolbox v0.8.0: A python toolbox to benchmark the robustness of machine learning models. arXiv preprint arXiv:1707.04131, 2017.
-  Y. Ruan and J. Dai. TwinNet: A Double Sub-Network Framework for Detecting Universal Adversarial Perturbations. Future Internet, 10(3), Mar. 2018.
M. Sharif, S. Bhagavatula, L. Bauer, and M. K. Reiter.
Accessorize to a Crime: Real and Stealthy Attacks on State-of-the-Art Face Recognition.In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS ’16, pages 1528–1540, New York, NY, USA, 2016. ACM.
-  C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations (ICLR), 2014.
-  T. Tanay and L. Griffin. A Boundary Tilting Persepective on the Phenomenon of Adversarial Examples. arXiv:1608.07690 [cs, stat], Aug. 2016.
-  F. Tramèr, A. Kurakin, N. Papernot, D. Boneh, and P. McDaniel. Ensemble Adversarial Training: Attacks and Defenses. In International Conference on Learning Representations (ICLR), 2018.
-  J. Uesato, B. O’Donoghue, A. v. d. Oord, and P. Kohli. Adversarial Risk and the Dangers of Evaluating Against Weak Attacks. In arXiv:1802.05666 [cs, stat], Feb. 2018.
-  C. Xie, J. Wang, Z. Zhang, Y. Zhou, L. Xie, and A. Yuille. Adversarial Examples for Semantic Segmentation and Object Detection. In International Conference on Computer Vision (ICCV), 2017.
-  S. Zagoruyko and N. Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
-  S. Zheng, Y. Song, T. Leung, and I. Goodfellow. Improving the Robustness of Deep Neural Networks via Stability Training. In Computer Vision and Pattern Recognition (CVPR), 2016.
Appendix A Supplementary material
a.1 Threat Model
Here, we specify the capabilities of the adversary since the proposed defense mechanism aims at providing security under a specific threat model. We assume a white-box setting, where the adversary has full information about the model, i.e., it knows network architecture and weights, and can provide arbitrary inputs to the model and observe their corresponding outputs (and loss gradients). Moreover, we assume that the attacker can arbitrarily modify every pixel of the input but aims at keeping the norm of this perturbation minimal. In the case of a universal perturbation, we assume that the attacker can choose an arbitrary perturbation (while aiming to keep the norm minimal), but crucially does not know the inputs to which this perturbation will be applied. The adversary, however, has access to data points that have been sampled from the same data distribution as the future inputs.
a.2 Inequalities of the risks
To check validity of the inequalities , we set to obtain (and thus larger sets can only increase ). For the second inequation, we set to obtain , and thus can only be larger in general.
a.3 Relationship of different sharedness
Provided that the heap adversary finds a perturbation that is sufficiently close to the optimal perturbation of the heap and that heaps are composed hierarchically777Heaps are composed hierarchically when a heap of sharedness is always the union of two disjoint heaps of sharedness ., we have the following relationship for (we omit the dependence on , and // for brevity):
To see , let be the shared perturbations on the heaps of . Let be the shared perturbation for the -th heap. Then, because of the hierarchical construction of the heaps, this heap is composed of two heaps used in . Let and be the corresponding indices of these heaps in . By setting , we obtain .
a.4 Configuration of Baselines for CIFAR10
For the defense proposed by Moosavi-Dezfooli et al. , we generated 10 different universal perturbations using the DeepFool-based method for generating universal-perturbations on 10.000 randomly sampled training images, ran 5 epochs of adversarial training with , and chose the applied perturbation uniform randomly from the precomputed perturbations. After these 5 epochs, the robustness was evaluated. This procedure was iterated five times, which resulted in 5 accuracy-robustness points in Figure 1.
We run the defense proposed by Perolat et al.  for 45 epochs (sufficiently long for achieving convergence as evidenced by Figure 4 of Perolat et al. ). At the beginning of each episode, we generated one universal perturbation using the DeepFool-based method for generating universal-perturbations on the entire training set. We used , and chose the applied perturbation uniform randomly from all universal perturbations computed so far. We report the accuracy and robustness at the end of these 45 epochs. We note that even though we did not replicate the exact setup of Perolat et al. , we achieve a similar accuracy-robustness trade-off in Figure 3 (right) as the one given in Figure 4 of Perolat et al. .
a.5 Illustration of Universal Perturbations on CIFAR10
Figure A1 illustrates the minimal universal perturbation found for sharedness and for and . Universal perturbations of the undefended model resemble high-frequency noise and are quasi-imperceptible when added to an image. Shared adversarial training increases robustness and the resulting perturbations are more perceptible (for small ) or even dominate the image: for larger , the cat in the figure is completely hidden and the perturbed image could also not be classified correctly by a human. Moreover, the perturbation becomes more structured and even object-like for larger . Note that the perturbations shown for also achieve high robustness but for smaller accuracy on clean data than those of shared adversarial training with .
a.6 Selection of subset of ImageNet
Since the generation of the Pareto fronts on the entire ImageNet dataset would be computationally very expensive, we restrict the experiment to a subset of ImageNet. We use classes defined in TinyImageNet to filter out the samples from ImageNet dataset. We conducted our experiments on the samples of 200 classes from ImageNet, which results in 258,601 train and 10,000 validation images. Note that we take only the list of classes defined from TinyImageNet and use the data of those classes from ImageNet dataset.
a.7 Illustration of Universal Perturbations on ImageNet
We depict the universal perturbations with minimum magnitude on different models that are obtained from settings , sharedness and different values of on the subset of ImageNet in Figure A2. It can be clearly seen that both the standard () and shared adversarial training () increase robustness when compared against the undefended model but the latter handles the trade-off between performance on unperturbed data and robustness more gracefully. The universal perturbations become clearly visible on a model hardened with shared adversarial training with only a marginal loss of in top-1 accuracy and perturbations become much smoother for larger .
a.8 Predicted Class after Untargeted Universal Perturbations
Figure A3 shows which class is predicted on ImageNet validation data after an untargeted universal perturbation (for the respective model) is added. While the undefended model predicts nearly always the same (wrong) class, the models defended with standard and shared adversarial training have a substantially higher entropy in their predictions. Prior work  has also observed that undefended models typically misclassify images perturbed with universal perturbation to the same class even though the attack is untargeted. Based on this observation, they hypothesized that directions in which a classifier is vulnerable to universal perturbations coincide with directions important for correct prediction on unperturbed data. We believe it would be important to re-examine these results for a defended model.
a.9 Attacks on Semantic Image Segmentation
We illustrate universal perturbations for targeted and untargeted attacks on different models in this section. We illustrate the effect of the perturbations on one image; however, the perturbations are not specific for this image. For the model trained with empirical risk minimization, Figure A4 shows a targeted attack and Figure A5 an untargeted attack. For the model trained with adversarial training, Figure A6 shows a targeted attack and Figure A8 an untargeted attack. For the model trained with shared adversarial training, Figure A7 shows a targeted attack and Figure A9 an untargeted attack.
a.10 Illustration of Universal Perturbations
We illustrate the universal perturbations found for different models for targeted attacks in Figure A10.