Defending against Universal Perturbations with Shared Adversarial Training

12/10/2018 ∙ by Chaithanya Kumar Mummadi, et al. ∙ Bosch University of Freiburg 12

Classifiers such as deep neural networks have been shown to be vulnerable against adversarial perturbations on problems with high-dimensional input space. While adversarial training improves the robustness of image classifiers against such adversarial perturbations, it leaves them sensitive to perturbations on a non-negligible fraction of the inputs. In this work, we show that adversarial training is more effective in preventing universal perturbations, where the same perturbation needs to fool a classifier on many inputs. Moreover, we investigate the trade-off between robustness against universal perturbations and performance on unperturbed data and propose an extension of adversarial training that handles this trade-off more gracefully. We present results for image classification and semantic segmentation to showcase that universal perturbations that fool a model hardened with adversarial training become clearly perceptible and show patterns of the target scene.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 12

page 13

page 14

page 15

page 16

page 17

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

While deep learning is relatively robust to random noise

[11], it can be easily fooled by so-called adversarial perturbations [40]. These perturbations are generated by adversarial attacks [15, 30, 5] that generate perturbed versions of the input which are misclassified by a classifier and, at the same time, remain quasi-imperceptible for humans. There have been different approaches for explaining properties of adversarial examples and provide rationale for their existence in the first place [15, 41, 12, 13]. Moreover, adversarial perturbations have been shown to be relatively robust against various kind of image transformations and can even be successful when being placed as artifacts in the physical world [21, 39, 10, 4]. Thus, adversarial perturbations might pose a safety and security risk for autonomous systems and also reduce trust on the models that are in principle vulnerable to these perturbations.

         Clean image         Adv.image undefended model  Adv.image defended model

Figure 1: Effectiveness of shared adversarial training against universal perturbations:

the top row shows an ImageNet example and the bottom row an example from Cityscapes. Adversarial images perturbed by universal perturbations generated for both the undefended models and models defended by our proposed method

shared adversarial training are shown. The performance of the defended models deteriorates no more than 5% but robustness increases by 3x and 5x on image classification and semantic segmentation, respectively. Moreover, universal perturbations become clearly perceptible.

Several methods have been proposed for increasing the robustness of deep networks against adversarial examples such as adversarial training [15, 22], virtual adversarial training [27], ensemble adversarial training [42]

, defensive distillation

[35, 34], stability training [46], robust optimization [24], and Parseval networks [7]. An alternative approach for defending against adversarial examples is to detect and reject them as malicious [25]. While some of these approaches improve robustness against adversarial examples to some extent, the classifier remains vulnerable against adversarial perturbations on a non-negligible fraction of the inputs for all defenses [3, 43].

Most work has focused on increasing robustness in classification tasks with high-dimensional input spaces such as image classification, where the adversary can choose a data-dependent perturbation for each input. This setting is very much in favor of the adversary since the adversary can craft a high-dimensional perturbation “just” to fool a model on a single decision. In this work, we argue that limited success in increasing the robustness under these conditions does not necessarily imply that robustness can not be achieved in other settings. Specifically, we focus on robustness against universal perturbations [28], where the same perturbation needs to fool a classifier on many inputs. Moreover, we investigate robustness against such perturbations in dense prediction tasks such as semantic image segmentation, where a perturbation needs to fool a model on many decisions, e.g., the pixel-wise classifications. Robustness against universal perturbations is important in settings where an adversary aims at fooling a model on many inputs but is not able to compute input-dependent perturbations for all inputs, for instance because of a lack of computational resources or a lack of time for re-computing perturbations.

Prior work has shown that standard models are vulnerable to both universal perturbations which mislead a classifier on the majority of the inputs [28, 31] and to adversarial perturbations on semantic image segmentation tasks [14, 44, 6], and even to universal perturbations on semantic image segmentation [26]. However, these results have been achieved for undefended models. In this work, we focus on the case where models have been ”hardened” by a defense mechanism.

Our main contributions are as follows: (1) We propose shared adversarial training, an extension of adversarial training that handles the inherent trade-off between accuracy on clean examples and robustness against adversarial examples with universal perturbations more gracefully. (2) We evaluate our method on CIFAR10, a subset of ImageNet (200 classes of TinyImageNet), and Cityscapes and demonstrate that universal perturbations for the defended models become clearly perceptible as shown in Figure 1 (Note that all graphics in the paper are best viewed in color). (3) To the best of our knowledge, we are the first to scale defenses based on adversarial training to semantic image segmentation. (4) We demonstrate empirically on CIFAR10 that the proposed technique outperforms other defense mechanisms [29, 36] in terms of robustness against universal perturbations.

2 Related Work

In this section, we review related work on universal perturbations and adversarial perturbations for semantic image segmentation.

2.1 Universal Perturbations

In contrast to adversarial perturbations, which are input-dependent, universal perturbations are input-agnostic (generated without knowing future inputs) with the objective to fool a classifier on most inputs. Different methods for generating universal perturbations exist: the first approach by Moosavi-Dezfooli et al. [28] uses an extension of the DeepFool adversary [30] to generate perturbations that fool a classifier on a maximum number of inputs from a training set. Metzen et al. [26] proposed a similar extension of the basic iterative adversary [22] for generating universal perturbations for semantic image segmentation. Mopuri et al. [32] proposed Fast Feature Fool, a method which is, in contrast to the former works, a data-independent approach for generating universal perturbations. In follow-up work [31], they show similar fooling rates of data-independent approaches as have been achieved by Moosavi-Dezfooli et al. [28]. Khrulkov and Oseledets [19]

show a connection between universal perturbations and singular vectors. Hayes and Danezis

[16] proposed a generative model that can be trained to generate a diverse set of universal perturbations.

An analysis of universal perturbation and their properties is provided by Moosavi-Dezfooli et al. [29]. They connect the robustness to universal perturbations with the geometry of the decision boundary and prove the existence of small universal perturbation provided the decision boundary is systematically positively curved. Jetley et al. [18] build upon this work and provide evidence that directions in which a classifier is vulnerable to universal perturbations coincide with directions important for correct prediction on unperturbed data. They follow that predictive power and adversarial vulnerability are closely intertwined. Moosavi-Dezfooli et al. [28] investigated whether fine-tuning the network on a modified dataset, where precomputed universal perturbations have been added to a fraction of training samples, increases robustness. They observe a small increase in robustness against universal perturbations; however the network remained vulnerable against universal perturbations on most inputs. Perolat et al. [36] propose a related approach based on approximate fictitious play. We propose a method which computes shared perturbations on each mini-batch and uses them in adversarial training, i.e., the shared perturbations are computed on-the-fly rather than precomputed as in prior work [28, 36].

Alternative approaches for defending against universal perturbations are based on adding additional components to the model: Ruan and Dai [38] proposed to identify and reject universal perturbations by adding a shadow classifiers, while Akhtar et al. [1] proposed to prepend a subnetwork in front of the model that is used to compensate for the added universal perturbation by detecting and rectifying the perturbation. Both methods have the disadvantage that the model becomes large and thus inference more costly. More severely, it is assumed that the adversary is not aware of the defense mechanism and it is unclear if a more powerful adversary could not fool the defense mechanism.

2.2 Adversarial Perturbations for Semantic Image Segmentation

Methods for generating adversarial perturbations have been extended to structured and dense prediction tasks like semantic segmentation and object detection [14, 44, 6]. It has been found that models for these tasks are not considerably more robust to adversarial perturbations than in image classification. Metzen et al. [26] even showed the existence of universal perturbations for semantic image segmentation which are quasi-imperceptible for humans but result in an arbitrary target segmentation of the scene which has nothing in common with the scene a human perceives. A comparison of the robustness of different network architectures for semantic image segmentation has been conducted by Arnab et al. [2]

: they found that residual connections and multiscale processing actually increase robustness of an architecture, while mean-field inference for Dense Conditional-Random Fields only makes gradient-based attacks harder since it masks gradients but does not increase robustness itself. In contrast to their work, we focus on modifying the training procedure rather than the network architectures for increasing robustness. Both approaches could be combined in the future.

3 Preliminaries

In this section, we introduce basic terms relevant for this work and establish some of their properties. We aim to defend against an adversary under specific attack settings. Please refer to Section A.1 in the supplementary material for details on capabilities of the adversary and the threat model.

3.1 Risks

Let

be a loss function (categorical crossentropy throughout this work),

be a data distribution, and

be the parameters of a parametric model

. We denote the risk of as . The following risks are relevant for this work (we extend the definitions of Uesato et al. [43]):

  1. Expected Risk:

  2. Adversarial Risk:

  3. Universal Adversarial Risk:

Here, denotes an adversarial perturbation, a universal perturbation, and an adversarial example. The set defines the space from which perturbations may be chosen. The following inequalities hold for the risks: . Please refer to Section A.2 for a justification of those inequality.

3.2 Robustness

For the special case of the 0-1 loss, an -dimensional input , and , we define the adversarial robustness as the smallest perturbation magnitude that results in an adversarial risk (misclassification rate) of at least . More formally:

In other words, there are perturbations with that result in a misclassification rate of at least . Analogously, we can also define the universal robustness as

Here, a perturbation with exists that results in a misclassification rate of at least . We have .

3.3 Adversaries

We define an adversary as a function , which maps a data point and model parameters onto a perturbation . Let the objective of the adversary be finding a perturbation which maximizes a loss function .111We note that one may choose or one may also choose, e.g., to be the 0-1 loss and be a differentiable surrogate loss. A lower bound of the adversarial risk is:

with


A heap adversary is a function , which maps a set of data points and model parameters onto a perturbation . The loss function of a heap adversary is defined as:


A universal adversary evaluates generalization of by tracking on unseen inputs and we denote such an adversary by . We can lower bound the universal adversarial risk by:

with


The lower bound will typically become better if , is large, and is powerful. While different options for and exist [15, 30, 5, 28, 31], we focus on projected gradient descent (PGD) [24, 21], as it provides in our experience a good trade-off between being computationally efficient and being powerful. PGD initializes uniform randomly in (or a subset of ) and then performs iterations of the following update:

where denotes a projection on the space and denotes a step-size. Similarly, a targeted attack where the model shall output the target class can be obtained via the following update rule:

Moreover, this procedure can also be turned into a heap adversary by replacing by . If the number of data points in the heap is large (which is typically required for approximating universal perturbations), one can employ stochastic PGD, where in every iteration , a set of data points is sampled and is only evaluated on this subset of data points.

3.4 Adversarial training

The objective of adversarial training can now be defined as finding a minimizer of , where controls the trade-off between robustness and performance on unperturbed inputs. Note that adversarial training depends implicitly on the adversary , its loss , and . Any procedure for minimizing the expected risk

such as stochastic gradient descent (SGD) can also be applied to

. We can also define analogously by replacing by .

Prior procedures for fine-tuning a network for robustness against universal perturbation are variants of the following approach: define a distribution over (approximately optimal) universal perturbations for a model with parameters (either by precomputing and random sampling [28], by learning a generative model [16], or by collecting an increasing set of universal perturbations for model checkpoints during training [36]

), fine-tune model parameters to become robust against this distribution of universal perturbations, and (optionally) iterate. This procedure increase robustness against universal perturbations slightly, however, not to a satisfying level. This is probably caused by the model overfitting to the fixed distribution of universal perturbations which do not change during optimization of

. However, re-computing universal perturbations in every mini-batch anew by evaluating is prohibitively expensive. In this work, we propose a procedure that can be performed efficiently and updates the perturbations used in adversarial training in every mini-batch.

4 Methods

In this section, we introduce shared adversarial training, an extension of adversarial training that aims at maximizing robustness against universal perturbations.

4.1 Shared Adversarial Training

Figure 2: A pictorial representation of shared adversarial training. We split the mini-batch of images into heaps each with sharedness and obtain the gradients of the loss with respect to the inputs. Here, the sharedness corresponds to the number of inputs that are used for the generation of a shared perturbation. The gradients in each heap of size are then processed and multiplied with step-size to create a shared perturbation that is further broadcasted to size of the heap. The generated shared perturbations are aggregated and clipped after every iteration in order to confine the perturbations within a predefined magnitude . These perturbations added to the images and this process is repeated in an iterative fashion. The adversarial inputs generated from the shared perturbations are used for adversarial training.

We note that if one is interested in minimizing the universal adversarial risk , then for using (or ) in adversarial training corresponds to minimizing an upper bound of (or ) because , provided that the adversaries find perturbations that are sufficiently close to the optimal perturbations. On the other hand, standard empirical risk minimization ERM (

), which minimizes the empirical estimate of

, corresponds to minimizing a lower bound. As shown in previous work [15, 30, 5], this does confer only little robustness against (universal) perturbations. For , adversarial training corresponds to minimizing a convex combination of the upper bound and the lower bound . As we show in Section 5, this standard version of adversarial training already provides strong robustness against universal perturbations at the cost of reducing performance on unperturbed data considerably.

However, is a very loose upper bound of and it would be desirable to use a better approximation of in adversarial training. Directly employing in adversarial training is infeasible since evaluating in every mini-batch is prohibitively expensive.

We propose instead to use a weaker adversary in . We split a mini-batch consisting of data points into heaps of size (we denote as sharedness). Rather than using for computing a perturbation on each of the data points separately, we employ a heap adversary (see Section 3.3) for computing shared perturbations on the heaps (subsets of the mini-batch), and then broadcast these perturbations to all data points by repeating each of the shared perturbations times for all elements of the heap which defines a risk . We propose to use in adversarial training with an aim to increase robustness against universal perturbations and denote the resulting procedure as shared adversarial training. This entire process is illustrated in Figure 2. We can obtain the following relationship for (please refer to Section A.3 for more details):

Note that while all are upper bounds on the universal risk , this does not imply that shared perturbations are strong universal perturbations. In contrast, the smaller , the more “overfit” are the shared perturbations to the respective heap (there is no requirement of generalization of the shared perturbation to unseen datapoints as in universal perturbations). Moreover, with is typically a much tighter bound on than but can be computed as efficiently as . As discussed in Section 3.3, we can employ PGD on rather than . By appropriately reshaping and broadcasting perturbations, we can compute the shared perturbations on the respective heaps of the mini-batch jointly by PGD with essentially the same cost as computing adversarial perturbations with PGD.

4.2 Adversarial Loss Function

We recall that . Because of the limited capacity of the perturbation (), there is “competition” between the data points: the maximizers of will typically be different and the data points will “pull” into different directions. Because of this, using the categorical cross-entropy as a proxy for the 0-1 loss is problematic for untargeted adversaries: since we are maximizing the loss and the categorical crossentropy has no upper bound, there is a winner-takes-all tendency where the perturbation is chosen such that it leads to highly confident misclassifications on some data points and to correct classification on other data points (this incurs higher cost than misclassifying more data points but with lower confidence).

To prevent this, we employ loss thresholding on the categorical crossentropy to enforce an upper bound on : . We used , which corresponds to not encouraging the adversary to reduce confidence of the correct class below . Besides, we also incorporate label smoothing and use the soft targets for the computation of loss in all our experiments.

5 Experimental Results

We present experimental results of the robustness against universal perturbations achieved by shared adversarial training for image classification and semantic segmentation.

5.1 General Setup

Figure 3: Pareto front on CIFAR10 for sharedness values . ERM corresponds to the model pretrained with empirical risk minimization, ”DeepFool UAD” [28] to models trained with the procedure proposed by Moosavi-Dezfooli et al. [28], and ”FictitiousPlay” to the procedure proposed by Perolat et al. [36]. (Left) Robustness with regard to S-PGD universal perturbations. (Right) Robustness with regard to DeepFool-based universal perturbations [28]. The pareto front of the proposed defense is clearly above all previous defenses.

For shared adversarial training, we extended the PGD implementation of Cleverhans [33] such that it supports shared adversarial perturbations and loss clipping as discussed in Section 4. For evaluation, we extended Foolbox [37] such that universal perturbations with minimal norm can be computed that achieve a misclassification rate of . Since this evaluation only provides an upper bound on the actual robustness , we tuned the PGD adversary as follows to make it more powerful (and thus the upper bound more tight): we performed a binary search of iterations for of , i.e., the bound in the norm on the perturbation, on the interval . In every iteration, we used the step-size annealing schedule which guarantees that . If a perturbation with misclassification rate is found in an iteration, the next iteration of binary search continues on the lower half of the interval for , otherwise on the upper half. The reported robustness is the smallest perturbation found in the entire procedure that achieves a misclassification rate of . Note that this procedure was only used for evaluation; for training we used a predefined and constant step-size .

5.2 Experiments on CIFAR10

We present results on CIFAR10 [20] for a ResNet20 [17] with 64-128-256 feature maps per stage. We pretrained this network with standard regularized empirical risk minimization (ERM) and obtained an accuracy of on clean data and a robustness222For evaluating robustness, we generate using stochastic PGD on the CIFAR10 validation set with mini-batches of size and a validation set of size from CIFAR10 test set. We used binary search iterations, S-PGD iterations, and the step-size schedule values and . against universal perturbations of .

In general, we are interested in models that increase the robustness without decreasing the accuracy on clean data considerably. We consider this a as a multi-objective problems with two objectives (accuracy and robustness). In order to approximate the Pareto-front of different variants of adversarial and shared adversarial training (sharedness ), we conducted runs for a range of attack parameters: maximum perturbation strength and (controlling the trade-off between expected and adversarial risk). We performed 4 steps of PGD with step-size

. Model fine-tuning was performed with 65 epochs of SGD with batch-size 128, momentum term

, and an initial learning rate of . The learning rate was annealed after 50 epochs by a factor of 10.

Figure 3 (left) shows the resulting Pareto fronts of different sharedness values. While sharedness (standard adversarial training) and sharedness perform similarly, sharedness strictly dominates the other two settings. Without any loss on accuracy, a robustness of can be achieved, and if one accepts an accuracy of , a robustness of is obtainable. This corresponds to nearly three times the robustness of the undefended model while accuracy only drops by less than . We would like to note that also standard adversarial training is surprisingly effective in defending against universal perturbations and achieves a robustness that is smaller by approximately 5 than at the same level of accuracy on unperturbed data.

We also evaluated the performance of the defenses against universal perturbations proposed by Moosavi-Dezfooli et al. [28] and by Perolat et al. [36] (please refer to Section A.4 in the supplementary material for details). The results are also shown in the Figure 3 (left). It can be seen that these defenses are strictly dominated by all variants of (shared) adversarial training. In terms of computation, shared adversarial training required 189s while the defense [28] required 3118s, and the defense [36] required 3840s per epoch on average. In summary, the proposed method outperforms the baseline defenses both in terms of computation and with regard to the robustness-accuracy trade-off.

Figure 3 (right) shows the Pareto front of the same models when attacked by the DeepFool-based method for generating universal perturbations [28]. In this case, the robustness is computed for a fixed perturbation magnitude and the accuracy under this perturbation is reported. The qualitative results are the same as for an S-PGD attack: the Pareto-front of adversarial training (s=1) clearly dominates the results achieved by the defense proposed in [28]. Moreover, shared adversarial training with s=64 dominates standard adversarial training and the defense proposed by Perolat et al. [36]. This indicates that the increased robustness by shared adversarial training is not specific to the way the attacker generates universal perturbations. An illustration of the universal perturbations is given in Section A.5 in the supplementary material.

5.3 Experiment on a Subset of ImageNet

The results on CIFAR10 have demonstrated that the proposed method outperforms previous defenses and dominates standard adversarial training in terms of the robustness/accuracy trade-off. We extend our experiments to a subset of ImageNet [9], which has more classes and higher resolution inputs than CIFAR10. Please refer to Section A.6 in the supplementary material for details on the selection of this subset of ImageNet. We pre-trained wide residual network WRN-50-2-bottleneck [45] on this dataset with ERM using SGD for 100 epochs along with initial learning rate 0.1 and reduced it by a factor of 10 after every 30 epochs. We have obtained a top-1 accuracy of 77.57% on unperturbed validation data and a robustness333Similar to CIFAR10, we evaluate the robustness using stochastic PGD but generate the perturbations on the training set with mini-batches of size and evaluate on the total validation set. We used binary search iterations, S-PGD iterations, and the step-size schedule values and . against universal perturbations of .

We approximate the Pareto front of adversarial and shared adversarial training with sharedness and different and . We performed 5 steps of PGD with step-size . The model was fine-tuned for 30 epochs of SGD with batch-size 128, momentum term , weight decay and an initial learning rate of that was reduced by a factor of 10 after 20 epochs.

Figure 4 compares the Pareto front of shared adversarial training with sharedness 32 and standard adversarial training. It can be clearly seen that shared adversarial training increases the robustness from to without any loss of accuracy. Moreover, shared adversarial training also dominates standard adversarial training for a target accuracy between 67%-74%, which corresponds to the sweet spot as a small loss in accuracy allows a large increase in robustness. The point with accuracy 72.74% and robustness can be considered a good trade-off as accuracy drops by only 5% while robustness increases by a factor of 3, which results in clearly perceptible perturbations as shown in the top row of Figure 1 and Section A.7. Moreover, shared adversarial training also increases the entropy of the predicted class distribution for successful untargeted perturbations substantially (see Section A.8).

Figure 4: Pareto front on ImageNet for sharedness . Shared adversarial training has doubled the robustness at the point of accuracy similar to baseline. With a slight loss of accuracy between to , the method increases the robustness by a factor of 3 and clearly dominates the standard adversarial training in terms of the robustness/accuracy trade-off.

5.4 Semantic Image Segmentation

The results of the previous experiments have shown that shared adversarial training allows improving the robustness against universal perturbations on image classification tasks where the adversary aims to fool the classifier’s single decision on an input. In this section, we investigate shared adversarial training against adversaries in a dense prediction task (semantic image segmentation), where the adversary aims at fooling the classifier on many decisions. To our knowledge, this is the first work to scale defenses based on adversarial training to semantic image segmentation.

We evaluate the proposed method on the Cityscapes dataset [8]. For computational reasons, all images and labels were downsampled from a resolution of pixels to

pixels, where for images a bilinear interpolation and for labels a nearest-neighbor approach was used for downsampling. We pretrained the FCN-8s network architecture

[23] for semantic image segmentation on the whole training set of images and achieved 49.3% class-wise intersection-over-union (IoU) on the validation set of images. Note that this IoU is relatively low because of downsampling the images.

We follow the experimental setup of Metzen et al. [26]: we perform a targeted attack with a fixed target scene (monchengladbach_000000_026602_gtFine) and consider an attack successful if the average pixel-wise accuracy between the prediction on the perturbed images and the target segmentation exceeds . We find a universal perturbation that upper bounds the robustness444For evaluating robustness, we generate using stochastic PGD with mini-batches from the Cityscapes validation set of size and a validation set of size from the Cityscapes test set. We used binary search iterations, S-PGD iterations, the step-size schedule values and , and did not employ loss thresholding for targeted attacks. of the model to .

We fine-tuned this model with adversarial and shared adversarial training. Since approximating the entire Pareto front of both methods would have been computationally very expensive, we instead selected a target performance on unperturbed data of roughly 45% IoU (no more than worse than the undefended model). The following two settings achieved this target performance555Training was performed with 20 epochs of Adam with batch-size 5 and a learning rate of that was annealed after 15 epochs to . (see Figure 5 left): adversarial training with and and shared adversarial training for sharedness , , and . As heap adversary, we performed 5 steps of untargeted PGD with step-size .

Figure 5:

Learning curves on Cityscapes for adversarial (red, circle) and shared adversarial training (blue, diamond) with regard to performance on unperturbed images (left), and robustness against adversarial perturbations (middle, showing mean and standard error of mean) and universal perturbations (right). Black horizontal lines denote performance of undefended model. Isolated markers correspond to robustness against untargeted attacks. Performance of both standard and shared adversarial training are comparable on unperturbed data, but standard adversarial training dominates in terms of robustness against image-dependent adversarial perturbations, while shared adversarial training dominates in terms of robustness against targeted and untargeted universal perturbations.

While both methods achieved very similar performance on unperturbed data, they show very different robustness against adversarial and universal perturbations (see Figure 5): standard adversarial training largely increases robustness against adversarial perturbations to , an increase by a factor of 4 compared to the undefended model. Shared adversarial training is less effective against adversarial perturbations, its robustness is upper bounded by . However, shared adversarial training is more effective against universal perturbations with an upper bound on robustness of , while adversarial training reaches . We also evaluated robustness against untargeted attacks: robustness increased from for the undefended model to for the model trained with adversarial training and for the model trained with shared adversarial training. We refer to Section A.9 and Section A.10 in the supplementary material for illustrations of targeted and untargeted universal perturbations for the different models. The universal perturbation for the model trained with shared adversarial training clearly shows patterns of the target scene and dominates the original image, which is also depicted in the bottom row of Figure 1.

5.5 Discussion

Results shown in Figure 5 indicate that there may exist a trade-off between robustness against image-dependent adversarial perturbations and robustness against universal perturbations. Figure 6 illustrates why these two kinds of robustness are not strictly related: adversarial perturbations fool a classifier by both adding structure from the target scene/class666For untargeted attacks, the attacks may choose a target scene/class arbitrarily such that fooling the model becomes as simple as possible. to the image (e.g., street light in the middle, vegetation on the middle left part of the image) and destroying properties of the original scene (e.g., edges of the windowsills). The latter is not possible for universal perturbations since the input images are not known in advance. As also shown in the figure, universal perturbations compensate this by adding stronger patterns of the target scene. Shared perturbations will become more similar to universal perturbations with increasing sharedness since a single shared perturbation has fixed capacity and cannot destroy properties of arbitrarily many input images (even if they are known in advance). Accordingly, shared adversarial training will make the model mostly more robust against perturbations which add new structures and not against perturbations which destroy existing structure. Hence, it results in less robustness against image-specific adversarial perturbations as seen in Figure 5 (middle). On the other hand, since shared adversarial training focuses on one specific kind of perturbations (those that add structure to the scene), it leads to models that are particularly robust against universal perturbations as shown in Figure 5 (right).

Figure 6: Illustration of image-dependent and universal perturbations for the same image (upper left) and target scene (lower left) on Cityscapes. Shown are small subregions of the image for enhanced visibility. The adversarial perturbations and universal perturbations are generated for a model trained with standard and shared adversarial training, respectively. The resultant adversarial images (upper and lower right) depict that the disturbance induced by these perturbations are qualitatively different: image-dependent perturbations weaken patterns of existing structure like edges of the actual scene (upper right) whereas universal perturbations are restricted to adding structure indicative of the target scene (lower right). This qualitative difference also provides a possible explanation why standard and shared adversarial training demonstrate different levels of robustness on image-dependent and universal perturbations: shared adversarial training improves robustnesss against additive structure whereas adversarial training additionally needs to address the weakening of existing structure.

6 Conclusion

We have shown that adversarial training is surprisingly effective in defending against universal perturbations. However, there is a trade-off between robustness against universal perturbations and performance on unperturbed data points (and robustness against adversarial perturbations). Since adversarial training does not explicitly optimize for robustness against universal perturbations, it handles this trade-off suboptimal. We have proposed shared adversarial training, which performs adversarial training on a tight upper bound of the universal adversarial risk. We have shown on CIFAR10 and a subset of ImageNet that it allows achieving high robustness against universal perturbation at smaller loss of accuracy. The proposed method also scales to semantic image segmentation on high resolution images, where compared to adversarial training it achieves higher robustness against universal perturbations at the same level of performance on unperturbed images.

References

Appendix A Supplementary material

a.1 Threat Model

Here, we specify the capabilities of the adversary since the proposed defense mechanism aims at providing security under a specific threat model. We assume a white-box setting, where the adversary has full information about the model, i.e., it knows network architecture and weights, and can provide arbitrary inputs to the model and observe their corresponding outputs (and loss gradients). Moreover, we assume that the attacker can arbitrarily modify every pixel of the input but aims at keeping the norm of this perturbation minimal. In the case of a universal perturbation, we assume that the attacker can choose an arbitrary perturbation (while aiming to keep the norm minimal), but crucially does not know the inputs to which this perturbation will be applied. The adversary, however, has access to data points that have been sampled from the same data distribution as the future inputs.

a.2 Inequalities of the risks

To check validity of the inequalities , we set to obtain (and thus larger sets can only increase ). For the second inequation, we set to obtain , and thus can only be larger in general.

a.3 Relationship of different sharedness

Provided that the heap adversary finds a perturbation that is sufficiently close to the optimal perturbation of the heap and that heaps are composed hierarchically777Heaps are composed hierarchically when a heap of sharedness is always the union of two disjoint heaps of sharedness ., we have the following relationship for (we omit the dependence on , and // for brevity):

To see , let be the shared perturbations on the heaps of . Let be the shared perturbation for the -th heap. Then, because of the hierarchical construction of the heaps, this heap is composed of two heaps used in . Let and be the corresponding indices of these heaps in . By setting , we obtain .

a.4 Configuration of Baselines for CIFAR10

For the defense proposed by Moosavi-Dezfooli et al. [28], we generated 10 different universal perturbations using the DeepFool-based method for generating universal-perturbations on 10.000 randomly sampled training images, ran 5 epochs of adversarial training with , and chose the applied perturbation uniform randomly from the precomputed perturbations. After these 5 epochs, the robustness was evaluated. This procedure was iterated five times, which resulted in 5 accuracy-robustness points in Figure 1.

We run the defense proposed by Perolat et al. [36] for 45 epochs (sufficiently long for achieving convergence as evidenced by Figure 4 of Perolat et al. [36]). At the beginning of each episode, we generated one universal perturbation using the DeepFool-based method for generating universal-perturbations on the entire training set. We used , and chose the applied perturbation uniform randomly from all universal perturbations computed so far. We report the accuracy and robustness at the end of these 45 epochs. We note that even though we did not replicate the exact setup of Perolat et al. [36], we achieve a similar accuracy-robustness trade-off in Figure 3 (right) as the one given in Figure 4 of Perolat et al. [36].

a.5 Illustration of Universal Perturbations on CIFAR10

Figure A1 illustrates the minimal universal perturbation found for sharedness and for and . Universal perturbations of the undefended model resemble high-frequency noise and are quasi-imperceptible when added to an image. Shared adversarial training increases robustness and the resulting perturbations are more perceptible (for small ) or even dominate the image: for larger , the cat in the figure is completely hidden and the perturbed image could also not be classified correctly by a human. Moreover, the perturbation becomes more structured and even object-like for larger . Note that the perturbations shown for also achieve high robustness but for smaller accuracy on clean data than those of shared adversarial training with .

Figure A1: Illustration of universal perturbations on CIFAR10 for sharedness (top row) and (middle row) for different values of . The bottom row shows a test image of a cat with the respective perturbation of the middle row being added.

a.6 Selection of subset of ImageNet

Since the generation of the Pareto fronts on the entire ImageNet dataset would be computationally very expensive, we restrict the experiment to a subset of ImageNet. We use classes defined in TinyImageNet to filter out the samples from ImageNet dataset. We conducted our experiments on the samples of 200 classes from ImageNet, which results in 258,601 train and 10,000 validation images. Note that we take only the list of classes defined from TinyImageNet and use the data of those classes from ImageNet dataset.

a.7 Illustration of Universal Perturbations on ImageNet

We depict the universal perturbations with minimum magnitude on different models that are obtained from settings , sharedness and different values of on the subset of ImageNet in Figure A2. It can be clearly seen that both the standard () and shared adversarial training () increase robustness when compared against the undefended model but the latter handles the trade-off between performance on unperturbed data and robustness more gracefully. The universal perturbations become clearly visible on a model hardened with shared adversarial training with only a marginal loss of in top-1 accuracy and perturbations become much smoother for larger .

Figure A2: Illustration of universal perturbations (not amplified) on ImageNet that are generated from different models with the settings: sharedness (top row) and (third row), and different values of . The top-1 accuracy of the corresponding models and their smallest perturbation magnitude that results in a misclassification rate of atleast are also shown. The second and bottom rows show a test image of a dog added with the respective universal perturbations from the first and third row. The models hardened with both standard and shared adversarial training demonstrate higher robustness when compared against the undefended model and universal perturbations become clearly visible. However, the shared adversarial training outperforms its counterpart in terms of robustness against perturbations and performance on unperturbed inputs. The perturbation of models from standard adversarial training resemble high frequency noise whereas the perturbations of the latter becomes much smoother for larger .

a.8 Predicted Class after Untargeted Universal Perturbations

Figure A3: The figure shows histogram of the predicted classes over validation data of different models when an untargeted universal perturbation is added. The histogram is based on 200 classes ImageNet validation data. In other words, a bar in each histogram represents the number of times a class (represented by class index) is predicted over the validation samples. It is interesting to note that the undefended model almost always misclassified the adversarial samples (samples added with universal perturbations) under the same class even though attack is untargeted. In contrast, the defended models from standard and shared adversarial training have higher entropy in their predictions.

Figure A3 shows which class is predicted on ImageNet validation data after an untargeted universal perturbation (for the respective model) is added. While the undefended model predicts nearly always the same (wrong) class, the models defended with standard and shared adversarial training have a substantially higher entropy in their predictions. Prior work [18] has also observed that undefended models typically misclassify images perturbed with universal perturbation to the same class even though the attack is untargeted. Based on this observation, they hypothesized that directions in which a classifier is vulnerable to universal perturbations coincide with directions important for correct prediction on unperturbed data. We believe it would be important to re-examine these results for a defended model.

a.9 Attacks on Semantic Image Segmentation

We illustrate universal perturbations for targeted and untargeted attacks on different models in this section. We illustrate the effect of the perturbations on one image; however, the perturbations are not specific for this image. For the model trained with empirical risk minimization, Figure A4 shows a targeted attack and Figure A5 an untargeted attack. For the model trained with adversarial training, Figure A6 shows a targeted attack and Figure A8 an untargeted attack. For the model trained with shared adversarial training, Figure A7 shows a targeted attack and Figure A9 an untargeted attack.

Figure A4: Targeted universal perturbations on Cityscapes for a model pretrained with empirical risk minimization. The shown perturbation upper bounds the robustness of the model to . Top row shows original image, universal perturbation, and perturbed image. Bottom row shows prediction on original image, target segmentation, and prediction on perturbed image.
Figure A5: Untargeted universal perturbations on Cityscapes for a model pretrained with empirical risk minimization. The shown perturbation upper bounds the robustness of the model to . Top row shows original image, universal perturbation, and perturbed image. Bottom row shows prediction on original image and prediction on perturbed image.
Figure A6: Targeted universal perturbations on Cityscapes for a model trained with adversarial training. The shown perturbation upper bounds the robustness of the model to . Top row shows original image, universal perturbation, and perturbed image. Bottom row shows prediction on original image, target segmentation, and prediction on perturbed image.
Figure A7: Universal perturbations on Cityscapes for a model trained with shared adversarial training. The shown perturbation upper bounds the robustness of the model to . Top row shows original image, universal perturbation, and perturbed image. Bottom row shows prediction on original image, target segmentation, and prediction on perturbed image.
Figure A8: Untargeted universal perturbations on Cityscapes for a model trained with adversarial training. The shown perturbation upper bounds the robustness of the model to . Top row shows original image, universal perturbation, and perturbed image. Bottom row shows prediction on original image and prediction on perturbed image.
Figure A9: Untargeted universal perturbations on Cityscapes for a model trained with shared adversarial training. The shown perturbation upper bounds the robustness of the model to . Top row shows original image, universal perturbation, and perturbed image. Bottom row shows prediction on original image, target segmentation, and prediction on perturbed image.

a.10 Illustration of Universal Perturbations

We illustrate the universal perturbations found for different models for targeted attacks in Figure A10.

Figure A10: Illustration of targeted universal perturbation for empirical risk minimization (top), adversarial training (middle), and shared adversarial training (bottom).