Adversarial examples and where to find them

04/22/2020
by   Niklas Risse, et al.
0

Adversarial robustness of trained models has attracted considerable attention over recent years, within and beyond the scientific community. This is not only because of a straight-forward desire to deploy reliable systems, but also because of how adversarial attacks challenge our beliefs about deep neural networks. Demanding more robust models seems to be the obvious solution – however, this requires a rigorous understanding of how one should judge adversarial robustness as a property of a given model. In this work, we analyze where adversarial examples occur, in which ways they are peculiar, and how they are processed by robust models. We use robustness curves to show that ℓ_∞ threat models are surprisingly effective in improving robustness for other ℓ_p norms; we introduce perturbation cost trajectories to provide a broad perspective on how robust and non-robust networks perceive adversarial perturbations as opposed to random perturbations; and we explicitly examine the scale of certain common data sets, showing that robustness thresholds must be adapted to the data set they pertain to. This allows us to provide concrete recommendations for anyone looking to train a robust model or to estimate how much robustness they should require for their operation. The code for all our experiments is available at www.github.com/niklasrisse/adversarial-examples-and-where-to-find-them .

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

page 13

09/04/2019

Are Adversarial Robustness and Common Perturbation Robustness Independant Attributes ?

Neural Networks have been shown to be sensitive to common perturbations ...
07/31/2019

Adversarial Robustness Curves

The existence of adversarial examples has led to considerable uncertaint...
11/03/2020

Recent Advances in Understanding Adversarial Robustness of Deep Neural Networks

Adversarial examples are inevitable on the road of pervasive application...
03/20/2020

One Neuron to Fool Them All

Despite vast research in adversarial examples, the root causes of model ...
11/25/2021

Clustering Effect of (Linearized) Adversarial Robust Models

Adversarial robustness has received increasing attention along with the ...
07/28/2020

Reachable Sets of Classifiers Regression Models: (Non-)Robustness Analysis and Robust Training

Neural networks achieve outstanding accuracy in classification and regre...
12/07/2019

Does Interpretability of Neural Networks Imply Adversarial Robustness?

The success of deep neural networks is clouded by two issues that largel...

Code Repositories

adversarial-examples-and-where-to-find-them

None


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Adversarial robustness is a property that describes a model’s ability to behave correctly under small input perturbations that are crafted with the intent to mislead the model. In recent years, it has been pointed out that deep neural networks, despite their astonishing success in a wide range of prediction tasks, frequently lack this property Szegedy2014Intriguing,Goodfellow2014Explaining. The study of adversarial robustness – with its definitions, their implications, attacks, and defenses – has subsequently attracted considerable research interest. This is due both to the practical importance of trustworthy models and the intellectual interest in the differences between decisions of machine learning models and our human perception.

From a theoretical perspective, this scientific enthusiasm has led to an analysis of different definitions of adversarial robustness, their implications in terms of sample complexity and computational hardness, and the possibility or impossibility of achieving adversarial robustness under certain distributional assumptions dohmatob2018generalized,gourdeau2019hardness. These results on hardness pose an intellectual challenge: for many domains, a robust and efficient classifier does exist in the human brain (indeed, adversarial examples are often motivated by the differences between human and model classification). If we are not yet able to replicate this artificially, is it due to our definitions, our models, our data representation, or our optimization procedures?

From a practical perspective, we have seen the development of a wide array of defenses against adversarial attacks, most of which have quickly prompted the development of new attack methods capable of circumventing them Carlini2017Towards,chen2017ead,madry2018towards,tramer2020adaptive,carlini2019evaluating. A helpful overview of methodological foundations for defense mechanisms (and thereby indirectly of attack mechanisms) is given by carlini2019evaluating. Central to the development of an adversarial defense is the specification of a threat model that defines the types of perturbations which, when applied to real data points, should not change the classification behavior of the model. The challenge then lies in finding a tractable optimization procedure that satisfies the threat model, and in providing credible empirical or – ideally – theoretical evidence to show that the resulting model is indeed robust to all perturbations in the threat model. The most commonly used threat models are ball constrained perturbations, i. e. perturbations with norm smaller than a specified threshold. The most commonly used norms in this context are , and . Under this type of threat model, by definition, the study of adversarial robustness is closely tied to the distance of examples to decision boundaries. Constructing models where adversarial perturbations are rare, then involves directly or indirectly maximizing margins around decision boundaries111

It should – therefore – not surprise us when algorithms, such as the Support Vector Machine, that by construction maximize margins, also lead to adversarially robust models regarding the type of margin they optimize.

.

One popular defense mechanism for indirectly maximizing margins is Adversarial Training with Projected Gradient Descent (AT), which madry2018towards argue provides a natural security guarantee against first-order adversaries motivated by robust optimization. wong2017provable introduce a method (KW) with provable robustness against any norm-bounded perturbation on the training data by adapting the training process to minimize an upper bound on the robust loss of the network. Croce2018ProvableRobustness,Croce2020Provable achieve improved robustness to norm bounded perturbations by adding a regularization term that penalizes closeness to decision boundaries and linear regions of the model (MMR + AT and MMR-UNIV).

For this work, we explicitly decided not to focus on developing a new adversarial attack or defense. Instead, we explore and present adversarial robustness from a broader perspective, in order to increase the community’s understanding of how defense methods interact with choice of threat model, properties of the data set, and the way adversarials are processed. We start by describing robustness curves Gopfert2020AdversarialRobustnessCurves in Section 2 as a tool to evaluate the robustness of specific models and data sets and then study how robustness depends on distance function, data distribution and training method. The main part of the paper is structured as follows:

  1. In Section 3, we compare robustness curves for different models and different choices of distance functions. Our goal is to understand how the choice of model and training method impacts robustness, and to what extent robustness transfers between distance functions. For linear classifiers we show theoretically, that the shape of the robustness curve – representing the global robustness behaviour – is the same for all norms, and closeness of the norms can be controlled via sparsity of the weight vector. For non-linear models we empirically find that robustness transfers to norms not considered in the threat model, and the threat models frequently provide better robustness than threat models.

  2. In Section 4, we analyze how adversarial perturbations are perceived by robust and non-robust networks, specifically compared to random perturbations. To gain insight, we introduce perturbation cost trajectories, which offer a broader perspective on how a network perceives an input, and we find that adversarial perturbations propagate through deep networks in peculiar ways.

  3. Finally, in Section 5, we analyze how scale depends on the underlying data space, choice of distance function, and distribution. The main results are that the robustness curve for one norm-induced distance places bounds on the robustness curves for all other norm-induced distances and that the separation between curves can be extremely large, depending on the model and underlying data distribution. Based on these findings we suggest that the maximum perturbation size of threat models should be selected carefully in order to be meaningful, depending on characteristics of the data set.

2 Robustness curves

Let us consider what we would actually hope to see when performing an adversarial attack, instead of the small performance-breaking perturbations one encounters. We would hope to see either a meaningful – but small – alteration of the original input, such as an airplane growing eyes, a beak, and plumage, to be classified as a bird; or a complete dismantling of meaningful features resulting in unrecognizable noise, where we can readily accept pseudo-random class predictions. Semantics are crucial in both scenarios, which leads us to the conviction that our human perception and classification are integral to the nature of adversarial examples. Still, we must work with a more rigorous definition and therefore abstract further, which we do in the following.

An adversarial perturbation for a classifier and input-output pair is a small perturbation with . The resulting input is called an adversarial example. The set of input-output pairs vulnerable to adversarial perturbations is the set of data points which the classifier can be induced to misclassify using small perturbations – that is the set of points that are either already misclassified, or that lie in some small distance from the decision boundary. But what does small mean? How should distance be measured? And what distinguishes adversarial examples from other data points? These are the questions we aim to answer in the following sections.

One tool for our study of these questions are robustness curves

 Gopfert2020AdversarialRobustnessCurves, which we shortly explain in this section. A robustness curve is the cumulative distribution function of the distance of points from the decision boundaries of a classifier:

Definition 1.

Given an input space and label space , distance function on , and classifier . Let and let denote the data distribution that generates the samples. Then the -robustness curve for is the graph of the function

A model’s robustness curve shows how data points are distributed in relation to the decision boundaries of the model. The advantage of this distributional perspective is that it allows a step back from robustness regarding a specific quantification of what it means to be close to the decision boundary, and instead allows us to compare global robustness properties and their dependence on a given classifier, distribution and distance function. One of the main questions driving this work is how robustness – and robust training – are affected by the choice of distance function.

In principle, there exist a wealth of possible distance functions, and the choice of one particular distance function should depend on the downstream application domain of the model. For this work, we will assume that the distance function is induced by some norm on , i. e. . As HiddenInPlainSight show, this is not sufficient to encode dissimilarities of images as perceived by a human, or guarantee that perturbations are imperceptible to the human eye. However, the majority of current work on robustness uses distances induced by norms tramer2020adaptive. They are easy to define and well studied, and it is reasonable to assert that perturbations that are small in some norm typically should not change the classification of real-world images. In the following, we always assume that the distance we consider is induced by some norm, usually an norm, on a finite-dimensional vector space. In our experiments we consider the and norms, as these are frequently used prototypical examples.

3 Robustness curves and their dependence on distance functions

Due to strong concerns about trust in and security of machine learning models that are susceptible to adversarial perturbations, the last years have seen significant effort in developing training methods that produce more robust models. These training methods typically require the choice of a distance function and perturbation threshold . The pair is sometimes called a threat model. The training method will attempt (and possibly guarantee) to make the model robust against attacks in distance up to size . In other words, the objective of the training method is to minimize . In the following, we analyze to what extent the minimization of impacts for other choices and .

For the special case of linear classifiers, we find that the global robustness behavior of a model w. r. t. a data distribution for all norms is fully specified by the model parameters and the robustness curve for a single norm. Thus, we fully understand the behavior of for any norm induced distance and any based on knowing for one norm induced distance . This is an extension of a weaker theorem from Gopfert2020AdversarialRobustnessCurves.

Theorem 1.

If is a linear classifier on parameterized by , i. e. , then the shape of the robustness curve for regarding an norm-induced distance does not depend on the choice of . There exists a constant such that

(1)

The distortion factor is given by , where .

We provide a proof in Appendix A. Clearly, the distortion factor can be as large as , where is the dimensionality of the input space. See Section B.2 for an example where this occurs. One way to minimize this distortion factor is to encourage sparse weight vectors. However, as illustrated by LABEL:app:examplecurve_separation, this may backfire by worsening robustness for norms the model would otherwise be robust to, instead of improving robustness for norms it would otherwise not be robust to.

For more complex decision boundaries, dependence on can be more than a question of scale. The shape of the robustness curve may depend on , i. e. the distribution of the distances between points and decision boundary may be different for different choices of . See LABEL:fig:different_robustnesscurves_shape for an example.

Figure 1: A model with a non-linear decision boundary and three differently shaped robustness curves.

In the following, we empirically evaluate robustness curves for different models to investigate how robustification efforts transfer across norms. For complex models, exactly calculating the distance of a point to the closest decision boundary, and thus estimating the robustness curve, is computationally very intensive, if not intractable. Instead, we bound the true robustness curve from below using adversarial attacks. To verify that this approach is justified, we compare a robustness curve estimated using adversarial attacks to one based on exact calculation of distances to the decision boundary on a simple robust model. We observe that the shape and scale of the curves are very similar, especially for small perturbation sizes, which are most relevant in assessing the robustness of the model. See LABEL:app:Justifying_approximated_robustnesscurves for detailed results. We compare and

robustness curves for different robustified models from the literature. The network architecture we use in our analysis is a convolutional network with two convolutional layers, two fully connected layers and ReLu activation functions. We compare the following training methods:

  1. Standard training (ST), i. e. training without specific robustness considerations.

  2. Adversarial training (AT), see madry2018towards.

  3. Training with robust loss (KW), see wong2017provable.

  4. Maximum margin regularization for a single norm together with adversarial training (MMR + AT), see Croce2018ProvableRobustness

  5. Maximum margin regularization simultaneously for and margins (MMR-UNIV), see Croce2020Provable.

Figure 2: Robustness curves resulting from different training methods (MMR + AT, MMR-UNIV (first column), KW (second column), and AT (third column), always with ST for comparison) and different threat models (indicated by color), measured in the norm (first row) and the norm (second row).

Together with each training method, we state the threat model the model is optimized to guarantee (e. g. Threat Model: for robustness guarantees in norm for ), if available. Unless explicitly stated otherwise, the trained models are those made publicly available222www.github.com/fra31/mmr-universal, www.github.com/max-andr/provable-robustness-max-linear-regions by Croce2018ProvableRobustness,Croce2020Provable. We use six datasets: MNIST, Fashion-MNIST (FMNIST), German Traffic Signs (GTS) Houben-IJCNN-2013, CIFAR-10

, Tiny-Imagenet-200 (

TINY-IMG) tinyimagenet, and Human Activity Recognition (HAR) hardataset. For specifics on model training, refer to Appendix C. Models are generally trained on the full training set for the corresponding data set, and robustness curves evaluated on the full test set, unless stated otherwise. See Section D.2 for examples of adversarial perturbations optimized for different norms and different models, which provides intuition on how norm constraints and adversarial defenses affect adversarial perturbations.

First, we compute and robustness curves on MNIST for the five different training methods mentioned above, each for and threat models, except for MMR-UNIV which simultaneously defends against attacks in all norms. The results can be seen in Figure 2. We find that for AT and MMR + AT, the threat model leads to better robustness than the threat model both for and robustness curves. In fact, MMR + AT with the threat model even leads to better and robustness curves than MMR-UNIV, which is specifically designed to improve robustness for all norms. For KW, interestingly the threat model leads to better robustness than the threat model only for perturbation sizes up to around , and the threat model only leads to better robustness than the threat model up to perturbation size around , after which the two curves intersect. For all three training methods under the threat model, we observe a noticeable change in slope of the robustness curve between perturbation sizes and (perturbation size in the threat model chosen to be in each case). MMR + AT is significant in that the change in slope occurs later than in the other models, even though the target perturbation size is identically . Each method leads to significantly improved robustness curves as compared to ST, even beyond the perturbation sizes specified by the threat model.

Figure 3: Scaled and unscaled robustness curves for different training methods. Scaling methods are indicated by color. Training methods are indicated by plot titles.

Because perturbations sizes in and norm tend to be very different, it is difficult to compare robustness curves for the two norms in a single plot. In Figure 3, we introduce two methods of rescaling to better compare the shapes of the curves. With method 1, we rescale the robustness curve by the mean ratio between the curves. This method allows us to see even more clearly how the robustness curve for MMR + AT with threat model changes slope around the target perturbation size, which does not occur in the robustness curve for any of the training methods. With method 2, we rescale the robustness curve by the mean ratio for all points in the data set. The benefit of this rescaling method is that it does not require computation of both curves, but, at least empirically, can still provide an estimate of the robustness curve or vice versa. It is interesting to see that method 2 overestimates the robustness curve for ST, underestimates it for MMR + AT with the threat model, and almost perfectly estimates it for MMR + AT with the threat model. This reinforces that compared to standard training, MMR + AT with both the and threat models improve robustness more strongly than robustness.

Figure 4: Approximated robustness curves for multiple data sets. Each curve is calculated for a different model and a different test data set. The data sets are indicated by the labels. The models are trained with MMR + AT, Threat Models: MNIST: , FMNIST: , GTS: , CIFAR-10: .

Figure 4 shows robustness curves for MMR + AT with threat model as provided by Croce2018ProvableRobustness. The models trained on MNIST and FMNIST both show the characteristic change in slope. Further, the robustness curves for CIFAR-10 and GTS show that the models for these two data sets, which were trained with much smaller target perturbation sizes, are very non-robust compared to the former two models and do not have a visually detectable change in slope. We will revisit the question of which target perturbation level to choose in dependence on the data set in Section 5.

4 How adversarial perturbations are processed

In the previous section, our focus was on how robust training methods impact the robustness curve of the resulting model. Now, we investigate how they impact the way the neural network “perceives” adversarial perturbations. Instead of only looking at the perturbed input and the resulting (mis-)classification, we follow the effects of a perturbation along every step along its way through the network – i. e. at each layer. As a tool for this, we introduce perturbation cost trajectories.

Definition 2.

Let be the function calculated by a neural network, where is the input dimensionality, is the cardinality of the label space, and the output of

are the softmax probabilities for each class. Let

be the number of layers in the network. For and let denote the output of the first layers of the network, applied to . In particular, and . Let be an input to the network, let be a perturbation and let denote a norm. Then the perturbation cost of for at layer is defined as

(2)

The perturbation cost of at layer is . The perturbation cost trajectory of at is the sequence .

Adversarial versus random perturbations

Figure 5: Averaged perturbation cost trajectories for adversarial and random perturbations for ST and AT. The labels of the curves show the perturbation type of the individual curves. The training methods are indicated by the subplot titles. In LABEL:app:perturbation_cost_atadversarial-type_level we present an additional experiment to explore why the -robustified model misclassifies a number of random perturbations (middle).

In the following, we present and discuss several perturbation cost trajectories for a simple architecture trained with either ST or MMR + AT 333We limit our analysis to the least and most effective methods, respectively, because of computational (and time) constraints.. We perturb  data points from the MNIST test set and feed them through the network – then, for comparison, we sample random perturbations of the exact same size and trace those through the network444See LABEL:app:randomperturbation_generation and C.2 for details on how we construct the perturbations. For each trained model, this yields two pairs of curves – one for the norm and one for the norm. Perturbation cost trajectories are computed w. r. t. the norm that the perturbation is optimized for. This allows a number of observations (cf. Figure 5):

  1. For all three models, data points perturbed with random noise are almost never misclassified, compared to (by construction) all data points with adversarial perturbations.

  2. Only perturbation costs of scaled perturbations decrease monotonously with each additional layer.

  3. In the non-robust network, perturbation costs for adversarial perturbations are amplified by each layer.

  4. Perturbation cost trajectories for adversarial perturbations are similar for -optimized adversarial perturbations and -optimized adversarial perturbations.

  5. Perturbation costs for random perturbations in norm decrease more slowly than those for norm. For the model robustified to , the random perturbation cost trajectory mirrors that of the adversarial perturbations, but leads to misclassification in only approximately of cases.

Figure 6: Perturbation cost trajectories for adversarial perturbations plus random perturbations of different sizes for different training methods. The labels of the curves show the perturbation type and random perturbation size of the individual curves. The training methods are indicated by the subplots titles.

By construction, adversarial examples are close to a decision boundary, but what about the rest of their surroundings? In Figure 6, we present perturbation cost trajectories for random perturbations added to adversarial perturbations. We observe that, although each adversarial example is misclassified (as that is how it was constructed), only a small fraction of points in its immediate vicinity are also misclassified. However, contrary to what we see for random perturbations added to the original inputs (cf. Figure 5), perturbation costs for random perturbations on top of adversarial perturbations do not decrease layer by layer – instead, the average perturbation cost trajectory for the adversarial perturbation with random noise follows the average perturbation cost trajectory for the adversarial perturbations. The two trajectories only separate in the layer. This shows that perturbations in the general vicinity of the adversarial perturbation, are not abstracted away by the neural network processing, although most of them do not affect the final classification. We also observe that a much larger proportion of points in the vicinity of the adversarial examples for the -robustified model are misclassified than for the -robustified model.

Figure 7: Perturbation cost trajectories for adversarial perturbations (optimized for different threat models) for models trained with different training methods. The labels of the curves show the perturbation type and the model, for which the perturbation was optimized. The training methods are indicated by the column titles.

Since robustification also improves robustness and vice versa (see Section 3), one might expect that adversarial examples themselves transfer between the models robustified for different norms. To explore this, we find adversarial perturbations for each trained model and calculate the respective trajectories not only for said model, but for all three models – see Figure 7. The adversarial perturbations are only successful against the models upon which they were created. In fact, both robustified models process “foreign” adversarial perturbations similarly to -scaled random perturbations. Perhaps surprisingly, the non-robust ST model manages to correctly classify almost all “foreign” adversarial examples, even though the perturbations are relatively large.

All in all, we observe that both models trained with MMR + AT have less variation in perturbation costs across layers than the non-robust ST model. We are surprised by how strongly the perturbation cost trajectory decreases for

-scaled random perturbations, for all three models. Our finding that “foreign” adversarial examples are repelled is at odds with the observation that for black-box attacks, transfer of adversarial examples between different models seems to be crucial.

5 Scale of robustness curves

The dependence of scale on the norm

In Section 3, we showed examples of how different the scale of robustness curves (as characterized, for example, by the average distance of a sample from a decision boundary) can be, both based on the distance function chosen as well as the data set under consideration. In the following, we present some considerations on how heavily scale may vary, and end with some practical recommendations for practitioners working with popular data sets. We begin by observing that on finite-dimensional vector spaces, one robustness curve places natural bounds on all others that represent distances induced by norms:

Proposition 1.

Let be a finite-dimensional vector space. Let and be two distance measures on that are induced by norms. Then there exist constants such that for every classifier ,

(3)

See Section B.1 for the proof. Even restricting ourselves to norm-induced distances, if is clear that the bounds provided by Proposition 1 are not necessarily informative. As noted in the discussion for Theorem 1, even for this reduced class of distances, the constants can lead to a very broad range. For example, we find

(4)

where is the data dimensionality, and these inequalities are tight. See Section B.2 for an example of a data distribution where we construct two classifiers with identical robustness, but robustness that differs by a factor of . Consequently, at least in high dimensions, the scale of a robustness curve can, but doesn’t necessarily depend strongly on the choice of norm.

Scale for specific data sets

Figure 8: Minimum inter- and intra-class distances in different norms. Red curves: Sorted closest distances of each point to the closest point of the same class. Blue curves: Sorted closest distances of each point to the closest point of a different class. Left: Distance Measured in . Middle: Distance measured in . Right: Distance measured in . Data set is the full MNIST test set ( points).

Robustness curves allow us to analyze robustness globally, without focusing too much on (more or less) arbitrary thresholds. Nonetheless, the question of scale remains. Which perturbations should one consider “large” or “small”? One of the underlying assumptions in the definition of adversarial examples is that when , is being incorrectly classified, i. e. would be the correct classification choice. The question of scale therefore cannot be answered independently of the data distribution. In order to understand how to interpret different perturbation sizes, it can be helpful to understand how strongly the data point would need to be perturbed to actually change the correct classification. In this section, we analyze this question for several popular data sets.

Recall that for two differently labeled points and , with and , the perturbation when applied to hopefully leads to a change in prediction, but it can hardly be considered adversarial. Instead, it can provide some sense of scale and serve as an upper bound on reasonable robustness thresholds: when a perturbation is large enough to change an input such that the result should correctly be assigned a different label, it might not be sensible to expect a model to be robust against it. Similarly, when a perturbation does not even “reach” the closest point of the same label, it probably should not be able to influence a model’s prediction. Of course, this is subject to the concrete distribution in question and depends on the specific sampled data that is available.

When we look at the respective numbers for the MNIST test set in the , , and norms, we can make several observations. See Figure 8 for the distributions of distances between points and their closest neighbors from the same class and different classes. Because the smallest inter-class distance in the norm is around , we can see that transforming an input from one class to one from a different class almost always requires completely flipping at least one pixel from almost-black to almost-white or vice versa. Transforming within a given class frequently requires only a smaller perturbation. For the and norms, the inter-class distance distributions are more spread out than the inter-class distance distribution. We observe that with perturbations of size , it becomes possible to transform samples from different classes into each other, so starting from this threshold, any classifier must necessarily trade off between accuracy and robustness.

Figure 9: Minimum inter- and intra-class distances for three data sets. Red curves: Sorted closest distances of each point to the closest point of the same class. Blue curves: Sorted closest distances of each point to the closest point of a different class. Left:  points from MNIST test data set. Middle: points from GTS test data set. Right:  points from TINY-IMG test set.

Now that we have developed a certain sense of scale w. r. t. MNIST, let us compare it to two further data sets, namely GTS and TINY-IMG. We present the inter- and intra-class distance distributions in the norm (where all distances lie in the interval ) in Figure 9. The shapes of the curves differ strongly between data sets, and – most notably – for TINY-IMG, the distance of each point to the nearest neighbor from a different class is smaller than the distance to the nearest neighbor within the same class. Presumably, this is due to high data dimensionality, a large number of classes, and relatively large variation within a given class, when compared to MNIST– refer to Table 1 for exact values.

Figure 10: Minimum inter-class distances of all data sets considered in this work, measured in (left), (middle), and (right) norm. See Table 1 for size and dimensionality.

Finally, we compare the inter-class distance distributions in , , and norm for all data sets considered in this work – see Figure 10. We observe that for the norm, the shape of the curves is similar across data sets, but their extent is determined by the dimensionality. In the norm, vastly different curves emerge for the different data sets. We hypothesize that, because the inter-class distance distributions vary more strongly for distances than for distances, the results of robustifying a model w. r. t. distances may depend more strongly on the underlying data distribution than the results of robustifying w. r. t. distances. This is an interesting avenue for future work. In any case, it is safe to say that, when judging the robustness of a model by a certain threshold, that number must be set with respect to the scale of the distribution the model operates on.

Inter-class Distance
Smallest Largest
Dataset Samples Classes Dimensionality
MNIST
TINY-IMG
FMNIST
GTS
CIFAR-10
HAR
Table 1: Smallest and largest inter-class distances for subsets of several data sets, measured in , , and norm, together with basic contextual information about the data sets. All data has been been normalized to lie within the interval , and duplicates and corrupted data points have been removed. Besides HAR, all data sets contain images – the dimensionality reported specifies their sizes and number of channels.

In Table 1, we summarize the smallest and largest inter-class distances in different norms together with additional information about the size, number of classes, and dimensionality of the all the data sets we consider in this work. The values correspond directly to Figure 10, but even in this simplified view, we can quickly make out key differences between the data sets. Compare, for example, MNIST and GTS: While it appears entirely reasonable to expect robustness of for MNIST, the same threshold for GTS is not possible. Relating Table 1 and Figure 4, we find entirely plausible the strong robustness results for MNIST, and the small perturbation threshold for GTS. Based on inter-class distances we also expect less robustness for CIFAR-10 than for FMNIST, but not as seen in Figure 10.

Overall, the strong dependence of robustness curve scaling on the data set and the chosen norm, emphasizes the necessity of informed and conscious decisions regarding robustness thresholds. We hope that Table 1 can provide an easy reference when judging scales in a threat model.

6 Conclusion

We have investigated the relationships between adversarial defenses, the threat models they address, and the robustness of a model w. r. t. small perturbations in different norms, using robustness curves and perturbation cost trajectories. We have found that threat models are surprisingly effective in improving robustness for other norms, and hope that future defenses will be evaluated from this perspective. We have used perturbation cost trajectories to gain a broader view on how robust and non-robust networks perceive adversarial perturbations as opposed to random perturbations. Finally, we have seen how suitable robustness thresholds necessarily depend on the data set under consideration.

It is our hope that practitioners and researchers alike will use the methodology proposed in this work, especially when developing and analyzing adversarial defenses, and carefully motivate any threat models they might choose, taking into account the available context.

Appendix A Proof of Theorem 1

Theorem 1.

If is a linear classifier on parameterized by , i. e. , then the shape of the robustness curve for regarding an norm-induced distance does not depend on the choice of . I.e. there exists a such that

(5)

The distortion factor is given by , where .

Lemma 1.

Let with . Let and such that , where we take . Then

(6)

and the minimum is attained by

(7)

where and is the -th unit vector.

Proof.

By Hölder’s inequality, for any ,

(8)

For such that it follows that

(9)

Using the identity , it is easy to check that for every , with as defined in Equation 7,

  1. , so that , and

  2. .

Item 1 shows that is a feasible point, while Item 2 in combination with Equation 9 shows that is minimal. ∎

Using Lemma 1, we are ready to prove Theorem 1.

Proof.

By definition,

(10)

We can split into the disjoint sets

(11)
(12)
(13)

Choose such that . By Lemma 1, and using that ,

(14)
(15)
(16)

This shows that

(17)
(18)
(19)

Appendix B Robustness curve separation

b.1 Proof of Proposition 1

Proposition 1.

Let be a finite-dimensional vector space. Let and be two distance measures on that are induced by norms. Then there exist constants such that for every classifier ,

(20)
Proof.

If is a finite-dimensional vector space, it is known that any two norms on are equivalent, i.e. for any two norms on there exist constants such that for all

(21)

For a distance function and perturbation size , let

(22)

so that . Let be induced by the norms . If , it follows that , and if , it follows that . As a result,

(23)

b.2 Example of robustness curve separation

Figure 11: Left two images: the types of images distinguished between in the toy classification problem. Second from the right: a sparse weight vector, the result of a -regularized linear SVM, with one non-zero entry. Right: a dense weight vector, the result of a -regularized linear SVM, with no non-zero entries. The vectors are reshaped to . Black corresponds to , beige corresponds to , and reddish colors correspond to values or close to .

As an illustrative toy example of how strongly the robustness curves for different norms can be separated, consider the following classification task. The goal is two distinguish between the two types of -pixel images in Figure 11. Running a -regularized linear SVM on this data leads to the sparse weight vector schematically represented in Figure 11, with just one non-zero entry. Running a -regularized linear SVM leads to the dense weight vector also schematically represented in Figure 11, with all non-zero entries.

Figure 12: Left: an image that is incorrectly classified by the sparse classifier, but not the dense classifier. Right: an image that is misclassified by both classifiers.

Both classifiers have 100% accuracy, and all samples are at -distance from both decision boundaries, i.e. the robustness curves of both classifiers are identical. However, the robustness curves of the two classifiers are maximally separated: for the sparse classifier, the robustness curve is a step function at , while for the dense classifier, it is a step function at , the dimensionality of the data. Figure 12 shows adverarial examples for the two classifiers. For the sparse classifier, it is sufficient to flip a single pixel, while for the dense classifier, more noticeable changes are necessary.

Appendix C Experimental details

c.1 Model training

We use the same model architecture as Croce2018ProvableRobustness and wong2017provable. Unless explicitly stated otherwise, the trained models are taken from Croce2018ProvableRobustness. The exact architecture of the model is: Convolutional layer (number of filters: 16, size: 4x4, stride: 2), ReLu activation function, convolutional layer (number of filters: 32, size: 4x4, stride: 2), ReLu activation function, fully connected layer (number of units: 100), ReLu activation function, output layer (number of units depends on the number of classes). All models are trained with Adam Optimizer kingma2014adam for 100 epochs, with batch size 128 and a default learning rate of 0.001. More information on the training can be found in the experimental details section of the appendix of Croce2018ProvableRobustness. All models are publicly available in the GitHub repositories of Croce2018ProvableRobustness and Croce2020Provable:

www.github.com/max-andr/provable-robustness-max-linear-regions and www.github.com/fra31/mmr-universal.

c.2 Approximated robustness curves

We use state-of-the-art adversarial attacks to approximate the true minimal distances of input datapoints to the decision boundary of a classifier for our adversarial robustness curves (see Definition 1). We base our selection of attacks on the recommendations of carlini2019evaluating. Specifically, we use the following attacks: For robustness curves we use EAD, chen2017ead, for robustness curves we use Carlini2017Towards and for robustness curves we use madry2018towards. For all three attacks, we use the implementations of Foolbox rauber2017foolbox. For the

attack, we use the standard hyperparameters of the Foolbox implementation. For the

and attacks we increase the number of binary search steps that are used to find the optimal tradeoff-constant between distance and confidence from 5 to 10, which we found empirically to improve the results. For the rest of the hyperparameters, we again use the standard values of the Foolbox implementation.

c.3 Generating random perturbations

We compare perturbation cost trajectories for adversarial perturbations and random perturbations. In the following, we specify how the random perturbations are generated. Let be the perturbation that we want to generate. is a -dimensional vector, where is defined by the number of input features of the original input. Let be the target size of the noise.

If the target size is measured in norm, we sample

(24)

where the

are i.i.d. Rademacher variables, i.e. uniformly distributed on

. This method of sampling is based on the Fast Gradient Sign Method, and means that our random noise is large as possible under the norm constraints we impose.

If the target size is measured in norm, we let

(25)

where

, i. e. we sample a vector from a Gaussian distribution and scale it to the desired length.

c.4 Computational architecture

We executed all programs on an architecture with 2 x Intel Xeon(R) CPU E5-2640 v4 @ 2.4 GHz, 2 x Nvidia GeForce GTX 1080 TI 12G and 128 GB RAM.

Appendix D Additional experiments

d.1 Justifying approximated robustness curves

We’ve compared approximated and true curves to justify our use of approximate robustness curves. For computational reasons, this was done only for MNIST, and for a smaller architecture than our other experiments, and not for curves. The results can be seen in Figure 13. We observe that the curves track each other reasonably well, especially for small perturbation sizes, which is the most interesting region. Due to computational constraints, we did not perform this comparison for the more complicated architecture our other experiments are based on. The calculation of the exact robustness curve in Figure 13 took  days and  hours on our computational setup (see Appendix C for details). We started calculation of the exact curves for the architecture used in our other experiments, but had to cancel after  days, since not even a single data point had been evaluated.

Figure 13: True and approximated robustness curves for norm. The model is trained on MNIST with MMR+AT, Threat Model: . We use a different model architecture to reduce the runtime of the exact robustness evaluation. The model architecture is: INPUT, FC(1024), RELU. Both curves are calculated for 100 points of the MNIST test set. The true robustness curve is calculated using the Mixed Integer Programming method from [Tjeng2017VerifyingNeuralNetworks]. The approximated curve is calculated using the PGD attack decribed in [madry2018towards].

d.2 Adversarial attack intuition

In order to provide intuition on how adversarial examples differ based on the choice of norm and defense, we show adversarial attacks to an image from GTS in Figure 14.

Figure 14: Adversarial examples for robust/non-robust models and different norms. The adversarial examples are generated with adversarial attacks (: Projected Gradient Descent Attack (PGD) as suggested in madry2018towards, : Carlini Wagner L2 Attack (CW) as suggested in Carlini2017Towards, : Elastic-net attack (EAD) as suggested in chen2017ead). The adversarial perturbations are scaled by factor 10 to increase visibility. We show adversarial examples for three different models. The training methods are indicated by the labels on the y-axes. The original image is taken from GTS.

d.3 Distinguishing between defended and non-defended adversarial perturbations

Figure 5 gives the impression that the MMR + AT () model does not reduce perturbation costs of random perturbations as the other models do (although only a small number of these perturbations lead to misclassification). To verify that this is not simply due to the larger perturbation size made necessary for the robustified model, we calculate perturbation cost trajectories separated by the size of the adversarial perturbation (defended against or part of the percentage not defended against). For both cases, and thus for both perturbation magnitudes, we observe the same effect.

Figure 15: Perturbation cost trajectories for adversarial perturbations and random perturbations for a robust model, separated by class (defended against vs. not defended against). The curves show the mean perturbation costs for a subset of 1000 datapoints of the MNIST test set, that falls into the respective class. The labels of the curves show the perturbation type and the class of the individual curves. The model is trained with MMR+AT, Threat Model: .
Figure 16: Minimum inter-class differences for subsets of different data sets, measured in , and norm.

d.4 Calculating inter-class distances

In Figure 16, we show minimum inter-class distances of , and norm for subsets of commonly used data sets. We can observe, that for all norms and all data sets we examined, a small sample (   to   ) of the data is sufficient to reliably estimate the minimum inter-class distance of the full data set.