1 Introduction
Research on adversarial examples is motivated by a spectrum of questions. These range from the security of models deployed in the presence of realworld adversaries to the need to capture limitations of representations and their (in)ability to generalize (Gilmer et al., 2018a). The broadest accepted definition of an adversarial example is “an input to a ML model that is intentionally designed by an attacker to fool the model into producing an incorrect output” (Goodfellow & Papernot, 2017).
To enable concrete progress, many definitions of adversarial examples were introduced in the literature since their initial discovery (Szegedy et al., 2013; Biggio et al., 2013). In a majority of work, adversarial examples are commonly formalized as adding a perturbation to some test example to obtain an input that produces an incorrect model outcome.^{1}^{1}1Here, an incorrect output either refers to the model returning any class different from the original source class of the input, or a specific target class chosen by the adversary prior to searching for a perturbation. We refer to this entire class of malicious inputs as perturbationbased adversarial examples. The adversary’s capabilities may optionally be constrained by placing a bound on the maximum perturbation added to the original input (e.g., using an norm).
Achieving robustness to perturbationbased adversarial examples, in particular when they are constrained using norms, is often cast as a problem of learning a model that is uniformly continuous: the defender wishes to prove that for all and for some , all pairs of points with satisfy (where
denotes the classifier’s logits). Different papers take different approaches to achieving this result, ranging from robust optimization
(Madry et al., 2017) to training models to have Lipschitz constants (Cisse et al., 2017) to models which are provably robust to small perturbations (Wong & Kolter, 2018; Raghunathan et al., 2018).In this paper we present analytical results that show how optimizing for uniform continuity is not only insufficient to address the lack of generalization identified through adversarial examples, but also potentially harmful. Our intuition, captured in Figure 1, relies on the inability of norms to capture the geometry of ideal decision boundaries (or any other distance metric that does not perfectly capture semantics). This leads us to present analytical constructions and empirical evidence that robustness to perturbationbased adversaries can increase the vulnerability of models to other types of adversarial examples.
Our argument relies on the existence of invariancebased adversarial examples (Jacobsen et al., 2019). Rather than perturbing the input to change the classifier’s output, they modify input semantics while keeping the decision of the classifier identical. In other words, the vulnerability exploited by invariancebased adversarial examples is a lack of sensitivity in directions relevant to the task: the model’s consistent prediction does not reflect the change in the input’s true label.
Our analytical work exposes a complex relationship between perturbationbased and invariancebased adversarial examples. We construct a model that is robust to perturbationbased adversarial examples but not to invariancebased adversarial examples. We then demonstrate how an imperfect model for the adversarial spheres task proposed by Gilmer et al. (2018b) is either vulnerable to perturbationbased or invariancebased attacks—depending on whether the point attacked is on the inner or outer sphere. Hence, at least these two types of adversarial examples are needed to fully account for model failures (more vulnerabilities may be discovered at a later point).
To demonstrate the practicality of our argument, we then consider vision models with stateoftheart robustness to norm adversaries. We introduce an algorithmic approach for finding invariancebased adversarial examples. Our attacks are modelagnostic and generate and invariance adversarial examples, succeeding at changing the underlying classification (as determined by a human study) in and of cases, respectively. When robust models classify the successful attacks, they achieve under (respectively, ) agreement with the human label.
Perhaps one of the most interesting aspects of our work is to show that different classes of current classifier’s limitations fall under the same umbrella term of adversarial examples. Despite this common terminology, each of these limitations may stem from different shortcomings of learning that have nontrivial relationships. To be clear, developing norm perturbationrobust classifiers is a useful benchmark task. However, as our paper demonstrates, it is not the only potential way classifiers may make mistakes even within the norm. Hence, we argue that the community will benefit from working with a series of definitions that precisely taxonomize adversarial examples.
2 Defining Perturbationbased and Invariancebased Adversarial examples
In order to make precise statements about adversarial examples, we begin with two definitions.
Definition 1 (Perturbationbased Adversarial Examples).
Let denote the th layer, logit or argmax of the classifier. A Perturbationbased adversarial example (or perturbation adversarial) corresponding to a legitimate test input fulfills:

Created by adversary: is created by an algorithm with .

Perturbation of output: and , where perturbation is set by the adversary and denotes the oracle.
Furthermore, is bounded if , where is a norm on and .
Property (i) allows us to distinguish perturbation adversarial examples from points that are misclassified by the model without adversarial intervention. Furthermore, the above definition incorporates also adversarial perturbations designed for hidden features as in (Sabour et al., 2016), while usually the decision of the classifier (argmaxoperation on logits) is used as the perturbation target. Our definition also identifies bounded perturbationbased adversarial examples (Goodfellow et al., 2015) as a specific case of unbounded perturbationbased adversarial examples. However, our analysis primarily considers the latter, which correspond to the threat model of a stronger adversary.
Definition 2 (Invariancebased Adversarial Examples).
Let denote the th layer, logit or argmax of the classifier. An invariancebased adversarial example (or invariance adversarial) corresponding to a legitimate test input fulfills:

Created by adversary: is created by an algorithm with .

Lies in preimage of under : and , where denotes the oracle.
As a consequence, also holds for invariancebased adversarial examples, where is the output of the classifier. Intuitively, adversarial perturbations cause the output of the classifier to change, while the oracle would still label the new input in the original source class. Whereas perturbationbased adversarial examples exploit the classifier’s excessive sensitivity in taskirrelevant directions, invariancebased adversarial examples explore the classifier’s preimage to identify excessive invariance in taskrelevant directions: its prediction is unchanged while the oracle’s output differs. Briefly put, perturbationbased and invariancebased adversarial examples are complementary failure modes of the learned classifier.
3 Robustness to Perturbationbased Adversarial Examples Can Cause Invariancebased Vulnerabilities
We now investigate the relationship between the two adversarial example definitions from Section 2. So far, it has been unclear whether solving perturbationbased adversarial examples implies solving invariancebased adversarial examples, and vice versa. In the following, we show that this relationship is intricate and developing models robust in one of the two settings only would be insufficient.
In a general setting, invariance and stability can be uncoupled. For this consider a linear classifier with matrix
. The perturbationrobustness is tightly related to forward stability (largest singular value of
). On the other hand, the invarianceview relates to the stability of the inverse (smallest singular value of ) and to the nullspace of . As largest and smallest singular values are uncoupled for general matrices , the relationship between both viewpoints is likely nontrivial in practice.3.1 Building our Intuition with Extreme Uniform Continuity
In the extreme, a classifier achieving perfect uniform continuity would be a constant classifier. Let denote a classifier with for all . As the classifier maps all inputs to the same output , there exist no , such that . Thus, the model is trivially perturbationrobust (at the expense of decreased utility). On the other hand, the preimage of under is the entire input space, thus is arbitrarily vulnerable to invariancebased adversarial examples. Because this toy model is a constant function over the input domain, no perturbation of an initially correctly classified input can change its prediction.
This trivial model illustrates how one not only needs to control sensitivity but also invariance alongside accuracy to obtain a robust model. Hence, we argue that the oftendiscussed tradeoff between accuracy and robustness (see Tsipras et al. (2019) for a recent treatment) should in fact take into account at least three notions: accuracy, sensitivity, and invariance. This is depicted in Figure 1. In the following, we present arguments as for why this insight can also extend to almost perfect classifiers.
3.2 Comparing Invariancebased and Perturbationbased Robustness
We now show how the analysis of perturbationbased and invariancebased adversarial examples can uncover different model failures. To do so, we consider the synthetic adversarial spheres problem of Gilmer et al. (2018b). The goal of this synthetic task is to distinguish points from two cocentric spheres (class 1: and class 2: ) with different radii and . The dataset was designed such that a robust (maxmargin) classifier can be formulated as:
Our analysis considers a similar, but slightly suboptimal classifier in order to study model failures in a controlled setting:
which computes the norm of from its first cartesiancoordinates and outputs 1 (resp. +1) for the inner (resp. outer) sphere. The bias is chosen based on finite training set (see Appendix A).
Even though this suboptimal classifier reaches nearly 100 on finite test data, the model is imperfect in the presence of adversaries that operate on the manifold (i.e., produce adversarial examples that remain on one of the two spheres but are misclassified). Most interestingly, the perturbationbased and invariancebased approaches uncover different failures (see Appendix A for details on the attacks):

Perturbationbased: All points from the outer sphere (i.e., ) can be perturbed to , where while staying on the outer sphere (i.e., ).

Invariancebased: All points from the inner sphere () can be perturbed to , where , despite being in fact on the outer sphere after the perturbation has been added (i.e., ).
In Figure 2, we plot the mean accuracy over points sampled either from the inner or outer sphere, as a function of the norm of the adversarial manipulation added to create perturbationbased and invariancebased adversarial examples. This illustrates how the robustness regime differs significantly between the two variants of adversarial examples. Therefore, by looking only at perturbationbased (respectively invariancebased) adversarial examples, important model failures may be overlooked. This is exacerbated when the data is sampled in an unbalanced fashion from the two spheres: the inner sphere is robust to perturbation adversarial examples while the outer sphere is robust to invariance adversarial examples (for accurate models).
4 Invariancebased Attacks in Practice
We now show that our argument is not limited to the analysis of synthetic tasks, and give practical automated attack algorithms to generate invariance adversarial examples. We elect to study the only dataset for which robustness is considered to be nearly solved under the norm threat model: MNIST (Schott et al., 2019). We show that MNIST models trained to be robust to perturbationbased adversarial examples are less robust to invariancebased adversarial examples. As a result, we show that while perturbation adversarial examples may not exist within the ball around test examples, adversarial examples still do exist within the ball around test examples.
Why MNIST?
The MNIST dataset is typically a poor choice of dataset for studying adversarial examples, and in particular defenses that are designed to mitigate them (Carlini et al., 2019). In large part this is due to the fact that MNIST is significantly different from other vision classification problems (e.g., features are quasibinary and classes are well separated in most cases). However, the simplicity of MNIST is why studying norm adversarial examples was originally proposed as a toy task to benchmark models (Goodfellow et al., 2015). Unexpectedly, it is perhaps much more difficult than was originally expected. However, several years later, it is now argued that training MNIST classifiers whose decision is constant in an norm ball around their training data provides robustness to adversarial examples (Schott et al., 2019; Madry et al., 2017; Wong & Kolter, 2018; Raghunathan et al., 2018).
Furthermore, if defenses relying on the norm threat model are going to perform well on a vision task, MNIST is likely the best dataset to measure that—due to the specificities mentioned above. In fact, MNIST is the only dataset for which robustness to adversarial examples is considered even remotely close to being solved (Schott et al., 2019) and researchers working on (provable) robustness to adversarial examples have moved on to other, larger vision datasets such as CIFAR10 (Madry et al., 2017; Wong et al., 2018)
or ImageNet
(Lecuyer et al., 2018; Cohen et al., 2019).This section argues that, contrary to popular belief, MNIST is far from being solved. We show why robustness to norm perturbationbased adversaries is insufficient, even on MNIST, and why defenses with unreasonably high uniform continuity can harm the performance of the classifier and make it more vulnerable to other attacks exploiting this excessive invariance.
4.1 A toy worstcase: binarized MNIST classifier
To give an initial constructive example, consider a MNIST classifier which binarizes (by thresholding at, e.g., 0.5) all of its inputs before classifying them with a neural network. As
(Tramèr et al., 2018; Schott et al., 2019) demonstrate, this binarizing classifier is highly robust, because most perturbations in the pixel space do not actually change the (thresholded) feature representation.However, this binary classifier will have trivial invariancebased adversarial examples. Figure 8 shows an example of this attack. Two images which are dramatically different to a human (e.g., a digit of a one and a digit of a four) can become identical after preprocessing the images with a thresholding function at (as examined by, e.g., Schott et al. (2019)).
4.2 Generating Modelagnostic Invariancebased Adversarial Examples
In the following, we build on existing invariancebased attacks (Jacobsen et al., 2019; Behrmann et al., 2018; Li et al., 2019) to propose a modelagnostic algorithm for crafting invariancebased adversarial examples. That is, our attack algorithm generates invariance adversarial examples that cause a human to change their classification, but where most models, not known by the attack algorithm, will not change their classification. Our algorithm for generating invariancebased adversarial examples is simple, albeit tailored to work specifically on datasets where comparing images in pixel space is meaningful, like MNIST.
Begin with a source image, correctly classified by both the oracle evaluator (i.e., a human) and the model. Next, try all possible affine transformations of training data points whose label is different from the source image, and find the target training example which—once transformed—has the smallest distance to the source image. Finally, construct an invariancebased adversarial example by perturbing the source image to be “more similar” to the target image under the metric considered. In Appendix B, we describe instantiations of this algorithm for the and norms. Figure 4 visualizes the substeps for the attack, which are described in details in Appendix B.
The underlying assumption of this attack is that small affine transformations are less likely to cause an oracle classifier to change its label of the underlying digit than perturbations. In practice, we validate this hypothesis with a human study in Section 4.3.
4.3 Evaluation
Attack analysis.
We generate 1000 adversarial examples using each of the two above approaches on examples randomly drawn from the MNIST test set. Our attack is quite slow, with the alignment process taking (amortized) several minutes per example. We performed no optimizations of this process and expect it could be improved. The mean distortion required is 25.9 (with a median of 25). The adversarial examples always use the full budget of and take a similar amount of time to generate; most of the cost is again dominated by finding the nearest test image.
Human Study.
We randomly selected 100 examples from the MNIST test set and create 100 invariancebased adversarial examples under the norm and norm, as described above. We then conduct a human study to evaluate whether or not these invariance adversarial examples indeed are successful, i.e., whether humans agree that the label has been changed despite the model’s prediction remaining the same. We presented 40 human evaluators with these images, half of which were natural unmodified MNIST digits, and the remaining half were distributed randomly between or invariance adversarial examples.

Results.
For the clean (unmodified) test images, 98 of the 100 examples were labeled correctly by all human evaluators. The other 2 images were labeled correctly by over of human evaluators.
Our attack is highly effective: For 48 of the 100 examples at least of human evaluator who saw that digit assigned it the same label, different from the original test label. Humans only agreed with the original test label (with the same threshold) on 34 of the images, while they did not form a consensus on the remaining 18 examples. The (much simpler) attack is less effective: Humans only agreed that the image changed label on 14 of the examples, and agreed the label had not changed in 74 cases. We summarize results in Table 5 (a).
In Figure 5 (b) we show sample invariance adversarial examples. To simplify the analysis in the following section, we split our generated invariance adversarial examples into two sets: the successes and the failures, as determined by whether the plurality decision by humans was different than or equal to the human label. We only evaluate the models on the subset of invariance adversarial examples that caused the humans to switch their classification.
Model Evaluation.
Now that we have oracle groundtruth labels for each of the images as decided by the humans, we report how often our models agree with the humanassigned label. Table 1 summarizes the results of this analysis. For the invariance adversarial examples we report model accuracy only on the successful attacks, that is, those where the human oracle label changed between the original image and the modified image.
Every classifiers labeled all successful adversarial examples incorrectly (with one exception where the PGDtrained classifier Madry et al. (2017) labeled one of the invariance adversarial examples correctly). Despite this fact, PGD adversarial training and Analysis by Synthesis Schott et al. (2019) are two of the stateoftheart perturbationrobust classifiers.
The situation is more complex for the invariance adversarial examples. In this setting, the models which achieve higher perturbationrobustness result in lower accuracy on this new invariance test set. For example, Bafna et al. (2018) develops a
perturbationrobust classifier that relies on the sparse Fourier transform. This perturbationrobust classifier is substantially weaker to invariance adversarial examples, getting only
accuracy compared to a baseline classifier’s accuracy.Fraction of examples where human and model agree  
Model:  Baseline  ABS  BinaryABS  PGD  PGD  Sparse 
Clean  99%  99%  99%  99%  99%  99% 
54%  58%  47%  56%  27%  38%  
0%  0%  0%  0%  5%  0% 
4.4 Natural Images
While the previous discussion focused on synthetic (Adversarial Spheres) and simple tasks like MNIST, similar phenomena may arise in natural images. In Figure 6, we show two different perturbations of the original image (left). The perturbation of the middle image is nearly imperceptible and thus the classifier´s decision should be robust to such changes. On the other hand, the image on the right went through a semantic change (from tennis ball to a strawberry) and thus the classifier should be sensitive to such changes (even though this case is ambiguous due to two objects in the image). However, in terms of the norm the change in the right image is even smaller than the imperceptible change in the middle. Hence, making the classifier robust within this normball will make the classifier vulnerable to invariancebased adversarial examples like the semantic changes in the right image.
5 Conclusion
Training models robust to perturbationbased adversarial examples should not be treated as equivalent to learning models robust to all adversarial examples. While most of the research has focused on perturbationbased adversarial examples that exploit excessive classifier sensitivity, we show that the reverse viewpoint of excessive classifier invariance should also be taken into account when evaluating robustness. Furthermore, other unknown types of adversarial examples may exist: it remains unclear whether or not the union of perturbation and invariance adversarial examples completely captures the full space of evasion attacks.
Consequences for norm evaluation.
Our invariancebased attacks are able to find adversarial examples within the ball on classifiers that were trained to be robust to norm perturbationbased adversaries. As a consequence of this analysis, researchers should carefully set the radii of balls when measuring robustness to normbounded perturbationbased adversarial examples. Furthermore, setting a consistent radius across all of the data may be difficult: we find in our experiments that some class pairs are more easily attacked than others by invariancebased adversaries.
Some recent defense proposals, which claim extremely high and normbounded robustness, are likely overfitting to peculiarities of MNIST to deliver higher robustness to perturbationbased adversaries. This may not actually be delivering classifiers matching the human oracle more often. Indeed, another byproduct of our study is to showcase the importance of human studies when the true label of candidate adversarial inputs becomes ambiguous and cannot be inferred algorithmically.
Invariance.
Our work confirms findings reported recently in that it surfaces the need for mitigating undesired invariance in classifiers. The crossentropy loss as well as architectural elements such as ReLU activation functions have been put forward as possible sources of excessive invariance
(Jacobsen et al., 2019; Behrmann et al., 2018). However, more work is needed to develop quantitative metrics for invariancebased robustness. One promising architecture class to control invariancebased robustness are invertible networks (Dinh et al., 2014) because, by construction, they cannot build up any invariance until the final layer (Jacobsen et al., 2018; Behrmann et al., 2019).References
 Bafna et al. (2018) Mitali Bafna, Jack Murtagh, and Nikhil Vyas. Thwarting adversarial examples: An robust sparse fourier transform. In Advances in Neural Information Processing Systems, pp. 10096–10106, 2018.
 Behrmann et al. (2018) Jens Behrmann, Sören Dittmer, Pascal Fernsel, and Peter Maaß. Analysis of invariance and robustness via invertibility of relunetworks. arXiv preprint arXiv:1806.09730, 2018.
 Behrmann et al. (2019) Jens Behrmann, Will Grathwohl, Ricky T. Q. Chen, David Duvenaud, and JörnHenrik Jacobsen. Invertible residual networks. arXiv preprint arXiv:1811.00995, 2019.

Biggio et al. (2013)
Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim
Šrndić, Pavel Laskov, Giorgio Giacinto, and Fabio Roli.
Evasion attacks against machine learning at test time.
In Joint European conference on machine learning and knowledge discovery in databases, pp. 387–402. Springer, 2013.  Carlini et al. (2019) Nicholas Carlini, Anish Athalye, Nicolas Papernot, Wieland Brendel, Jonas Rauber, Dimitris Tsipras, Ian Goodfellow, and Aleksander Madry. On evaluating adversarial robustness. arXiv preprint arXiv:1902.06705, 2019.
 Cisse et al. (2017) Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 854–863. JMLR. org, 2017.
 Cohen et al. (2019) Jeremy M Cohen, Elan Rosenfeld, and J Zico Kolter. Certified adversarial robustness via randomized smoothing. arXiv preprint arXiv:1902.02918, 2019.
 Dinh et al. (2014) Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Nonlinear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
 Gilmer et al. (2018a) Justin Gilmer, Ryan P. Adams, Ian Goodfellow, David Andersen, and George E. Dahl. Motivating the rules of the game for adversarial example research. arXiv preprint arXiv:1807.06732, 2018a.
 Gilmer et al. (2018b) Justin Gilmer, Luke Metz, Fartash Faghri, Samuel S Schoenholz, Maithra Raghu, Martin Wattenberg, and Ian Goodfellow. Adversarial spheres. arXiv preprint arXiv:1801.02774, 2018b.
 Goodfellow & Papernot (2017) Ian Goodfellow and Nicolas Papernot. Is attacking machine learning easier than defending it? Blog post on Feb, 15:2017, 2017.
 Goodfellow et al. (2015) Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. International Conference on Learning Representations, 2015.
 Ilyas et al. (2017) Andrew Ilyas, Ajil Jalal, Eirini Asteri, Constantinos Daskalakis, and Alexandros G Dimakis. The robust manifold defense: Adversarial training using generative models. arXiv preprint arXiv:1712.09196, 2017.
 Jacobsen et al. (2018) JörnHenrik Jacobsen, Arnold W.M. Smeulders, and Edouard Oyallon. irevnet: Deep invertible networks. In International Conference on Learning Representations, 2018.
 Jacobsen et al. (2019) JörnHenrik Jacobsen, Jens Behrmann, Richard Zemel, and Matthias Bethge. Excessive invariance causes adversarial vulnerability. In International Conference on Learning Representations, 2019.
 Lecuyer et al. (2018) Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certified robustness to adversarial examples with differential privacy. arXiv preprint arXiv:1802.03471, 2018.
 Li et al. (2019) Ke Li, Tianhao Zhang, and Jitendra Malik. A study of robustness of neural nets using approximate feature collisions, 2019. URL https://openreview.net/forum?id=H1gDgn0qY7.

Madry et al. (2017)
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and
Adrian Vladu.
Towards deep learning models resistant to adversarial attacks.
International Conference on Learning Representations, 2017.  Mirza & Osindero (2014) Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
 Raghunathan et al. (2018) Aditi Raghunathan, Jacob Steinhardt, and Percy Liang. Certified defenses against adversarial examples. International Conference on Learning Representations, 2018.
 Sabour et al. (2016) Sara Sabour, Yanshuai Cao, Fartash Faghri, and David J Fleet. Adversarial manipulation of deep representations. International Conference on Learning Representations, 2016.
 Samangouei et al. (2018) Pouya Samangouei, Maya Kabkab, and Rama Chellappa. Defensegan: Protecting classifiers against adversarial attacks using generative models. arXiv preprint arXiv:1805.06605, 2018.
 Schott et al. (2019) Lukas Schott, Jonas Rauber, Matthias Bethge, and Wieland Brendel. Towards the first adversarially robust neural network model on MNIST. In International Conference on Learning Representations, 2019.
 Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
 Tramèr et al. (2018) Florian Tramèr, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel. Ensemble adversarial training: Attacks and defenses. In International Conference on Learning Representations, 2018.

Tsipras et al. (2019)
Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and
Aleksander Madry.
Robustness may be at odds with accuracy.
In International Conference on Learning Representations, 2019.  Wong & Kolter (2018) Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In Proceedings of the 35th International Conference on Machine Learning, 2018.
 Wong et al. (2018) Eric Wong, Frank Schmidt, Jan Hendrik Metzen, and J Zico Kolter. Scaling provable adversarial defenses. In Advances in Neural Information Processing Systems, pp. 8410–8419, 2018.
Appendix A Details about Adversarial Spheres Experiment
In this section, we provide details about the Adversarial Spheres (Gilmer et al., 2018b) experiment. First, the bias is chosen, such that the classifier is the maxmargin classifier on the (finite) training set (assuming separability: ):
Second, the attacks are designed such that the adversarial examples stay on the data manifold (two concentric spheres). In particular, following steps are taken:
Perturbationbased: All points from the outer sphere (i.e., ) can be perturbed to , where , while staying on the outer sphere (i.e., ) via following steps:

Perturbation of decision: , where scaling is chosen such that

Projection to outer sphere: , where scaling is chosen such that
For points from the inner sphere, this is not possible if .
Invariancebased: All points from the inner sphere () can be perturbed to , where , despite being in fact on the outer sphere after the perturbation has been added (i.e., ) via following steps:

Fixing the used dimensions:

Perturbation of unused dimensions: , where scaling is chosen such that
For points from the outer sphere, this is not possible if .
Appendix B Details about Modelagnostic Invariancebased Attacks
Here, we give details about our modelagnostic invariancebased adversarial attacks on MNIST.
Generating invariant adversarial examples.
Assume we are given a training set consisting of labeled example pairs . As input our algorithm accepts an example with oracle label . Image with label is given in Figure 4 (a).
Define , the set of training examples with a different label. Now we define to be the set of transformations that we allow: rotations by up to degrees, horizontal or vertical shifts by up to pixels (out of 28), shears by up to , and resizing by up to .
Now, we generate the new augmented training set . By assumption, each of these examples is labeled correctly by the oracle. In our experiments, we verify the validity of this assumption through a human study and omit any candidate adversarial example that violates this assumption. Finally, we search for
By construction, we know that and are similar in pixel space but have a different label. Figure 4 (bc) show this step of the process. Next, we introduce a number of refinements to make be “more similar” to . This reduces the distortion introduced to create an invariancebased adversarial example—compared to directly returning as the adversarial example.
First, we define where the absolute value and comparison operator are taken elementwise. Intuitively, represents the pixels that substantially change between and . We choose as an arbitrary threshold representing how much a pixel changes before we consider the change “important”. This step is shown in Figure 4 (d). Along with containing the useful changes that are responsible for changing the oracle class label of , it also contains irrelevant changes that are superficial and do not contribute to changing the oracle class label. For example, in Figure 4 (d) notice that the green cluster is the only semantically important change; both the red and blue changes are not necessary.
To identify and remove the superficial changes, we perform spectral clustering on . We compute by enumerating all possible subsets of clusters of pixel regions. This gives us many possible potential adversarial examples . Notice these are only potential because we may not actually have applied the necessary change that actually changes the class label.
We show three of the eight possible candidates in Figure 4. In order to alleviate the need for human inspection of each candidate to determine which of these potential adversarial examples is actually misclassified, we follow an approach from DefenseGAN Samangouei et al. (2018) and the Robust Manifold Defense Ilyas et al. (2017): we take the generator from a GAN and use it to assign a likelihood score to the image. We make one small refinement, and use an ACGAN Mirza & Osindero (2014) and compute the classconditional likelihood of this image occurring. This process reduces distortion by on average.
As a small refinement, we find that initially filtering by leastcanonical examples makes the attack succeed more often.
Generating invariant adversarial examples.
Our approach for generating invariant examples follows similar ideas as for the case, but is conceptually simpler as the perturbation budget can be applied independently for each pixel (as we will see, our attack is however less effective than the one, so further optimizations may prove useful).
We build an augmented training set as in the case. Instead of looking for the closest nearest neighbor for some example with label , we restrict our search to examples with specific target labels , which we’ve empirically found to produce more convincing examples (e.g., we always match digits representing a , with a target digit representing either a or a ). We then simply apply an bounded perturbation (with ) to
by interpolating with
, so as to minimize the distance between and the chosen target example .Appendix C Invariancebased Adversarial Examples for Binarized MNIST
Appendix D Complete Set of 100 Invariance Adversarial Examples
Below we give the randomlyselected test images along with the invariance adversarial examples that were shown during the human study.
Comments
There are no comments yet.