Exploiting Excessive Invariance caused by Norm-Bounded Adversarial Robustness

by   Jörn-Henrik Jacobsen, et al.

Adversarial examples are malicious inputs crafted to cause a model to misclassify them. Their most common instantiation, "perturbation-based" adversarial examples introduce changes to the input that leave its true label unchanged, yet result in a different model prediction. Conversely, "invariance-based" adversarial examples insert changes to the input that leave the model's prediction unaffected despite the underlying input's label having changed. In this paper, we demonstrate that robustness to perturbation-based adversarial examples is not only insufficient for general robustness, but worse, it can also increase vulnerability of the model to invariance-based adversarial examples. In addition to analytical constructions, we empirically study vision classifiers with state-of-the-art robustness to perturbation-based adversaries constrained by an ℓ_p norm. We mount attacks that exploit excessive model invariance in directions relevant to the task, which are able to find adversarial examples within the ℓ_p ball. In fact, we find that classifiers trained to be ℓ_p-norm robust are more vulnerable to invariance-based adversarial examples than their undefended counterparts. Excessive invariance is not limited to models trained to be robust to perturbation-based ℓ_p-norm adversaries. In fact, we argue that the term adversarial example is used to capture a series of model limitations, some of which may not have been discovered yet. Accordingly, we call for a set of precise definitions that taxonomize and address each of these shortcomings in learning.



There are no comments yet.


page 7

page 8

page 12

page 13

page 14


Fundamental Tradeoffs between Invariance and Sensitivity to Adversarial Perturbations

Adversarial examples are malicious inputs crafted to induce misclassific...

Excessive Invariance Causes Adversarial Vulnerability

Despite their impressive performance, deep neural networks exhibit strik...

Relaxing Local Robustness

Certifiable local robustness, which rigorously precludes small-norm adve...

Adversarial Robustness Curves

The existence of adversarial examples has led to considerable uncertaint...

Proper measure for adversarial robustness

This paper analyzes the problems of standard adversarial accuracy and st...

Semantic Adversarial Perturbations using Learnt Representations

Adversarial examples for image classifiers are typically created by sear...

On the Suitability of L_p-norms for Creating and Preventing Adversarial Examples

Much research effort has been devoted to better understanding adversaria...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Research on adversarial examples is motivated by a spectrum of questions. These range from the security of models deployed in the presence of real-world adversaries to the need to capture limitations of representations and their (in)ability to generalize (Gilmer et al., 2018a). The broadest accepted definition of an adversarial example is “an input to a ML model that is intentionally designed by an attacker to fool the model into producing an incorrect output” (Goodfellow & Papernot, 2017).

To enable concrete progress, many definitions of adversarial examples were introduced in the literature since their initial discovery (Szegedy et al., 2013; Biggio et al., 2013). In a majority of work, adversarial examples are commonly formalized as adding a perturbation to some test example to obtain an input that produces an incorrect model outcome.111Here, an incorrect output either refers to the model returning any class different from the original source class of the input, or a specific target class chosen by the adversary prior to searching for a perturbation. We refer to this entire class of malicious inputs as perturbation-based adversarial examples. The adversary’s capabilities may optionally be constrained by placing a bound on the maximum perturbation added to the original input (e.g., using an norm).

Achieving robustness to perturbation-based adversarial examples, in particular when they are constrained using norms, is often cast as a problem of learning a model that is uniformly continuous: the defender wishes to prove that for all and for some , all pairs of points with satisfy (where

denotes the classifier’s logits). Different papers take different approaches to achieving this result, ranging from robust optimization 

(Madry et al., 2017) to training models to have Lipschitz constants (Cisse et al., 2017) to models which are provably robust to small perturbations  (Wong & Kolter, 2018; Raghunathan et al., 2018).

Figure 1: [Left]: When training a classifier without constraints, we may end up with a decision boundary that is not robust to perturbation-based adversarial examples. [Right]: However, enforcing robustness to norm-bounded perturbations, introduces erroneous invariance (dashed regions in epsilon spheres). This excessive invariance of the perturbation-robust model in task-relevant directions may be exploited, as shown by the attack proposed in this paper.

In this paper we present analytical results that show how optimizing for uniform continuity is not only insufficient to address the lack of generalization identified through adversarial examples, but also potentially harmful. Our intuition, captured in Figure 1, relies on the inability of -norms to capture the geometry of ideal decision boundaries (or any other distance metric that does not perfectly capture semantics). This leads us to present analytical constructions and empirical evidence that robustness to perturbation-based adversaries can increase the vulnerability of models to other types of adversarial examples.

Our argument relies on the existence of invariance-based adversarial examples (Jacobsen et al., 2019). Rather than perturbing the input to change the classifier’s output, they modify input semantics while keeping the decision of the classifier identical. In other words, the vulnerability exploited by invariance-based adversarial examples is a lack of sensitivity in directions relevant to the task: the model’s consistent prediction does not reflect the change in the input’s true label.

Our analytical work exposes a complex relationship between perturbation-based and invariance-based adversarial examples. We construct a model that is robust to perturbation-based adversarial examples but not to invariance-based adversarial examples. We then demonstrate how an imperfect model for the adversarial spheres task proposed by Gilmer et al. (2018b) is either vulnerable to perturbation-based or invariance-based attacks—depending on whether the point attacked is on the inner or outer sphere. Hence, at least these two types of adversarial examples are needed to fully account for model failures (more vulnerabilities may be discovered at a later point).

To demonstrate the practicality of our argument, we then consider vision models with state-of-the-art robustness to -norm adversaries. We introduce an algorithmic approach for finding invariance-based adversarial examples. Our attacks are model-agnostic and generate and invariance adversarial examples, succeeding at changing the underlying classification (as determined by a human study) in and of cases, respectively. When -robust models classify the successful attacks, they achieve under (respectively, ) agreement with the human label.

Perhaps one of the most interesting aspects of our work is to show that different classes of current classifier’s limitations fall under the same umbrella term of adversarial examples. Despite this common terminology, each of these limitations may stem from different shortcomings of learning that have non-trivial relationships. To be clear, developing -norm perturbation-robust classifiers is a useful benchmark task. However, as our paper demonstrates, it is not the only potential way classifiers may make mistakes even within the norm. Hence, we argue that the community will benefit from working with a series of definitions that precisely taxonomize adversarial examples.

2 Defining Perturbation-based and Invariance-based Adversarial examples

In order to make precise statements about adversarial examples, we begin with two definitions.

Definition 1 (Perturbation-based Adversarial Examples).

Let denote the -th layer, logit or argmax of the classifier. A Perturbation-based adversarial example (or perturbation adversarial) corresponding to a legitimate test input fulfills:

  1. Created by adversary: is created by an algorithm with .

  2. Perturbation of output: and , where perturbation is set by the adversary and denotes the oracle.

Furthermore, is -bounded if , where is a norm on and .

Property (i) allows us to distinguish perturbation adversarial examples from points that are misclassified by the model without adversarial intervention. Furthermore, the above definition incorporates also adversarial perturbations designed for hidden features as in (Sabour et al., 2016), while usually the decision of the classifier (argmax-operation on logits) is used as the perturbation target. Our definition also identifies -bounded perturbation-based adversarial examples (Goodfellow et al., 2015) as a specific case of unbounded perturbation-based adversarial examples. However, our analysis primarily considers the latter, which correspond to the threat model of a stronger adversary.

Definition 2 (Invariance-based Adversarial Examples).

Let denote the -th layer, logit or argmax of the classifier. An invariance-based adversarial example (or invariance adversarial) corresponding to a legitimate test input fulfills:

  1. Created by adversary: is created by an algorithm with .

  2. Lies in pre-image of under : and , where denotes the oracle.

As a consequence, also holds for invariance-based adversarial examples, where is the output of the classifier. Intuitively, adversarial perturbations cause the output of the classifier to change, while the oracle would still label the new input in the original source class. Whereas perturbation-based adversarial examples exploit the classifier’s excessive sensitivity in task-irrelevant directions, invariance-based adversarial examples explore the classifier’s pre-image to identify excessive invariance in task-relevant directions: its prediction is unchanged while the oracle’s output differs. Briefly put, perturbation-based and invariance-based adversarial examples are complementary failure modes of the learned classifier.

3 Robustness to Perturbation-based Adversarial Examples Can Cause Invariance-based Vulnerabilities

We now investigate the relationship between the two adversarial example definitions from Section 2. So far, it has been unclear whether solving perturbation-based adversarial examples implies solving invariance-based adversarial examples, and vice versa. In the following, we show that this relationship is intricate and developing models robust in one of the two settings only would be insufficient.

In a general setting, invariance and stability can be uncoupled. For this consider a linear classifier with matrix

. The perturbation-robustness is tightly related to forward stability (largest singular value of

). On the other hand, the invariance-view relates to the stability of the inverse (smallest singular value of ) and to the null-space of . As largest and smallest singular values are uncoupled for general matrices , the relationship between both viewpoints is likely non-trivial in practice.

3.1 Building our Intuition with Extreme Uniform Continuity

In the extreme, a classifier achieving perfect uniform continuity would be a constant classifier. Let denote a classifier with for all . As the classifier maps all inputs to the same output , there exist no , such that . Thus, the model is trivially perturbation-robust (at the expense of decreased utility). On the other hand, the pre-image of under is the entire input space, thus is arbitrarily vulnerable to invariance-based adversarial examples. Because this toy model is a constant function over the input domain, no perturbation of an initially correctly classified input can change its prediction.

This trivial model illustrates how one not only needs to control sensitivity but also invariance alongside accuracy to obtain a robust model. Hence, we argue that the often-discussed tradeoff between accuracy and robustness (see Tsipras et al. (2019) for a recent treatment) should in fact take into account at least three notions: accuracy, sensitivity, and invariance. This is depicted in Figure 1. In the following, we present arguments as for why this insight can also extend to almost perfect classifiers.

Figure 2: Robustness experiment on spheres with radii and and max-margin classifier that does not see dimensions of the dimensional input. [Left]: Attacking points from the outer sphere with perturbation-based attacks, with accuracy dropping when increasing the upper bound on -norm perturbations. [Right]: Attacking points from the inner sphere with invariance-based attacks, with accuracy dropping when increasing the upper bound on -norm perturbations. Each attack has a different effect on the manifold. Red arrows indicate the only possible direction of attack for each sphere. Perturbation attacks fail on the inner sphere, while invariance attacks fail on the outer sphere. Hence, both attacks are needed for a full account of model failures.

3.2 Comparing Invariance-based and Perturbation-based Robustness

We now show how the analysis of perturbation-based and invariance-based adversarial examples can uncover different model failures. To do so, we consider the synthetic adversarial spheres problem of Gilmer et al. (2018b). The goal of this synthetic task is to distinguish points from two cocentric spheres (class 1: and class 2: ) with different radii and . The dataset was designed such that a robust (max-margin) classifier can be formulated as:

Our analysis considers a similar, but slightly sub-optimal classifier in order to study model failures in a controlled setting:

which computes the norm of from its first cartesian-coordinates and outputs -1 (resp. +1) for the inner (resp. outer) sphere. The bias is chosen based on finite training set (see Appendix A).

Even though this sub-optimal classifier reaches nearly 100 on finite test data, the model is imperfect in the presence of adversaries that operate on the manifold (i.e., produce adversarial examples that remain on one of the two spheres but are misclassified). Most interestingly, the perturbation-based and invariance-based approaches uncover different failures (see Appendix A for details on the attacks):

  • Perturbation-based: All points from the outer sphere (i.e., ) can be perturbed to , where while staying on the outer sphere (i.e., ).

  • Invariance-based: All points from the inner sphere () can be perturbed to , where , despite being in fact on the outer sphere after the perturbation has been added (i.e., ).

In Figure 2, we plot the mean accuracy over points sampled either from the inner or outer sphere, as a function of the norm of the adversarial manipulation added to create perturbation-based and invariance-based adversarial examples. This illustrates how the robustness regime differs significantly between the two variants of adversarial examples. Therefore, by looking only at perturbation-based (respectively invariance-based) adversarial examples, important model failures may be overlooked. This is exacerbated when the data is sampled in an unbalanced fashion from the two spheres: the inner sphere is robust to perturbation adversarial examples while the outer sphere is robust to invariance adversarial examples (for accurate models).

4 Invariance-based Attacks in Practice

We now show that our argument is not limited to the analysis of synthetic tasks, and give practical automated attack algorithms to generate invariance adversarial examples. We elect to study the only dataset for which robustness is considered to be nearly solved under the norm threat model: MNIST (Schott et al., 2019). We show that MNIST models trained to be robust to perturbation-based adversarial examples are less robust to invariance-based adversarial examples. As a result, we show that while perturbation adversarial examples may not exist within the ball around test examples, adversarial examples still do exist within the ball around test examples.


The MNIST dataset is typically a poor choice of dataset for studying adversarial examples, and in particular defenses that are designed to mitigate them (Carlini et al., 2019). In large part this is due to the fact that MNIST is significantly different from other vision classification problems (e.g., features are quasi-binary and classes are well separated in most cases). However, the simplicity of MNIST is why studying -norm adversarial examples was originally proposed as a toy task to benchmark models (Goodfellow et al., 2015). Unexpectedly, it is perhaps much more difficult than was originally expected. However, several years later, it is now argued that training MNIST classifiers whose decision is constant in an -norm ball around their training data provides robustness to adversarial examples (Schott et al., 2019; Madry et al., 2017; Wong & Kolter, 2018; Raghunathan et al., 2018).

Furthermore, if defenses relying on the -norm threat model are going to perform well on a vision task, MNIST is likely the best dataset to measure that—due to the specificities mentioned above. In fact, MNIST is the only dataset for which robustness to adversarial examples is considered even remotely close to being solved (Schott et al., 2019) and researchers working on (provable) robustness to adversarial examples have moved on to other, larger vision datasets such as CIFAR-10 (Madry et al., 2017; Wong et al., 2018)

or ImageNet 

(Lecuyer et al., 2018; Cohen et al., 2019).

This section argues that, contrary to popular belief, MNIST is far from being solved. We show why robustness to -norm perturbation-based adversaries is insufficient, even on MNIST, and why defenses with unreasonably high uniform continuity can harm the performance of the classifier and make it more vulnerable to other attacks exploiting this excessive invariance.

4.1 A toy worst-case: binarized MNIST classifier

Figure 3:

Invariance-based adversarial example (top-left) is labeled differently by a human than original (bottom-left). However, both become identical after binarization.

To give an initial constructive example, consider a MNIST classifier which binarizes (by thresholding at, e.g., 0.5) all of its inputs before classifying them with a neural network. As

(Tramèr et al., 2018; Schott et al., 2019) demonstrate, this binarizing classifier is highly -robust, because most perturbations in the pixel space do not actually change the (thresholded) feature representation.

However, this binary classifier will have trivial invariance-based adversarial examples. Figure 8 shows an example of this attack. Two images which are dramatically different to a human (e.g., a digit of a one and a digit of a four) can become identical after pre-processing the images with a thresholding function at (as examined by, e.g., Schott et al. (2019)).

4.2 Generating Model-agnostic Invariance-based Adversarial Examples

In the following, we build on existing invariance-based attacks (Jacobsen et al., 2019; Behrmann et al., 2018; Li et al., 2019) to propose a model-agnostic algorithm for crafting invariance-based adversarial examples. That is, our attack algorithm generates invariance adversarial examples that cause a human to change their classification, but where most models, not known by the attack algorithm, will not change their classification. Our algorithm for generating invariance-based adversarial examples is simple, albeit tailored to work specifically on datasets where comparing images in pixel space is meaningful, like MNIST.

Begin with a source image, correctly classified by both the oracle evaluator (i.e., a human) and the model. Next, try all possible affine transformations of training data points whose label is different from the source image, and find the target training example which—once transformed—has the smallest distance to the source image. Finally, construct an invariance-based adversarial example by perturbing the source image to be “more similar” to the target image under the metric considered. In Appendix B, we describe instantiations of this algorithm for the and norms. Figure 4 visualizes the sub-steps for the attack, which are described in details in Appendix B.

The underlying assumption of this attack is that small affine transformations are less likely to cause an oracle classifier to change its label of the underlying digit than perturbations. In practice, we validate this hypothesis with a human study in Section 4.3.

(a)        (b)      (c)        (d)      (e)                (f-h)

Figure 4: Process for generating invariant adversarial examples. From left to right: (a) the original image of an 8; (b) the nearest training image (labeled as 3), before alignment; (c) the nearest training image (still labeled as 3), after alignment; (d) the

perturbation between the original and aligned training example; (e) spectral clustering of the perturbation

; and (f-h) possible invariance adversarial examples, selected by applying subsets of clusters of to the original image. (f) is a failed attempt at an invariance adversarial example. (g) is successful, but introduces a larger perturbation than necessary (adding pixels to the bottom of the 3). (h) is successful and minimally perturbed.

4.3 Evaluation

Attack analysis.

We generate 1000 adversarial examples using each of the two above approaches on examples randomly drawn from the MNIST test set. Our attack is quite slow, with the alignment process taking (amortized) several minutes per example. We performed no optimizations of this process and expect it could be improved. The mean distortion required is 25.9 (with a median of 25). The adversarial examples always use the full budget of and take a similar amount of time to generate; most of the cost is again dominated by finding the nearest test image.

Human Study.

We randomly selected 100 examples from the MNIST test set and create 100 invariance-based adversarial examples under the norm and norm, as described above. We then conduct a human study to evaluate whether or not these invariance adversarial examples indeed are successful, i.e., whether humans agree that the label has been changed despite the model’s prediction remaining the same. We presented 40 human evaluators with these images, half of which were natural unmodified MNIST digits, and the remaining half were distributed randomly between or invariance adversarial examples.

Attack Type Success Rate
Clean Images 0%
Attack 55%
Attack 21%
(a) Success rate of our invariance adversarial example causing humans to switch their classification.
(b) Original test images (top) with our (middle) and (bottom) invariance adversarial examples.          (left) successful attacks; (right) failed attacks.
Figure 5: Our invariance-based adversarial examples. Humans (acting as the oracle) switch their classification of the image from the original test label to a different label.


For the clean (unmodified) test images, 98 of the 100 examples were labeled correctly by all human evaluators. The other 2 images were labeled correctly by over of human evaluators.

Our attack is highly effective: For 48 of the 100 examples at least of human evaluator who saw that digit assigned it the same label, different from the original test label. Humans only agreed with the original test label (with the same threshold) on 34 of the images, while they did not form a consensus on the remaining 18 examples. The (much simpler) attack is less effective: Humans only agreed that the image changed label on 14 of the examples, and agreed the label had not changed in 74 cases. We summarize results in Table 5 (a).

In Figure 5 (b) we show sample invariance adversarial examples. To simplify the analysis in the following section, we split our generated invariance adversarial examples into two sets: the successes and the failures, as determined by whether the plurality decision by humans was different than or equal to the human label. We only evaluate the models on the subset of invariance adversarial examples that caused the humans to switch their classification.

Model Evaluation.

Now that we have oracle ground-truth labels for each of the images as decided by the humans, we report how often our models agree with the human-assigned label. Table 1 summarizes the results of this analysis. For the invariance adversarial examples we report model accuracy only on the successful attacks, that is, those where the human oracle label changed between the original image and the modified image.

Every classifiers labeled all successful adversarial examples incorrectly (with one exception where the PGD-trained classifier Madry et al. (2017) labeled one of the invariance adversarial examples correctly). Despite this fact, PGD adversarial training and Analysis by Synthesis Schott et al. (2019) are two of the state-of-the-art perturbation-robust classifiers.

The situation is more complex for the -invariance adversarial examples. In this setting, the models which achieve higher perturbation-robustness result in lower accuracy on this new invariance test set. For example, Bafna et al. (2018) develops a

perturbation-robust classifier that relies on the sparse Fourier transform. This perturbation-robust classifier is substantially weaker to invariance adversarial examples, getting only

accuracy compared to a baseline classifier’s accuracy.

Fraction of examples where human and model agree
Model: Baseline ABS Binary-ABS PGD PGD Sparse
Clean 99% 99% 99% 99% 99% 99%
54% 58% 47% 56% 27% 38%
0% 0% 0% 0% 5% 0%
Table 1: Models which are more robust to perturbation adversarial examples (such as those trained with adversarial training) agree with humans less often on invariance-based adversarial examples. Agreement between human oracle labels and labels by five models on clean (unmodified) examples and our successful - and -generated invariance adversarial examples. Values denoted with an asterisks violate the perturbation threat model of the defense and should not be taken to be attacks. When the model is wrong, it classified the input as the original label, and not the new oracle label.

4.4 Natural Images

While the previous discussion focused on synthetic (Adversarial Spheres) and simple tasks like MNIST, similar phenomena may arise in natural images. In Figure 6, we show two different perturbations of the original image (left). The perturbation of the middle image is nearly imperceptible and thus the classifier´s decision should be robust to such changes. On the other hand, the image on the right went through a semantic change (from tennis ball to a strawberry) and thus the classifier should be sensitive to such changes (even though this case is ambiguous due to two objects in the image). However, in terms of the norm the change in the right image is even smaller than the imperceptible change in the middle. Hence, making the classifier robust within this norm-ball will make the classifier vulnerable to invariance-based adversarial examples like the semantic changes in the right image.

Figure 6: Visualization that large norms can also fail to measure semantic changes in images.     (a) original image in the ImageNet test set labeled as a tennis ball; (b) imperceptible perturbation, ; (c) semantic perturbation with a perturbation of that removes the tennis ball.

5 Conclusion

Training models robust to perturbation-based adversarial examples should not be treated as equivalent to learning models robust to all adversarial examples. While most of the research has focused on perturbation-based adversarial examples that exploit excessive classifier sensitivity, we show that the reverse viewpoint of excessive classifier invariance should also be taken into account when evaluating robustness. Furthermore, other unknown types of adversarial examples may exist: it remains unclear whether or not the union of perturbation and invariance adversarial examples completely captures the full space of evasion attacks.

Consequences for -norm evaluation.

Our invariance-based attacks are able to find adversarial examples within the ball on classifiers that were trained to be robust to -norm perturbation-based adversaries. As a consequence of this analysis, researchers should carefully set the radii of -balls when measuring robustness to norm-bounded perturbation-based adversarial examples. Furthermore, setting a consistent radius across all of the data may be difficult: we find in our experiments that some class pairs are more easily attacked than others by invariance-based adversaries.

Some recent defense proposals, which claim extremely high and norm-bounded robustness, are likely over-fitting to peculiarities of MNIST to deliver higher robustness to perturbation-based adversaries. This may not actually be delivering classifiers matching the human oracle more often. Indeed, another by-product of our study is to showcase the importance of human studies when the true label of candidate adversarial inputs becomes ambiguous and cannot be inferred algorithmically.


Our work confirms findings reported recently in that it surfaces the need for mitigating undesired invariance in classifiers. The cross-entropy loss as well as architectural elements such as ReLU activation functions have been put forward as possible sources of excessive invariance 

(Jacobsen et al., 2019; Behrmann et al., 2018). However, more work is needed to develop quantitative metrics for invariance-based robustness. One promising architecture class to control invariance-based robustness are invertible networks (Dinh et al., 2014) because, by construction, they cannot build up any invariance until the final layer (Jacobsen et al., 2018; Behrmann et al., 2019).


Appendix A Details about Adversarial Spheres Experiment

In this section, we provide details about the Adversarial Spheres (Gilmer et al., 2018b) experiment. First, the bias is chosen, such that the classifier is the max-margin classifier on the (finite) training set (assuming separability: ):

Second, the attacks are designed such that the adversarial examples stay on the data manifold (two concentric spheres). In particular, following steps are taken:

Perturbation-based: All points from the outer sphere (i.e., ) can be perturbed to , where , while staying on the outer sphere (i.e., ) via following steps:

  1. Perturbation of decision: , where scaling is chosen such that

  2. Projection to outer sphere: , where scaling is chosen such that

For points from the inner sphere, this is not possible if .

Invariance-based: All points from the inner sphere () can be perturbed to , where , despite being in fact on the outer sphere after the perturbation has been added (i.e., ) via following steps:

  1. Fixing the used dimensions:

  2. Perturbation of unused dimensions: , where scaling is chosen such that

For points from the outer sphere, this is not possible if .

Appendix B Details about Model-agnostic Invariance-based Attacks

Here, we give details about our model-agnostic invariance-based adversarial attacks on MNIST.

Generating -invariant adversarial examples.

Assume we are given a training set consisting of labeled example pairs . As input our algorithm accepts an example with oracle label . Image with label is given in Figure 4 (a).

Define , the set of training examples with a different label. Now we define to be the set of transformations that we allow: rotations by up to degrees, horizontal or vertical shifts by up to pixels (out of 28), shears by up to , and re-sizing by up to .

Now, we generate the new augmented training set . By assumption, each of these examples is labeled correctly by the oracle. In our experiments, we verify the validity of this assumption through a human study and omit any candidate adversarial example that violates this assumption. Finally, we search for

By construction, we know that and are similar in pixel space but have a different label. Figure 4 (b-c) show this step of the process. Next, we introduce a number of refinements to make be “more similar” to . This reduces the distortion introduced to create an invariance-based adversarial example—compared to directly returning as the adversarial example.

First, we define where the absolute value and comparison operator are taken element-wise. Intuitively, represents the pixels that substantially change between and . We choose as an arbitrary threshold representing how much a pixel changes before we consider the change “important”. This step is shown in Figure 4 (d). Along with containing the useful changes that are responsible for changing the oracle class label of , it also contains irrelevant changes that are superficial and do not contribute to changing the oracle class label. For example, in Figure 4 (d) notice that the green cluster is the only semantically important change; both the red and blue changes are not necessary.

To identify and remove the superficial changes, we perform spectral clustering on . We compute by enumerating all possible subsets of clusters of pixel regions. This gives us many possible potential adversarial examples . Notice these are only potential because we may not actually have applied the necessary change that actually changes the class label.

We show three of the eight possible candidates in Figure 4. In order to alleviate the need for human inspection of each candidate to determine which of these potential adversarial examples is actually misclassified, we follow an approach from Defense-GAN Samangouei et al. (2018) and the Robust Manifold Defense Ilyas et al. (2017): we take the generator from a GAN and use it to assign a likelihood score to the image. We make one small refinement, and use an AC-GAN Mirza & Osindero (2014) and compute the class-conditional likelihood of this image occurring. This process reduces distortion by on average.

As a small refinement, we find that initially filtering by least-canonical examples makes the attack succeed more often.

Generating -invariant adversarial examples.

Our approach for generating -invariant examples follows similar ideas as for the case, but is conceptually simpler as the perturbation budget can be applied independently for each pixel (as we will see, our attack is however less effective than the one, so further optimizations may prove useful).

We build an augmented training set as in the case. Instead of looking for the closest nearest neighbor for some example with label , we restrict our search to examples with specific target labels , which we’ve empirically found to produce more convincing examples (e.g., we always match digits representing a , with a target digit representing either a or a ). We then simply apply an -bounded perturbation (with ) to

by interpolating with

, so as to minimize the distance between and the chosen target example .

Appendix C Invariance-based Adversarial Examples for Binarized MNIST

Figure 7: Histogram of MNIST pixel values (note the log scale on the y-axis) with two modes around and . Hence, binarizing inputs to a MNIST model does not impact its performance importantly.
Figure 8: Invariance-based adversarial examples for a toy -robust model on MNIST. By thresholding inputs, the model is robust to perturbations such that . Adversarial examples (top-right of each set of 4 images) are labeled differently by a human. However, they become identical after binarization; the model thus labels both images confidently in the source image’s class.

Appendix D Complete Set of 100 Invariance Adversarial Examples

Below we give the randomly-selected test images along with the invariance adversarial examples that were shown during the human study.

d.1 Original Images

d.2 Invariance Adversarial Examples

d.3 Invariance Adversarial Examples