Ground-Truth Adversarial Examples

by   Nicholas Carlini, et al.

The ability to deploy neural networks in real-world, safety-critical systems is severely limited by the presence of adversarial examples: slightly perturbed inputs that are misclassified by the network. In recent years, several techniques have been proposed for training networks that are robust to such examples; and each time stronger attacks have been devised, demonstrating the shortcomings of existing defenses. This highlights a key difficulty in designing an effective defense: the inability to assess a network's robustness against future attacks. We propose to address this difficulty through formal verification techniques. We construct ground truths: adversarial examples with provably minimal perturbation. We demonstrate how ground truths can serve to assess the effectiveness of attack techniques, by comparing the adversarial examples produced to the ground truths; and also of defense techniques, by measuring the increase in distortion to ground truths in the hardened network versus the original. We use this technique to assess recently suggested attack and defense techniques.



page 11

page 12


Attack as Defense: Characterizing Adversarial Examples using Robustness

As a new programming paradigm, deep learning has expanded its applicatio...

Attacking the Madry Defense Model with L_1-based Adversarial Examples

The Madry Lab recently hosted a competition designed to test the robustn...

Increasing Confidence in Adversarial Robustness Evaluations

Hundreds of defenses have been proposed to make deep neural networks rob...

Heat and Blur: An Effective and Fast Defense Against Adversarial Examples

The growing incorporation of artificial neural networks (NNs) into many ...

Global Optimization of Objective Functions Represented by ReLU Networks

Neural networks (NN) learn complex non-convex functions, making them des...

Idealised Bayesian Neural Networks Cannot Have Adversarial Examples: Theoretical and Empirical Study

We prove that idealised discriminative Bayesian neural networks, capturi...

Detecting Adversarial Attacks on Neural Network Policies with Visual Foresight

Deep reinforcement learning has shown promising results in learning cont...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

While machine learning, and neural networks in particular, have seen significant success, recent work (Szegedy et al., 2014)

has shown that an adversary can cause unintended behavior by performing slight modifications to the input at test-time. In neural networks used as classifiers, these

adversarial examples are produced by taking some normal instance that is classified correctly, and applying a slight perturbation to cause it to be misclassified as any target desired by the adversary. This phenomenon, which has been shown to affect most state-of-the-art networks, poses a significant hindrance to deploying neural networks in safety-critical settings.

Many effective techniques have been proposed for generating adversarial examples (Szegedy et al., 2014; Goodfellow et al., 2014; Moosavi-Dezfooli et al., 2015; Carlini and Wagner, 2017); and, conversely, several techniques have been proposed for training networks that are more robust to these examples (Huang et al., 2015; Zheng et al., 2016; Hendrik Metzen et al., 2017; Hendrycks and Gimpel, 2017; Madry et al., 2017). Unfortunately, it has proven difficult to accurately assess the robustness of any given defense by evaluating it against existing attack techniques. In several cases, a defensive technique that was at first thought to produce robust networks was later shown to be susceptible to new kinds of attacks. Most recently, at ICLR 2018, seven accepted defenses were shown to be vulnerable to attack (Athalye et al., 2018). This ongoing cycle thus cast a doubt in any newly-proposed defensive technique.

In recent years, new techniques have been proposed for the formal verification of neural networks (Katz et al., 2017b; Pulina and Tacchella, 2010, 2012; Huang et al., 2016; Ehlers, 2017). These techniques take a network and a desired property, and formally prove that the network satisfies the property — or provide an input for which the property is violated, if such an input exists. Verification techniques can be used to find adversarial examples for a given input point and some allowed amount of distortion, but they tend to be significantly slower than gradient-based techniques (Katz et al., 2017b; Pulina and Tacchella, 2012; Katz et al., 2017a).

Contributions. In this paper we propose a method for using formal verification to assess the effectiveness of adversarial example attacks and defenses. The key idea apply verification to construct provably minimally distorted adversarial examples: inputs that are misclassified by a classifier but are provably minimally distorted under a chosen distance metric. We perform two forms of analysis with this approach.

  • Attack Evaluation. We use provably minimally distorted adversarial examples to evaluate the efficacy of a recent attack (Carlini and Wagner, 2017) at generating adversarial examples, and find it produces produces adversarial examples within of optimal on our small model on the MNIST dataset. This suggests that iterative optimization-based attacks are indeed effective at generating adversarial examples, and strengthens the hypothesis of Madry et al. (2017) that first-order attacks are “universal”.

  • Defense Evaluation. More interestingly, we can also apply this technique to prove properties about defenses. Given an arbitrary defense, we can apply it to a small enough problem amenable to verficiation, and prove properties about that defense in the restricted setting. As a case study, we evaluate the robustness of adversarial training as performed by Madry et al. (2017) at defending against adversarial examples on the MNIST dataset. This defense was emperically found to be among the strongest submitted to ICLR 2018 (Athalye et al., 2018), and in this paper we formally prove this defense is effective: it succeeds at increasing robustness to adversarial examples by on the samples we examine. While this does not gaurantee efficacy at larger scale, it does gaurantee that, at least on small networks, this defense has not just caused current attacks to fail; it has successfully managed to increase the robustness of neural networks against all future attacks. To the best of our knowledge, we are the first to apply formal verification to formally prove properties about defenses initially designed with only emperical results.

2 Background and Notation

Neural network notation. We regard a neural network as a function consisting of multiple layers

. In this paper we exclusively study feed-forward neural networks used for classification, and so the final layer

is the softmax activation function. We refer to the output of the second-to-last-layer of the network (the input to

) as the logits and denote this as

. We define to be the cross-entropy loss of the network on instance with true label .

We focus here on networks for classifying greyscale MNIST images. Input images with width and height are represented as points in the space .

Adversarial examples. (Szegedy et al., 2014) Given an input , classified originally as target , and a new desired target , we call a targeted adversarial example if and is close to under some given distance metric.

Exactly which distance metric to use to properly evaluate the “closeness” between and is a difficult question (Rozsa et al., 2016; Xiao et al., 2018; Zhengli Zhao, 2018). However, almost all work in this space has decided on using distances to measure distortion (Szegedy et al., 2014; Goodfellow et al., 2014; Moosavi-Dezfooli et al., 2015; Hendrik Metzen et al., 2017; Carlini and Wagner, 2017; Madry et al., 2017), and Every defense at ICLR 2018 argues robustness. We believe that considering more sophisticated distance metrics is an important direction of research, but for consistency with prior work, in this paper we evaluate using the and distance metrics.

Generating adversarial examples. We make use of three popular methods for constructing adversarial examples:

  1. The Fast Gradient Method (FGM) (Goodfellow et al., 2014) is a one-step algorithm that takes a single step in the direction of the gradient.

    where controls the step size taken, and clip ensures that the adversarial example resides in the valid image space from to .

  2. The Basic Iterative Method (BIM) (Kurakin et al., 2016) (sometimes also called Projected Gradient Descent (Madry et al., 2017)) can be regarded as an iterative application of the fast gradient method. Initially it lets and then uses the update rule

    Intuitively, in each iteration this attack takes a step of size as per the FGM method, but it iterates this process while keeping each within the -sized ball of .

  3. The Carlini and Wagner (CW) (Carlini and Wagner, 2017) method is an iterative attack that constructs adversarial examples by approximately solving the minimization problem such that for the attacker-chosen target , where is an appropriate distance metric. Since the constrained optimization is difficult, instead they choose to solve where

    is a loss function that encodes how close

    is to being adversarial. Specifically, they set

    , the logits of the network, are used instead of the softmax output because it was found to provide superior results. Although it was originally constructed to optimize for distortion, we use it with and distortions in this paper.

Neural network verification. The intended use of deep neural networks as controllers in safety-critical systems (Julian et al., 2016; Bojarski et al., 2016) has sparked an interest in developing techniques for verifying that they satisfy various properties (Pulina and Tacchella, 2010, 2012; Huang et al., 2016; Ehlers, 2017; Katz et al., 2017b). Here we focus on the recently-proposed Reluplex algorithm (Katz et al., 2017b)

: a simplex-based approach that can effectively tackle networks with piecewise-linear activation functions, such as rectified linear units (ReLUs) or max-pooling layers. Reluplex is known to be sound and complete, and so it is suitable for establishing adversarial examples of provably minimally distortion.

In (Katz et al., 2017b) it is shown that Reluplex can be used to determine whether there exists an adversarial example within distance of some input point . This is performed by encoding the neural network itself and the constraints regarding

as a set of linear equations and ReLU constraints, and then having Reluplex attempt to prove the property that “there does not exist an input point within distance

of that is assigned a different label than ”. Reluplex either responds that the property holds, in which case there is no adversarial example within distance of , or it returns a counter-example which constitutes the sought-after adversarial input. By invoking Reluplex iteratively and applying binary search, one can approximate the optimal (i.e., the largest for which no adversarial example exists) up to a desired precision (Katz et al., 2017b).

The proof-of-concept implementation of Reluplex described in (Katz et al., 2017b) supported only networks with the ReLU activation function, and could only handle the norm as a distance metric. Here we use a simple encoding that allows us to use it for the norm as well.

Adversarial training. Adversarial training is perhaps the first proposed defense against adversarial examples (Szegedy et al., 2014), and is a conceptually straightforward approach. The defender trains a classifier, generates adversarial examples for that classifier, retrains the classifier using the adversarial examples, and repeats.

Formally, the defender attempts to solve the following formulation

by approximating the inner minimization step with an existing attack technique.

Recent work has shown (Madry et al., 2017) that for networks with sufficient capacity, adversarial training can be an effective defense even against the most powerful attacks today by training against equally powerful attacks.

It is known that if adversarial training is performed using weaker attacks, such as the fast gradient sign method, then it is still possible to construct adversarial examples by using stronger attacks (Tramèr et al., 2018).

It is an open question whether adversarial training using stronger attacks (such as PGD) will actually increase robustness to all attacks, or whether such training will be effective at preventing only current attacks.

Provable (certified) defenses. Very recent work at ICLR 2018 has begun to construct certified defenses to adversarial examples. These defenses can give a proof of robustness that adversarial examples of distortion at most cause a test loss of at most . This work is an extremely important direction of research that applies formal verification to the process of constructing provably sound defenses to adversarial examples. However, the major drawback of these approaches is that certified defenses so far can only be applied to small networks on small datasets.

In contrast, this work can take an arbitrary defense (that can be applied to networks of any size), and formally prove properties about it on a small dataset. If a defense is not effective on a large dataset, it is also likely to be ineffective on the small dataset we study, and we will therefore be able to show it is not effective.

Put differently, our work shares the same key limitation of certified defenses: when scaling to larger datasets, we are no longer able to offer provably gaurantees. However, because the defenses we study scale to larger datasets, even though our proofs of robustness do not, it is still possible to apply these defenses with increased confidence in their security.

3 Model Setup

The problem of neural network verification that we consider here is an NP-complete problem (Katz et al., 2017b), and despite recent progress only networks with a few hundred nodes can be soundly verified. Thus, in order to evaluate our approach we trained a small network over the MNIST data set. This network is a fully-connected, 3-layer network that achieves a

accuracy despite having only 20k weights and consisting of fewer than 100 hidden neurons (24 in each layer). As verification of neural networks becomes more scalable in the future, our approach could become applicable to larger networks and additional data sets.

Number Carlini- Minimally Distorted Percent
of Points Wagner Adversarial Example Improvement
, 38/90 0.042 0.038 11.632
90/90 0.063 0.061 6.027
, 6/90 1.94 1.731 34.909
90/90 7.551 7.492 3.297
, 81/90 0.211 0.193 11.637
90/90 0.219 0.203 10.568
, 64/90 6.44 6.36 6.285
90/90 8.187 8.128 4.486
Table 1: Evaluating our technique on the MNIST dataset

For verification, we use the proof-of-concept implementation of Reluplex available online (Katz et al., 2017c). The only non-linear operator that this implementation was originally designed to support is the ReLU function, but we observe that it can support also operators using the following encoding:

This fact allows the encoding of max operators using ReLUs, and consequently to encode max-pooling layers into Reluplex (although we did not experiment with such layers in this paper). Thus, it allows us to extend the results from (Katz et al., 2017b) and measure distances with the norm as well as the norm, by encoding absolute values using ReLUs:

Because the distance between two points is defined as a sum of absolute values, this encoding allowed us to encode distances into Reluplex without modifying its code. We point out, however, that an increase in the number of ReLU constraints in the input adversely affects Reluplex’s performance. For example, in the case of the MNIST dataset, encoding distance entails adding a ReLU constraint for each input of the 784 input coordinates. It is thus not surprising that experiments using typically took longer to finish than those using .

Each individual experiment that we conducted included a network , a distance metric , an input point , a target label , and an initial adversarial input for which . The goal of the experiment was then to find minimally distorted example , such that and is minimal. As explained in Section 2, this is performed by iteratively invoking Reluplex and performing a binary search.

Intuitively, indicates the distance to the closest adversarial input currently known, and the provably minimally-distroted input is known to be in the range between and . Thus, is initialized using the distance of the initial adversarial input provided, and is initialized to 0. The search procedure iteratively shrinks the range until it is below a certain threshold (we used for our experiments). It then returns as the distance to the provably minimally distorted adversarial example, and this is guaranteed to be accurate up to the specified precision. The provably minimally distorted input itself is also returned.

For the initial in our experiments we used an adversarial input found using the CW attack. We note that Reluplex invocations are computationally expensive, and so it is better to start with as close as possible to , in order to reduce the number of required iterations until is sufficiently small. For the same reason, experiments using the distance metric were slower than those using : the initial distances were typically much larger, which required additional iterations.

4 Evaluation

For evaluation purposes we arbitrarily selected 10 source images with known labels from the MNIST test set. We considered two neural networks — the one described in Section 3, denoted , and also a version of that has been trained with adversarial training as described in (Madry et al., 2017), denoted . We also considered two distance metrics, and . For every combination of neural network, distance metric and labeled source image , we considered each of the 9 other possible labels for . For each of these we used the CW attack to produce an initial targeted adversarial example, and then used Reluplex to search for a provably minimally distorted example. The results are given in Table 1.

Each major row of the table corresponds to specific neural network and distance metric (as indicated in the first column), and describes 90 individual experiments (10 inputs, times 9 target labels for each input). The first sub-row within each row considers just those experiments for which Reluplex terminated successfully, whereas the second sub-row considers all 90 examples, including those where Reluplex timed out. Whenever a timeout occurred, we considered the last (smallest) that was discovered by the search before it timed out as the algorithm’s output. The other columns of the table indicate the average distance to the adversarial examples found by the CW attack, the average distance to the minimally-distorted adversarial examples found by our technique, and the average improvement rate of our technique over the CW attack.

Below we analyze the results in order to draw conclusions regarding the CW attack and the defense of (Madry et al., 2017). While these results naturally hold only for the networks we used and the inputs we tested, we believe they provide some intuition as to how well the tested attack and defense techniques perform. We intend to make our data publicly available, and we encourage others to (i) evaluate new attack techniques using the minimally-distorted examples we have already discovered, and on additional ones; and (ii) to use this approach for evaluating new defensive techniques.

4.1 Evaluating Attacks

Iterative attacks produce near-optimal adversarial examples. As is shown by Table 1, the adversarial examples produced by the CW attack are on average within of the minimally-distorted example when using the norm, and within of the minimally-distorted example when using (we consider here just the terminated experiments, and ignore the category where too few experiments terminated to draw a meaningful conclusion). In particular, iterative attacks perform substantially better than single-step methods, such as the fast gradient method. This is an expected result and is not surprising: the fast gradient method was designed to show the linearity of neural networks, not to produce high-quality adversarial examples.

This result supports the hypothesis of Madry et al. (2017) who argue first-order attacks (i.e., gradient-based methods) are “universal”. Further, this therefore justifies using first-order methods as the basis of adversarial training; at least on the datasets we consider.

There is still room for improving iterative attacks. Even on this very small and simple neural network, we observed that in many instances the ground-truth adversarial example has a or lower distortion rate than the best iterative adversarial example. The cause for this is simple: gradient descent only finds a local minimum, not a global minimum.

We have found that if we take a small step from the original image in the direction of the minimally-distorted adversarial example, then gradient descent will converge to the minimally-distorted adversarial example. Taking random steps and then performing gradient descent does not help significantly.

Suboptimal results are correlated. We have found that when the iterative attack performs suboptimally compared to the minimally-distorteds example for one target label, it will often perform poorly for many other target labels as well. These instances are not always of larger absolute distortion, but a larger relative gap on one instance often indicates that the relative gap will be larger for other targets. For instance, on the adversarially trained network attacked under distance, the ground-truth adversarial examples for the digit were from to better than the iterative attack results.

Reluplex, , , Target Label
0 1 2 3 4 5 6 7 8 9

Source Label
9 8 7 6 5 4 3 2 1 0
(a) Adversarial Examples generated on a neural
network using Reluplex.
CW, , , Target Label
0 1 2 3 4 5 6 7 8 9

Source Label
9 8 7 6 5 4 3 2 1 0
(b) Adversarial Examples generated on a neural
network using Carlini and Wagner (2017).

When we examined the most extreme cases in which this phenomenon was observed, we found that, similarly to the case described above, the large gap was caused by gradient descent initially leading away from the minimally-distorted example for most targets, resulting in the discovery of an inferior, local minimum.

4.2 Evaluating Defenses

For the purpose of evaluating the defensive technique of (Madry et al., 2017), we compared the and experiments (the experiments were disregarded because of the small number of experiments that terminated for the case). Specifically, we compared the and experiments on the subset of instances that terminated for both experiments. The results appear in Table 2.

Number CW Minimally Percent
of Points Distorted Improvement
, 35/35 0.042 0.039 12.319
, 35/35 0.18 0.165 11.153
Table 2: Comparing the 35 instances on which Reluplex terminated for both and .

The defense of Madry et al. (2017) is effective. Our evaluation suggests that adversarial retraining is indeed effective: it improves the distance to the minimally distorted adversarial examples by an average of 423% (from an average of 0.039 to an average of 0.165) on our small network.

Another interesting observation is that while adversarial retraining improves the overall situation, we found several points in which it actually made things worse — i.e., the minimally distorted adversarial examples for the hardened network were smaller than that of the original network. This behavior was observed for 7 out of the 35 aforementioned experiments, with the average percent of degradation being 12.8%. This seems to highlight the necessity of evaluating the effectiveness of a defensive technique, and the robustness of a network in general, over a large dataset of points. The question of how to pick a “good” set of points that would adequately represent the behavior of the network remains open.

Training on iterative attacks does not overfit. Overfitting is a an issue that is often encountered when performing adversarial training. By this we mean that a defense may overfit to the type of attack used during training. When this occurs, the hardened network will have high accuracy against the one attack used during training, but give low accuracy on other attacks. We have found no evidence of overfitting when performing the adversarial training of (Madry et al., 2017): the minimally distorted adversarial examples improve on the CW attack by on both the hardened and untrained networks.

Reluplex, , , Target Label
0 1 2 3 4 5 6 7 8 9

Source Label
9 8 7 6 5 4 3 2 1 0
(c) Adversarial Examples generated on Madry et al. (2017)
using Carlini and Wagner (2017).
CW, , , Target Label
0 1 2 3 4 5 6 7 8 9

Source Label
9 8 7 6 5 4 3 2 1 0
(d) Adversarial Examples generated on Madry et al. (2017)
using Carlini and Wagner (2017).

It is easier to formally analyze Madry et al. (2017). For both the and distance metrics, it seems significantly easier to analyze the robustness of the adversarially trained network: when using , Reluplex terminated on 81 of the 90 instances on the adversarially trained network, versus 38 on the standard network; and for , the termination rate was 64 for the hardened network compared to just 6 on the standard network. We are still looking into the reason for this behavior. Naively, one might assume that because the initial adversarial examples provided to Reluplex have larger distance for the hardened network, that these experiments will take longer to converge — but we were seeing an opposite behavior.

One possible explanation could be that the adversarially trained network makes less use of the nonlinear ReLU units, and is therefore more amenable to analysis with Reluplex. We empirically verify that this is not the case. For a given instance, we track, for each ReLU unit in the network, whether it is in the saturated zero region, or the linear region. We then compute the nonlinearity of the network as the number of units that change from the saturated region to the linear region, or vice versa, when going from the given input to the discovered adversarial example. We find that there is no statistically significant difference between the nonlinearity of the two networks.

5 Conclusion

Neural networks hold great potential to be used in safety-critical systems, but their susceptibility to adversarial examples poses a significant hindrance. While defenses can be argued secure against existing attacks, it is difficult to assess vulnerability to future attacks. The burgeoning field of neural network verification can mitigate this problem, by allowing us to obtain an absolute measurement of the usefulness of a defense, regardless of the attack to be used against it.

In this paper, we introduce provably minimally distorted adversarial examples and show how to construct them with formal verification approaches. We evaluate one recent attack (Carlini and Wagner, 2017) and find it often produces adversarial examples whose distance is within to of optimal, and one defense (Madry et al., 2017), and find that it increases distortion to the nearest adversarial example by an average of on the MNIST dataset for our tested networks. To the best of our knowledge, this is the first proof of robustness increase for a defense that was not designed to be proven secure.

Currently available verification tools afford limited scalability, which means experiments can only be conducted on small networks. However, as better verification techniques are developed, this limitation is expected to be lifted. Orthogonally, when preparing to use a neural network in a safety-critical setting, users may choose to design their networks as to make them particularly amenable to verification techniques — e.g., by using specific activation functions or network topologies — so that strong guarantees about their correctness and robustness may be obtained.