While machine learning, and neural networks in particular, have seen significant success, recent work (Szegedy et al., 2014)
has shown that an adversary can cause unintended behavior by performing slight modifications to the input at test-time. In neural networks used as classifiers, theseadversarial examples are produced by taking some normal instance that is classified correctly, and applying a slight perturbation to cause it to be misclassified as any target desired by the adversary. This phenomenon, which has been shown to affect most state-of-the-art networks, poses a significant hindrance to deploying neural networks in safety-critical settings.
Many effective techniques have been proposed for generating adversarial examples (Szegedy et al., 2014; Goodfellow et al., 2014; Moosavi-Dezfooli et al., 2015; Carlini and Wagner, 2017); and, conversely, several techniques have been proposed for training networks that are more robust to these examples (Huang et al., 2015; Zheng et al., 2016; Hendrik Metzen et al., 2017; Hendrycks and Gimpel, 2017; Madry et al., 2017). Unfortunately, it has proven difficult to accurately assess the robustness of any given defense by evaluating it against existing attack techniques. In several cases, a defensive technique that was at first thought to produce robust networks was later shown to be susceptible to new kinds of attacks. Most recently, at ICLR 2018, seven accepted defenses were shown to be vulnerable to attack (Athalye et al., 2018). This ongoing cycle thus cast a doubt in any newly-proposed defensive technique.
In recent years, new techniques have been proposed for the formal verification of neural networks (Katz et al., 2017b; Pulina and Tacchella, 2010, 2012; Huang et al., 2016; Ehlers, 2017). These techniques take a network and a desired property, and formally prove that the network satisfies the property — or provide an input for which the property is violated, if such an input exists. Verification techniques can be used to find adversarial examples for a given input point and some allowed amount of distortion, but they tend to be significantly slower than gradient-based techniques (Katz et al., 2017b; Pulina and Tacchella, 2012; Katz et al., 2017a).
Contributions. In this paper we propose a method for using formal verification to assess the effectiveness of adversarial example attacks and defenses. The key idea apply verification to construct provably minimally distorted adversarial examples: inputs that are misclassified by a classifier but are provably minimally distorted under a chosen distance metric. We perform two forms of analysis with this approach.
Attack Evaluation. We use provably minimally distorted adversarial examples to evaluate the efficacy of a recent attack (Carlini and Wagner, 2017) at generating adversarial examples, and find it produces produces adversarial examples within of optimal on our small model on the MNIST dataset. This suggests that iterative optimization-based attacks are indeed effective at generating adversarial examples, and strengthens the hypothesis of Madry et al. (2017) that first-order attacks are “universal”.
Defense Evaluation. More interestingly, we can also apply this technique to prove properties about defenses. Given an arbitrary defense, we can apply it to a small enough problem amenable to verficiation, and prove properties about that defense in the restricted setting. As a case study, we evaluate the robustness of adversarial training as performed by Madry et al. (2017) at defending against adversarial examples on the MNIST dataset. This defense was emperically found to be among the strongest submitted to ICLR 2018 (Athalye et al., 2018), and in this paper we formally prove this defense is effective: it succeeds at increasing robustness to adversarial examples by on the samples we examine. While this does not gaurantee efficacy at larger scale, it does gaurantee that, at least on small networks, this defense has not just caused current attacks to fail; it has successfully managed to increase the robustness of neural networks against all future attacks. To the best of our knowledge, we are the first to apply formal verification to formally prove properties about defenses initially designed with only emperical results.
2 Background and Notation
Neural network notation. We regard a neural network as a function consisting of multiple layers
. In this paper we exclusively study feed-forward neural networks used for classification, and so the final layer
is the softmax activation function. We refer to the output of the second-to-last-layer of the network (the input to
) as the logits and denote this as. We define to be the cross-entropy loss of the network on instance with true label .
We focus here on networks for classifying greyscale MNIST images. Input images with width and height are represented as points in the space .
Adversarial examples. (Szegedy et al., 2014) Given an input , classified originally as target , and a new desired target , we call a targeted adversarial example if and is close to under some given distance metric.
Exactly which distance metric to use to properly evaluate the “closeness” between and is a difficult question (Rozsa et al., 2016; Xiao et al., 2018; Zhengli Zhao, 2018). However, almost all work in this space has decided on using distances to measure distortion (Szegedy et al., 2014; Goodfellow et al., 2014; Moosavi-Dezfooli et al., 2015; Hendrik Metzen et al., 2017; Carlini and Wagner, 2017; Madry et al., 2017), and Every defense at ICLR 2018 argues robustness. We believe that considering more sophisticated distance metrics is an important direction of research, but for consistency with prior work, in this paper we evaluate using the and distance metrics.
Generating adversarial examples. We make use of three popular methods for constructing adversarial examples:
The Fast Gradient Method (FGM) (Goodfellow et al., 2014) is a one-step algorithm that takes a single step in the direction of the gradient.
where controls the step size taken, and clip ensures that the adversarial example resides in the valid image space from to .
The Basic Iterative Method (BIM) (Kurakin et al., 2016) (sometimes also called Projected Gradient Descent (Madry et al., 2017)) can be regarded as an iterative application of the fast gradient method. Initially it lets and then uses the update rule
Intuitively, in each iteration this attack takes a step of size as per the FGM method, but it iterates this process while keeping each within the -sized ball of .
The Carlini and Wagner (CW) (Carlini and Wagner, 2017) method is an iterative attack that constructs adversarial examples by approximately solving the minimization problem such that for the attacker-chosen target , where is an appropriate distance metric. Since the constrained optimization is difficult, instead they choose to solve where
is a loss function that encodes how closeis to being adversarial. Specifically, they set
, the logits of the network, are used instead of the softmax output because it was found to provide superior results. Although it was originally constructed to optimize for distortion, we use it with and distortions in this paper.
Neural network verification. The intended use of deep neural networks as controllers in safety-critical systems (Julian et al., 2016; Bojarski et al., 2016) has sparked an interest in developing techniques for verifying that they satisfy various properties (Pulina and Tacchella, 2010, 2012; Huang et al., 2016; Ehlers, 2017; Katz et al., 2017b). Here we focus on the recently-proposed Reluplex algorithm (Katz et al., 2017b)
: a simplex-based approach that can effectively tackle networks with piecewise-linear activation functions, such as rectified linear units (ReLUs) or max-pooling layers. Reluplex is known to be sound and complete, and so it is suitable for establishing adversarial examples of provably minimally distortion.
In (Katz et al., 2017b) it is shown that Reluplex can be used to determine whether there exists an adversarial example within distance of some input point . This is performed by encoding the neural network itself and the constraints regarding
as a set of linear equations and ReLU constraints, and then having Reluplex attempt to prove the property that “there does not exist an input point within distanceof that is assigned a different label than ”. Reluplex either responds that the property holds, in which case there is no adversarial example within distance of , or it returns a counter-example which constitutes the sought-after adversarial input. By invoking Reluplex iteratively and applying binary search, one can approximate the optimal (i.e., the largest for which no adversarial example exists) up to a desired precision (Katz et al., 2017b).
The proof-of-concept implementation of Reluplex described in (Katz et al., 2017b) supported only networks with the ReLU activation function, and could only handle the norm as a distance metric. Here we use a simple encoding that allows us to use it for the norm as well.
Adversarial training. Adversarial training is perhaps the first proposed defense against adversarial examples (Szegedy et al., 2014), and is a conceptually straightforward approach. The defender trains a classifier, generates adversarial examples for that classifier, retrains the classifier using the adversarial examples, and repeats.
Formally, the defender attempts to solve the following formulation
by approximating the inner minimization step with an existing attack technique.
Recent work has shown (Madry et al., 2017) that for networks with sufficient capacity, adversarial training can be an effective defense even against the most powerful attacks today by training against equally powerful attacks.
It is known that if adversarial training is performed using weaker attacks, such as the fast gradient sign method, then it is still possible to construct adversarial examples by using stronger attacks (Tramèr et al., 2018).
It is an open question whether adversarial training using stronger attacks (such as PGD) will actually increase robustness to all attacks, or whether such training will be effective at preventing only current attacks.
Provable (certified) defenses. Very recent work at ICLR 2018 has begun to construct certified defenses to adversarial examples. These defenses can give a proof of robustness that adversarial examples of distortion at most cause a test loss of at most . This work is an extremely important direction of research that applies formal verification to the process of constructing provably sound defenses to adversarial examples. However, the major drawback of these approaches is that certified defenses so far can only be applied to small networks on small datasets.
In contrast, this work can take an arbitrary defense (that can be applied to networks of any size), and formally prove properties about it on a small dataset. If a defense is not effective on a large dataset, it is also likely to be ineffective on the small dataset we study, and we will therefore be able to show it is not effective.
Put differently, our work shares the same key limitation of certified defenses: when scaling to larger datasets, we are no longer able to offer provably gaurantees. However, because the defenses we study scale to larger datasets, even though our proofs of robustness do not, it is still possible to apply these defenses with increased confidence in their security.
3 Model Setup
The problem of neural network verification that we consider here is an NP-complete problem (Katz et al., 2017b), and despite recent progress only networks with a few hundred nodes can be soundly verified. Thus, in order to evaluate our approach we trained a small network over the MNIST data set. This network is a fully-connected, 3-layer network that achieves a
accuracy despite having only 20k weights and consisting of fewer than 100 hidden neurons (24 in each layer). As verification of neural networks becomes more scalable in the future, our approach could become applicable to larger networks and additional data sets.
|of Points||Wagner||Adversarial Example||Improvement|
For verification, we use the proof-of-concept implementation of Reluplex available online (Katz et al., 2017c). The only non-linear operator that this implementation was originally designed to support is the ReLU function, but we observe that it can support also operators using the following encoding:
This fact allows the encoding of max operators using ReLUs, and consequently to encode max-pooling layers into Reluplex (although we did not experiment with such layers in this paper). Thus, it allows us to extend the results from (Katz et al., 2017b) and measure distances with the norm as well as the norm, by encoding absolute values using ReLUs:
Because the distance between two points is defined as a sum of absolute values, this encoding allowed us to encode distances into Reluplex without modifying its code. We point out, however, that an increase in the number of ReLU constraints in the input adversely affects Reluplex’s performance. For example, in the case of the MNIST dataset, encoding distance entails adding a ReLU constraint for each input of the 784 input coordinates. It is thus not surprising that experiments using typically took longer to finish than those using .
Each individual experiment that we conducted included a network , a distance metric , an input point , a target label , and an initial adversarial input for which . The goal of the experiment was then to find minimally distorted example , such that and is minimal. As explained in Section 2, this is performed by iteratively invoking Reluplex and performing a binary search.
Intuitively, indicates the distance to the closest adversarial input currently known, and the provably minimally-distroted input is known to be in the range between and . Thus, is initialized using the distance of the initial adversarial input provided, and is initialized to 0. The search procedure iteratively shrinks the range until it is below a certain threshold (we used for our experiments). It then returns as the distance to the provably minimally distorted adversarial example, and this is guaranteed to be accurate up to the specified precision. The provably minimally distorted input itself is also returned.
For the initial in our experiments we used an adversarial input found using the CW attack. We note that Reluplex invocations are computationally expensive, and so it is better to start with as close as possible to , in order to reduce the number of required iterations until is sufficiently small. For the same reason, experiments using the distance metric were slower than those using : the initial distances were typically much larger, which required additional iterations.
For evaluation purposes we arbitrarily selected 10 source images with known labels from the MNIST test set. We considered two neural networks — the one described in Section 3, denoted , and also a version of that has been trained with adversarial training as described in (Madry et al., 2017), denoted . We also considered two distance metrics, and . For every combination of neural network, distance metric and labeled source image , we considered each of the 9 other possible labels for . For each of these we used the CW attack to produce an initial targeted adversarial example, and then used Reluplex to search for a provably minimally distorted example. The results are given in Table 1.
Each major row of the table corresponds to specific neural network and distance metric (as indicated in the first column), and describes 90 individual experiments (10 inputs, times 9 target labels for each input). The first sub-row within each row considers just those experiments for which Reluplex terminated successfully, whereas the second sub-row considers all 90 examples, including those where Reluplex timed out. Whenever a timeout occurred, we considered the last (smallest) that was discovered by the search before it timed out as the algorithm’s output. The other columns of the table indicate the average distance to the adversarial examples found by the CW attack, the average distance to the minimally-distorted adversarial examples found by our technique, and the average improvement rate of our technique over the CW attack.
Below we analyze the results in order to draw conclusions regarding the CW attack and the defense of (Madry et al., 2017). While these results naturally hold only for the networks we used and the inputs we tested, we believe they provide some intuition as to how well the tested attack and defense techniques perform. We intend to make our data publicly available, and we encourage others to (i) evaluate new attack techniques using the minimally-distorted examples we have already discovered, and on additional ones; and (ii) to use this approach for evaluating new defensive techniques.
4.1 Evaluating Attacks
Iterative attacks produce near-optimal adversarial examples. As is shown by Table 1, the adversarial examples produced by the CW attack are on average within of the minimally-distorted example when using the norm, and within of the minimally-distorted example when using (we consider here just the terminated experiments, and ignore the category where too few experiments terminated to draw a meaningful conclusion). In particular, iterative attacks perform substantially better than single-step methods, such as the fast gradient method. This is an expected result and is not surprising: the fast gradient method was designed to show the linearity of neural networks, not to produce high-quality adversarial examples.
This result supports the hypothesis of Madry et al. (2017) who argue first-order attacks (i.e., gradient-based methods) are “universal”. Further, this therefore justifies using first-order methods as the basis of adversarial training; at least on the datasets we consider.
There is still room for improving iterative attacks. Even on this very small and simple neural network, we observed that in many instances the ground-truth adversarial example has a or lower distortion rate than the best iterative adversarial example. The cause for this is simple: gradient descent only finds a local minimum, not a global minimum.
We have found that if we take a small step from the original image in the direction of the minimally-distorted adversarial example, then gradient descent will converge to the minimally-distorted adversarial example. Taking random steps and then performing gradient descent does not help significantly.
Suboptimal results are correlated. We have found that when the iterative attack performs suboptimally compared to the minimally-distorteds example for one target label, it will often perform poorly for many other target labels as well. These instances are not always of larger absolute distortion, but a larger relative gap on one instance often indicates that the relative gap will be larger for other targets. For instance, on the adversarially trained network attacked under distance, the ground-truth adversarial examples for the digit were from to better than the iterative attack results.
network using Reluplex.
network using Carlini and Wagner (2017).
When we examined the most extreme cases in which this phenomenon was observed, we found that, similarly to the case described above, the large gap was caused by gradient descent initially leading away from the minimally-distorted example for most targets, resulting in the discovery of an inferior, local minimum.
4.2 Evaluating Defenses
For the purpose of evaluating the defensive technique of (Madry et al., 2017), we compared the and experiments (the experiments were disregarded because of the small number of experiments that terminated for the case). Specifically, we compared the and experiments on the subset of instances that terminated for both experiments. The results appear in Table 2.
The defense of Madry et al. (2017) is effective. Our evaluation suggests that adversarial retraining is indeed effective: it improves the distance to the minimally distorted adversarial examples by an average of 423% (from an average of 0.039 to an average of 0.165) on our small network.
Another interesting observation is that while adversarial retraining improves the overall situation, we found several points in which it actually made things worse — i.e., the minimally distorted adversarial examples for the hardened network were smaller than that of the original network. This behavior was observed for 7 out of the 35 aforementioned experiments, with the average percent of degradation being 12.8%. This seems to highlight the necessity of evaluating the effectiveness of a defensive technique, and the robustness of a network in general, over a large dataset of points. The question of how to pick a “good” set of points that would adequately represent the behavior of the network remains open.
Training on iterative attacks does not overfit. Overfitting is a an issue that is often encountered when performing adversarial training. By this we mean that a defense may overfit to the type of attack used during training. When this occurs, the hardened network will have high accuracy against the one attack used during training, but give low accuracy on other attacks. We have found no evidence of overfitting when performing the adversarial training of (Madry et al., 2017): the minimally distorted adversarial examples improve on the CW attack by on both the hardened and untrained networks.
It is easier to formally analyze Madry et al. (2017). For both the and distance metrics, it seems significantly easier to analyze the robustness of the adversarially trained network: when using , Reluplex terminated on 81 of the 90 instances on the adversarially trained network, versus 38 on the standard network; and for , the termination rate was 64 for the hardened network compared to just 6 on the standard network. We are still looking into the reason for this behavior. Naively, one might assume that because the initial adversarial examples provided to Reluplex have larger distance for the hardened network, that these experiments will take longer to converge — but we were seeing an opposite behavior.
One possible explanation could be that the adversarially trained network makes less use of the nonlinear ReLU units, and is therefore more amenable to analysis with Reluplex. We empirically verify that this is not the case. For a given instance, we track, for each ReLU unit in the network, whether it is in the saturated zero region, or the linear region. We then compute the nonlinearity of the network as the number of units that change from the saturated region to the linear region, or vice versa, when going from the given input to the discovered adversarial example. We find that there is no statistically significant difference between the nonlinearity of the two networks.
Neural networks hold great potential to be used in safety-critical systems, but their susceptibility to adversarial examples poses a significant hindrance. While defenses can be argued secure against existing attacks, it is difficult to assess vulnerability to future attacks. The burgeoning field of neural network verification can mitigate this problem, by allowing us to obtain an absolute measurement of the usefulness of a defense, regardless of the attack to be used against it.
In this paper, we introduce provably minimally distorted adversarial examples and show how to construct them with formal verification approaches. We evaluate one recent attack (Carlini and Wagner, 2017) and find it often produces adversarial examples whose distance is within to of optimal, and one defense (Madry et al., 2017), and find that it increases distortion to the nearest adversarial example by an average of on the MNIST dataset for our tested networks. To the best of our knowledge, this is the first proof of robustness increase for a defense that was not designed to be proven secure.
Currently available verification tools afford limited scalability, which means experiments can only be conducted on small networks. However, as better verification techniques are developed, this limitation is expected to be lifted. Orthogonally, when preparing to use a neural network in a safety-critical setting, users may choose to design their networks as to make them particularly amenable to verification techniques — e.g., by using specific activation functions or network topologies — so that strong guarantees about their correctness and robustness may be obtained.
- Athalye et al.  A. Athalye, N. Carlini, and D. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.
- Bojarski et al.  M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba. End to end learning for self-driving cars, 2016. Technical Report. http://arxiv.org/abs/1604.07316.
- Carlini and Wagner  N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. IEEE Symposium on Security and Privacy, 2017.
- Ehlers  R. Ehlers. Formal verification of piece-wise linear feed-forward neural networks. In Proc. 15th Int. Symp. on Automated Technology for Verification and Analysis (ATVA), 2017.
- Goodfellow et al.  I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
- Hendrik Metzen et al.  J. Hendrik Metzen, T. Genewein, V. Fischer, and B. Bischoff. On detecting adversarial perturbations. In International Conference on Learning Representations, 2017. arXiv preprint arXiv:1702.04267.
- Hendrycks and Gimpel  D. Hendrycks and K. Gimpel. Early methods for detecting adversarial images. In International Conference on Learning Representations (Workshop Track), 2017.
- Huang et al.  R. Huang, B. Xu, D. Schuurmans, and C. Szepesvári. Learning with a strong adversary. CoRR, abs/1511.03034, 2015.
- Huang et al.  X. Huang, M. Kwiatkowska, S. Wang, and M. Wu. Safety verification of deep neural networks, 2016. Technical Report. http://arxiv.org/abs/1610.06940.
- Julian et al.  K. Julian, J. Lopez, J. Brush, M. Owen, and M. Kochenderfer. Policy compression for aircraft collision avoidance systems. In Proc. 35th Digital Avionics Systems Conf. (DASC), pages 1–10, 2016.
- Katz et al. [2017a] G. Katz, C. Barrett, D. Dill, K. Julian, and M. Kochenderfer. Towards Proving the Adversarial Robustness of Deep Neural Networks. In Proc. 1st. Workshop on Formal Verification of Autonomous Vehicles (FVAV), pages 19–26, 2017a.
- Katz et al. [2017b] G. Katz, C. Barrett, D. Dill, K. Julian, and M. Kochenderfer. Reluplex: An efficient SMT solver for verifying deep neural networks. In Proc. 29th Int. Conf. on Computer Aided Verification (CAV), pages 97–117, 2017b.
- Katz et al. [2017c] G. Katz, C. Barrett, D. Dill, K. Julian, and M. Kochenderfer. Reluplex, 2017c. https://github.com/guykatzz/ReluplexCav2017.
- Kurakin et al.  A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial examples in the physical world. In International Conference on Learning Representations (Workshop Track), 2016.
- Madry et al.  A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
- Moosavi-Dezfooli et al.  S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. Deepfool: a simple and accurate method to fool deep neural networks. arXiv preprint arXiv:1511.04599, 2015.
- Pulina and Tacchella  L. Pulina and A. Tacchella. An abstraction-refinement approach to verification of artificial neural networks. In Proc. 22nd Int. Conf. on Computer Aided Verification (CAV), pages 243–257, 2010.
- Pulina and Tacchella  L. Pulina and A. Tacchella. Challenging SMT solvers to verify neural networks. AI Communications, 25(2):117–135, 2012.
- Rozsa et al.  A. Rozsa, E. M. Rudd, and T. E. Boult. Adversarial diversity and hard positive generation. In , pages 25–32, 2016.
- Szegedy et al.  C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. 2014.
- Tramèr et al.  F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel. Ensemble adversarial training: Attacks and defenses. International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rkZvSe-RZ. accepted as poster.
- Xiao et al.  C. Xiao, J.-Y. Zhu, B. Li, W. He, M. Liu, and D. Song. Spatially transformed adversarial examples. International Conference on Learning Representations, 2018.
- Zheng et al.  S. Zheng, Y. Song, T. Leung, and I. Goodfellow. Improving the robustness of deep neural networks via stability training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4480–4488, 2016.
- Zhengli Zhao  S. S. Zhengli Zhao, Dheeru Dua. Generating natural adversarial examples. International Conference on Learning Representations, 2018.