Recent work sze14 shows that it is often possible to construct an input mislabeled by a neural net by perturbing a correctly labeled input by a tiny amount in a carefully chosen direction. Lack of robustness can be problematic in a variety of settings, such as changing camera lens or lighting conditions, successive frames in a video, or adversarial attacks in security-critical applications papernot2016practical .
A number of approaches have since been proposed to improve robustness gu14 ; good15 ; cha15 ; hua15 ; shaham2015understanding . However, work in this direction has been handicapped by the lack of objective measures of robustness. A typical approach to improving the robustness of a neural net is to use an algorithm to find adversarial examples, augment the training set with these examples, and train a new neural net good15 . Robustness is then evaluated by using the same algorithm to find adversarial examples for —if discovers fewer adversarial examples for than for , then is concluded to be more robust than . However, may have overfit to adversarial examples generated by —in particular, a different algorithm may find as many adversarial examples for as for
. Having an objective robustness measure is vital not only to reliably compare different algorithms, but also to understand robustness of production neural nets—e.g., when deploying a login system based on face recognition, a security team may need to evaluate the risk of an attack using adversarial examples.
In this paper, we study the problem of measuring robustness. We propose to use two statistics of the robustness of at point (i.e., the distance from to the nearest adversarial example) sze14 . The first one measures the frequency with which adversarial examples occur; the other measures the severity of such adversarial examples. Both statistics depend on a parameter , which intuitively specifies the threshold below which adversarial examples should not exist (i.e., points with distance to less than should be assigned the same label as ).
The key challenge is efficiently computing . We give an exact formulation of this problem as an intractable optimization problem. To recover tractability, we approximate this optimization problem by constraining the search to a convex region around
. Furthermore, we devise an iterative approach to solving the resulting linear program that produces an order of magnitude speed-up. Common neural nets (specifically, those using rectified linear units as activation functions) are in fact piecewise linear functionsMontufarPCB14 ; we choose to be the region around on which is linear. Since the linear nature of neural nets is often the cause of adversarial examples good15 , our choice of focuses the search where adversarial examples are most likely to exist.
We evaluate our approach on a deep convolutional neural networkfor MNIST. We estimate using both our algorithm and (as a baseline) the algorithm introduced by sze14 . We show that produces a substantially more accurate estimate of than . We then use data augmentation with each algorithm to improve the robustness of , resulting in fine-tuned neural nets and . According to , is more robust than , but not according to . In other words, overfits to adversarial examples computed using . In contrast, is more robust according to both and . Furthermore, to demonstrate scalability, we apply our approach to evaluate the robustness of the 23-layer network-in-network (NiN) neural net nin for CIFAR-10, and reveal a surprising lack of robustness. We fine-tune NiN and show that robustness improves, albeit only by a small amount. In summary, our contributions are:
We demonstrate experimentally that our algorithm produces substantially more accurate measures of robustness compared to algorithms based on previous work, and show evidence that neural nets fine-tuned to improve robustness (§5) can overfit to adversarial examples identified by a specific algorithm (§6).
1.1 Related work
The susceptibility of neural nets to adversarial examples was discovered by sze14 . Given a test point with predicted label , an adversarial example is an input with predicted label where the adversarial perturbation is small (in norm). Then, sze14 devises an approximate algorithm for finding the smallest possible adversarial perturbation . Their approach is to minimize the combined objective , which is an instance of box-constrained convex optimization that can be solved using L-BFGS-B. The constant is optimized using line search.
Our formalization of the robustness of at corresponds to the notion in sze14 of finding the minimal . We propose an exact algorithm for computing as well as a tractable approximation. The algorithm in sze14 can also be used to approximate ; we show experimentally that our algorithm is substantially more accurate than sze14 .
There has been a range of subsequent work studying robustness; nguyen2015deep devises an algorithm for finding purely synthetic adversarial examples (i.e., no initial image ), taba15 searches for adversarial examples using random perturbations, showing that adversarial examples in fact exist in large regions of the pixel space, sabour2015adversarial shows that even intermediate layers of neural nets are not robust to adversarial noise, and feng2016ensemble seeks to explain why neural nets may generalize well despite poor robustness properties.
Starting with good15 , a major focus has been on devising faster algorithms for finding adversarial examples. Their idea is that adversarial examples can then be computed on-the-fly and used as training examples, analogous to data augmentation approaches typically used to train neural nets kri12 . To find adversarial examples quickly, good15 chooses the adversarial perturbation to be in the direction of the signed gradient of
with fixed magnitude. Intuitively, given only the gradient of the loss function, this choice ofis most likely to produce an adversarial example with . In this direction, moosavi2016deepfool improves upon good15 by taking multiple gradient steps, hua15 extends this idea to norms beyond the norm, gu14 takes the approach of sze14 but fixes , and shaham2015understanding formalizes good15 as robust optimization.
A key shortcoming of these lines of work is that robustness is typically measured using the same algorithm used to find adversarial examples, in which case the resulting neural net may have overfit to adversarial examples generating using that algorithm. For example, good15 shows improved accuracy to adversarial examples generated using their own signed gradient method, but do not consider whether robustness increases for adversarial examples generated using more precise approaches such as sze14 . Similarly, hua15 compares accuracy to adversarial examples generated using both itself and good15 (but not sze14 ), and shaham2015understanding only considers accuracy on adversarial examples generated using their own approach on the baseline network. The aim of our paper is to provide metrics for evaluating robustness, and to demonstrate the importance of using such impartial measures to compare robustness.
Additionally, there has been work on designing neural network architectures gu14 and learning procedures cha15 that improve robustness to adversarial perturbations, though they do not obtain state-of-the-art accuracy on the unperturbed test sets. There has also been work using smoothness regularization related to good15 to train neural nets, focusing on improving accuracy rather than robustness miyato2015distributional .
establishes theoretical lower bounds on the robustness of linear and quadratic classifiers, andgloberson2006nightmare
seeks to improve robustness by promoting resiliance to deleting features during training. More broadly, robustness has been identified as a desirable property of classifiers beyond prediction accuracy. Traditional metrics such as (out-of-sample) accuracy, precision, and recall help users assess prediction accuracy of trained models; our work aims to develop analogous metrics for assessing robustness.
2 Robustness Metrics
Consider a classifier , where is the input space and are the labels. We assume that training and test points have distribution . We first formalize the notion of robustness at a point, and then describe two statistics to measure robustness. Our two statistics depend on a parameter , which captures the idea that we only care about robustness below a certain threshold—we disregard adversarial examples whose distance to is greater than . We use in our experiments on MNIST and CIFAR-10 (on the pixel scale 0-255).
Pointwise robustness. Intuitively, is robust at if a “small” perturbation to does not affect the assigned label. We are interested in perturbations sufficiently small that they do not affect human classification; an established condition is for some parameter . Formally, we say is -robust if for every such that , . Finally, the pointwise robustness of at is the minimum for which fails to be -robust:
Given a parameter , the adversarial frequency
measures how often fails to be -robust. In other words, if has high adversarial frequency, then it fails to be -robust for many inputs .
Given a parameter , the adversarial severity
measures the severity with which fails to be robust at conditioned on not being -robust. We condition on pointwise robustness since once is -robust at , then the degree to which is robust at does not matter. Smaller corresponds to worse adversarial severity, since is more susceptible to adversarial examples if the distances to the nearest adversarial example are small.
The frequency and severity capture different robustness behaviors. A neural net may have high adversarial frequency but low adversarial severity, indicating that most adversarial examples are about distance away from the original point . Conversely, a neural net may have low adversarial frequency but high adversarial severity, indicating that it is typically robust, but occasionally severely fails to be robust. Frequency is typically the more important metric, since a neural net with low adversarial frequency is robust most of the time. Indeed, adversarial frequency corresponds to the accuracy on adversarial examples used to measure robustness in good15 ; shaham2015understanding . Severity can be used to differentiate between neural nets with similar adversarial frequency.
Given a set of samples drawn i.i.d. from , we can estimate and using the following standard estimators, assuming we can compute :
An approximation of , such as the one we describe in Section 4, can be used in place of . In practice, is taken to be the test set .
3 Computing Pointwise Robustness
Consider the training points in Figure 1 (a) colored based on the ground truth label. To classify this data, we train a two-layer neural net , where the ReLU function is applied pointwise. Figure 1 (a) includes contours of the per-point loss function of this neural net.
Exhaustively searching the input space to determine the distance to the nearest adversarial example for input (labeled ) is intractable. Recall that neural nets with rectified-linear (ReLU) units as activations are piecewise linear MontufarPCB14 . Since adversarial examples exist because of this linearity in the neural net good15 , we restrict our search to the region around on which the neural net is linear. This region around is defined by the activation of the ReLU function: for each , if (resp., ), we constrain to the half-space (resp., ). The intersection of these half-spaces is convex, so it admits efficient search. Figure 1 (b) shows one such convex region 111Our neural net has 8 hidden units, but for this , 6 of the half-spaces entirely contain the convex region..
Additionally, is labeled exactly when for each . These constraints are linear since is linear on . Therefore, we can find the distance to the nearest input with label by minimizing on . Finally, we can perform this search for each label , though for efficiency we take to be the label assigned the second-highest score by . Figure 1 (b) shows the adversarial example found by our algorithm in our running example. In Figure 1 note that the direction of the nearest adversarial example is not necessary aligned with the signed gradient of the loss function, as observed by others hua15 .
3.2 Formulation as Optimization
We compute by expressing (1) as constraints , which consist of
Linear relations; specifically, inequalities and equalities , where (for some ) are variables and , are constants.
Conjunctions , where and are themselves constraints. Both constraints must be satisfied for the conjunction to be satisfied.
Disjunctions , where and are themselves constraints. One of the constraints must be satisfied for the disjunction to be satisfied.
The feasible set of is the set of that satisfy ; is satisfiable if is nonempty.
In the next section, we show that the condition can be expressed as constraints ; i.e., if and only if is satisfiable. Then, can be computed as follows:
The optimization problem is typically intractable; we describe a tractable approximation in §4.
3.3 Encoding a Neural Network
We show how to encode the constraint as constraints when is a neural net. We assume has form , where the layer of the network is a function , with and
. We describe the encoding of fully-connected and ReLU layers; convolutional layers are encoded similarly to fully-connected layers and max-pooling layers are encoded similarly to ReLU layers. We introduce the variablesinto our constraints, with the interpretation that
represents the output vector of layerof the network; i.e., . The constraint encodes the input layer. For each layer , we encode the computation of given as a constraint .
In this case, , which we encode using the constraints , where is the -th row of .
In this case, (for each ), which we encode using the constraints , where .
Finally, the constraints ensure that the output label is . Together, the constraints encodes the computation of :
For any and , we have if and only if is satisfiable.
4 Approximate Computation of Pointwise Robustness
The challenge to solving (3) is the non-convexity of the feasible set of . To recover tractability, we approximate (3) by constraining the feasible set to , where is carefully chosen so that the constraints have convex feasible set. We call the convex restriction of . In some sense, convex restriction is the opposite of convex relaxation. Then, we can approximately compute robustness:
The objective is optimized over , which approximates the optimum over .
Choice of .
We construct as the feasible set of constraints ; i.e., . We now describe how to construct .
Note that and are convex sets. Furthermore, if and are convex, then so is their conjunction . However, their disjunction may not be convex; for example, . The potential non-convexity of disjunctions makes (3) difficult to optimize.
We can eliminate disjunction operations by choosing one of the two disjuncts to hold. For example, note that for , we have both and . In other words, if we replace with either or , the feasible set of the resulting constraints can only become smaller. Taking (resp., ) effectively replaces with (resp., ).
To restrict (3), for every disjunction , we systematically choose either or to replace the constraint . In particular, we choose if satisfies (i.e., ) and choose otherwise. In our constraints, disjunctions are always mutually exclusive, so never simultaneously satisfies both and . We then take to be the conjunction of all our choices. The resulting constraints contains only conjunctions of linear relations, so its feasible set is convex. In fact, it can be expressed as a linear program (LP) and can be solved using any standard LP solver.
For example, consider a rectified linear layer (as before, max pooling layers are similar). The original constraint added for unit of rectified linear layer is
To restrict this constraint, we evaluate the neural network on the seed input and look at the input to , which equals . Then, for each :
Iterative constraint solving.
We implement an optimization for solving LPs by lazily adding constraints as necessary. Given all constraints , we start off solving the LP with the subset of equality constraints , which yields a (possibly infeasible) solution . If is feasible, then is also an optimal solution to the original LP; otherwise, we add to the constraints in that are not satisfied by and repeat the process. This process always yields the correct solution, since in the worst case becomes equal to . In practice, this optimization is an order of magnitude faster than directly solving the LP with constraints .
Single target label.
For simplicity, rather than minimize over for each , we fix
to be the second most probable label; i.e.,
Approximate robustness statistics.
5 Improving Neural Net Robustness
Finding adversarial examples.
We can use our algorithm for estimating to compute adversarial examples. Given , the value of computed by the optimization procedure used to solve (5) is an adversarial example for with .
We use fine-tuning to reduce a neural net’s susceptability to adversarial examples. First, we use an algorithm to compute adversarial examples for each and add them to the training set. Then, we continue training the network on a the augmented training set at a reduced training rate. We can repeat this process multiple rounds (denoted ); at each round, we only consider in the original training set (rather than the augmented training set).
MNIST images are represented as integers, so we must round the perturbation to obtain an image, which oftentimes results in non-adversarial examples. When fine-tuning, we add a constraint for all , which eliminates this problem by ensuring that the neural net has high confidence on its adversarial examples. In our experiments, we fix .
Similarly, we modified the L-BFGS-B baseline so that during the line search over , we only count as adversarial if for all . We choose , since larger causes the baseline to find significantly fewer adversarial examples, and small results in smaller improvement in robustness. With this choice, rounding errors occur on 8.3% of the adversarial examples we find on the MNIST training set.
6.1 Adversarial Images for CIFAR-10 and MNIST
We find adversarial examples for the neural net LeNet lenet (modified to use ReLUs instead of sigmoids) trained to classify MNIST lecun1998gradient , and for the network-in-network (NiN) neural net nin trained to classify CIFAR-10 krizhevsky2009learning
. Both neural nets are trained using Caffejia2014caffe . For MNIST, Figure 2 (b) shows an adversarial example (labeled 1) we find for the image in Figure 2 (a) labeled 3, and Figure 2 (c) shows the corresponding adversarial perturbation scaled so the difference is visible (it has norm ). For CIFAR-10, Figure 2 (e) shows an adversarial example labeled “truck” for the image in Figure 2 (d) labeled “automobile”, and Figure 2 (f) shows the corresponding scaled adversarial perturbation (which has norm ).
6.2 Comparison to Other Algorithms on MNIST
|Neural Net||Accuracy (%)||Adversarial Frequency (%)||Adversarial Severity (pixels)|
|Baseline||Our Algo.||Baseline||Our Algo.|
|Our Algo. ()||99.17||1.18||5.40||12.8||12.2|
|Our Algo. ()||99.23||1.12||5.03||12.2||11.7|
We compare our algorithm for estimating to the baseline L-BFGS-B algorithm proposed by sze14 . We use the tool provided by taba15 to compute this baseline. For both algorithms, we use adversarial target label . We use LeNet in our comparisons, since we find that it is substantially more robust than the neural nets considered in most previous work (including sze14 ). We also use versions of LeNet fine-tuned using both our algorithm and the baseline with . To focus on the most severe adversarial examples, we use a stricter threshold for robustness of pixels.
We performed a similar comparison to the signed gradient algorithm proposed by good15 (with the signed gradient multiplied by pixels). For LeNet, this algorithm found only one adversarial example on the MNIST test set (out of 10,000) and four adversarial examples on the MNIST training set (out of 60,000), so we omit results 222Futhermore, the signed gradient algorithm cannot be used to estimate adversarial severity since all the adversarial examples it finds have norm ..
In Figure 3, we plot the number of test points for which , as a function of , where is estimated using (a) the baseline and (b) our algorithm. These plots compare the robustness of each neural network as a function of . In Table 1, we show results evaluating the robustness of each neural net, including the adversarial frequency and the adversarial severity. The running time of our algorithm and the baseline algorithm are very similar; in both cases, computing for a single input takes about 1.5 seconds. For comparison, without our iterative constraint solving optimization, our algorithm took more than two minutes to run.
For every neural net, our algorithm produces substantially higher estimates of the adversarial frequency. In other words, our algorithm estimates with substantially better accuracy compared to the baseline.
According to the baseline metrics shown in Figure 3 (a), the baseline neural net (red) is similarly robust to our neural net (blue), and both are more robust than the original LeNet (black). Our neural net is actually more robust than the baseline neural net for smaller values of , whereas the baseline neural net eventually becomes slightly more robust (i.e., where the red line dips below the blue line). This behavior is captured by our robustness statistics—the baseline neural net has lower adversarial frequency (so it has fewer adversarial examples with ) but also has worse adversarial severity (since its adversarial examples are on average closer to the original points ).
However, according to our metrics shown in Figure 3 (b), our neural net is substantially more robust than the baseline neural net. Again, this is reflected by our statistics—our neural net has substantially lower adversarial frequency compared to the baseline neural net, while maintaining similar adversarial severity. Taken together, our results suggest that the baseline neural net is overfitting to the adversarial examples found by the baseline algorithm. In particular, the baseline neural net does not learn the adversarial examples found by our algorithm. On the other hand, our neural net learns both the adversarial examples found by our algorithm and those found by the baseline algorithm.
6.3 Scaling to CIFAR-10
We also implemented our approach for the for the CIFAR-10 network-in-network (NiN) neural net nin , which obtains 91.31% test set accuracy. Computing for a single input on NiN takes about 10-15 seconds on an 8-core CPU. Unlike LeNet, NiN suffers severely from adversarial examples—we measure a 61.5% adversarial frequency and an adversarial severity of 2.82 pixels. Our neural net (NiN fine-tuned using our algorithm and ) has test set accuracy 90.35%, which is similar to the test set accuracy of the original NiN. As can be seen in Figure 3 (c), our neural net improves slightly in terms of robustness, especially for smaller . As before, these improvements are reflected in our metrics—the adversarial frequency of our neural net drops slightly to 59.6%, and the adversarial severity improves to 3.88. Nevertheless, unlike LeNet, our fine-tuned version of NiN remains very prone to adversarial examples. In this case, we believe that new techniques are required to significantly improve robustness.
We have shown how to formulate, efficiently estimate, and improve the robustness of neural nets using an encoding of the robustness property as a constraint system. Future work includes devising better approaches to improving robustness on large neural nets such as NiN and studying properties beyond robustness.
-  K. Chalupka, P. Perona, and F. Eberhardt. Visual causal feature learning. 2015.
-  A. Fawzi, O. Fawzi, and P. Frossard. Analysis of classifers’ robustness to adversarial perturbations. ArXiv e-prints, 2015.
-  Jiashi Feng, Tom Zahavy, Bingyi Kang, Huan Xu, and Shie Mannor. Ensemble robustness of deep learning algorithms. arXiv preprint arXiv:1602.02389, 2016.
Amir Globerson and Sam Roweis.
Nightmare at test time: robust learning by feature deletion.
Proceedings of the 23rd international conference on Machine learning, pages 353–360. ACM, 2006.
-  Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. 2015.
-  S. Gu and L. Rigazio. Towards deep neural network architectures robust to adversarial examples. 2014.
-  Ruitong Huang, Bing Xu, Dale Schuurmans, and Csaba Szepesvári. Learning with a strong adversary. CoRR, abs/1511.03034, 2015.
-  Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
-  Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images, 2009.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. 2012.
-  Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. In S. Haykin and B. Kosko, editors, Intelligent Signal Processing, pages 306–351. IEEE Press, 2001.
-  Min Lin, Qiang Chen, and Shuicheng Yan. Network In Network. CoRR, abs/1312.4400, 2013.
-  Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Distributional smoothing with virtual adversarial training. stat, 1050:25, 2015.
-  Guido F. Montúfar, Razvan Pascanu, KyungHyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 2924–2932, 2014.
-  Seyed Mohsen Moosavi Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In , number EPFL-CONF-218057, 2016.
-  Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 427–436. IEEE, 2015.
-  Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against deep learning systems using adversarial examples. arXiv preprint arXiv:1602.02697, 2016.
-  Sara Sabour, Yanshuai Cao, Fartash Faghri, and David J Fleet. Adversarial manipulation of deep representations. arXiv preprint arXiv:1511.05122, 2015.
-  Uri Shaham, Yutaro Yamada, and Sahand Negahban. Understanding adversarial training: Increasing local stability of neural nets through robust optimization. arXiv preprint arXiv:1511.05432, 2015.
-  Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. 2014.
-  Pedro Tabacof and Eduardo Valle. Exploring the space of adversarial images. CoRR, abs/1510.05328, 2015.
-  Huan Xu and Shie Mannor. Robustness and generalization. Machine learning, 86(3):391–423, 2012.