Measuring Neural Net Robustness with Constraints

05/24/2016 ∙ by Osbert Bastani, et al. ∙ University of Cambridge Microsoft University of Pennsylvania Stanford University 0

Despite having high accuracy, neural nets have been shown to be susceptible to adversarial examples, where a small perturbation to an input can cause it to become mislabeled. We propose metrics for measuring the robustness of a neural net and devise a novel algorithm for approximating these metrics based on an encoding of robustness as a linear program. We show how our metrics can be used to evaluate the robustness of deep neural nets with experiments on the MNIST and CIFAR-10 datasets. Our algorithm generates more informative estimates of robustness metrics compared to estimates based on existing algorithms. Furthermore, we show how existing approaches to improving robustness "overfit" to adversarial examples generated using a specific algorithm. Finally, we show that our techniques can be used to additionally improve neural net robustness both according to the metrics that we propose, but also according to previously proposed metrics.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent work sze14 shows that it is often possible to construct an input mislabeled by a neural net by perturbing a correctly labeled input by a tiny amount in a carefully chosen direction. Lack of robustness can be problematic in a variety of settings, such as changing camera lens or lighting conditions, successive frames in a video, or adversarial attacks in security-critical applications papernot2016practical .

A number of approaches have since been proposed to improve robustness gu14 ; good15 ; cha15 ; hua15 ; shaham2015understanding . However, work in this direction has been handicapped by the lack of objective measures of robustness. A typical approach to improving the robustness of a neural net is to use an algorithm to find adversarial examples, augment the training set with these examples, and train a new neural net  good15 . Robustness is then evaluated by using the same algorithm to find adversarial examples for —if discovers fewer adversarial examples for than for , then is concluded to be more robust than . However, may have overfit to adversarial examples generated by —in particular, a different algorithm may find as many adversarial examples for as for

. Having an objective robustness measure is vital not only to reliably compare different algorithms, but also to understand robustness of production neural nets—e.g., when deploying a login system based on face recognition, a security team may need to evaluate the risk of an attack using adversarial examples.

In this paper, we study the problem of measuring robustness. We propose to use two statistics of the robustness of at point (i.e., the distance from to the nearest adversarial example) sze14 . The first one measures the frequency with which adversarial examples occur; the other measures the severity of such adversarial examples. Both statistics depend on a parameter , which intuitively specifies the threshold below which adversarial examples should not exist (i.e., points with distance to less than should be assigned the same label as ).

The key challenge is efficiently computing . We give an exact formulation of this problem as an intractable optimization problem. To recover tractability, we approximate this optimization problem by constraining the search to a convex region around

. Furthermore, we devise an iterative approach to solving the resulting linear program that produces an order of magnitude speed-up. Common neural nets (specifically, those using rectified linear units as activation functions) are in fact piecewise linear functions 

MontufarPCB14 ; we choose to be the region around on which is linear. Since the linear nature of neural nets is often the cause of adversarial examples good15 , our choice of focuses the search where adversarial examples are most likely to exist.

We evaluate our approach on a deep convolutional neural network

for MNIST. We estimate using both our algorithm and (as a baseline) the algorithm introduced by sze14 . We show that produces a substantially more accurate estimate of than . We then use data augmentation with each algorithm to improve the robustness of , resulting in fine-tuned neural nets and . According to , is more robust than , but not according to . In other words, overfits to adversarial examples computed using . In contrast, is more robust according to both and . Furthermore, to demonstrate scalability, we apply our approach to evaluate the robustness of the 23-layer network-in-network (NiN) neural net nin for CIFAR-10, and reveal a surprising lack of robustness. We fine-tune NiN and show that robustness improves, albeit only by a small amount. In summary, our contributions are:

  • We formalize the notion of pointwise robustness studied in previous work good15 ; sze14 ; gu14 and propose two statistics for measuring robustness based on this notion (§2).

  • We show how computing pointwise robustness can be encoded as a constraint system (§3). We approximate this constraint system with a tractable linear program and devise an optimization for solving this linear program an order of magnitude faster (§4).

  • We demonstrate experimentally that our algorithm produces substantially more accurate measures of robustness compared to algorithms based on previous work, and show evidence that neural nets fine-tuned to improve robustness (§5) can overfit to adversarial examples identified by a specific algorithm (§6).

1.1 Related work

The susceptibility of neural nets to adversarial examples was discovered by sze14 . Given a test point with predicted label , an adversarial example is an input with predicted label where the adversarial perturbation is small (in norm). Then,  sze14 devises an approximate algorithm for finding the smallest possible adversarial perturbation . Their approach is to minimize the combined objective , which is an instance of box-constrained convex optimization that can be solved using L-BFGS-B. The constant is optimized using line search.

Our formalization of the robustness of at corresponds to the notion in sze14 of finding the minimal . We propose an exact algorithm for computing as well as a tractable approximation. The algorithm in sze14 can also be used to approximate ; we show experimentally that our algorithm is substantially more accurate than sze14 .

There has been a range of subsequent work studying robustness;  nguyen2015deep devises an algorithm for finding purely synthetic adversarial examples (i.e., no initial image ),  taba15 searches for adversarial examples using random perturbations, showing that adversarial examples in fact exist in large regions of the pixel space,  sabour2015adversarial shows that even intermediate layers of neural nets are not robust to adversarial noise, and feng2016ensemble seeks to explain why neural nets may generalize well despite poor robustness properties.

Starting with good15 , a major focus has been on devising faster algorithms for finding adversarial examples. Their idea is that adversarial examples can then be computed on-the-fly and used as training examples, analogous to data augmentation approaches typically used to train neural nets kri12 . To find adversarial examples quickly, good15 chooses the adversarial perturbation to be in the direction of the signed gradient of

with fixed magnitude. Intuitively, given only the gradient of the loss function, this choice of

is most likely to produce an adversarial example with . In this direction,  moosavi2016deepfool improves upon good15 by taking multiple gradient steps,  hua15 extends this idea to norms beyond the norm,  gu14 takes the approach of sze14 but fixes , and  shaham2015understanding formalizes good15 as robust optimization.

A key shortcoming of these lines of work is that robustness is typically measured using the same algorithm used to find adversarial examples, in which case the resulting neural net may have overfit to adversarial examples generating using that algorithm. For example, good15 shows improved accuracy to adversarial examples generated using their own signed gradient method, but do not consider whether robustness increases for adversarial examples generated using more precise approaches such as sze14 . Similarly, hua15 compares accuracy to adversarial examples generated using both itself and good15 (but not sze14 ), and shaham2015understanding only considers accuracy on adversarial examples generated using their own approach on the baseline network. The aim of our paper is to provide metrics for evaluating robustness, and to demonstrate the importance of using such impartial measures to compare robustness.

Additionally, there has been work on designing neural network architectures gu14 and learning procedures cha15 that improve robustness to adversarial perturbations, though they do not obtain state-of-the-art accuracy on the unperturbed test sets. There has also been work using smoothness regularization related to good15 to train neural nets, focusing on improving accuracy rather than robustness miyato2015distributional .

Robustness has also been studied in more general contexts;  xu2012robustness studies the connection between robustness and generalization,  alh15

establishes theoretical lower bounds on the robustness of linear and quadratic classifiers, and 


seeks to improve robustness by promoting resiliance to deleting features during training. More broadly, robustness has been identified as a desirable property of classifiers beyond prediction accuracy. Traditional metrics such as (out-of-sample) accuracy, precision, and recall help users assess prediction accuracy of trained models; our work aims to develop analogous metrics for assessing robustness.

2 Robustness Metrics

Consider a classifier , where is the input space and are the labels. We assume that training and test points have distribution . We first formalize the notion of robustness at a point, and then describe two statistics to measure robustness. Our two statistics depend on a parameter , which captures the idea that we only care about robustness below a certain threshold—we disregard adversarial examples whose distance to is greater than . We use in our experiments on MNIST and CIFAR-10 (on the pixel scale 0-255).

Pointwise robustness. Intuitively, is robust at if a “small” perturbation to does not affect the assigned label. We are interested in perturbations sufficiently small that they do not affect human classification; an established condition is for some parameter . Formally, we say is -robust if for every such that , . Finally, the pointwise robustness of at is the minimum for which fails to be -robust:


This definition formalizes the notion of robustness in good15 ; gu14 ; sze14 .

Adversarial frequency.

Given a parameter , the adversarial frequency

measures how often fails to be -robust. In other words, if has high adversarial frequency, then it fails to be -robust for many inputs .

Adversarial severity.

Given a parameter , the adversarial severity

measures the severity with which fails to be robust at conditioned on not being -robust. We condition on pointwise robustness since once is -robust at , then the degree to which is robust at does not matter. Smaller corresponds to worse adversarial severity, since is more susceptible to adversarial examples if the distances to the nearest adversarial example are small.

The frequency and severity capture different robustness behaviors. A neural net may have high adversarial frequency but low adversarial severity, indicating that most adversarial examples are about distance away from the original point . Conversely, a neural net may have low adversarial frequency but high adversarial severity, indicating that it is typically robust, but occasionally severely fails to be robust. Frequency is typically the more important metric, since a neural net with low adversarial frequency is robust most of the time. Indeed, adversarial frequency corresponds to the accuracy on adversarial examples used to measure robustness in good15 ; shaham2015understanding . Severity can be used to differentiate between neural nets with similar adversarial frequency.

Given a set of samples drawn i.i.d. from , we can estimate and using the following standard estimators, assuming we can compute :

An approximation of , such as the one we describe in Section 4, can be used in place of . In practice, is taken to be the test set .

3 Computing Pointwise Robustness

3.1 Overview

(a) (b) Figure 1:

Neural net with a single hidden layer and ReLU activations trained on dataset with binary labels. (a) The training data and loss surface. (b) The linear region corresponding to the red training point.

(a) (b) (c) (d) (e) (f) Figure 2: For MNIST, (a) an image classified 1, (b) its adversarial example classifed 3, and (c) the (scaled) adversarial perturbation. For CIFAR-10, (d) an image classified as “automobile”, (e) its adversarial example classified as “truck”, and (f) the (scaled) adversarial perturbation.

Consider the training points in Figure 1 (a) colored based on the ground truth label. To classify this data, we train a two-layer neural net , where the ReLU function is applied pointwise. Figure 1 (a) includes contours of the per-point loss function of this neural net.

Exhaustively searching the input space to determine the distance to the nearest adversarial example for input (labeled ) is intractable. Recall that neural nets with rectified-linear (ReLU) units as activations are piecewise linear MontufarPCB14 . Since adversarial examples exist because of this linearity in the neural net good15 , we restrict our search to the region around on which the neural net is linear. This region around is defined by the activation of the ReLU function: for each , if (resp., ), we constrain to the half-space (resp., ). The intersection of these half-spaces is convex, so it admits efficient search. Figure 1 (b) shows one such convex region 111Our neural net has 8 hidden units, but for this , 6 of the half-spaces entirely contain the convex region..

Additionally, is labeled exactly when for each . These constraints are linear since is linear on . Therefore, we can find the distance to the nearest input with label by minimizing on . Finally, we can perform this search for each label , though for efficiency we take to be the label assigned the second-highest score by . Figure 1 (b) shows the adversarial example found by our algorithm in our running example. In Figure 1 note that the direction of the nearest adversarial example is not necessary aligned with the signed gradient of the loss function, as observed by others hua15 .

3.2 Formulation as Optimization

We compute by expressing (1) as constraints , which consist of

  • Linear relations; specifically, inequalities and equalities , where (for some ) are variables and , are constants.

  • Conjunctions , where and are themselves constraints. Both constraints must be satisfied for the conjunction to be satisfied.

  • Disjunctions , where and are themselves constraints. One of the constraints must be satisfied for the disjunction to be satisfied.

The feasible set of is the set of that satisfy ; is satisfiable if is nonempty.

In the next section, we show that the condition can be expressed as constraints ; i.e., if and only if is satisfiable. Then, can be computed as follows:


The optimization problem is typically intractable; we describe a tractable approximation in §4.

3.3 Encoding a Neural Network

We show how to encode the constraint as constraints when is a neural net. We assume has form , where the layer of the network is a function , with and

. We describe the encoding of fully-connected and ReLU layers; convolutional layers are encoded similarly to fully-connected layers and max-pooling layers are encoded similarly to ReLU layers. We introduce the variables

into our constraints, with the interpretation that

represents the output vector of layer

of the network; i.e., . The constraint encodes the input layer. For each layer , we encode the computation of given as a constraint .

Fully-connected layer.

In this case, , which we encode using the constraints , where is the -th row of .

ReLU layer.

In this case, (for each ), which we encode using the constraints , where .

Finally, the constraints ensure that the output label is . Together, the constraints encodes the computation of :

Theorem 1

For any and , we have if and only if is satisfiable.

4 Approximate Computation of Pointwise Robustness

Convex restriction.

The challenge to solving (3) is the non-convexity of the feasible set of . To recover tractability, we approximate (3) by constraining the feasible set to , where is carefully chosen so that the constraints have convex feasible set. We call the convex restriction of . In some sense, convex restriction is the opposite of convex relaxation. Then, we can approximately compute robustness:


The objective is optimized over , which approximates the optimum over .

Choice of .

We construct as the feasible set of constraints ; i.e., . We now describe how to construct .

Note that and are convex sets. Furthermore, if and are convex, then so is their conjunction . However, their disjunction may not be convex; for example, . The potential non-convexity of disjunctions makes (3) difficult to optimize.

We can eliminate disjunction operations by choosing one of the two disjuncts to hold. For example, note that for , we have both and . In other words, if we replace with either or , the feasible set of the resulting constraints can only become smaller. Taking (resp., ) effectively replaces with (resp., ).

To restrict (3), for every disjunction , we systematically choose either or to replace the constraint . In particular, we choose if satisfies (i.e., ) and choose otherwise. In our constraints, disjunctions are always mutually exclusive, so never simultaneously satisfies both and . We then take to be the conjunction of all our choices. The resulting constraints contains only conjunctions of linear relations, so its feasible set is convex. In fact, it can be expressed as a linear program (LP) and can be solved using any standard LP solver.

For example, consider a rectified linear layer (as before, max pooling layers are similar). The original constraint added for unit of rectified linear layer is

To restrict this constraint, we evaluate the neural network on the seed input and look at the input to , which equals . Then, for each :

Iterative constraint solving.

We implement an optimization for solving LPs by lazily adding constraints as necessary. Given all constraints , we start off solving the LP with the subset of equality constraints , which yields a (possibly infeasible) solution . If is feasible, then is also an optimal solution to the original LP; otherwise, we add to the constraints in that are not satisfied by and repeat the process. This process always yields the correct solution, since in the worst case becomes equal to . In practice, this optimization is an order of magnitude faster than directly solving the LP with constraints .

Single target label.

For simplicity, rather than minimize over for each , we fix

to be the second most probable label

; i.e.,


Approximate robustness statistics.

We can use in our statistics and defined in §2. Because is an overapproximation of (i.e., ), the estimates and may not be unbiased (in particular, ). In §6, we show empirically that our algorithm produces substantially less biased estimates than existing algorithms for finding adversarial examples.

5 Improving Neural Net Robustness

Finding adversarial examples.

We can use our algorithm for estimating to compute adversarial examples. Given , the value of computed by the optimization procedure used to solve (5) is an adversarial example for with .


We use fine-tuning to reduce a neural net’s susceptability to adversarial examples. First, we use an algorithm to compute adversarial examples for each and add them to the training set. Then, we continue training the network on a the augmented training set at a reduced training rate. We can repeat this process multiple rounds (denoted ); at each round, we only consider in the original training set (rather than the augmented training set).

Rounding errors.

MNIST images are represented as integers, so we must round the perturbation to obtain an image, which oftentimes results in non-adversarial examples. When fine-tuning, we add a constraint for all , which eliminates this problem by ensuring that the neural net has high confidence on its adversarial examples. In our experiments, we fix .

Similarly, we modified the L-BFGS-B baseline so that during the line search over , we only count as adversarial if for all . We choose , since larger causes the baseline to find significantly fewer adversarial examples, and small results in smaller improvement in robustness. With this choice, rounding errors occur on 8.3% of the adversarial examples we find on the MNIST training set.

6 Experiments

6.1 Adversarial Images for CIFAR-10 and MNIST

We find adversarial examples for the neural net LeNet lenet (modified to use ReLUs instead of sigmoids) trained to classify MNIST lecun1998gradient , and for the network-in-network (NiN) neural net nin trained to classify CIFAR-10 krizhevsky2009learning

. Both neural nets are trained using Caffe 

jia2014caffe . For MNIST, Figure 2 (b) shows an adversarial example (labeled 1) we find for the image in Figure 2 (a) labeled 3, and Figure 2 (c) shows the corresponding adversarial perturbation scaled so the difference is visible (it has norm ). For CIFAR-10, Figure 2 (e) shows an adversarial example labeled “truck” for the image in Figure 2 (d) labeled “automobile”, and Figure 2 (f) shows the corresponding scaled adversarial perturbation (which has norm ).

6.2 Comparison to Other Algorithms on MNIST

Neural Net Accuracy (%) Adversarial Frequency (%) Adversarial Severity (pixels)
Baseline Our Algo. Baseline Our Algo.
LeNet (Original) 99.08 1.32 7.15 11.9 12.4
Baseline () 99.14 1.02 6.89 11.0 12.3
Baseline () 99.15 0.99 6.97 10.9 12.4
Our Algo. () 99.17 1.18 5.40 12.8 12.2
Our Algo. () 99.23 1.12 5.03 12.2 11.7
Table 1: Evaluation of fine-tuned networks. Our method discovers more adversarial examples than the baseline sze14 for each neural net, hence producing better estimates. LeNet fine-tuned for rounds (bottom four rows) exhibit a notable increase in robustness compared to the original LeNet.
(a) (b) (c)
Figure 3: The cumulative number of test points such that as a function of . In (a) and (b), the neural nets are the original LeNet (black), LeNet fine-tuned with the baseline and (red), and LeNet fine-tuned with our algorithm and (blue); in (a), is measured using the baseline, and in (b), is measured using our algorithm. In (c), the neural nets are the original NiN (black) and NiN finetuned with our algorithm, and is estimated using our algorithm.

We compare our algorithm for estimating to the baseline L-BFGS-B algorithm proposed by sze14 . We use the tool provided by taba15 to compute this baseline. For both algorithms, we use adversarial target label . We use LeNet in our comparisons, since we find that it is substantially more robust than the neural nets considered in most previous work (including sze14 ). We also use versions of LeNet fine-tuned using both our algorithm and the baseline with . To focus on the most severe adversarial examples, we use a stricter threshold for robustness of pixels.

We performed a similar comparison to the signed gradient algorithm proposed by good15 (with the signed gradient multiplied by pixels). For LeNet, this algorithm found only one adversarial example on the MNIST test set (out of 10,000) and four adversarial examples on the MNIST training set (out of 60,000), so we omit results 222Futhermore, the signed gradient algorithm cannot be used to estimate adversarial severity since all the adversarial examples it finds have norm ..


In Figure 3, we plot the number of test points for which , as a function of , where is estimated using (a) the baseline and (b) our algorithm. These plots compare the robustness of each neural network as a function of . In Table 1, we show results evaluating the robustness of each neural net, including the adversarial frequency and the adversarial severity. The running time of our algorithm and the baseline algorithm are very similar; in both cases, computing for a single input takes about 1.5 seconds. For comparison, without our iterative constraint solving optimization, our algorithm took more than two minutes to run.


For every neural net, our algorithm produces substantially higher estimates of the adversarial frequency. In other words, our algorithm estimates with substantially better accuracy compared to the baseline.

According to the baseline metrics shown in Figure 3 (a), the baseline neural net (red) is similarly robust to our neural net (blue), and both are more robust than the original LeNet (black). Our neural net is actually more robust than the baseline neural net for smaller values of , whereas the baseline neural net eventually becomes slightly more robust (i.e., where the red line dips below the blue line). This behavior is captured by our robustness statistics—the baseline neural net has lower adversarial frequency (so it has fewer adversarial examples with ) but also has worse adversarial severity (since its adversarial examples are on average closer to the original points ).

However, according to our metrics shown in Figure 3 (b), our neural net is substantially more robust than the baseline neural net. Again, this is reflected by our statistics—our neural net has substantially lower adversarial frequency compared to the baseline neural net, while maintaining similar adversarial severity. Taken together, our results suggest that the baseline neural net is overfitting to the adversarial examples found by the baseline algorithm. In particular, the baseline neural net does not learn the adversarial examples found by our algorithm. On the other hand, our neural net learns both the adversarial examples found by our algorithm and those found by the baseline algorithm.

6.3 Scaling to CIFAR-10

We also implemented our approach for the for the CIFAR-10 network-in-network (NiN) neural net nin , which obtains 91.31% test set accuracy. Computing for a single input on NiN takes about 10-15 seconds on an 8-core CPU. Unlike LeNet, NiN suffers severely from adversarial examples—we measure a 61.5% adversarial frequency and an adversarial severity of 2.82 pixels. Our neural net (NiN fine-tuned using our algorithm and ) has test set accuracy 90.35%, which is similar to the test set accuracy of the original NiN. As can be seen in Figure 3 (c), our neural net improves slightly in terms of robustness, especially for smaller . As before, these improvements are reflected in our metrics—the adversarial frequency of our neural net drops slightly to 59.6%, and the adversarial severity improves to 3.88. Nevertheless, unlike LeNet, our fine-tuned version of NiN remains very prone to adversarial examples. In this case, we believe that new techniques are required to significantly improve robustness.

7 Conclusion

We have shown how to formulate, efficiently estimate, and improve the robustness of neural nets using an encoding of the robustness property as a constraint system. Future work includes devising better approaches to improving robustness on large neural nets such as NiN and studying properties beyond robustness.