Scaling up the randomized gradient-free adversarial attack reveals overestimation of robustness using established attacks

03/27/2019 ∙ by Francesco Croce, et al. ∙ 0

Modern neural networks are highly non-robust against adversarial manipulation. A significant amount of work has been invested in techniques to compute lower bounds on robustness through formal guarantees and to build provably robust model. However it is still difficult to apply them to larger networks or in order to get robustness against larger perturbations. Thus attack strategies are needed to provide tight upper bounds on the actual robustness. We significantly improve the randomized gradient-free attack for ReLU networks [9], in particular by scaling it up to large networks. We show that our attack achieves similar or significantly smaller robust accuracy than state-of-the-art attacks like PGD or the one of Carlini and Wagner, thus revealing an overestimation of the robustness by these state-of-the-art methods. Our attack is not based on a gradient descent scheme and in this sense gradient-free, which makes it less sensitive to the choice of hyperparameters as no careful selection of the stepsize is required.



There are no comments yet.


page 13

page 14

Code Repositories


A powerful white-box adversarial attack that exploits knowledge about the geometry of neural networks to find minimal adversarial perturbations without doing gradient descent

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent work has shown that state-of-the-art neural networks are non-robust [33, 11]

, in the sense that a small adversarial change of a (even with high confidence) correctly classified input leads to a wrong decision again potentially with high confidence. While

[33, 11] have brought up this problem in object recognition tasks, the problem itself has been discussed for some time in the area of email spam classification [10, 20]

. However, since machine learning is nowadays used as a component for automated decision making in safety critical systems e.g. autonomous driving or medical diagnosis systems, fixing this problem should have high priority as it potentially can lead to fatal failures beyond the eminent security issue


While a lot of research has been done on attacks and defenses [28, 19, 18, 38] it has been shown that all existing defense strategies can be broken again [5, 2], with two exceptions. The first one are methods which provide provable guarantees on the robustness of a network [16, 34, 13, 29, 36, 22, 35, 31, 8] and which have proposed new ways of training [36, 22] or of regularizing neural networks [13, 8] to make them more robust. While this area has made huge progress it is still difficult to provide such guarantees for medium-sized networks [36, 22]

. Then the only way to evaluate robustness for large networks is still to use successful attacks which thus provide, for every clean input, an upper bound on the norm of the minimal perturbation necessary to change the class. In fact, this is an approach to the problem of estimating robustness symmetric to formal certificates, which are lower bounds on the actual robustness. The first attack scheme based on L-BFGS has been proposed in

[33], afterwards research has produced a variety of adversarial attacks of growing effectiveness [11, 15, 23]. However, it has been recognized that simple attacks often fail when they face a defense created against the specific attacks but which can be easily broken again using other more powerful techniques [5, 2]. Apart from these white-box attacks (model is known at attack time), also several black-box attacks have been proposed [19, 25, 4].
The second exception is adversarial training with a relatively powerful attack [21] based on projected gradient descent (PGD). This defense technique could not be broken even using the state-of-the-art attack of Carlini and Wagner [6, 5, 2].

In this paper we extend the white-box attack scheme proposed in [9]

, originally designed to attack fully-connected neural networks using ReLU type activation function. It is well known that these networks result in continuous piecewise affine functions

[1], that is the domain is decomposed into linear regions given as polytopes on which the classifier is affine. The principle of the attack of [9] is then to solve the minimal adversarial perturbation problem on each linear region as it boils down to a convex optimization problem. In [9] they report that the attack outperforms the DeepFool attack [23] and the state-of-the-art Carlini-Wagner attack (CW) [6] by up to relative improvement in the norm of the smallest perturbation

needed to change the classifier decision. However, the attack has been limited to small fully-connected neural networks with up to 10000 neurons. In this paper we show that this attack can also be applied to convolutional, residual and dense networks with piecewise affine activation functions as well as max and average pooling layers and scales to networks consisting of more than 2.5 million neurons and achieving state-of-the-art performance.

The main contributions of this paper are 1) an upscaling of the attack to large networks so that it can be applied to standard networks for CIFAR-10 and 2) supporting now the most common types of layers, e.g. convolutional and residual ones. The key for the upscaling is a very fast solver for the dual of the quadratic program which has to be solved for finding minimal

perturbations. The employed accelerated gradient descent scheme achieves quickly medium accuracy, which is enough for our purposes. Moreover, we use the fact that the solver just needs matrix-vector products of the constraint matrix and thus the explicit computation of the constraint matrix is not needed. This leads to a small memory footprint so that we can use this solver directly on the GPU. Finally, compared to

[9] we have designed a more efficient sampling scheme of the next region to be checked. All these speed-ups together allow us now to attack networks as long as they basically fit into GPU memory. In this paper the largest network has over 2.8 million neurons, which is 280 times more than in [9]. We show that in most of the cases our attack performs at least as good as the best attack among PGD [21], DeepFool [23] and CW [6]. In particular our algorithm works well across architectures, datasets and training schemes, being always the most successful or very close to the best of the competitors. Notably, we show especially for models trained with adversarial training [21] against the -norm and provably robust models [8, 36] that the other attacks overestimate, partially by large margin, the robustness wrt to the -norm.
We thus recommend our attack if a reliable estimation of the real robustness of a network is needed, as our attack not only performs well on average but does not, unlike the established state-of-the-art attacks (PGD, DeepFool, CW), lead to gross overestimation of the robustness of the network in some cases.

2 Piecewise affine formulation of ReLU networks

It has been noted in [1, 9] that ReLU networks, that is networks that use only the ReLU activation function, result in continuous piecewise affine classifiers in the form , where is the input dimension and the number of classes. This implies that there exists a finite set of polytopes , with , such that on each polytope the classifier is an affine function, that is there exists and such that holds for . Note that, although we here focus on ReLU, the same property hold for any piecewise affine activation function, like Leaky ReLU. In the following we generalize the construction from [9] done for fully-connected networks to the case of other layer types, so that it extends to convolutional networks (CNNs), ResNets [12] and DenseNets [14].

We can write ReLU networks with hidden layers as the composition of functions, each standing for one of the layers (including the output layer), . Note that we consider the application of the activation function as a stand-alone layer (called ReLU layer). Denoting with , for , the number of units in layer (in particular and we assume ), we can define, for ,


and the output of the -th layer is obtained as


where is the input of the network. Then, the final output of the classifier is given by


If we make explicit the relation between each and the input we can recover the formulation of classifier as a function from to . While this definition of a classifier differs from the usual recursive formulation, it allows to handle also connections between non-consecutive layers (as it happens for residual networks [12]). Finally, the class which is assigned to is given by

Every layer has one of the following types: dense, convolutional, skip connection, ReLU (or leaky ReLU), avg-pooling, max-pooling, batch normalization. We now show how it is possible to rewrite each of them,

, , as an affine function


with , in the linear region corresponding to the input (see below for the definition).
First, let us notice that dense, convolutional, skip connection and batch normalization (at inference time) layers are already affine operations, which means .
Second, ReLU layers apply the function componentwise to the output of the previous layer. Thus, they can be, given sorted so that the first components correspond to , replaced by linear functions explicitly represented by the matrices defined as


Then, the desired affine function is .
Third, average pooling computes the mean over certain subsets of the input vector. For example, the average of the first four entries of is obtained, introducing

as . Then, since we have pools of elements, it is sufficient to create vectors similar to , with entries equal to in the positions of the elements we want to average and zero else. We use then these vectors as rows of the matrix , getting . We notice that does not depend on the input (as avg-pooling is already an affine function).
Finally, the construction of for max-pooling layers is analogue (as these layers return the maximum instead of the mean). The main difference is that in this case may change as does. In fact, going back to our example, if we want to extract the maximum among the first four entries of and assume that it is realized by the second component, we can set . If the position of the maximum changes also the vector changes. Again, returns the value we are interested in. If are the positions of the maxima for each of the pools, we can then build as

so that . Please notice that, similarly to the case of ReLU layers, avg- and max-pooling layers usually involve only the output of the immediately preceding layer.

Once we have computed the affine functions for every we can explicitly derive recursively the affine functions , represented by the matrices and vectors , satisfying the conditions


Let us start with , that is the first layer. Then and the are the linear function and the bias which define , namely and .
Assuming now that are available for , we get combining Equations (2) and (5) and the definition of , so that


which is affine as a composition of linear and affine functions is affine again.

It still remains to compute the polytope containing on which all the previous affine approximations hold exactly. First, note that is independent of its input (and thus from the input of the network as well) if is either a dense, convolutional, residual, avg-pooling or batch normalization layer, meaning that is equivalent to on the whole input space. Thus these layers do not contribute to the definition of . We are left to define where the linear reformulations of ReLU and max-pooling layers hold. As we noticed above, these kinds of layers only take into account the output of the immediately previous layer. Therefore, while considering layer , we are allowed to act like the only input of was .
Let be a ReLU layer and notice that the matrix computed for is the same for any vector whose components have the same sign as those of . Defining elementwise as

we then get the set

containing the points of which lead to the same matrix as . We note that the condition is equivalent to and that we are interested in the intersection of and the domain of layer . With (5), we define the polytope on which is affine,


The set defines the region of the input space containing and where and thus is an affine function.
If is instead a max-pooling layer, we can see that is preserved as long as the maximum within each pool is realized at the same position. We can denote the pools as the sets , whose elements are the indices of the components of the input (of the max-pooling layer) involved in the pool. Moreover we define for every

that is the index of the component of attaining the maximum for each pool . Then, for ,

is the set of the vectors in preserving the position of the maximum computed at for pool . Similar to what has been done for ReLU layers, we define


so that, finally, is the subset of the input space containing on which is an affine function.
Note that if is neither a ReLU nor a max-pooling layer. The polytope on which (and all layers below) is affine is given by

In the following we refer to as the linear region of . Note also that the intersection of with any other polytope is still a polytope (e.g. this is necessary when the input domain of a classifier is a subset of ). Note that the explicit storage of the matrices is not possible for large networks and high input dimension as one needs memory. In Section 4 it will turn out that our attack algorithm only requires matrix-vector products which can be done without computing explicitly and thus we can do the whole attack on the GPU as long as the network itself fits into GPU memory.

3 Minimal adversarial perturbation inside a linear region

Classifiers based on neural networks have been shown to be vulnerable to adversarial samples, that is they misclassify inputs which are almost indistinguishable from an original, correctly recognized test image [33, 11]. The minimal adversarial perturbation wrt an -norm is defined as the solution of the following optimization problem


with being a set of constraints the input of has to satisfy (in the following we assume that is a polytope), e.g. images scaled to be in , is the original point and the class assigned to by (we assume is correctly classified by ). The -norm of measures the difference between original and adversarial inputs (changing leads to adversarial samples with different properties). In practice, one often uses or . We concentrate for simplicity in this paper on , even though the framework allows to handle any -norm given that a fast solver is available for the following linearized problem (11). Note that (9) represents an untargeted attack, that is we just want that the decision changes but we do not want to achieve that is classified as a particular class.

The optimization problem (9) is in general non-convex and NP-hard [16]. However, as shown in [9], one can solve it efficiently inside every linear region of the classifier, that is if we add to (9) the constraint , where is the linear region which contains the point . In fact, recalling Section 2, we introduce for the vectors as the solutions of the convex problems (note that we assume that is a polytope)


Then, the solution of Problem (9) restricted to the linear region is . While we are mainly interested in untargeted attacks, we would like to highlight that targeted attacks against any of the classes are easily possible by solving instead the following problem:


Please note that if one would solve (10) for all possible linear regions and take the smallest perturbation, then this the exact solution of (9). However, due to the extremely large number of linear regions this is infeasible in practice. Thus we use a randomized scheme for selecting the next linear region which is described in Section 4 together with a description of the particular solver for the resulting quadratic program in (10) for the choice of .

4 Generation of adversarial samples through randomized local search

In the following we present an improved selection scheme of the linear regions compared to the one in [9]. The observation motivating our scheme is that the decision surface dividing areas of the input space assigned to different classes extends continuously across neighboring linear regions. If a point, say , lying on the decision boundary is available, it is highly likely to find in its vicinity other points, again on the decision boundary between two classes, closer than to the target image . However, as pointed out in [9] it is very difficult to determine neighboring regions as a large number of the constraints defining the polytope are active at the solution of (10). In this case the neighboring region is not unique and checking all of them is infeasible and inefficient.
Thus we sample random points (more details below) in a small ball centered around the currently best point , that is realizing the smallest adversarial perturbation found so far, and then solve (10) in the corresponding linear region until we find a better adversarial sample.
Moreover, we save the activation patterns of the linear regions we have explored. Before checking a point and its corresponding linear region we compare the activation pattern to the ones of the points which we have already visited. If the activation patterns agree it means that the two points belong to the same region and then we can skip checking it again.

Algorithm 1 shows our overall attack for a general -norm trying to solve the optimization problem (9) for the minimal adversarial perturbation. In the experiments we use either or , that is we check resp. linear regions. Please note that Algorithm 1 requires to be fed with a feasible point of (9). There are several possibilities e.g. an adversarial sample of a fast attack like DeepFool as has been used in [9]. In this paper we prefer to be independent of another attack. Thus we are using the following scheme to choose starting points. At we rank the classes according to the components of corresponding classifier output in descending order , where is the class which is assigned to . We choose the classes in the ranking and compute the point in the training set correctly classified by in class which is closest to for . In order to be speed up the attack we do for each a binary search on and identify the point which is closest to but is classified differently from and use , , as starting perturbations for Algorithm 1.

Input :  original image, starting perturbation,
Output :  adversarial perturbation
1 ,
2 for  do
3      sampled according to (13)
4      if region containing has not been checked already then
5           computation of
6           solution of Problem (10) on
7           if  then , ;
9           end if
11           end for
Algorithm 1 Our attack

At each of the iterations we sample a point around the current best (smallest -norm) feasible point of (9). The following sampling scheme is biased towards , where is a parameter controlling the bias towards ( no bias, maximal bias) and is a parameter controlling how localized our search is (the larger , the more localized). We sample i) uniformly a point from the intersection of the unit sphere centered in

and the hyperplane containing

with normal vector , and ii) an angle given by



is the uniform distribution on the interval

. We define . Note that by construction . Finally,

give direction and step size to produce the next point whose linear region will be checked, defined as


Note that the larger the more biased will be towards . On the other hand our sampling scheme makes a difference between the half-sphere centered at with pole at versus the half-sphere with pole at . If

samples from both half-sphere are equally probable, whereas if

one samples just from the half-sphere pointing towards . At first sight it might look strange that we do not choose , as points sampled from the half-sphere pointing away from have larger distance from than . However, experiments on a small subset of points show that a value of leads to best results even though the difference to is not large and thus we fix it to for all experiments. Moreover, we use or for all experiments, noting anyway that the attack is not very sensitive to this value as in the range between and lead to very similar results.

If is a polytope e.g. , then the optimization problems (10) and (11

) are equivalent to linear programs (LP) for

and and equivalent to a quadratic program (QP) for . The main cost of the attack is to solve the optimization problem. Next we describe an efficient scalable way of solving (10) for , avoiding the explicit calculation of the linear regions.

4.1 A scalable and efficient solver for the quadratic program

Let us suppose is our current best found solution, then we would like that the solution of (10) produces a new which satisfies . This implies that as soon as we have a certificate that the optimal value of (10) is larger than then we can stop the solver as checking this region will not yield an improvement. Thus we work with the dual of (10) as the dual objective is always a lower bound on the primal objective. As soon as we have found dual parameters realizing a larger dual objective than we can stop.

In the following we describe first how we solve the generic resulting dual QP using accelerated gradient descent together with coordinate descent in a subset of the variables. Then we describe how this algorithm for solving the QP can be efficiently implemented on the GPU without having to ever to compute the constraint matrix. Note that in [9] we used the commercial package Gurobi for solving the QP on the CPU. Now we present an own implementation fully running on the GPU which is roughly three orders of magnitude faster than then our old implementation on the CPU and which allows us to deal with fully-connected, convolutional and residual layers.

Solving the dual problem.

As we are mainly interested in applications in computer vision we specialize to the case

in (10), which can then be formulated as


Note that the formulation is different from (10) but can be transformed into each other using . The chosen formulation of the optimization problem in (14) is better adapted to the componentwise constraints imposed by . The primal problem is strongly convex and thus has a unique solution. We derive the dual problem as



and the inequalities in (15) are componentwise. The correspondence between the primal variable and the dual optimal variables is given by


Note however that even for dual feasible and , the primal variable need not to be feasible. The KKT conditions are

This implies . Solving for yields


Thus for fixed we can directly find the optimal values of and . The dual problem is also a quadratic program but it is not necessarily strongly convex as does not need to be positive definite. However, the gradient

is Lipschitz continuous and the Lipschitz constant can be upper bounded as,


We estimate via the power method with iterations, which is enough to get already a quite accurate estimate. We solve the QP itself with accelerated projected gradient descent [26, 3, 7] in by setting and to their optimal values for given as in (17) which can be seen as a mixture of a coordinate descent in and accelerated projected gradient descent in . Note that in all steps we never need the matrix explicitly, but just matrix vector products or if we want to compute feasibility of the current primal variable . Even for the computation of we use the power method which also only requires matrix vector products. The only caveat is a good pre-conditioning of the problem, which can be achieved by normalizing the rows , of to have unit norm (with corresponding rescaling of ). One can compute them via matrix vector products , but this would require too many of them. We discuss how this can be resolved in the next section and how the whole QP solver can be ported to the GPU.

4.2 Solving the QP efficiently on the GPU without explicit computation of the constraint matrix

As discussed at the end of the previous section, the QP solver via accelerated gradient descent does not require the explicit computation of as long as there is a way to compute matrix vector products and efficiently. While in [9] the matrix has been explicitly computed on the CPU, this is no longer feasible for larger networks as the memory consumption is , where is the input dimension and the total number of neurons. Even if one uses sparse matrix formats e.g. in the case of convolutional layers, this does not help to reduce the required memory significantly if the network is deep. Moreover, also the computation of the hyperplanes requires a computational cost equivalent to forward passes of the network.
Thus a major improvement of this paper compared to [9] is the transfer of all computations from the CPU to the GPU which is only possible if the matrix is not explicitly computed as the GPU memory would not suffice for this. The major insight to do this is that accelerated gradient descent only requires matrix-vector products of the form and . Note that contains basically the concatenated matrices from (7) and (8). However, we note that according to (5) it holds

and thus is nothing else than the Jacobian of with respect to and

Suppose for simplicity that

Then the Jacobian of at

is given by the chain rule as

Note that can be evaluated as

In the same way we can compute as

Thus calculating requires a single forward pass through the network and requires a forward pass for computing the values and then a backward pass through the network. More general, the computation of the Jacobian-vector products can be done via automatic differentiation (forward-mode resp. backward-mode automatic differentiation). Finally, to calculate the above expressions efficiently we still need a fast way to compute and for primitive functions e.g. if is a convolution, then can be computed as well as a convolution and as the transposed convolution. Fortunately, modern implementations of automatic differentiation already come with a large collection of primitive functions and corresponding rules for and . Thus, we can directly and efficiently compute them on the GPU without computing the Jacobians itself. Thus our QP solver does not require much more memory than the network itself which allows it to scale to large networks.

Note that for pre-conditioning of it would make sense to rescale the rows of to have unit norm (one has to rescale correspondingly also the vector ). While every row vector of can be obtained as and thus also just via matrix-vector products, doing this for every row is prohibitively expensive. Thus we use the fact that the norms of the row vectors corresponding to the same hidden layer have quite similar norms (typically we see increasing norms as one moves from lower to upper layers). Thus we just sample a small number of rows (in our case ) of each layer, compute their norms, take the mean of them and use the inverse of that as a rescaling factor for that layer. While this coarse pre-conditioning scheme is worse than if one rescales every row individually, it is significantly better than not doing any rescaling at all. There is one exception: we upscale the constraint of the decision boundary, as we have found that this leads to faster feasibility of this constraint which is the most important one of all the constraints.
Moreover, we do not need an accurate solution of (10) and thus we have found that in practice iterations of the accelerated gradient descent scheme suffice to get a reasonable solution. As the primal variable in (16) obtained from the dual variables need not be feasible, we explicitly check it the output is an adversarial sample. If not then we check via a small line search , where , if it is an adversarial sample as long as , where is the norm of the perturbation of the currently best adversarial sample . Finally, this leads to a scheme which is more than three orders of magnitude faster than that in [9].

5 Experiments

average difference to the best robust accuracy
model PGD-1 PGD-10k CW-10k CW-100k DF ours
MNIST 0.2367 0.1011 0.1701 0.1681 0.3135 0.0051
GTS 0.0361 0.0237 0.0177 0.0172 0.0643 0
CIFAR-10 0.1123 0.0879 0.0037 0.0029 0.0804 0.0052
maximum difference to the best robust accuracy
model PGD-1 PGD-10k CW-10k CW-100k DF ours
MNIST 0.7800 0.5040 0.6200 0.6120 0.9000 0.0500
GTS 0.1600 0.1260 0.0440 0.0420 0.1140 0
CIFAR-10 0.3300 0.2720 0.0180 0.0180 0.1220 0.0140
Table 1: Performances of different attacks. For each dataset, attack and threshold , we compute the differences between the robust accuracies estimated by an attack and the best one among those of all the attacks. We here report, given dataset and attack, the mean (top) and the maximum (bottom) of these differences across the thresholds. We can see that our attack has the smallest average distance from the best on two of three datasets and always achieves the best maximal distance. Notably, on GTS both mean and maximum for our attack are 0, which means that it gets the lowest robust accuracy for every model and .

In this section we show that our attack often outperforms the state-of-the-art methods to compute upper bounds on the robust accuracy of a model, which is defined for a given , as the minimal accuracy that the classifier can achieve if each test sample is allowed to be perturbed within a -norm ball of radius in order to achieve a misclassification. The smaller the found robust accuracy the stronger is the attack and the less robust is the network. We focus here on the -attack. We make the implementation of our attack publicly available111
We show that current state-of-the-art attacks sometimes overestimate the robustness of classifiers. In fact, with our attack we are often able to achieve smaller robust accuracy than our competitors, and even when we do not we never overestimate the robust test accuracy more than 5.0% compared to the minimal one found by the competitors. In contrast, all the other attacks have cases where they achieve a robust accuracy at least 50.4% larger than that provided by our method (see Table 1). Thus if one just evaluates robustness using the competing attacks, one would consider models robust which are in fact quite non-robust. Our technique does not show a similar weakness in any setting, pointing out how our algorithm is, on one side, able to recover in general small adversarial perturbations and, on the other side, less susceptible to changes in the characteristics of the network. Interestingly, we notice that - gradient-based methods suffer especially when attacking models trained with -adversarial training.
We consider three datasets: MNIST, German Traffic Sign (GTS) [32] and CIFAR-10 [17] (all images are scaled in ). On each of them three models are trained, the plain model (plain), one with -adversarial training [21] (called -at) and one with -adversarial (-at). More details about architectures and training are provided below.

We compare our attack against: Projected Gradient Descent on the loss function (PGD)

[21], Carlini-Wagner -attack (CW) [6] and DeepFool (DF) [23]. We use two versions of PGD: PGD-1 uses a single starting point, while PGD-10k exploits 10000 restarts, randomly sampled in the -ball of radius around the original image. This large number of restarts is motivated by a recent paper which could break a certain defense only when using 10000 restarts of PGD [24]. For both PGD versions we set iterations and, if is the threshold at which we want to evaluate robust accuracy, we use a step size of . Similarly, we evaluate CW in the implementation of [27] with 40 binary search steps and either 10000 (CW-10k) or 100000 iterations (CW-100k). We use the DF implementation as in [30].

Since the objective of PGD is only to find if there exists an adversarial sample with norm less than the threshold , it provides directly the robust accuracy at (and must be rerun for each threshold ). On the contrary, CW, DeepFool and our attack try to find the minimal adversarial perturbation as in (9). After running these attacks, we compute the robust accuracy for a given threshold as the fraction of points whose adversarial examples are farther, in -distance, than . Note that we only check correctly classified points for all methods. The obtained values of robust accuracy achieved for all attacks and different thresholds are reported in Tables 2 (MNIST), 3 (GTS) and 4 (CIFAR-10).

In order to thoroughly evaluate the effectiveness of an attack, it is necessary to assess average and worst case performance. In this way one can see whether it overfits to some particular model, dataset or training scheme. For every dataset we compute for all thresholds the difference between the robust accuracy provided by every attack and the minimal robust accuracy across all the attacks for the fixed threshold. Thus the worse the performance a method achieves, the larger the difference is. In Table 1 we report for each attack the mean and the maximal distance from the best accuracy, across the three models and five thresholds , for each of the three datasets. It can be directly seen from this table that our attack has at the same time the best worst case performance for all three datasets and the best average performance in two of three datasets with only a tiny difference in the case it is worse.
In particular, we can see how on MNIST the second best attack (PGD-10k) is on average worse than the minimal robust accuracy, compared to for our attack, while in the worst case it returns a robust accuracy 50.4% larger than the minimal one versus 5.0% for our method. On GTS, both average and maximal difference are 0.0% for our attack, meaning that it always achieves the minimal robust accuracy among all competing methods. Although on CIFAR-10 we cannot match the average result of CW attack, our attack has nevertheless the best worst case performance, highlighting the quality of our approach.

5.1 Main experiments: details

robust accuracy on MNIST
model PGD-1 PGD-10k CW-10k CW-100k DF ours
plain 0.0 0.984
2-8 0.5 0.928 0.926 0.926 0.926 0.936 0.926
1.0 0.508 0.472 0.474 0.474 0.586 0.474
1.5 0.168 0.106 0.088 0.088 0.198 0.078
2.0 0.106 0.028 0.006 0.006 0.018 0.002
2.5 0.078 0.014 0.000 0.000 0.000 0.000
-at 0.0 0.986
2-8 1.0 0.930 0.930 0.926 0.926 0.938 0.926
1.5 0.838 0.834 0.846 0.848 0.872 0.834
2.0 0.698 0.672 0.706 0.706 0.790 0.680
2.5 0.468 0.366 0.466 0.464 0.674 0.416
3.0 0.192 0.096 0.170 0.172 0.542 0.112
-at 0.0 0.984
2-8 1.0 0.924 0.878 0.888 0.888 0.948 0.736
1.5 0.886 0.748 0.774 0.776 0.932 0.258
2.0 0.812 0.536 0.652 0.644 0.918 0.032
2.5 0.758 0.248 0.552 0.538 0.904 0.004
3.0 0.658 0.064 0.480 0.468 0.848 0.000
Table 2: Robustness of MNIST models. We report upper bounds on the robust accuracy, that is the fraction of points in the test set which are still correctly classified when any perturbation of -norm smaller than or equal to is allowed in order to achieve a misclassification (a smaller robust accuracy means a stronger attack). The statistics are computed on the first 500 points of the MNIST test set.


For MNIST we use the same architecture as in [21], consisting in 2 convolutional layers of 16 and 32 filters, each followed by max-pooling, and 2 dense layers. In particular, the plain and - trained models are the natural and secret models of ”MNIST Adversarial Examples Challenge”222, based on [21]. For -at we adapted the code of [21] to perform adversarial training using the PGD attack wrt the -norm with and 40 iterations. The clean accuracy of the models can be found in Table 2 in the row corresponding to . Moreover, we use different starting points for our attack (corresponding to five classes) and as the maximum number of linear regions checked, or equivalently iterations in Algorithm 1, for each starting point. Moreover we set the parameter in Algorithm 1 to .
In Table 2 we report the robust accuracy, computed on 500 points of the test set, for the three models when the -norm of the perturbations is bounded by . We see that in most of the cases our attack achieves the best performance. In particular, on the -trained model all the other gradient-based methods suggest that the classifier is highly robust, while our attack shows that this is not the case, as it turns out to be just slightly less vulnerable to adversarial examples than the plain model (e.g. at the best of other attacks reduces accuracy only to 53.6% while our technique brings it down to 3.2%).
Notably, in [31] the same -at model was tested and, taking the pointwise best output among those of 11 attacks of various nature, the authors could decrease robust accuracy no more than 35% with . On the other hand we see that our attack alone, without even testing all the possible 9 target classes, yields an upper bound on robust accuracy for the same of 25.8%, which is almost 10% less than the current state-of-the-art [4].

robust accuracy on GTS
model PGD-1 PGD-10k CW-10k CW-100k DF ours
plain 0.0 0.946
2-8 0.1 0.746 0.746 0.754 0.754 0.788 0.740
0.2 0.568 0.562 0.566 0.566 0.628 0.550
0.4 0.360 0.348 0.334 0.334 0.408 0.316
0.6 0.298 0.274 0.214 0.214 0.292 0.178
0.8 0.268 0.234 0.124 0.124 0.210 0.108
-at 0.0 0.908
2-8 0.1 0.818 0.818 0.826 0.820 0.826 0.818
0.2 0.708 0.704 0.706 0.710 0.728 0.704
0.4 0.488 0.472 0.496 0.496 0.538 0.468
0.6 0.328 0.320 0.322 0.322 0.378 0.320
0.8 0.222 0.218 0.224 0.222 0.284 0.212
-at 0.0 0.904
2-8 0.25 0.690 0.686 0.692 0.692 0.718 0.686
0.5 0.460 0.446 0.468 0.466 0.500 0.444
0.75 0.302 0.288 0.300 0.300 0.338 0.280
1.0 0.212 0.200 0.214 0.214 0.246 0.194
1.25 0.164 0.130 0.116 0.114 0.172 0.072
Table 3: Robustness of GTS models. We report upper bounds on the robust accuracy, that is the fraction of points in the test set which are still correctly classified when any perturbation of -norm smaller than or equal to is allowed in order to achieve a misclassification (a smaller robust accuracy means a stronger attack). The statistics are computed on the first 500 points of the GTS test set.


In this case the models are CNNs with 2 convolutional layers (16 and 32 feature maps) with stride 2, which replaces max-pooling for downsizing, and 2 dense layers. Adversarial training is based on 40 iterations of PGD attack, with

for -at and for -at. Since GTS has 43 classes, we run our algorithm with starting points, 500 linear regions each and .
Table 3 shows how the upper bounds on robust accuracy, computed on the first 500 images of the test set, obtained through our technique are always smaller than those by the competitors, apart from 4 cases out of 15 where the PGD results can only match ours. We notice that, although in some cases the difference is not extremely large, in 3 of 15 settings our attack reduces the robust accuracy at least by 2% compared to the best result of the other methods, with a maximum of 4.2% for for the -trained model (robust accuracy of 11.4% by CW-100k vs 7.2% for our attack). It is interesting to notice that, similarly to MNIST, the other attacks seem to have problems to perform well on the models adversarially trained wrt -norm.

robust accuracy on CIFAR-10
model PGD-1 PGD-10k CW-10k CW-100k DF ours
plain 0.0 0.892
2-8 0.1 0.684 0.676 0.694 0.694 0.722 0.690
0.15 0.546 0.538 0.554 0.552 0.626 0.550
0.2 0.452 0.434 0.434 0.432 0.512 0.434
0.3 0.312 0.288 0.216 0.216 0.338 0.220
0.4 0.268 0.220 0.094 0.092 0.208 0.098
-at 0.0 0.812
2-8 0.25 0.658 0.656 0.660 0.660 0.670 0.656
0.5 0.502 0.492 0.482 0.482 0.538 0.478
0.75 0.464 0.424 0.324 0.324 0.422 0.324
1.0 0.448 0.408 0.212 0.204 0.300 0.216
1.25 0.444 0.386 0.114 0.114 0.224 0.124
-at 0.0 0.794
2-8 0.25 0.644 0.644 0.646 0.646 0.670 0.644
0.5 0.502 0.490 0.484 0.484 0.530 0.488
0.75 0.448 0.420 0.332 0.332 0.414 0.334
1.0 0.430 0.396 0.226 0.228 0.326 0.228
1.25 0.418 0.382 0.120 0.120 0.242 0.130
Table 4: Robustness of CIFAR-10 models. We report upper bounds on the robust accuracy, that is the fraction of points in the test set which are still correctly classified when any perturbation of -norm smaller than or equal to is allowed in order to achieve a misclassification (a smaller robust accuracy means a stronger attack). The statistics are computed on the first 500 points of the CIFAR-10 test set.


Since CIFAR-10 represents a more difficult classification task, we use for it a deeper and wider architecture, made of 8 convolutional layers (with number of filters increasing from 96 to 384) and 2 dense layers, which contains overall more than 375000 units. We perform adversarial training again with the PGD attack, with 10 iterations, and for - and -robust training respectively. We here run our attack with 400 iterations and 3 starting points, fixing .
The statistics over the first 500 points of the test set are summarized in Table 4. Although with this dataset we see that the best performances are achieved by different methods in many situations, we can nevertheless notice that our attack clearly outperforms PGD and DF and is at most 1.4% off from the best robust accuracy. CW attack performs here very well but it still has a slightly worse performance in the worst case setting, as we can see in Table 1.
Moreover, these CIFAR-10 networks are less robust than those trained on MNIST and GTS, so that the task of crafting small adversarial examples is easier than previously. This implies that even weak attackers can succeed in finding good, maybe almost optimal, adversarial perturbations.

robust accuracy on MNIST
model PGD-1k PGD-10k CW-10k CW-100k DF ours
- MMR-at 0.0 0.988
2-8 1.0 0.828 0.816 0.854 0.854 0.868 0.704
1.5 0.488 0.428 0.642 0.642 0.682 0.250
2.0 0.310 0.270 0.414 0.412 0.642 0.048
2.5 0.222 0.180 0.196 0.194 0.238 0.004
3.0 0.136 0.116 0.074 0.070 0.084 0.000
-KW 0.0 0.982
2-8 1.0 0.924 0.910 0.924 0.924 0.926 0.854
1.5 0.674 0.600 0.834 0.834 0.898 0.478
2.0 0.226 0.176 0.664 0.662 0.844 0.148
2.5 0.030 0.020 0.454 0.454 0.784 0.018
3.0 0.002 0.000 0.264 0.264 0.644 0.002
- MMR-at 0.0 0.986
2-8 1.0 0.848 0.848 0.850 0.850 0.868 0.842
1.5 0.608 0.606 0.622 0.622 0.682 0.576
2.0 0.286 0.270 0.312 0.312 0.462 0.238
2.5 0.050 0.048 0.090 0.090 0.238 0.044
3.0 0.016 0.012 0.032 0.030 0.084 0.010
-KW 0.0 0.988
2-8 1.0 0.916 0.916 0.914 0.914 0.928 0.912
1.5 0.722 0.716 0.740 0.740 0.826 0.692
2.0 0.392 0.366 0.438 0.438 0.690 0.298
2.5 0.214 0.202 0.166 0.166 0.478 0.078
3.0 0.172 0.152 0.046 0.046 0.292 0.012
Table 5: Provably robust MNIST models. We report upper bounds on the robust accuracy, that is the fraction of points in the test set which are still correctly classified when any perturbation of -norm smaller than or equal to is allowed in order to achieve a misclassification (a smaller robust accuracy means a stronger attack). The statistics are computed on the first 500 points of the MNIST test set.
robust accuracy on CIFAR-10
model PGD-1k PGD-10k CW-10k CW-100k DF ours
- MMR-at 0.0 0.638
2-8 0.25 0.504 0.504 0.490 0.490 0.498 0.484
0.5 0.332 0.330 0.340 0.340 0.348 0.314
0.75 0.180 0.174 0.176 0.174 0.210 0.154
1.0 0.066 0.064 0.070 0.070 0.096 0.056
1.25 0.036 0.034 0.032 0.032 0.050 0.028
-KW 0.0 0.532
2-8 0.25 0.390 0.390 0.376 0.376 0.374 0.364
0.5 0.238 0.236 0.218 0.218 0.236 0.216
0.75 0.132 0.130 0.128 0.128 0.146 0.104
1.0 0.060 0.060 0.064 0.064 0.082 0.036
1.25 0.018 0.018 0.032 0.032 0.036 0.014
- MMR-at 0.0 0.618
2-8 0.25 0.418 0.418 0.404 0.404 0.412 0.398
0.5 0.270 0.266 0.264 0.262 0.284 0.252
0.75 0.146 0.144 0.146 0.146 0.174 0.128
1.0 0.076 0.076 0.094 0.094 0.104 0.064
1.25 0.032 0.032 0.050 0.050 0.054 0.024
-KW 0.0 0.614
2-8 0.25 0.492 0.492 0.478 0.478 0.480 0.478
0.5 0.384 0.384 0.374 0.374 0.376 0.360
0.75 0.266 0.266 0.262 0.262 0.284 0.246
1.0 0.172 0.172 0.176 0.176 0.190 0.152
1.25 0.094 0.092 0.108 0.108 0.122 0.082
Table 6: Provably robust CIFAR-10 models. We report upper bounds on the robust accuracy, that is the fraction of points in the test set which are still correctly classified when any perturbation of -norm smaller than or equal to is allowed in order to achieve a misclassification (a smaller robust accuracy means a stronger attack). The statistics are computed on the first 500 points of the CIFAR-10 test set.

5.2 Testing provably robust models

In this section we test classifiers trained to be provably robust, that is it is possible to compute for a large fraction of the test points if there exists or not an adversarial perturbation with norm smaller than a fixed threshold. This means that non-trivial lower bounds on the robust accuracy are provided. For what concerns upper bounds, we have mostly to rely, especially for the case, on the adversarial examples provided by the attacks. Then, using powerful attacks allows also to correctly assess the tightness of the lower bounds or equivalently the effectiveness of the verification methods.
We consider the models presented in [8], that is CNNs with 2 convolutional layers of 16 and 32 filters and a hidden fully-connected layer of 100 units. These are trained with the techniques of either [8] (called MMR) or [36, 37] (KW) to be robust wrt the -norm at for MNIST and for CIFAR-10, wrt the -norm at for MNIST and . We decide to test the robustness of all the models with thresholds larger than those used for robust training since at those levels the uncertainty on robust accuracy is limited as tight bounds on it are available (see [8]). We run our attack for 500 regions and 5 starting points. In Table 5 (MNIST) and Table 6 (CIFAR-10) we report similarly to the previous section the upper bounds on the robust accuracy, computed with 500 test points, provided the different attacks (we here use PGD-1k with 1000 restarts instead of the weaker version with a single restart).
For both datasets we see that our attack outperforms, often significantly, the competitors, with the only exception being the largest value of on the model trained with KW technique wrt -norm on MNIST. Moreover, note that similar to Table 2, the largest differences (over 22% between the upper bounds on robust accuracies of PGD-100k and our attack) are reached for the classifier trained on MNIST with adversarial training from [21] wrt .

robust accuracy of large models
model PGD-1k DF ours
plain 0.0 0.96
2-5 0.05 0.78 0.83 0.78
0.075 0.61 0.75 0.60
0.1 0.43 0.63 0.44
0.15 0.18 0.42 0.18
0.2 0.08 0.26 0.07
-at 0.0 0.85
2-5 0.25 0.72 0.75 0.71
0.5 0.53 0.62 0.54
0.75 0.36 0.51 0.37
1.0 0.23 0.44 0.22
1.25 0.15 0.37 0.12
Table 7: Large models. We report here the robust accuracy, that is an upper bound on the fraction of points in the test set which are correctly classified when any perturbation of -norm smaller than or equal to is allowed (a smaller robust accuracy means a stronger attack). The statistics are computed on the first 100 points of the CIFAR-10 test set.
Figure 1: Progression of our attack for different sampling schemes. We show median (left), maximum (center) and mean (right) of the norms of the adversarial perturbations found by our attack as a function of the explored linear regions. We repeat the experiments for different values of (see Equation (12)), represented in different colors, and with the uniform sampling scheme () from [9] as a comparison.

5.3 Attacking large models

In order to show the scalability of our approach to large models, we here attack the networks from ”CIFAR-10 Adversarial Examples Challenge”333 trained on CIFAR-10 with either plain or -adversarial training [21] (called naturally trained and secret

in the original challenge). The architecture used is a residual convolutional network consisting of a convolutional layer, five residual blocks and a fully-connected layer, derived from the ”w32-10 wide” variant of the TensorFlow model repository, with 2.883.593 units. In order to apply our algorithm we had to replace the per image normalization, which is not an affine operation on the input, with the following step: for each input image, we subtract the mean of its entries and divide it by a constant (0.21, which is an approximation of the average standard deviation across the images of the training set). Note that this small variation does not affect the performance of the classifier while allows the network to result in a piecewise affine function.

In Table 7 we report the robust accuracy, on the first 100 test points, given by the three methods (in this case we use PGD with 1000 but not 10000 restarts as it would be computationally too expensive). We omit CW since with the default parameters it fails to provide meaningful results. For our method we use 5 starting points. While DeepFool is always worse than the others, PGD and our attack perform similarly, although the largest gap (3%, achieved at for the -at model) is in favour of our method.

5.4 Runtime comparison

We analyze the runtime the different attacks take to return results on 500 test points on the plain model on CIFAR-10 of Section 5.1 using a single GPU. Note that CW, DeepFool and our method aim at finding the minimal adversarial perturbation within a limited budget of iterations while PGD takes as input a thresholds and looks for a manipulation with norm smaller than it, but does not try to minimize it. This means that, in order to build Table 4 one has to run PGD once for each value . Conversely, for the other attacks a single run is sufficient to compute the robust accuracy at every threshold.
We compare the runtime of the attacks in the setting used for the experiment in Table 4, and report the total time needed to run the attacks on 500 different points on a single GPU: PGD-10k takes around 18 hours for a single threshold. CW-100k needs 55 hours in total and our method takes 150 hours (using 3 starting points), while the fastest but also weakest attack is DeepFool with a runtime of less than one minute.

5.5 Choosing parameter for sampling

In order to choose a proper parameter for the sampling scheme in Equation (12) we run our attack on the MNIST plain model, already introduced in Section 5.1, with and with the scheme proposed in [9], where the next linear region to check is chosen by sampling uniformly a direction from the current best solution (corresponding to ). In Figure 1 we show the development of median (left), maximum (center) and mean (right) of the -norms of the adversarial perturbations found as a function of the explored linear regions. We can see that the final values of the statistics do not differ significantly. However, since we are interested in getting quickly results close to optimal, is preferable as the maximum for this run is the fastest to converge to the final solution.

Figure 2: Progression of our attack. In each row, the first image is from the training set, the second is obtained with the linear search towards the target image (last image) for which we create an adversarial example. The other images are intermediate adversarial images found by our attack (the seventh is the final output). Apart from the starting image and the target image, all are on the decision boundary, that is between the classes indicated on top of each picture ( means it is on the decision boundary between class and ). We also report the -distance between each image and the target image (d). First three rows: non-robust plain model, last three rows: (first) and (second, third) adversarially trained models on MNIST, GTS and CIFAR-10.

6 Visualizing the decision boundary

While our attack runs, almost at each iteration an image lying on the decision boundary, that is the classifier outputs assigns the same (up to a tolerance) probability for the input to belong to different classes, is available. In fact, unless the linear region to which the current solution belongs does not intersect the decision boundary, the solution of problem (10) is attained when the first constraint holds as an equality.
In Figure 2 we show some of these intermediate solutions found while crafting an adversarial example. The first three rows are obtained attacking the plain models reported in the Section 5, while for the fourth to sixth row we used respectively the -at network on MNIST and the -at classifiers on GTS and CIFAR-10. For every row, the first image is the starting point of our method and belongs to the training set of the respective dataset, while the second image is the point we get through the initial binary search on the segment joining the starting point and the target image for which we want to provide an adversarial perturbation (represented in the last image of each row). We also report the -distance between each image and the target image, which is equivalent to the -norm of the adversarial manipulation found at that iteration of the algorithm.
We can see that, apart from the starting image and the target image, all the images lie on the decision boundary. Furthermore, in many cases, although the distance from the target image is notable, they are clearly assignable to a specific class, meaning that the decision boundary is still wrong showing that there is still quite some way to go if we want to achieve robustness with respect to human perception of these images.

Figure 3: The linear regions can be large. For the same cases reported in Figure 2 for GTS we show here the image got by the initial linear search, say , and that obtained by solving (10) on the first region . This means that the two images of each row belong to the same linear region even though their appearance is quite different. This shows that some of the linear regions cover quite larger parts of the input space.

We can also check how large the linear regions are. The first polytope our attack checks is the one containing the point of the linear search performed as initial step of the attack between the image from the training set and the target image. We show the image and the solution of (10) on . Both images are contained in and both lie on the decision boundary. In Figure 3 we show these two images for some cases for the GTS models. It is interesting that, although the number of polytopes is extremely large, they are still wide enough to contain images of such different appearance and with significant -distance.

F.C. and M.H. acknowledge support from the BMBF through the Tübingen AI Center (FKZ: 01IS18039A) and by the DFG via grant 389792660 as part of TRR 248 and the Excellence Cluster “Machine Learning - New Perspectives for Science”. J.R. acknowledges support from the Bosch Research Foundation (Stifterverband, T113/30057/17) and the International Max Planck Research School for Intelligent Systems (IMPRS-IS).


  • [1] R. Arora, A. Basuy, P. Mianjyz, and A. Mukherjee.

    Understanding deep neural networks with rectified linear unit.

    In ICLR, 2018.
  • [2] A. Athalye, N. Carlini, and D. A. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In ICML, 2018.
  • [3] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sciences, 2:183–202, 2009.
  • [4] W. Brendel, J. Rauber, and M. Bethge. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. In ICLR, 2018.
  • [5] N. Carlini and D. Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. In

    ACM Workshop on Artificial Intelligence and Security

    , 2017.
  • [6] N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, 2017.
  • [7] A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40(1):120–145, 2011.
  • [8] F. Croce, M. Andriushchenko, and M. Hein. Provable robustness of relu networks via maximization of linear regions. In AISTATS, 2019.
  • [9] F. Croce and M. Hein. A randomized gradient-free attack on relu networks. In GCPR, 2018.
  • [10] N. Dalvi, P. Domingos, Mausam, S. Sanghai, and D. Verma. Adversarial classification. In KDD, 2004.
  • [11] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2015.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [13] M. Hein and M. Andriushchenko. Formal guarantees on the robustness of a classifier against adversarial manipulation. In NIPS, 2017.
  • [14] G. Huang, Z. Liu, and K. Q. Weinberger. Densely connected convolutional networks. CoRR, abs/1608.06993, 2016.
  • [15] R. Huang, B. Xu, D. Schuurmans, and C. Szepesvari. Learning with a strong adversary. In ICLR, 2016.
  • [16] G. Katz, C. Barrett, D. Dill, K. Julian, and M. Kochenderfer. Reluplex: An efficient smt solver for verifying deep neural networks. In CAV, 2017.
  • [17] A. Krizhevsky, V. Nair, and G. Hinton. Cifar-10 (canadian institute for advanced research).
  • [18] A. Kurakin, I. J. Goodfellow, and S. Bengio. Adversarial examples in the physical world. In ICLR Workshop, 2017.
  • [19] Y. Liu, X. Chen, C. Liu, and D. Song. Delving into transferable adversarial examples and black-box attacks. In ICLR, 2017.
  • [20] D. Lowd and C. Meek. Adversarial learning. In KDD, 2005.
  • [21] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Valdu.

    Towards deep learning models resistant to adversarial attacks.

    In ICLR, 2018.
  • [22] M. Mirman, T. Gehr, and M. Vechev. Differentiable abstract interpretation for provably robust neural networks. In ICML, 2018.
  • [23] S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In CVPR, pages 2574–2582, 2016.
  • [24] M. Mosbach, M. Andriushchenko, T. Trost, M. Hein, and D. Klakow. Logit pairing methods can fool gradient-based attacks. In NeurIPS 2018 Workshop on Security in Machine Learning, 2018. arXiv:1810.12042.
  • [25] N. Narodytska and S. P. Kasiviswanathan. Simple black-box adversarial perturbations for deep networks. In CVPR 2017 Workshops, 2016.
  • [26] Y. E. Nesterov. A method of solving a convex programming problem with convergence rate O. Soviet Mathematics Doklady, 27(2):372–376, 1983.
  • [27] N. Papernot, N. Carlini, I. Goodfellow, R. Feinman, F. Faghri, A. Matyasko, K. Hambardzumyan, Y.-L. Juang, A. Kurakin, R. Sheatsley, A. Garg, and Y.-C. Lin. cleverhans v2.0.0: an adversarial machine learning library. preprint, arXiv:1610.00768, 2017.
  • [28] N. Papernot, P. McDonald, X. Wu, S. Jha, and A. Swami. Distillation as a defense to adversarial perturbations against deep networks. In IEEE Symposium on Security & Privacy, 2016.
  • [29] A. Raghunathan, J. Steinhardt, and P. Liang. Certified defenses against adversarial examples. In ICLR, 2018.
  • [30] J. Rauber, W. Brendel, and M. Bethge. Foolbox: A python toolbox to benchmark the robustness of machine learning models. In ICML Reliable Machine Learning in the Wild Workshop, 2017.
  • [31] L. Schott, J. Rauber, M. Bethge, and W. Brendel. Towards the first adversarially robust neural network model on MNIST. In ICLR, 2019.
  • [32] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural Networks, 32:323–332, 2012.
  • [33] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In ICLR, pages 2503–2511, 2014.
  • [34] V. Tjeng, K. Xiao, and R. Tedrake. Evaluating robustness of neural networks with mixed integer programming. preprint, arXiv:1711.07356v3, 2019.
  • [35] T. Weng, H. Zhang, H. Chen, Z. Song, C. Hsieh, L. Daniel, D. S. Boning, and I. S. Dhillon. Towards fast computation of certified robustness for relu networks. In ICML, 2018.
  • [36] E. Wong and J. Z. Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In ICML, 2018.
  • [37] E. Wong, F. Schmidt, J. H. Metzen, and J. Z. Kolter. Scaling provable adversarial defenses. In NeurIPS, 2018.
  • [38] X. Yuan, P. He, Q. Zhu, R. R. Bhat, and X. Li. Adversarial examples: Attacks and defenses for deep learning. IEEE Trans. Neural Netw. Learn. Syst., 2019.