# Evaluating the Robustness of Neural Networks: An Extreme Value Theory Approach

The robustness of neural networks to adversarial examples has received great attention due to security implications. Despite various attack approaches to crafting visually imperceptible adversarial examples, little has been developed towards a comprehensive measure of robustness. In this paper, we provide a theoretical justification for converting robustness analysis into a local Lipschitz constant estimation problem, and propose to use the Extreme Value Theory for efficient evaluation. Our analysis yields a novel robustness metric called CLEVER, which is short for Cross Lipschitz Extreme Value for nEtwork Robustness. The proposed CLEVER score is attack-agnostic and computationally feasible for large neural networks. Experimental results on various networks, including ResNet, Inception-v3 and MobileNet, show that (i) CLEVER is aligned with the robustness indication measured by the ℓ_2 and ℓ_∞ norms of adversarial examples from powerful attacks, and (ii) defended networks using defensive distillation or bounded ReLU indeed achieve better CLEVER scores. To the best of our knowledge, CLEVER is the first attack-independent robustness metric that can be applied to any neural network classifier.

## Authors

• 14 publications
• 51 publications
• 106 publications
• 44 publications
• 6 publications
• 4 publications
• 106 publications
• 17 publications
• ### Towards Evaluating the Robustness of Neural Networks

Neural networks provide state-of-the-art results for most machine learni...
08/16/2016 ∙ by Nicholas Carlini, et al. ∙ 0

• ### On Extensions of CLEVER: A Neural Network Robustness Evaluation Algorithm

CLEVER (Cross-Lipschitz Extreme Value for nEtwork Robustness) is an Extr...
10/19/2018 ∙ by Tsui-Wei Weng, et al. ∙ 0

• ### Limitations of the Lipschitz constant as a defense against adversarial examples

Several recent papers have discussed utilizing Lipschitz constants to li...
07/25/2018 ∙ by Todd Huster, et al. ∙ 0

• ### Generalised Lipschitz Regularisation Equals Distributional Robustness

The problem of adversarial examples has highlighted the need for a theor...
02/11/2020 ∙ by Zac Cranko, et al. ∙ 3

• ### Minimum-Norm Adversarial Examples on KNN and KNN-Based Models

We study the robustness against adversarial examples of kNN classifiers ...
03/14/2020 ∙ by Chawin Sitawarin, et al. ∙ 10

• ### Interpreting and Evaluating Neural Network Robustness

Recently, adversarial deception becomes one of the most considerable thr...
05/10/2019 ∙ by Fuxun Yu, et al. ∙ 0

• ### ROBY: Evaluating the Robustness of a Deep Model by its Decision Boundaries

With the successful application of deep learning models in many real-wor...
12/18/2020 ∙ by Jinyin Chen, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Recent studies have highlighted the lack of robustness in state-of-the-art neural network models, e.g., a visually imperceptible adversarial image can be easily crafted to mislead a well-trained network (Szegedy et al., 2013; Goodfellow et al., 2015; Chen et al., 2017a). Even worse, researchers have identified that these adversarial examples are not only valid in the digital space but also plausible in the physical world (Kurakin et al., 2016a; Evtimov et al., 2017). The vulnerability to adversarial examples calls into question safety-critical applications and services deployed by neural networks, including autonomous driving systems and malware detection protocols, among others.

In the literature, studying adversarial examples of neural networks has twofold purposes: (i) security implications: devising effective attack algorithms for crafting adversarial examples, and (ii) robustness analysis: evaluating the intrinsic model robustness to adversarial perturbations to normal examples. Although in principle the means of tackling these two problems are expected to be independent, that is, the evaluation of a neural network’s intrinsic robustness should be agnostic to attack methods, and vice versa, existing approaches extensively use different attack results as a measure of robustness of a target neural network. Specifically, given a set of normal examples, the attack success rate and distortion of the corresponding adversarial examples crafted from a particular attack algorithm are treated as robustness metrics. Consequently, the network robustness is entangled with the attack algorithms used for evaluation and the analysis is limited by the attack capabilities. More importantly, the dependency between robustness evaluation and attack approaches can cause biased analysis. For example, adversarial training is a commonly used technique for improving the robustness of a neural network, accomplished by generating adversarial examples and retraining the network with corrected labels. However, while such an adversarially trained network is made robust to attacks used to craft adversarial examples for training, it can still be vulnerable to unseen attacks.

Motivated by the evaluation criterion for assessing the quality of text and image generation that is completely independent of the underlying generative processes, such as the BLEU score for texts (Papineni et al., 2002) and the INCEPTION score for images (Salimans et al., 2016), we aim to propose a comprehensive and attack-agnostic robustness metric for neural networks. Stemming from a perturbation analysis of an arbitrary neural network classifier, we derive a universal lower bound on the minimal distortion required to craft an adversarial example from an original one, where the lower bound applies to any attack algorithm and any norm for . We show that this lower bound associates with the maximum norm of the local gradients with respect to the original example, and therefore robustness evaluation becomes a local Lipschitz constant estimation problem. To efficiently and reliably estimate the local Lipschitz constant, we propose to use extreme value theory (De Haan & Ferreira, 2007) for robustness evaluation. In this context, the extreme value corresponds to the local Lipschitz constant of our interest, which can be inferred by a set of independently and identically sampled local gradients.　With the aid of extreme value theory, we propose a robustness metric called CLEVER, which is short for Cross Lipschitz Extreme Value for nEtwork Robustness. We note that CLEVER is an attack-independent robustness metric that applies to any neural network classifier. In contrast, the robustness metric proposed in Hein & Andriushchenko (2017), albeit attack-agnostic, only applies to a neural network classifier with one hidden layer.

We highlight the main contributions of this paper as follows:

• We propose a novel robustness metric called CLEVER, which is short for Cross Lipschitz Extreme Value for nEtwork R

obustness. To the best of our knowledge, CLEVER is the first robustness metric that is attack-independent and can be applied to any arbitrary neural network classifier and scales to large networks for ImageNet.

• The proposed CLEVER score is well supported by our theoretical analysis on formal robustness guarantees and the use of extreme value theory. Our robustness analysis extends the results in Hein & Andriushchenko (2017) from continuously differentiable functions to a special class of non-differentiable functions – neural+ networks with ReLU activations.

• We corroborate the effectiveness of CLEVER by conducting experiments on state-of-the-art models for ImageNet, including ResNet (He et al., 2016), Inception-v3 (Szegedy et al., 2016) and MobileNet (Howard et al., 2017). We also use CLEVER to investigate defended networks against adversarial examples, including the use of defensive distillation (Papernot et al., 2016) and bounded ReLU (Zantedeschi et al., 2017). Experimental results show that our CLEVER score well aligns with the attack-specific robustness indicated by the and distortions of adversarial examples.

## 2 Background and Related work

### 2.1 Attacking Neural Networks using Adversarial Examples

One of the most popular formulations found in literature for crafting adversarial examples to mislead a neural network is to formulate it as a minimization problem, where the variable to be optimized refers to the perturbation to the original example, and the objective function takes into account unsuccessful adversarial perturbations as well as a specific norm on for assuring similarity. For instance, the success of adversarial examples can be evaluated by their cross-entropy loss (Szegedy et al., 2013; Goodfellow et al., 2015) or model prediction (Carlini & Wagner, 2017b). The norm constraint on can be implemented in a clipping manner (Kurakin et al., 2016b) or treated as a penalty function (Carlini & Wagner, 2017b). The norm of , defined as for any , is often used for crafting adversarial examples. In particular, when , measures the maximal variation among all dimensions in . When , becomes the Euclidean norm of . When , measures the total variation of . The state-of-the-art attack methods for , and norms are the iterative fast gradient sign method (I-FGSM) (Goodfellow et al., 2015; Kurakin et al., 2016b), Carlini and Wagner’s attack (CW attack) (Carlini & Wagner, 2017b), and elastic-net attacks to deep neural networks (EAD) (Chen et al., 2017b), respectively. These attacks fall into the category of white-box attacks since the network model is assumed to be transparent to an attacker. Adversarial examples can also be crafted from a black-box network model using an ensemble approach (Liu et al., 2016), training a substitute model (Papernot et al., 2017), or employing zeroth-order optimization based attacks (Chen et al., 2017c).

### 2.2 Existing Defense Methods

Since the discovery of vulnerability to adversarial examples (Szegedy et al., 2013), various defense methods have been proposed to improve the robustness of neural networks. The rationale for defense is to make a neural network more resilient to adversarial perturbations, while ensuring the resulting defended model still attains similar test accuracy as the original undefended network. Papernot et al. proposed defensive distillation (Papernot et al., 2016), which uses the distillation technique (Hinton et al., 2015)

and a modified softmax function at the final layer to retrain the network parameters with the prediction probabilities (i.e., soft labels) from the original network.

Zantedeschi et al. (2017) showed that by changing the ReLU function to a bounded ReLU function, a neural network can be made more resilient. Another popular defense approach is adversarial training, which generates and augments adversarial examples with the original training data during the network training stage. On MNIST, the adversarially trained model proposed by Madry et al. (2017) can successfully defend a majority of adversarial examples at the price of increased network capacity. Model ensemble has also been discussed to increase the robustness to adversarial examples (Tramèr et al., 2017; Liu et al., 2017). In addition, detection methods such as feature squeezing (Xu et al., 2017) and example reforming (Meng & Chen, 2017) can also be used to identify adversarial examples. However, the CW attack is shown to be able to bypass 10 different detection methods (Carlini & Wagner, 2017a). In this paper, we focus on evaluating the intrinsic robustness of a neural network model to adversarial examples. The effect of detection methods is beyond our scope.

### 2.3 Theoretical Robustness Guarantees for Neural Networks

Szegedy et al. (2013) compute global Lipschitz constant for each layer and use their product to explain the robustness issue in neural networks, but the global Lipschitz constant often gives a very loose bound. Hein & Andriushchenko (2017)

gave a robustness lower bound using a local Lipschitz continuous condition and derived a closed-form bound for a multi-layer perceptron (MLP) with a single hidden layer and softplus activation. Nevertheless, a closed-form bound is hard to derive for a neural network with more than one hidden layer.

Wang et al. (2016) utilized terminologies from topology to study robustness. However, no robustness bounds or estimates were provided for neural networks. On the other hand, works done by Ehlers (2017); Katz et al. (2017a, b); Huang et al. (2017)

focus on formally verifying the viability of certain properties in neural networks for any possible input, and transform this formal verification problem into satisfiability modulo theory (SMT) and large-scale linear programming (LP) problems. These SMT or LP based approaches have high computational complexity and are only plausible for very small networks.

Intuitively, we can use the distortion of adversarial examples found by a certain attack algorithm as a robustness metric. For example, Bastani et al. (2016) proposed a linear programming (LP) formulation to find adversarial examples and use the distortions as the robustness metric. They observe that the LP formulation can find adversarial examples with smaller distortions than other gradient-based attacks like L-BFGS (Szegedy et al., 2013). However, the distortion found by these algorithms is an upper bound of the true minimum distortion and depends on specific attack algorithms. These methods differ from our proposed robustness measure CLEVER, because CLEVER is an estimation of the lower bound of the minimum distortion and is independent of attack algorithms. Additionally, unlike LP-based approaches which are impractical for large networks, CLEVER is computationally feasible for large networks like Inception-v3. The concept of minimum distortion and upper/lower bound will be formally defined in Section 3.

## 3 Analysis of Formal Robustness Guarantees for a Classifier

In this section, we provide formal robustness guarantees of a classifier in Theorem 3.2. Our robustness guarantees are general since they only require a mild assumption on Lipschitz continuity of the classification function. For differentiable classification functions, our results are consistent with the main theorem in (Hein & Andriushchenko, 2017) but are obtained by a much simpler and more intuitive manner111The authors in Hein & Andriushchenko (2017) implicitly assume Lipschitz continuity and use Mean Value Theorem and Hölder’s Inequality to prove their main theorem. Here we provide a simple and direct proof with Lipschitz continuity assumption and without involving Mean Value Theorem and Hölder’s Inequality.. Furthermore, our robustness analysis can be easily extended to non-differentiable classification functions (e.g. neural networks with ReLU) as in Lemma 3.3, whereas the analysis in Hein & Andriushchenko (2017) is restricted to differentiable functions. Specifically, Corollary 3.2.1 shows that the robustness analysis in (Hein & Andriushchenko, 2017) is in fact a special case of our analysis. We start our analysis by defining the notion of adversarial examples, minimum distortions, and lower/upper bounds. All the notations are summarized in Table 1.

###### Definition 3.1 (perturbed example and adversarial example).

Let be an input vector of a -class classification function and the prediction is given as . Given , we say is a perturbed example of with noise and -distortion if and . An adversarial example is a perturbed example that changes . A successful untargeted attack is to find a such that while a successful targeted attack is to find a such that given a target class .

###### Definition 3.2 (minimum adversarial distortion Δp,min).

Given an input vector of a classifier , the minimum adversarial distortion of , denoted as , is defined as the smallest over all adversarial examples of .

###### Definition 3.3 (lower bound of Δp,min).

Suppose is the minimum adversarial distortion of . A lower bound of , denoted by where , is defined such that any perturbed examples of with are not adversarial examples.

###### Definition 3.4 (upper bound of Δp,min).

Suppose is the minimum adversarial distortion of . An upper bound of , denoted by where , is defined such that there exists an adversarial example of with .

The lower and upper bounds are instance-specific because they depend on the input . While can be easily given by finding an adversarial example of using any attack method, is not easy to find. guarantees that the classifier is robust to any perturbations with , certifying the robustness of the classifier. Below we show how to derive a formal robustness guarantee of a classifier with Lipschitz continuity assumption. Specifically, our analysis obtains a lower bound of minimum adversarial distortion .

###### Lemma 3.1 (Lipschitz continuity and its relationship with gradient norm (Paulavičius & Žilinskas, 2006)).

Let be a convex bounded closed set and let be a continuously differentiable function on an open set containing . Then, is a Lipschitz function with Lipschitz constant if the following inequality holds for any :

 |h(x)−h(y)|≤Lq∥x−y∥p, (1)

where is the gradient of , and .

Given Lemma 3.1, we then provide a formal guarantee to the lower bound .

###### Theorem 3.2 (Formal guarantee on lower bound βL for untargeted attack).

Let and be a multi-class classifier with continuously differentiable components and let be the class which predicts for . For all with

 ∥δ∥p≤minj≠cfc(x0)−fj(x0)Ljq, (2)

holds with and is the Lipschitz constant for the function in norm. In other words, is a lower bound of minimum distortion.

The intuitions behind Theorem 3.2 is shown in Figure 1 with an one-dimensional example. The function value near point is inside a double cone formed by two lines passing and with slopes equal to , where is the (local) Lipschitz constant of near . In other words, the function value of around , i.e. can be bounded by , and the Lipschitz constant . When is decreased to 0, an adversarial example is found and the minimal change of is . The complete proof is deferred to Appendix A.

###### Remark 1.

is the Lipschitz constant of the function involving cross terms: , hence we also call it cross Lipschitz constant following (Hein & Andriushchenko, 2017).

To distinguish our analysis from (Hein & Andriushchenko, 2017), we show in Corollary 3.2.1 that we can obtain the same result in (Hein & Andriushchenko, 2017) by Theorem 3.2. In fact, the analysis in (Hein & Andriushchenko, 2017) is a special case of our analysis because the authors implicitly assume Lipschitz continuity on when requiring to be continuously differentiable. They use local Lipschitz constant () instead of global Lipschitz constant () to obtain a tighter bound in the adversarial perturbation .

###### Corollary 3.2.1 (Formal guarantee on βL for untargeted attack).
222proof deferred to Appendix B

Let be local Lipschitz constant of function at over some fixed ball and let . By Theorem 3.2, we obtain the bound in (Hein & Andriushchenko, 2017):

 ∥δ∥p≤min{minj≠cfc(x0)−fj(x0)Ljq,x0,R}. (3)

An important use case of Theorem 3.2 and Corollary 3.2.1 is the bound for targeted attack:

###### Corollary 3.2.2 (Formal guarantee on βL for targeted attack).

Assume the same notation as in Theorem 3.2 and Corollary 3.2.1. For a specified target class , we have .

In addition, we further extend Theorem 3.2 to a special case of non-differentiable functions – neural networks with ReLU activations. In this case the Lipchitz constant used in Lemma 3.1 can be replaced by the maximum norm of directional derivative, and our analysis above will go through.

###### Lemma 3.3 (Formal guarantee on βL for ReLU networks).
333proof deferred to Appendix C

Let be a -layer ReLU neural network with as the weights for layer . We ignore bias terms as they don’t contribute to gradient.

 h(x)=σ(Wlσ(Wl−1…σ(W1x)))

where . Let be a convex bounded closed set, then equation (1) holds with where is the one-sided directional direvative, then Theorem 3.2, Corollary 3.2.1 and Corollary 3.2.2 still hold.

## 4 The CLEVER Robustness Metric via Extreme Value Theory

In this section, we provide an algorithm to compute the robustness metric CLEVER with the aid of extreme value theory, where CLEVER can be viewed as an efficient estimator of the lower bound and is the first attack-agnostic score that applies to any neural network classifiers. Recall in Section 3 we show that the lower bound of network robustness is associated with and its cross Lipschitz constant , where is readily available at the output of a classifier and is defined as . Although can be calculated easily via back propagation, computing is more involved because it requires to obtain the maximum value of in a ball. Exhaustive search on low dimensional in seems already infeasible, not to mention the image classifiers with large feature dimensions of our interest. For instance, the feature dimension for MNIST, CIFAR and ImageNet respectively.

One approach to compute is through sampling a set of points in a ball around and taking the maximum value of . However, a significant amount of samples might be needed to obtain a good estimate of

and it is unknown how good the estimate is compared to the true maximum. Fortunately, Extreme Value Theory ensures that the maximum value of random variables can only follow one of the three extreme value distributions, which is useful to estimate

with only a tractable number of samples.

It is worth noting that although Wood & Zhang (1996) also applied extreme value theory to estimate the Lipschitz constant. However, there are two main differences between their work and this paper. First of all, the sampling methodology is entirely different. Wood & Zhang (1996) calculates the slopes between pairs of sample points whereas we directly take samples on the norm of gradient as in Lemma 3.1. Secondly, the functions considered in Wood & Zhang (1996) are only one-dimensional as opposed to the high-dimensional classification functions considered in this paper. For comparison, we show in our experiment that the approach in Wood & Zhang (1996), denoted as SLOPE in Table 5.3 and Figure (h)h, perform poorly for high-dimensional classifiers such as deep neural networks.

### 4.1 Estimate Ljq,x0 via Extreme Value Theory

When sampling a point uniformly in , can be viewed as a random variable characterized by a cumulative distribution function (CDF). For the purpose of illustration, we derived the CDF for a 2-layer neural network in Theorem D.1.444The theorem and proof are deferred to Appendix D. For any neural networks, suppose we have samples , and denote them as a sequence of independent and identically distributed (iid) random variables , each with CDF . The CDF of , denoted as , is called the limit distribution of . Fisher-Tippett-Gnedenko theorem says that , if exists, can only be one of the three family of extreme value distributions – the Gumbel class, the Fréchet class and the reverse Weibull class.

###### Theorem 4.1 (Fisher-Tippett-Gnedenko Theorem).

If there exists a sequence of pairs of real numbers such that and , where is a non-degenerate distribution function, then belongs to either the Gumbel class (Type I), the Fréchet class (Type II) or the Reverse Weibull class (Type III) with their CDFs as follows:

 Gumbel class (Type I): G(y)=exp{−exp[−y−aWbW]},y∈R, Fréchet class (Type II): Reverse Weibull class (Type III):

where , and are the location, scale and shape parameters, respectively.

Theorem 4.1 implies that the maximum values of the samples follow one of the three families of distributions. If has a bounded Lipschitz constant, is also bounded, thus its limit distribution must have a finite right end-point. We are particularly interested in the reverse Weibull class, as its CDF has a finite right end-point (denoted as ). The right end-point reveals the upper limit of the distribution, known as the extreme value. The extreme value is exactly the unknown local cross Lipschitz constant we would like to estimate in this paper. To estimate , we first generate samples of over a fixed ball uniformly and independently in each batch with a total of batches. We then compute and store the maximum values of each batch in set . Next, with samples in , we perform a maximum likelihood estimation of reverse Weibull distribution parameters, and the location estimate is used as an estimate of .

### 4.2 Compute CLEVER: a robustness score of neural network classifiers

Given an instance , its classifier and a target class , a targeted CLEVER score of the classifier’s robustness can be computed via and . Similarly, untargeted CLEVER scores can be computed. With the proposed procedure of estimating described in Section 4.1, we summarize the flow of computing CLEVER score for both targeted attacks and un-targeted attacks in Algorithm 1 and 2, respectively.

## 5 Experimental Results

### 5.1 Networks and Parameter Setup

We conduct experiments on CIFAR-10 (CIFAR for short), MNIST, and ImageNet data sets. For the former two smaller datasets CIFAR and MNIST, we evaluate CLEVER scores on four relatively small networks: a single hidden layer MLP with softplus activation (with the same number of hidden units as in (Hein & Andriushchenko, 2017)), a 7-layer AlexNet-like CNN (with the same structure as in (Carlini & Wagner, 2017b)), and the 7-layer CNN with defensive distillation (Papernot et al., 2016) (DD) and bounded ReLU (Zantedeschi et al., 2017) (BReLU) defense techniques employed.

For ImageNet data set, we use three popular deep network architectures: a 50-layer Residual Network (He et al., 2016) (ResNet-50), Inception-v3 (Szegedy et al., 2016) and MobileNet (Howard et al., 2017)

. They were chosen for the following reasons: (i) they all yield (close to) state-of-the-art performance among equal-sized networks; and (ii) their architectures are significantly different with unique building blocks, i.e., residual block in ResNet, inception module in Inception net, and depthwise separable convolution in MobileNet. Therefore, their diversity in network architectures is appropriate to test our robustness metric. For MobileNet, we set the width multiplier to 1.0, achieving a

accuracy on ImageNet. We used public pretrained weights for all ImageNet models555Pretrained models can be downloaded at https://github.com/tensorflow/models/tree/master/research/slim.

In all our experiments, we set the sampling parameters , and . For targeted attacks, we use 500 test-set images for CIFAR and MNIST and use 100 test-set images for ImageNet; for each image, we evaluate its targeted CLEVER score for three targets: a random target class, a least likely class (the class with lowest probability when predicting the original example), and the top-2 class (the class with largest probability except for the true class, which is usually the easiest target to attack). We also conduct untargeted attacks on MNIST and CIFAR for 100 test-set images, and evaluate their untargeted CLEVER scores. Our experiment code is publicly available666Source code is available at https://github.com/huanzhang12/CLEVER.

### 5.2 Fitting Gradient Norm samples with Reverse Weibull distributions

We fit the cross Lipschitz constant samples in (see Algorithm 1) with reverse Weibull class distribution to obtain the maximum likelihood estimate of the location parameter , scale parameter and shape parameter , as introduced in Theorem 4.1

. To validate that reverse Weibull distribution is a good fit to the empirical distribution of the cross Lipschitz constant samples, we conduct Kolmogorov-Smirnov goodness-of-fit test (a.k.a. K-S test) to calculate the K-S test statistics

and corresponding

-values. The null hypothesis is that samples

Figure 6

plots the probability distribution function of the cross Lipschitz constant samples and the fitted Reverse Weibull distribution for images from various data sets and network architectures. The estimated MLE parameters,

-values, and the K-S test statistics are also shown. We also calculate the percentage of examples whose estimation have -values greater than 0.05, as illustrated in Figure 6. If the -value is greater than 0.05, the null hypothesis cannot be rejected, meaning that the underlying data samples fit a reverse Weibull distribution well. Figure 6 shows that all numbers are close to 100%, validating the use of reverse Weibull distribution as an underlying distribution of gradient norm samples empirically. Therefore, the fitted location parameter of reverse Weibull distribution (i.e., the extreme value), , can be used as a good estimation of local cross Lipschitz constant to calculate the CLEVER score. The exact numbers are shown in Table 3 in Appendix E.

### 5.3 Comparing CLEVER Score with Attack-specific Network Robustness

We apply the state-of-the-art white-box attack methods, iterative fast gradient sign method (I-FGSM) (Goodfellow et al., 2015; Kurakin et al., 2016b) and Carlini and Wagner’s attack (CW) (Carlini & Wagner, 2017b), to find adversarial examples for 11 networks, including 4 networks trained on CIFAR, 4 networks trained on MNIST, and 3 networks trained on ImageNet. For CW attack, we run 1000 iterations for ImageNet and CIFAR, and 2000 iterations for MNIST, as MNIST has shown to be more difficult to attack (Chen et al., 2017b). Attack learning rate is individually tuned for each model: 0.001 for Inception-v3 and ResNet-50, 0.0005 for MobileNet and 0.01 for all other networks. For I-FGSM, we run 50 iterations and choose the optimal to achieve the smallest distortion for each individual image. For defensively distilled (DD) networks, 50 iterations of I-FGSM are not sufficient; we use 250 iterations for CIFAR-DD and 500 iterations for MNIST-DD to achieve a 100% success rate. For the problem to be non-trivial, images that are classified incorrectly are skipped. We report 100% attack success rates for all the networks, and thus the average distortion of adversarial examples can indicate the attack-specific robustness of each network. For comparison, we compute the CLEVER scores for the same set of images and attack targets. To the best of our knowledge, CLEVER is the first attack-independent robustness score that is capable of handling the large networks studied in this paper, so we directly compare it with the attack-induced distortion metrics in our study.

We evaluate the effectiveness of our CLEVER score by comparing the upper bound (found by attacks) and CLEVER score, where CLEVER serves as an estimated lower bound, . Table 5.3 compares the average and distortions of adversarial examples found by targeted CW and I-FGSM attacks and the corresponding average targeted CLEVER scores for and norms, and Figure (h)h visualizes the results for norm. Similarly, Table 2 compares untargeted CW and I-FGSM attacks with untargeted CLEVER scores. As expected, CLEVER is smaller than the distortions of adversarial images in most cases. More importantly, since CLEVER is independent of attack algorithms, the reported CLEVER scores can roughly indicate the distortion of the best possible attack in terms of a specific distortion. The average distortion found by CW attack is close to the CLEVER score, indicating CW is a strong attack. In addition, when a defense mechanism (Defensive Distillation or Bounded ReLU) is used, the corresponding CLEVER scores are consistently increased (except for CIFAR-BReLU), indicating that the network is indeed made more resilient to adversarial perturbations. For CIFAR-BReLU, both CLEVER scores and norm of adversarial examples found by CW attack decrease, implying that bound ReLU is an ineffective defense for CIFAR. CLEVER scores can be seen as a security checkpoint for unseen attacks. For example, if there is a substantial gap in distortion between the CLEVER score and the considered attack algorithms, it may suggest the existence of a more effective attack that can close the gap.

Since CLEVER score is derived from an estimation of the robustness lower bound, we further verify the viability of CLEVER per each example, i.e., whether it is usually smaller than the upper bound found by attacks. Table 5.3 shows the percentage of inaccurate estimations where the CLEVER score is larger than the distortion of adversarial examples found by CW and I-FGSM attacks in three ImageNet networks. We found that CLEVER score provides an accurate estimation for most of the examples. For MobileNet and Resnet-50, our CLEVER score is a strict lower bound of these two attacks for more than 96% of tested examples. For Inception-v3, the condition of strict lower bound is worse (still more than 75%), but we found that in these cases the attack distortion only differs from our CLEVER score by a fairly small amount. In Figure (k)k we show the empirical CDF of the gap between CLEVER score and the norm of adversarial distortion generated by CW attack for the same set of images in Table 5.3. In Figure (n)n, we plot the distortion and CLEVER scores for each individual image. A positive gap indicates that CLEVER (estimated lower bound) is indeed less than the upper bound found by CW attack. Most images have a small positive gap, which signifies the near-optimality of CW attack in terms of distortion, as CLEVER suffices for an estimated capacity of the best possible attack.

### 5.4 Time v.s. Estimation Accuracy

In Figure (q)q, we vary the number of samples () and compute the CLEVER scores for three large ImageNet models, Inception-v3, ResNet-50 and MobileNet. We observe that 50 or 100 samples are usually sufficient to obtain a reasonably accurate robustness estimation despite using a smaller number of samples. On a single GTX 1080 Ti GPU, the cost of 1 sample (with ) is measured as 2.9 s for MobileNet, 5.0 s for ResNet-50 and 8.9 s for Inception-v3, thus the computational cost of CLEVER is feasible for state-of-the-art large-scale deep neural networks. Additional figures for MNIST and CIFAR datasets are given in Appendix E.

## 6 Conclusion

In this paper, we propose the CLEVER score, a novel and generic metric to evaluate the robustness of a target neural network classifier to adversarial examples. Compared to the existing robustness evaluation approaches, our metric has the following advantages: (i) attack-agnostic; (ii) applicable to any neural network classifier; (iii) comes with strong theoretical guarantees; and (iv) is computationally feasible for large neural networks. Our extensive experiments show that the CLEVER score well matches the practical robustness indication of a wide range of natural and defended networks.

Acknowledgment. Luca Daniel and Tsui-Wei Weng are partially supported by MIT-Skoltech program and MIT-IBM Watson AI Lab. Cho-Jui Hsieh and Huan Zhang acknowledge the support of NSF via IIS-1719097.

## Appendix

### A Proof of Theorem 3.2

###### Proof.

According to Lemma 3.1, the assumption that is Lipschitz continuous with Lipschitz constant gives

 |g(x)−g(y)|≤Ljq∥x−y∥p. (4)

Let and in (4), we get

 |g(x0+δ)−g(x0)|≤Ljq∥δ∥p,

which can be rearranged into the following form

 g(x0)−Ljq∥δ∥p≤g(x0+δ)≤g(x0)+Ljq∥δ∥p. (5)

When , an adversarial example is found. As indicated by (5), is lower bounded by . If is small enough such that , no adversarial examples can be found:

 g(x0)−Ljq∥δ∥p≥0⇒∥δ∥p≤g(x0)Ljq⇒∥δ∥p≤fc(x0)−fj(x0)Ljq,

Finally, to achieve , we take the minimum of the bound on in (A) over . I.e. if

 ∥δ∥p≤minj≠cfc(x0)−fj(x0)Ljq,

the classifier decision can never be changed and the attack will never succeed. ∎

### B Proof of Corollary 3.2.1

###### Proof.

By Lemma 3.1 and let , we get , which then gives the bound in Theorem 2.1 of (Hein & Andriushchenko, 2017). ∎

### C Proof of Lemma 3.3

###### Proof.

For any , let be the unit vector pointing from to and . Define uni-variate function , then and and observe that and are the right-hand and left-hand derivatives of , we have

 u′(z)={D+h(x+zd;d)≤Lq if D+h(x+zd;d)=D+h(x+zd;−d)undefined if D+h(x+zd;d)≠D+h(x+zd;−d)

For ReLU network, there can be at most finite number of points in such that does not exist. This can be shown because each discontinuous is caused by some ReLU activation, and there are only finite combinations. Let be those points. Then, using the fundamental theorem of calculus on each interval separately, there exists for each such that

 u(r)−u(0) ≤k∑i=1|u(zi)−u(zi−1)| ≤k∑i=1|u′(¯zi)(zi−zi−1)| (Mean value theorem) ≤k∑i=1Lq|zi−zi−1|p =Lq∥x−y∥p. (zi are in line (x,y))

Theorem 3.2 and its corollaries remain valid after replacing Lemma 3.1 with Lemma 3.3. ∎

### D Theorem d.1 and its proof

###### Theorem D.1 (FY(y) of one-hidden-layer neural network).

Consider a neural network with input , a hidden layer with

hidden neurons, and rectified linear unit (ReLU) activation function. If we sample uniformly in a ball

, then the cumulative distribution function of , denoted as , is piece-wise linear with at most pieces, where for some given and , and .

###### Proof.

The output of a one-hidden-layer neural network can be written as

 fj(x)=U∑r=1Vjr⋅σ(d∑i=1Wri⋅xi+br)=U∑r=1Vjr⋅σ(wrx+br),

where is ReLU activation function, and are the weight matrices of the first and second layer respectively, and is the row of . Thus, we can compute and below:

 g(x)=fc(x)−fj(x) =U∑r=1Vcr⋅σ(wrx+br)−U∑r=1Vjr⋅σ(wrx+br) =U∑r=1(Vcr−Vjr)⋅σ(wrx+br)

and

 ∥∇g(x)∥q =∥∥ ∥∥U∑r=1I(wrx+br)(Vcr−Vjr)w⊤r∥∥ ∥∥q,

where is an univariate indicator function:

As illustrated in Figure 24, the hyperplanes divide the dimensional spaces into different regions, with the interior of each region satisfying a different set of inequality constraints, e.g. and . Given , we can identify which region it belongs to by checking the sign of for each . Notice that the gradient norm is the same for all the points in the same region, i.e. for any , satisfying  , we have . Since there can be at most different regions for a -dimensional space with hyperplanes, can take at most different values.

Therefore, if we perform uniform sampling in a ball centered at with radius and denote as a random variable , the probability distribution of is discrete and its CDF is piece-wise constant with at most pieces. Without loss of generality, assume there are distinct values for and denote them as in an increasing order, the CDF of , denoted as , is the following:

 FY(m(i))=FY(m(i−1))+Vd({x∣∥∇g(x)∥q=m(i)})∩Vd(Bp(x0,R)))Vd(Bp(x0,R)),i=1,…,M0,

where with , is the volume of in a dimensional space. ∎

#### e.1 Percentage of examples having p value >0.05

Table 3 shows the percentage of examples where the null hypothesis cannot be rejected by K-S test, indicating that the maximum gradient norm samples fit reverse Weibull distribution well.

#### e.2 CLEVER v.s. number of samples

Figure 31 shows the CLEVER score with different number of samples () for MNIST and CIFAR models. For most models except MNIST-BReLU, reducing the number of samples only change CLEVER scores very slightly. For MNIST-BReLU, increasing the number of samples improves the estimated lower bound, suggesting that a larger number of samples is preferred. In practice, we can start with a relatively small , and also try samples to see if CLEVER scores change significantly. If CLEVER scores stay roughly the same despite increasing , we can conclude that using is sufficient.