Interpreting and Evaluating Neural Network Robustness

05/10/2019 ∙ by Fuxun Yu, et al. ∙ Northeastern University George Mason University Clarkson University 0

Recently, adversarial deception becomes one of the most considerable threats to deep neural networks. However, compared to extensive research in new designs of various adversarial attacks and defenses, the neural networks' intrinsic robustness property is still lack of thorough investigation. This work aims to qualitatively interpret the adversarial attack and defense mechanism through loss visualization, and establish a quantitative metric to evaluate the neural network model's intrinsic robustness. The proposed robustness metric identifies the upper bound of a model's prediction divergence in the given domain and thus indicates whether the model can maintain a stable prediction. With extensive experiments, our metric demonstrates several advantages over conventional adversarial testing accuracy based robustness estimation: (1) it provides a uniformed evaluation to models with different structures and parameter scales; (2) it over-performs conventional accuracy based robustness estimation and provides a more reliable evaluation that is invariant to different test settings; (3) it can be fast generated without considerable testing cost.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the past few years, Neural Networks (NNs) have achieved superiors success in various domains, e.g.

, computer vision 

[Szegedy et al.2016], speech recognition [Hinton et al.2012], autonomous systems [Huval et al.2015], etc. However, the recent appearance of adversarial attacks [Kurakin et al.2016] greatly challenges the security of neural network applications: by crafting and injecting human-imperceptible noises into test inputs, neural networks’ prediction results can be arbitrarily manipulated [Ian J. Goodfellow2014]. Until now, the emerging pace, effectiveness, and efficiency of new attacks always take an early lead to the defense solutions [Carlini and Wagner2017], and the key factors of the adversarial vulnerabilities are still unclear, leaving the neural network robustness study in a vicious cycle.

In this work, we aim to qualitatively interpret neural network models’ adversarial vulnerability and robustness, and establish a quantitative metric for the model-intrinsic robustness evaluation. To interpret the robustness, we adopt the loss visualization technique [Goodfellow et al.2015]

, which was widely used in model convergence studies. As adversarial attacks leverage perturbations in inputs, we switch the loss visualization from its original parameter space into the input space and illustrate how a neural network is deceived by adversarial perturbations. Based on the interpretation, we design a robustness evaluation metric to measure a neural network’s maximum prediction divergence within a constrained perturbation range. We further optimize the metric evaluation process to keep its consistency under extrinsic factor variance,

e.g., model reparameterization [Dinh et al.2017].

Specifically, we have the following contributions:

  • We interpret the adversarial vulnerability and robustness by defining and visualizing a new loss surface called decision surface. Compared to the cross-entropy based loss surface, the decision surface contains the implicit decision boundary and provides better visualization effect;

  • We testify that adversarial deception is caused by the neural network’s neighborhood under-fitting. Our visualization shows that adversarial examples are naturally-existed points lying in the close neighborhood of the inputs. However, the neural network fails to classify them, which caused the adversarial example phenomenon;

  • We propose a robustness evaluation metric. Combined with a new normalization method, the metric can invariantly reflect a neural network’s intrinsic robustness property regardless of attacks and defenses;

  • We reveal that under certain cases, e.g.

    , defensive distillation, the commonly-used PGD adversarial testing accuracy can give unreliable robustness estimation, while our metric could reflect the model robustness correctly.

Extensive evaluation results show that our defined robustness metric could well indicate the model-intrinsic robustness across different datasets, various architectures, multiple adversarial attacks, and different defense methods.

2 Background and Related Work

2.1 Adversarial Attacks and Robustness

Adversarial examples were firstly introduced by [Szegedy et al.2013], which revealed neural networks’ vulnerability to adversarial noises and demonstrated the gap between the artificial cognition and human visual perception. Since then, various adversarial attacks were proposed, such as L-BFGS attack [Kurakin et al.2016], FGSM attack [Ian J. Goodfellow2014], C&W attack [Carlini and Wagner2017], black-box attack [Papernot et al.2017], etc.

Driven by the appearance of adversarial attacks, corresponding defense techniques also emerged, including adversarial training [Ian J. Goodfellow2014], defensive distillation [Papernot et al.2016], gradient regularization [Ross and Doshi-Velez2018]

, adversarial logit pairing 

[Kannan et al.2018], etc. Among those, MinMax robustness optimization [Madry et al.2018] is considered as one of the most potent defenses, which boosts model accuracy by integrating the worst-case adversarial examples into the model training.

Currently, testing accuracy under adversarial attacks is used to evaluate the model robustness. However, it is highly affected by the attack specifications and can’t comprehensively reflect the actual robustness regarding model-intrinsic properties. For example, one commonly used way to evaluate the model robustness is adopting the testing accuracy under projected gradient descent (PGD) attack as an estimation. However, our experiments demonstrate that such a robustness estimation is highly unreliable: a model with a high PGD testing accuracy could be easily broken by other attacks.

In this work, we aim to provide an intrinsic robustness property evaluation metric that is invariant from the specifications of models, attacks, and defenses.

2.2 Neural Network Loss Visualization

Neural network loss visualization is considered as one of the most useful approaches in neural network analysis own to its intuitive interpretation. Proposed by [Goodfellow et al.2015], the loss visualization is utilized to analyze model training and convergence. Later, [Keskar et al.2017] further revealed that flat local minima is the key to model generalization in parameter space. However, a model reparameterization issue was identified by [Dinh et al.2017] that the model parameter scaling may distort the geometry properties.

In this work, we adopt the concept of the loss visualization to analyze the neural network’s loss behaviors under adversarial perturbations. Meanwhile, we will also provide a normalization method to solve the model reparameterization problem and derive our scaling-invariant robustness metric.

2.3 Visualization Space Selection

Besides of solving the reparameterization issue, the loss visualization needs further customization for the adversarial perturbation analysis. As the loss visualization mainly evaluates a neural network’s generalization ability, it focuses on the parameter space to analyze the model training and convergence in previous works. However, such an analysis focus doesn’t fit well in the adversarial attacks and defenses, whose action scope is in input space

. On the other hand, loss function in the

input space measures the network’s loss variations w.r.t the input perturbations. It naturally shows the influence of adversarial perturbations and is suitable for studying the robustness to adversarial perturbations. Therefore, we extend the previous methods into the input space.

Fig. 1 showed two examples of the visualized loss surface of an ResNet model in the parameter space and the input space, which illustrate the difference between the two visualization spaces. Although the loss surface in the parameter space can show a flat minima, its significant non-smooth variations in the input space demonstrate the loss is highly sensitive to input perturbations, which can be adversarial vulnerabilities. In this work, we will adopt the input space as the default visualization setting for robustness interpretation.

Figure 1: (a) ResNet’s Loss Surface in Parameter Space (b) ResNet’s Loss Surface in the Input Space: The loss surface demonstrates significant non-smooth variation.

3 Adversarial Robustness Interpretation

In this section, we will use the customized loss visualization to reveal the mechanism of adversarial perturbation and neural network robustness.

3.1 Neural Network Loss Visualization

Loss Visualization Basis: The prediction of a neural network can be evaluated by its loss function , where is the model parameter set (weight and bias) and is the input. As the inputs x are usually constructed in a high-dimensional space, direct visualization analysis on the loss surface is impossible. To solve this issue, the loss visualization projects the high-dimensional loss surface into a low-dimensional space to visualize it (e.g.

a 2D hyper-plane). During the projection, two vectors

and are selected and normalized as the base vectors for - hyper-plane. Given an starting input point

, the points around it can be interpolated, and the corresponding loss values can be calculated as:


where, the original point in the function denotes the original image, and can be treated as the unit perturbation added into the image, and the coordinate denotes the perturbation intensity. In the loss visualization, a point’s coordinate also denotes its divergence degree from the original point along and direction. After sampling sufficient points’ loss values, the function with high-dimensional inputs could be projected to the chosen hyper-plane.

Decision Surface Construction: As the loss visualization is mostly used to analyze model convergence, the loss function is usually represented by the cross-entropy loss, which constructs a conventional loss surface in the visualization. However, one critical limitation of the cross-entropy based loss surface is that, it cannot qualitatively show the explicit decision boundary of one input test, and less helpful for adversarial deception analysis.

Therefore, we propose a decision surface to replace the loss surface in the loss visualization:



is the logit output before the softmax layer, and

is the true class index of the input . The decision function evaluates the confidence of prediction. In the correct prediction cases, should always be positive, while denotes a wrong prediction. Specifically, indicates the equal confidence for both correct and wrong prediction, which is the decision boundary of model. Consequently, the visualization surface constructed by the function is defined the decision. Different from the cross-entropy based loss surface, the decision surface demonstrates explicit decision boundaries, and assist the adversarial analysis.

3.2 Visualizing Adversarial Vulnerability

Experimental Analysis: Based on the loss visualization, we project a neural network’s loss behavior into 2D hyper-planes. By comparing the model’s 4 different types loss behavior in decision surface, we provide a experimental analysis for the adversarial vulnerability.

As shown in Fig. 2, the visualized hyper-planes have the central points ast the original neural network inputs, and their x-axes share the same input divergence direction – . Meanwhile, each hyper-plane has a dedicated input divergence direction – along the y-axis, which indicate 4 kinds of perturbations, including random noise, cross-entropy based non-targeted FGSM attack [Kurakin et al.2016], Least-likely targeted FGSM attack [Kurakin et al.2016], and non-targeted C&W attack [Carlini and Wagner2017]. Specifically values in the three adversarial attacks can be determined as:



is normal distribution,

is the logit output, is the true class label, is least likely class label (both one-hot).

In Fig. 2, we use arrows to show the shortest distance to cross the decision boundary =0. As projected in Fig. 2(a), when the input is diverged by the perturbation along a random direction, it will take much longer distance to cross the decision boundary. This explains the common sense that natural images with small random noises won’t degrade neural network accuracy significantly. By contrast, for the adversarial attacks projected in Fig. 2(b)(d), the attacks find aggressive directions ( direction shown in y-axis), towards which the decision boundary is in the close neighborhood around the original input. Therefore, adding those small perturbations that even human can’t perceive into input can mislead the model decision and generates adversarial examples.

Vulnerability Interpretation: The above experimental analysis reveals the nature of adversarial examples: Although a neural network seems to converge well after the model training (the demonstrated model achieves 90% accuracy on CIFAR10), there still exist large regions of image points that the neural network fails to classify correctly (as shown by the large regions beyond the decision boundary in Fig. 2(b)(d)). What’s worse, some of these regions are extremely close to the original input point (even within distance).

Base on these analysis, we could conclude that, rather than being “generated” by attackers, the adversarial examples are “naturally existed” already that models fail to learn correctly. To fix such intrinsic vulnerability of neural networks, the essential and ultimate robustness enhancement should focus on solving the “neighborhood under-fitting” issue.

Figure 2:

Adversarial vulnerability demonstration when loss surface in input space is projected onto different hyperplanes.

3.3 Interpreting Adversarial Robustness

To verify our geometric robustness theory, we compare two pairs of robust and natural models trained on MNIST and CIFAR10 respectively. These models are released from the adversarial attacking challenges 111, and built with the same structure but different robustness degrees (natural training and MinMax training [Madry et al.2018]).

To verify our theory, we visualize the models’ decision surfaces for interpretation: (1) As shown in Fig. 3, dramatic differences between the natural and robust decision surfaces can be observed: Natural (vulnerable) model’s decision surfaces show sharp peaks and large slopes, where the decision confidence could quickly drop to negative areas (wrong classification regions). (2) By comparison, on robust decision surfaces (shown in Fig. 3(c)(d)), all neighborhood points around the original input point are located on a high plateau with (correct classification regions). (3) The surface in the neighborhood is rather flat with negligible slopes until it reaches approximately constraints, which is exactly the adversarial attack constraint used in robust training. Similar phenomenon could be observed in Fig. 4 on CIFAR10.

Such robust model’s loss geometry verifies our previous conclusion that, fixing the neighborhood under-fitting issue is the essential robustness enhancement solution for neural networks. And a flat and wide plateau around the original point on decision surface is one of the most desired properties of a robust model.

Figure 3: Decision surfaces of the natural and robust models on MNIST. (a)-(b): natural model surfaces in random and adversarial projection; (c)-(d): robust model surfaces in random and adversarial projection (each unit denotes 0.05 perturbation step size)
Figure 4: Decision Surface of natural and robust model on CIFAR10 (step size = 1). As assumed, natural model’s surface shows sharp peaks and cliffs while robust model’s shows flat plateau.

4 Adversarial Robustness Evaluation

4.1 Formal Definition of Robustness Metric

As aforementioned, the decision surface of a robust model should have a flat neighborhood around the input point . Intuitively explanation is that a robust model should have good prediction stability–its prediction does not have significant change with small perturbations. In fact, models are not always robust–the predictions of a model on clean and noisy inputs are not always the same and can diverge to a large extent with small adversarial noises. As such, given a feasible perturbation set, the maximum divergence between the original prediction and the worst-case adversarial prediction could be used to denote the model’s vulnerability degree (i.e., the inverses of model robustness).

Based on this definition, firstly, we calculate the divergence between two predictions on an original input and an adversarial input with perturbations in a defined range. Specifically, we use the Kullback–Leibler divergence, which is known as KL Divergence (

) and is a common evaluation metric on measuring the divergence between two probability distributions. The formal robustness could be estimated by:


where is the prediction results from the evaluated model. A lower divergence indicates the model is more robust as a more stable prediction is maintained. The final robustness metric is defined inversely proportional to the maximum since the largest divergence will generate the smallest robustness score . To obtain the term in Eq. 4, we use the gradient ascent algorithm to directly optimize the KL-divergence, which demonstrates accurate and stable estimations that we will show in Sec. 5.

4.2 Invariant Normalization against
       Model Reparameterization

The robustness metric defined in previous works has a problem called “model re-parameterization”: on the condition that weights and biases are enlarged by the same coefficients simultaneously, a neural network model’s prediction results and its robustness property will not change, while the defined KL divergence can have dramatic change [Dinh et al.2017].

To solve this problem, we design a simple but effective normalization method: the basic idea is to add a scale-invariant normalization layer after the logit layer output. Since the neural network before the logit layer is piecewise-linear, we could then use normalization to safely remove the scaling effect of model reparameterization. The basic process is as follows: firstly, we attain a confidence vector of the logit layer, which can contain either positive or negative values; then we divide them by the max-absolute-value to normalize the confidence vector to the range of (-1, 1) and re-center them into positive range (0, 2). Owning to the max division, the final confidence vector will not change even when the parameters are linearly scaled up (or down). Finally, we use a simple sum-normalization to transform the confidence vector to a valid probability distribution. The overall normalization is:


Here is the final normalized probability distribution, is the normalized confidence vector, is the original logit layer output, and is the input. By the above normalization method, we could successfully alleviate the model reparameterization effect, which is shown in Sec. 5.

5 Robustness Evaluation Experiments

In this section, we evaluate our proposed robustness metric in various experimental settings and compare it with conventional evaluation based on adversarial testing accuracy.

5.1 Experiment Setup

To test the generality of our metric for neural networks’ robustness evaluation, we adopt three common datasets (i.e.

MNIST, CIFAR10, and ImageNet) and different models for the experiment, including FcNet, LeNet, ConvNet, ResNet18, ResNet152, and DenseNet.

To further test our metric on neural networks with different robustness degrees, the following defense settings are applied: No Defense, Adversarial Training [Kurakin et al.2016], Gradient Regularization Training [Ross and Doshi-Velez2018], Defensive Distillation [Papernot et al.2016], Gradient Inhibition [Liu et al.2018] and MinMax Training [Madry et al.2018]333

The gradient regularization and MinMax training is re-implemented with Pytorch, which may cause small deviations from the original reported accuracy.


Correspondingly, the robustness verification is based on referencing the adversarial testing accuracies from two currently strongest attacks: 30-step PGD (PGD-30) attack based on cross-entropy loss and 30-step CW (CW-30) attacks based on C&W loss. The adversarial perturbations are constrained by the -norm as 0.3/1.0, 8.0/255.0, 16.0/255.0 on MNIST, CIFAR10, and ImageNet respectively.

Model Defense (x)
FcNet No Defense 73.36 0.73% 0.2%
AdvTrain 80.43 4.43% 2.12%
MinMax 297.2 82.9% 80.3%
LeNet No Defense 93.8 2.82% 1.01%
AdvTrain 264.7 51.8% 46.2%
MinMax 958.4 92.3% 90.3%

*AdvTrain: [Kurakin et al.2016], MinMax: [Madry et al.2018].

Table 1: Robustness Metric Evaluation on MNIST

5.2 Robustness Metric Evaluation

MNIST Experiments: On MNIST dataset, the results are shown in Table 1: (1) The results firstly demonstrate that our metric could well reflect different robustness degrees on the same neural network model. For example, three FcNet models show increasing robustness in , which aligns well with their reference accuracies from both PGD-30 and CW-30 attack; (2) The results also show the generality of our metric on FcNet and LeNet models.

CIFAR10 Experiments: Table 2 shows the experimental results on CIFAR10, including three common neural network models (i.e. ConvNet, ResNet18, and DenseNet), as well as three robustness settings (i.e. No defense, Gradient Regularization, and MinMax Training). The experiment results show that our metric has the same scale with the referenced adversarial testing accuracies, implying our metric’s good generality on complex neural network models and different defenses.

To better illustrate a neural network model’s robustness, we visualized three ResNet18 models with different robustness degrees in Fig. 5. As the robustness degree increases, the models’ loss surfaces become more and more smooth. Our empirical visualization results imply that the smoother decision surface in the input space indicates better adversarial robustness, which coincidentally matches the parameter space generalization hypothesis [Keskar et al.2017].

Figure 5: Different models’ loss visualizations: model with higher robustness demonstrates more smooth and stable geometry.

ImageNet Experiments: In the experiment on MNIST and CIFAR10, our proposed robustness metric aligns well with adversarial testing accuracies of PGD-30 and CW-30. However, when we evaluate the MinMax model on ImageNet, two reference accuracies demonstrate certain inconsistency:

The MinMax model is released as the base model of the state-of-the-art defense on ImageNet by [Xie et al.2018]. To conduct the MinMax training, the reported time needed is about 52 hours on 128 V100 GPUs. Despite that, the reported accuracy showed very good robustness estimation of the model, which can achieve 42.6% under 2000-iteration PGD attacks. However, when we more thoroughly evaluated the model by CW-30 attack, we found the model’s testing accuracy is only 12.5% under the attack.

We call such a case as “unreliable estimation” in PGD-based adversarial testing, whose robustness estimation cannot generalize to all attacks. We will discuss this case and several other similar ones in details in Sec. 5.3, and reveal the current deficiency of adversarial testing based robustness estimation.

Model Defense
ConvNet No Defense 58.3 0.0% 0.0%
GradReg 86.5 16.0% 14.8%
MinMax 182.6 39.6% 38.7%
ResNet18 No Defense 67.9 0.0% 0.0%
GradReg 77.8 18.7% 17.5%
MinMax 162.7 44.3% 43.1%
DenseNet No Defense 59.1 0.1% 0.0%
GradReg 77.9 18.6% 17.2%
MinMax 142.4 39.1% 38.8%

*GradReg: [Ross and Doshi-Velez2018], MinMax: [Madry et al.2018].

Table 2: Robustness Metric Evaluation on CIFAR10

5.3 Our Metric vs. Adversarial Testing Accuracy

As mentioned above, the adversarial testing accuracy from different adversarial attacks may demonstrate certain inconsistency, and therefore mislead the robustness estimation. In addition to the ImageNet example, we also include another two cases that the adversarial testing accuracies yield unreliable robustness estimation: defensive distillation [Papernot et al.2016] and gradient inhibition [Liu et al.2018].

To demonstrate the unreliability of these cases, we train three new models on MNIST and CIFAR10 respectively, using natural training, defensive distillation, and gradient inhibition methods. For the ImageNet model, we use a public released model 444MinMax model is obtained in following link: /facebookresearch/imagenet-adversarial-training., which can achieve a state-of-the-art accuracy 45.5% against PGD-30 attack (within 16/255).

The overall experimental results are shown in Table. 3, which shown that though all these defenses can achieve high PGD-30 adversarial testing accuracy, they actually bring very limited robustness improvement:

On MNIST and CIFAR10, the distillation and gradient inhibition defenses provide the models with high adversarial testing accuracy against both FGSM and PGD-30 attacks (even higher than state-of-the-art MinMax methods), which seemly indicates these models are significantly robust. However, when measured by our metric, we have the opposite conclusion: these models are merely as robust as no-defense models and incomparable to the robust models trained by MinMax. To further verify this conclusion, we test these models with more adversarial settings and the testing accuracy dramatically degrades to almost zero in all the tests.

Dataset Defense
MNIST No Defense 23.4% 3.5% 89.8 0.5%
Distillation 97.3%* 97.1%* 70.5 0.0%
GradInhib 98.3%* 97.8%* 87.0 0.0%
MinMax 98.3% 92.3% 958.4 90.3%
CIFAR10 No Defense 7.6% 0.1% 58.3 0.0%
Distillation 72.6%* 72.3%* 60.5 0.0%
GradInhib 79.8%* 79.7%* 70.0 0.1%
MinMax 55.7% 39.6% 182.6 38.7%
ImageNet No Defense 15.5% 7.3% 1.710 4.6%
MinMax 46.9% 45.7% 2.310 12.5%

*Distillation: [Papernot et al.2016], GradInhib: [Liu et al.2018], MinMax: [Madry et al.2018] *Bold accuracies denote the unreliable robustness estimation cases.

Table 3: Unreliable Cases of Adversarial Testing Accuracy

The tests above further prove our statement: the adversarial testing accuracy based on PGD-30 may yield unreliable robustness estimation, which cannot reflect the model’s intrinsic robustness. This is because the distillation and gradient inhibition both rely on the input gradient vanishing to achieve robustness enhancement, which is mainly provided by the nonlinear softmax and negative log loss. Since C&W attack doesn’t rely on the cross-entropy loss, it can easily crack those two defenses. Such a case also applies to the ImageNet model trained with MinMax defenses as shown in the last two rows of Table. 3.

In contrast, our robustness metric can successfully reflect the model true robustness property with different defenses. Under all the above cases, the robustness metric gives reliable robustness estimation, remaining un-affected by defense methods and the unreliable PGD adversarial testing accuracy.

5.4 Reparemeterization Invariance Evaluation

The reliability of our proposed metric is also reflected in its invariance from the model parameter scaling.

Previous work [Keskar et al.2017] tried to define a metric, named -sharpness, to evaluate the loss surface’s geometry properties. We adopt its original definition and apply it into our input space to evaluate sharpness of input space loss surface, which can empirically reflect the adversarial generalization as aforementioned.

The experiment results are shown in Table. 4, where denotes the -sharpness, denotes our robustness metric based on softmax layer without normalization, and denotes our robustness metric with normalization. For the test cases, Org. indicates the tests with the original model without reparemeterization, *100 and /100 denote the model’s logit layer weights and biases are scaled accordingly. Please note that, such scaling won’t introduce accuracy and robustness change in practice  [Dinh et al.2017].

The experiments show that, both -sharpness and un-normalized give very distinct robustness estimations influenced by the reparameterization. By contrast, the normalization method successfully alleviates the scaling influence and enables our metric to keep a stable estimation under model reparameterization. Therefore, our metric could thus be used to more precisely capture one model’s robustness degree without being affected by model reparameterization.

Model Metric No Defense Model MinMax Model
Org. *100 /100 Org. *100 /100
ConvNet 22.7 109.6 0.095 0.43 3.20 0.004
0.96 0.012 1677.8 39.6 5.33 377443.3
58.3 59.5 57.9 182.5 183.1 177.36
ResNet18 15.4 87.4 0.048 0.085 5.63 0.005
0.963 0.0097 3178.8 17.11 0.158 128089
110.9 110.8 102.5 193.0 192.62 172.5
Table 4: Robustness Metrics Comparison under Reparemeterization

5.5 Efficiency of the Robustness Metric

Here we show the efficiency of our metric compared to adversarial testing methods. Since we are evaluating the model properties, theoretically it should be invariant to how many input we choose. Here we show that as the test-batch-size increases, the calculated robustness metric gradually converge to a stable robustness estimation which is close to the whole test set average robustness. Fig. 6 shows the relation with the batch size and the robustness deviation between batches with same batch-size. We can see that on both datasets, as the batch size increases, the robustness measurement become more accurate since they have much smaller deviations. With the batch-size equals to 1000 (or less), we could get the model’s robustness estimation with less than 10% deviation on MNIST and 5% on CIFAR10, which demonstrate higher efficiency than accuracy testing running on the whole test set.

5.6 Robustness Estimation Grading

Based on our experiments, we could derive a rough relationship between different robustness evaluation score and the adversarial accuracy. For example, on MNIST dataset within common threat model (), we can define the model robustness by three levels: Vulnerable (), Fairly Robust () and Robust (). In such a case, the corresponding robust metric range will be (), (), (), which could be used to quickly grade a neural network’s robustness. The robustness grading for CIFAR and ImageNet cannot be well developed yet due to the limited robustness currently (40% and 15%).

Figure 6: The robustness measurement is increasingly stable with the increasing batch size (100 to 1000).

6 Conclusion

In this work, through visualizing and interpreting neural networks’ decision surface in input space, we show that adversarial examples are essentially caused by neural networks’ neighborhood under-fitting issue. Oppositely, robust models manage to smoothen their neighborhood and relieve such under-fitting effect. Guided by such observation, we propose a model intrinsic robustness evaluation metric based on the model predictions’ maximum KL-divergence in a given neighborhood constrain. Combined with our new-designed normalization layer, the robustness metric shows multiple advantages than previous methods, including: great generality across dataset/models/attacks/defenses, invariance under reparameterization, and excellent computing efficiency.