1. Introduction
Recent years have seen rapid development on deep learning techniques as well as applications in a variety of domains like computer vision
(ResNet, ; vgg, )and natural language processing
(bert, ). There is a growing trend to apply deep learning for solving safetycritical tasks, such as face recognition (face_recognition, ), selfdriving cars (selfdriving, ) and malware detection (malware, ). Unfortunately, deep neural networks (DNN) are shown to be vulnerable to attacks and lack of robustness. For instance, they are easily subject to adversarial perturbation (cw, ; fgsm, ), i.e., a DNN makes a wrong decision given a carefully crafted small perturbation on the original input. Such attacks have been demonstrated successfully in the physical world (DBLP:journals/corr/KurakinGB16, ). This suggests that DNN, just like software systems, must be properly analyzed and tested before they are applied in safetycritical systems.The software engineering community welcomed the challenge and opportunity. Multiple software testing approaches, i.e., differential testing (deepxplore, ), mutation testing (DeepMutation, ; ouricse19, ) and concolic testing (DeepConcolic, ), have been adapted into the context of testing DNN. Inspired by the noticeable success of code coverage criteria in testing traditional software systems, multiple coverage criteria^{1}^{1}1Metric and criterion are used interchangeably.
, e.g., neuron coverage
(deepxplore, ; DeepTest, ) and its extensions DeepGauge (DeepGauge, ), MC/DC (DeepCover, ), and Surprise Adequacy (SurpriseAdequacy, ), have been proposed. Coverage criteria quantitatively measures how well a DNN is tested and offers guidelines on how to create new test cases. The underlying assumption is that a DNN which is better tested, i.e., with higher coverage, is more likely to be robust.This assumption however is often not examined or only evaluated with limited DNN models and structures, making it unclear whether the results generalize. Furthermore, how a test suite improves the quality of a DNN is different from that of a software system. A software system is improved by fixing bugs revealed by a test suite. A DNN is typically improved by retraining with the test suite. While existing studies show that retraining often improves a DNN’s accuracy to some extent (deepxplore, ; DeepConcolic, ), it is not clear whether there is correlation between the coverage of the test suite and the improvement, i.e., does a set of inputs with higher coverage imply better improvement (on DNN robustness)?
Inspired by the work in (DBLP:conf/icse/InozemtsevaH14, ), we conduct an empirical study to evaluate whether coverage is correlated with robustness of DNN and additional metrics which are associated with the quality of DNN (DeepSec, ). In particular, we would like to answer the following research questions.

Are there correlations between testing coverage criteria and the robustness of DNN?

Are there correlations among different coverage criteria themselves?

Are there correlations between the improvement of coverage criteria and the improvement in terms of robustness after the DNN is retrained?

Are there metrics that are strongly correlated to the robustness of DNN or the robustness improvement after retraining?
Based on the answers to the above questions, we aim to provide practical guidelines for developing testing methods which contribute towards improving the robustness of DNN.
Conducting such an empirical study is highly nontrivial. First, we need a large set of realworld DNN for the study. However, training realistic DNN often takes significant amount of time and resource. For instance, it takes GPU hours to train a ResNet101 model. Our study trained stateoftheart DNN models^{2}^{2}2 seed models trained with original dataset and models retrained using original dataset augmented with adversarial samples. with a variety of architectures with two popular datasets, i.e., MNIST (MNIST_Data, ) and CIFAR10 (Cifar, ). Obtaining these models took a total of 150 GPU hours.
Second, we need to obtain adversarial samples by attacking the trained original models. We adopt stateoftheart attack methods, i.e., FGSM (fgsm, ), JSMA (jsma, ) and C&W (cw, ), to attack the original models, in order to obtain different adversarial sample sets and train different DNN models. Some of the adversarial attack methods, e.g., JSMA and C&W, are known to be timeconsuming. It takes us a total of GPU hours to obtain adversarial samples for all the original models with the attack methods.
Last but not least, we need a systematic and automatic way of evaluating the coverage, robustness, and other associated metrics, which is not always straightforward. For instance, there are multiple definitions of robustness in the literature (Robustness, ), (Clever, ), some of which are complicated and expensive to compute (e.g., it took GPU hours to compute a CLEVER score (Clever, ) for GoogLeNet22.). In this work, we develop a selfcontained toolkit called DRTest (Deep Robustness Testing), which calculates a comprehensive set of metrics on DNN, including 1) testing coverage criteria proposed for DNN, 2) robustness metrics for DNN, and 3) a set of attack and defense metrics for DNN. A total of GPU hours are spent on computing these metrics based on the abovementioned models.
Our empirical study is conducted as follows. For each dataset, we first train diverse seed models (with stateoftheart architectures), attack each seed model with different attacking methods to generate adversarial samples (with varying attack parameters), augment the training dataset with the generated adversarial samples, and retrain the model. We apply DRTest to calculate a range of metrics for every model. Afterwards, we apply a standard correlation analysis algorithm, the Kendall’s rank correlation coefficient (Kendall, ), to analyze the correlations between the metrics.
In summary, we make the following contributions.

We conducted an empirical study to systematically investigate the correlation between coverage, robustness and related metrics for DNN. Based on the empirical study results, we discuss potential research directions on DNN testing.

We implemented a selfcontained and extensible toolkit which calculates a large set of metrics, which can be used to quantitatively measure different aspects of DNN.

We publish online our models, adversarial samples, retrained models as well as DRTest, which can be used as a benchmark for future proposals on methods for DNN testing.
We organize the remainder of the paper as follows. Section 2 introduces the background knowledge of this work. Section 3 presents our research methodology. Section 4 shows details on our implementations. Section 5 reports our findings on the research questions. We present related works in Section 6 and conclude in Section 7.
2. Preliminaries
In this section, we briefly review preliminaries related to this work, which include Deep Neural Networks (DNN), adversarial attacks on DNN, testing methods for DNN, and robustness of DNN.
2.1. Deep Neural Networks
A DNN is an artificial neural network with multiple layers between the input and output layers. It can be denoted as a tuple where

is the input layer;

is a set of hidden layers and the output layer, each of which contains neurons, and the neuron in layer is denoted as and its value is ;

is a set of activation functions;

: is a set of transitions between layers. The output of each neuron is computed by applying the activation function to the weighted sum of its inputs, and the weights represent the strength of the connections between two linked neurons.
In this work, we focus on DNN classifiers
, where is a set of inputs and is a finite set of labels. Given an input , a DNN classifier transforms information layer by layer and outputs a label for the input . In this work, we try to cover a wide range of (including stateoftheart) DNN architectures. We briefly introduce them in the following.LeNet (lenet, ) is one of the most representative DNN architectures. As shown in Fig. 1, the basic modules include the convolution layers (Conv), the pooling layers (Pool) and the fully connected layers (FC). Conv aims to extract different local features and Pool makes sure to get the same feature after transformation, i.e., translation, rotation, and scaling. FC then maps the distributed feature representations from Conv and Pool to the label space.
VGG (vgg, ) is an advanced architecture for extracting CNN features from images. Compared to LeNet, VGG utilizes smaller convolution kernels (e.g., or ) and pooling kernels (e.g., ) which significantly increases the expressive power.
GoogLeNet (googlenet, )
Unlike most popular DNN models which obtain better accuracy by increasing the depth of the network, it introduces an inception module with a parallel topology to expand the width of the model instead. The inception module helps to extract richer features and reduce dimensions using
convolution kernel, and aggregate convolution results on multiple sizes to obtain features from different scales and accelerate convergence rate.ResNet (ResNet, )
improves traditional sequential CNNs by solving the vanishing gradients problem when expanding the number of layers. It utilizes shortcuts (also called skip connections), which adds up the input and output of a layer and then transforms the sum to the next layer as input.
2.2. Adversarial Attack
Since Szegedy et al. discovered that DNNs are intrinsically vulnerable to adversarial samples (i.e., sample inputs which are generated through perturbation with the intention to trick a DNN into wrong decisions) (Intriguing, ), many attacking approaches have been developed to craft adversarial samples. In the following, we briefly introduce popular attacking algorithms that we adopt in our work.
FGSM Goodfellow et al. proposed the first and fastest attacking algorithm Fast Gradient Sign Method (FGSM) (fgsm, )
, which attempts to maximize the change of probability of sample’s original label by the gradient of its loss. The implementation of FGSM is as follows, which is quite straightforward and efficient.
(1) 
is the loss function for training,
is the prediction of x and is a hyperparameter to control the degree of perturbation.JSMA Jacobianbased Saliency Map Attack (JSMA) (jsma, ) is a targeted attack method. First, it calculates a saliency map based on the Jacobian matrix of a given sample. Each value of the map represents the impact of the corresponding pixel to the target prediction. Then it greedily picks the most influential features each time and maximizes their values until either successfully generates an adversarial sample or the number of pixels modified exceeds the bound. We refer the readers to (jsma, ) for details.
C&W Carlini et al. (cw, ) aim to craft adversarial samples with high confidence and small perturbation based on certain distance metric by solving the following optimization problem directly:
(2) 
is the perturbation according to pnorm measurements, e.g., , and ; is the target label and is a hyperparameter to balance the objectives. In order to prevent adversarial samples from generating illegal values, they devised a group of clip functions and loss functions. Readers can refer to (cw, ) for details.
2.3. Testing Deep Neural Networks
A variety of traditional software testing methods like differential testing (differential1, ; differential2, ), concolic testing (MCDC, ) have been adapted to the context of testing DNN (deepxplore, ; DeepConcolic, ) to find adversarial samples (in hope of revealing bugs in DNN). Note that in the setting of DNN testing, a test case is a sample input. In the following, we review some recently proposed coverage criteria for DNN.
Neuron Coverage Neuron coverage (deepxplore, ) is the first coverage criteria proposed for testing DNN, which quantifies the percentage of activated neurons by at least one test case in the test suite. The authors also proposed a differential testing method to generate test cases to improve neuron coverage.
DeepGauge Later, Ma et al. proposed DeepGauge (DeepGauge, ), which extends neuron coverage with coverage criteria which are defined based on the activation values from two different levels. For instance, neuronlevel coverage first divides the range of values at each neuron into sections during the training stage, obtains the upper and lower bounds, and then evaluates if each section is covered or the boundary has been crossed by the test suite. The layerlevel coverage concerns how many neurons used to be the top active neurons are activated at least once for each layer (TKNC), or whether the pattern formed by sequences of top active neurons on each layer (TKNP) is present.
Surprise Adequacy Based on the idea that a good test suite should be ‘surprising’ compared to the training set, Kim et al. (SurpriseAdequacy, )
defined two measures on how surprising a testing input is to the training set. One is called the kernel density estimation, which evaluates the likelihood of the testing input from the training set. The other measures the Euclidean distance of neuron activation traces for a given input and the training set. Readers are referred to
(SurpriseAdequacy, ) for details.2.4. Robustness of Deep Neural Networks
Given the existence of adversarial samples, adversarial robustness becomes an important desired property of a DNN which measures its resilience against adversarial perturbations. Following the definitions proposed by Katz et al. (katz2017reluplex, ), adversarial robustness can be categorized into local adversarial robustness and global adversarial robustness depending on different contexts.
Definition 2.0 ().
(Local Adversarial Robustness) Given a sample input , a DNN and a perturbation threshold , is local robust iff for any sample input such that , we have , where is the norm to measure the distance between two sample inputs.
Definition 2.0 ().
(Global Adversarial Robustness) For any sample inputs and , a DNN and two thresholds , is robust iff for any , we have .
Local robustness measures the robustness on a specific input, while global robustness measures the robustness on all inputs.
Verifying whether a DNN satisfies local or global robustness is an active research area (katz2017reluplex, ; AI2, ; Formalguarantees, ) and existing methods do not scale to stateoftheart DNNs (especially for global robustness). Thus, multiple metrics have been proposed in order to empirically evaluate the adversarial robustness of a DNN in the literature (Clever, ; Robustness, ; virmaux2018lipschitz, ; fazlyab2019efficient, ). In the following, we introduce two widely used adversarial robustness metrics (including both local (Clever, ) and global (Robustness, ) robustness) which we adopt in this work.
Global Lipschitz Constant Lipschitz Constant (Robustness, ) measures the sensitivity of a model to adversarial samples. Given a function , its Lipschitz constant is only related to the parameters of . In our context, the function is in the form of a DNN. Its Lipschitz constant can be calculated recursively layerbylayer from the output layer all the way to the input layer, taking consideration of shortcuts in ResNet and inception module in GoogLeNet. For example, the Lipschitz Constant of a DNN which has a structure similar to LeNet and VGG is the product of the Lipschitz Constant of all the hidden layers and the output layer.
As an example, we introduce how Lipschitz Constant is calculated for a fully connected layer. Readers are referred to (Parseval, ) for the calculation of convolution and aggregation layers. Let and be two inputs of layer ; and be their corresponding outputs; and be the parameter of the connection between the neuron in layer and the neuron in layer ; and be the number of neurons of layer . The Lipschitz Constant for layer is defined as (so that layer satisfies ).
CLEVER Score Another robustness metric we adopt is the CLEVER score (CrossLipschitz Extreme Value for nEtwork Robustness) (Clever, ), which is a recently proposed attackindependent robustness score for large scale networks.
Given a sample input and a DNN , we say is a perturbed example of with perturbation if , let denotes the norm of , thus an adversarial example is a perturbed example that satisfy , the minimum adversarial distortion of , denoted as , is defined as the smallest over all adversarial examples of . The idea is to approximately calculate the lower bound of of a given sample utilizing extreme value theory. The lower bound, denoted by where , is defined such that any perturbed example of with
are not adversarial examples. CLEVER score has been experimentally evaluated, which shows that it is consistent with other robustness evaluation metrics, e.g., attackinduced distortion metrics. Readers are referred to
(Clever, ) for details.3. Methodology
3.1. Experiment Design
The overall workflow of our experiment is shown in Figure 2. We follow a common DNN testing process (e.g., by (deepxplore, ; DeepGauge, )), as shown at the top of the figure, whilst extracting a variety of metrics (as shown in the middle of the figure) which are used for correlation analysis (as shown at the bottom). We start with training a model from a training set using stateoftheart training methods. Afterwards, various adversarial attacks (fgsm, ; jsma, ; cw, ) are applied to generate new test cases. The last step is to augment the training set with the new test cases and obtain a retrained model.
We collect four different groups of metrics to characterize different components of the process, i.e., (1) a set of testing coverage metrics of both the original and the retrained models, (2) a set of attack metrics of different kinds of adversarial attacks on the original models, (3) a set of robustness metrics of both the original models and the retrained models, and (4) a set of defense metrics which measure the differences between the retrained model and the original model. We repeat the above (attack and retrain) process for the seed models, obtain in total models, calculate the corresponding metrics and then conduct correlation analysis on all these metrics. In the following, we illustrate the challenges and our design choices of each part in detail.
Adversarial Attacks We adopt three stateoftheart DNN attack methods, i.e., FGSM (fgsm, ), CW (cw, ) and JSMA (jsma, ), which are introduced in section 2.2, to generate adversarial samples. These attack methods are commonly used by previous DNN testing approaches, e.g., (DeepGauge, ; SurpriseAdequacy, ). These generated adversarial samples are combined with the original datasets as new (training and testing) datasets for model retraining.
Model Retraining For each original model, we obtain three sets of adversarial samples, one for each attack method. During model retraining, we combine the original training set with one set of the adversarial samples to obtain a new training set. We retrain the original model with the new training set to obtain a retrained model. As a result, we obtain retrained models for each original model, one for each attacking method. We follow the standard partition of for training and testing on the MNIST dataset and for the CIFAR10 dataset.
Metric Calculation As our objective is to investigate the correlations between coverage, robustness and other metrics associated with DNN, we conduct a thorough survey on existing metrics and collected metrics in total. These metrics are categorized into four groups, i.e., testing metrics, robustness metrics, attack metrics and defense metrics. They are summarized in Table 1. Note that the attack metrics measure to what extent the attacks are successful, imperceptible, whereas the defense metrics measure mainly on how the retrained models preserve the accuracy of the original model. For brevity, we refer the readers to the original papers for details. We calculate values of all metrics based on their original definitions and use default parameters according to their original papers.
Metric Type  Metric Name  Description 

Testing  NC  Neuron Coverage (deepxplore, ) 
KNC  Kmultisection Neuron Coverage (DeepGauge, )  
SNAC  Strong Neuron Activation Coverage (DeepGauge, )  
NBC  Neuron Boundary Coverage (DeepGauge, )  
TKNC  Topk Dominant Neuron Coverage (DeepGauge, )  
TKNP  Topk Dominant Neuron Patterns Coverage (DeepGauge, )  
LSA/DSA  Surprise adequacy to training set (SurpriseAdequacy, )  
Robustness  Lipschitz constant  The global Lipschitz constant (Robustness, ) 
CL1/CL2/CLi  Clever score with L/L/L norm (Clever, )  
Attack  MR  Misclassification Ratio (DeepSec, ) 
ACAC  Average Confidence of Adversarial Class (DeepSec, )  
ACTC  Average Confidence of True Class (DeepSec, )  
ALD  Average Distortion (DeepSec, )  
ASS  Average Structural Similarity (ass, )  
PSD  Perturbation Sensitivity Distance (psd, )  
NTE  Noise Tolerance Estimation (psd, )  
RGB  Robustness to Gaussian Blur (DeepSec, )  
RIC  Robustness to Image Compressionr (DeepSec, )  
CC  Computation Cost (DeepSec, )  
Defense  CAV  Classification Accuracy Variance (DeepSec, ) 
CRR/CSR  Classification Rectify/Sacrifice Ratio (DeepSec, )  
CCV  Classification Confidence Variance (DeepSec, )  
COS  Classification Output Stability (DeepSec, ) 
Correlation Analysis We conduct correlation analysis, a statistical technique that shows whether and how strongly pairs of variables are correlated, on the metrics. We are particularly interested to observe which metrics are correlated to the robustness of a DNN model. The resulting correlation coefficient is a single value between and , where (and ) means the most positively (and negatively) correlated, and means no correlation. In this work, we adopt a commonly used correlation coefficients, Kendall’s rank correlation coefficient (Kendall, )
, which is a rank based correlation that measures monotonic relationship between two variables, to measure the correlations between different metrics. Note that compared to alternative methods like Pearson productmoment correlation coefficient
(Pearson, ), Kendall’srank correlation coefficient does not require that the dataset follows a normal distribution or the correlation is linear. Since we adopt two popular dataset MNIST and CIFAR10 to train
different families of DNN models. We calculate the correlations of different metrics for the two dataset separately, in order to avoid the potential impact due to the training data.4. Implementation and Configurations
Our system is implemented based on the TensorFlow framework
(tensorflow, ) and the architecture is shown in Figure 3. There arelayers, i.e., the data layer, the algorithm layer, the measurement layer and the analysis layer. Our implementation is designed to be extensible, i.e., each layer can be extended with new models and algorithms with little impact on the other layers. Our implementation, including all the data and algorithms, is open source on GitHub
^{3}^{3}3https://github.com/icse2020/DRTest.The data layer maintains all data used in our study, including the original models, the adversarial samples generated for original models, and the retrained models. It interacts with all other layers. We use wellknow models on image classification tasks in our experiment. To cover a range of different deep learning model structures, we adopt four different model families, including LeNet family models (LeNet1, LeNet4 and LeNet5 (lenet, )), VGG family models (VGG11, VGG13, VGG16, VGG19 (vgg, )), ResNet family models (ResNet18, ResNet34, ResNet50, ResNet101 (ResNet, )) and GoogLeNet family models (GoogLeNet12, GoogLeNet16, GoogLeNet22 (googlenet, )). In total, we have model structures, which are representative image classification models.
dataset  attack method  model family  parameter  success rate 

MNIST  FGSM  LeNet  0.2, 0.3, 0.4  0.94 
VGG  0.81  
ResNet  0.83  
GoogLeNet  0.71  
CW  LeNet  9, 10, 11  0.91  
VGG  0.81  
ResNet  0.91  
GoogLeNet  0.90  
JSMA  LeNet  0.09, 0.1, 0.11  0.89  
VGG  0.25  
ResNet  0.75  
GoogLeNet  0.52  
CIFAR  FGSM  VGG  0.01, 0.02, 0.03  0.76 
ResNet  0.65  
GoogLeNet  0.75  
CW  VGG  0.1, 0.2, 0.3  0.88  
ResNet  0.90  
GoogLeNet  0.90  
JSMA  VGG  0.09, 0.1, 0.11  0.80  
ResNet  0.79  
GoogLeNet  0.75 
We adopt two popular publiclyavailable datasets, i.e., MNIST (MNIST_Data, ) and CIFAR10 (Cifar, ) to train DNN models in our work. MNIST is a set of handwritten digit images. It contains images in total. Each image in MINIST dataset is singlechannel of size . CIFAR10 is a set of color images. It contains classes, each of which has images, and the input size of each image is .
The algorithm layer contains a set of algorithms for attacking DNN as well as algorithms for defending DNN through retraining. For each trained model, we adopt three stateoftheart attack methods (e.g., FGSM, CW and JSMA) to generate adversarial samples. The principle of choosing parameters for each attack is to balance the imperceptibility and success rate of generating adversarial samples. For MNIST, we adopt the same parameters from cleverhans (cleverhans, ) for all three attacks. For CIFAR10, we slightly changed the parameters of FGSM and CW in order to obtain better imperceptibility. The parameters chosen include the attack step size for FGSM, the initial tradeoffconstant for tuning the relative importance of size of the perturbation and confidence of classification for CW and the maximum percentage of perturbed features for JSMA.
To further avoid bias introduced by hyperparameters, we run each attack method on the original dataset for times, each time with a different hyperparameter configuration. Then we combine the successful adversarial samples generated from runs of attacks as the adversarial sample set for model retraining. Table 2 shows the details of the hyperparameter configurations for each attack method, and the column hyperparameter summarizes the hyperparameter configurations used in each run of attack.
During training and retraining, we adopt a learning rate of , a batch size of for all models in the two datasets. For MNIST, a test accuracy above is accepted in both training and retraining. For CIFAR10, a test accuracy above is accepted during training process and a test accuracy above is required for retraining.
The measurement layer contains all implementation for calculating the metrics shown in Table 1. We calculate four robustness values, i.e., Global Lipschitz Constant (Lipz) and the CLEVER score (CL1, CL and CLi) for each model. Note that LeNet is not feasible for CIFAR10. In our experiment, since calculate CLEVER score is extremely timeconsuming for GoogLeNet, we reduce the number of images to and sampling parameter , as it is reported that or samples are usually sufficient to obtain a reasonably accurate robustness estimation (Clever, ). We calculate the coverage criteria of different DNN models with the same test suite (i.e., the original test suite of MINIST or CIFAR10) and obtain and values of each coverage criteria on MNIST and CIFAR10, respectively.
Defense Metrics are calculated for all the defense enhanced models, i.e., models after adversarial training, according to their original definitions (DeepSec, ). For each dataset, We obtain and values for each defense metric on MINIST and CIFAR10, respectively. Attack Metrics are calculated for the generated adversarial examples of each attack method, all parameters of attack metrics are set based on their original definitions (DeepSec, ). We obtain and values for each attack metric on MINIST and CIFAR10, respectively.
We additionally calculate a set of metrics, which are denoted as Metricdiff. For instance, Lipzdiff is the Lipschitz Constant of the retrained model minus that of the original model. We obtain and robustness for each robustness metric on MINIST and CIFAR10. Similarly, we calculate coverage metrics by subtracting the coverage achieved by the augmented test set (i.e., the original test set plus the adversarial samples) from that of the original test set. We obtain and coverage values for each coverage metric on MINIST and CIFAR10.
The analysis layer implements the correlation analysis algorithm (Kendall, ). We first plot the data to observe the trend and then decide on the correlation analysis method to use. By observing the data plot, we found that the data does not show a linear trend. Therefore, we choose the Kendall’s rank correlation coefficient (Kendall, ), which does not assume that the data follows a normal distribution or the variables have a linear correlation.
All experiments were conducted using four GPU servers. Server 1 has 1 Intel Xeon 3.50GHz CPU, 64GB system memory and 2 NVIDIA GTX 1080Ti GPU. Server 2 has 2 Intel Xeon 2.50GHz CPU, 126GB system memory and 4 NVIDIA GTX 1080Ti GPU. Server 3 has 2 Intel Xeon 2.50GHz CPU, 96GB system memory and 4 NVIDIA GTX 1080Ti GPU. Server 4 has 1 Intel Xeon 2.50GHz CPU, 119GB system memory and 2 Tesla P100 GPU. Not all GPUs on the four servers are fully utilized. We remark that we do not always have full occupations of all GPUs and 6 GPUs are used on average during the experiment period.
In total, the experiment took more than GPU hours to finish. Table 3 shows the time spent on different steps, i.e., on generating adversarial examples, training and retraining, as well as metric calculations for each dataset on each model. The unit is GPU hour. The time for correlation calculation compared to the other steps is neglectable. The most time consuming step is the metric calculation, which took hours for the ResNet family on CIFAR10. The most time consuming metrics is the coverage criteria, which varies significantly depending on the model structure. Adversarial sample generation is also time consuming.
dataset  model family  generate AE  train & retrain  metric calc 

MNIST  LeNet  <0.5  <0.5  <0.5 
VGG  160  6  420  
ResNet  240  12  1200  
GoogLeNet  120  25  300  
CIFAR10  VGG  540  12  550 
ResNet  450  45  1350  
GoogLeNet  300  50  330 
5. Findings
5.1. Research Questions
RQ1: Are there any correlations between existing test coverage criteria and the robustness of the DNN models?
To answer the question, we conduct correlation analysis on the coverage metrics and the robustness metrics of all models on the original test set. The results are shown in Figure 4. The number and the color represent the strength of the correlation. The correlation value is a number between and . Positive number (and blue color) indicates positively correlated and negative number (and red color) indicates negative correlated. The larger the absolute number is, the stronger the correlation is. The darker the color is, the stronger the correlation is. We measure the pvalue of the sample data set we have and regard pvalue greater than as insignificant. An “X” mark means that we cannot make a decision because pvalue is larger than (i.e., insignificant) and a question mark “?” means that there are no valid results since the standard variation of the data is . The same notations are used in subsequent figures as well. We summarize the results in the following two aspects. According to the definition of correlation in Guildford scale (Guildfordscale, ), an absolute value of less than means that the (positive or negative) correlation is low; an absolute value of  means that the correlation is moderate; and otherwise the correlation is high or very high (i.e.,  or above , respectively).
We have the following observations based on Fig. 4. First, there is no significant or negative correlation between coverage and robustness metrics. In particularly, neural coverage is negatively correlated (i.e., with a value between and ) with the CLEVER score and is not significantly correlated with Lipschitz constant for both MNIST and CIFAR10. Moreover, KNC, TKNC and LSA also show negative correlations with CLEVER score on CIFAR10. It suggests that a DNN is less robust if the test set has a larger neuron coverage (although the strength of the correlation is weak), which is unexpected. Second, there is no significant correlation between any of the other coverage and any of the robustness metrics on the MNIST dataset. For the CIFAR10 dataset, positive correlation is only observed between SNAC and the CLEVER score, and the strength is low. This result suggests that a DNN model which achieves high coverage is not necessary robust and vice versa.
We further investigate the correlation among all test coverage criteria themselves. It can be observed from Fig. 4 that NC, KNC, TKNC, LSA and DSA are positively correlated with each other. NBC and SNAC are correlated with each other with medium or high strength, whereas they have no (or weak negative) correlation with the other metrics. The results are consistent with observations reported in (DeepGauge, ) and (SurpriseAdequacy, ) which propose these coverage. This suggests that despite that different coverage criteria are defined differently, they are in general correlated (except for the boundary coverage).
We have the following answer to RQ1.
Different coverage criteria are correlated with each other. There is limited correlation between the coverage criteria and the robustness metrics.
RQ2: Does retraining with new test cases which improves coverage criteria improve the robustness of a DNN model?
To answer this question, we conduct correlation analysis on the difference on coverage criteria and the difference on robustness metrics before and after retraining. The results are shown in Fig. 5. We observe that there is no correlation between the difference on any coverage criteria and the difference on any robustness metrics, except that there is negative correlation between TKNCdiff and the CLEVER scores for all the CIFAR10 models. This result casts a shadow over existing testing approaches, as the existing testing approaches are designed to generate test cases for high coverage, with the hope that such test cases can be used to improve the adversarial robustness of the DNN models.
We thus have the following answer to RQ2.
Retraining with new test cases which improve the coverage criteria does not necessarily improve the model robustness.
RQ3: Are there metrics that are strongly correlated to the improvement of model robustness?
The above results show that existing test coverage criteria have limited correlations with the robustness of DNN models and testing methods based on improving the coverage do not improve the robustness of DNN models. The question is then: are there metrics which are correlated to the improvement of the model robustness? To answer the question, we systematically conduct correlation analysis between all metrics (or the metrics’s difference before and after retraining) and the improvement of the model robustness.
The correlations between the defense metrics and the improvement of robustness are shown in Fig. 5. We observe that there are positive correlations between the difference of the CLEVER scores and all the defense metrics on CIFAR10. In particular, the correlation is of medium level for CRR and CAV. CRR and CAV measure how much the defenseenhanced model preserves the functionality of the original model (DeepSec, ). Intuitively, this indicates that a defense method leads to more robustness improvement if the original model is better preserved by the defenseenhanced model. Furthermore, given the huge cost on computing robustness metrics, such positive correlations potentially provide a lightweight way of estimating on the effectiveness of a model enhancement method.
We additionally analyze the correlation between the attack metrics and the improvement of coverage criteria. We have the following observations from the results shown in Fig. 6. There are correlations between the differences of TKNP and KNC and the attack metrics. Furthermore, NTE is positively correlated with KNCdiff, NBCdiff and SNACdiff. RGB is positively correlated with NCdiff, NBCdiff and SNACdiff. Intuitively, NTE and RGB measure the robustness of adversarial samples, which implies that more robust adversarial samples contribute more to the improvement of coverage metrics. Lastly, there is no correlation between the robustness differences and the attack metrics for the CIFAR10 dataset. For the MINIST dataset, we observe negative correlations between the CLEVER score differences with ACAC, ALD2, RIC and NTE, and positive correlations with ASS and ACTC. These observations indicate that more confident, perceptible and robust adversarial samples contribute more to improving the coverage criteria.
We have the answer to RQ3.
Some defense metrics are positively correlated to the improvement of model robustness.
RQ4: Are the correlation results consistent across different datasets, model families and correlation analysis methods?
This question examines whether the correlation results are universal or rather may vary cross different datasets, model families or correlation analysis methods. To answer this question, we systematically conduct the different correlation analysis using data obtained from different datasets and model families. For the sake of space, we omit the details and refer the readers to the supplementary materials made available at the online repository for details.
Overall, while the correlation between testing coverage and robustness on MNIST and CIFAR10 are mostly consistent, we do observe that the results on some correlations vary slightly across the two datasets. For instance, the attack metrics (except ALDinf) show correlation with CLdiff on MNIST but not on CIFAR10. The defense metrics show strong correlation with robustness and robustnessdiff on CIFRA10, which is not the case on MNIST.
There are also inconsistent correlation results across different model families. The correlation results on the MNIST, LeNet and VGG families are consistent, which is expected since they have similar model structures. However, it is surprise that models in the GoogLeNet family often show opposite correlation results to those of the MNIST, LeNet and VGG families, especially for correlation between the attack metrics and the improvement of the model robustness. This can be explained as GoogLeNet has a rather different architecture from MNIST, LeNet and VGG (GoogLeNet tends to have more neurons in a layer instead of having more layers).
The abovementioned inconsistency suggests that the correlation may depend on the dataset and, more noticeably, the model architecture, which further complicates the picture.
Lastly, we apply different correlation analysis algorithms (including Pearson product moment correlation (Pearson, ) and Spearman’s rankorder correlation (spearman, )) to observe whether the results are consistent. Overall, although the results are not identical, the differences are not significant and the results (e.g., whether it is positively or negatively correlated or whether it is strongly or weakly correlated) remain consistent. We choose to present the results of Kendall correlation coefficients in this work as it requires the least assumption on the underlying data. The results of other correlation analysis algorithms are present in the supplementary materials online.
We have our answer to RQ4.
The correlation results are consistent across different correlation analysis algorithms but may vary across different datasets or model families.
5.2. Explanation
In the following, we aim to interpret and ‘explain’ the abovementioned results. These explanations must, however, be taken a grain of salt as they should be properly examined in the future.
First, the reason that existing coverage criteria are not correlated with robustness may simply be due to the fact these coverage criteria are too weak to differentiate robust and notrobust DNN models. It has been shown that high neuron coverage could be easily achieved with a small number of samples (DeepCover, ), and similar conclusions are given by Odena et al. (tensorfuzz, ) for coverages proposed in DeepGauge, such as neuron boundary coverage. This finding is confirmed by another recent research work (ccmisleadingICSE19, ), which reports that adversarial examples are pervasively distributed in the space divided by coverage criteria. The work (ccmisleadingICSE19, ) also suggests that using structural coverage to measure the neural network robustness can be questionable.
Second, our results suggest that retraining with the test case does not necessarily improve robustness. For software systems, a test case which reveals a bug naturally leads to bug fixing, which “definitely” improves the ‘robustness’ of the system. This is not certain for DNN models. because the retrained model could be rather different from the original model, i.e., it is like a new model, due to how such models are trained (i.e., through optimization techniques which embody a lot of nondeterminism and carry little theoretical guarantee).
Third, we consider it to be intuitive that defense metrics are correlated with robustness as these defense metrics are indeed less formal ways of measuring robustness (i.e., in term of how well a DNN model defends adversarial attacks).
As for the answer to RQ4, we take the consistency between different correlation analysis algorithms positively as it shows that our results are not the result of certain ‘biased’ correlation analysis algorithm. The second part of the answer may suggest that a testing method may have to be tailored according to different DNN architectures.
5.3. Discussion
The results discussed so far are mostly negative, i.e., only several defense metrics are correlated with the improvement of model robustness and existing testing methods designed based on coverage have limited effectiveness on improving the robustness of the DNN models. The results question the usefulness of coverage criteria proposed for DNN models. Indeed, a well tested (and improved by retraining) DNN through existing testing methods might produce a new model which has higher empirical accuracy on the testing set. However, the new model is not necessarily more robust than the original model against adversarial perturbations. In fact, a recent finding shows that
DNN model robustness maybe at odds with accuracy since robust classifiers are learning fundamentally different feature representations than standard classifiers
(tsipras2018robustness, ). For DNN models to be deployed in safetycritical applications, we believe that robustness is an as (if not more) important property as accuracy. The real question thus remains: how should we test DNN models and make use of the testing results so that the robustness of the DNN models is improved? Or are there ways to improve the robustness of the DNN models in general?.To this question, we do not have a clear answer and thus it remains an open question to us. It is possible that there could be other coverage criteria which are correlated with the model robustness or the associated testing method can help improve the model robustness. It is however important that no matter what coverage is proposed, it must be thoroughly analyzed to show its effect on model robustness.
Our view is that finding adversarial samples should not be the end of DNN testing. Rather, testing DNN models should be designed in consideration of the model enhancement methods, i.e., a testing method should produce test cases which are useful according to the model enhancement methods. For instance, given the positive correlation between robustness and the defense metrics, we might want to generate test cases which could contribute to improve defense metrics such as CAV and CCV.
5.4. Threats to validity
First, there may be threats to validity due to the selected datasets and model structures. In this work, we regard each DNN model as the a program of the same functionality and calculate different metrics on these models. We assume the metrics are valid across different DNN model structures and conduct correlation analysis on the obtained metrics. However, some metrics are not applicable to certain model structures (e.g., MC/DC is not applicable to ResNet and GoogLeNet). Besides, Since each model family has limited number of models and datasets to analyse with, the results may be biased to these specific datasets and model structures even though we are adopting the most popular datasets and stateoftheart model structures.
Second, there may be threats to validity due to the limited size of datasets, models and attack methods adopted. In this work, we use different DNN model structures, adversarial attack methods, models, datasets, and different metrics. While we are working on more datasets, model structures, etc., we could not significantly increase the scale due to the huge cost (more than GPU hours) of the empirical study. For more statistical significant results, more data points are helpful (or even necessary). We thus call upon the open source community to jointly upscale our study. To make sure that our correlation analysis results are valid, we only report the results beyond a certain significant level by measuring its pvalue (pvalue, ) in this work.
Third, the evaluation of DNN model robustness in general is still an open and challenging research problem (zhang2018adversarial, ). Although we are adopting the most popular robustness metrics, there might still be threat to validity to what extent these metrics can actually reflect the robustness of the models.
6. Related Work
In this section, we review related works, with a focus on recent progress on 1) testing approaches which propose different testing criteria for DNN models, 2) different robustness metrics to evaluate the quality of the DNN models, and 3) stateoftheart adversarial attacks and defense methods.
Testing of deep learning models Several recent papers proposed different coverage criteria for evaluating the effectiveness of a test set, along with different methods to generate test cases to improve the coverage criteria. For instance, DeepXplore (deepxplore, ) proposed the first testing criterion for DNN models, i.e., Neuron Coverage (NC), which calculates the percentage of activated neurons (w.r.t. an activation function) among all neurons. Later, DeepGauge (DeepGauge, ) extended the idea and proposed a serial of more finegrained multigranularity testing criteria from both neuron level and layer level. Inspired by the MC/DC test criteria from traditional software testing, Sun et al. proposed four test criteria based on syntactic connections between neurons in adjacent layers and a concolic testing strategy to systematically improve MC/DC coverage of DNN models (DeepConcolic, ). More recently, two surprise adequacy criteria (SurpriseAdequacy, )
are proposed to measure the level of ‘surprise’ of a new test case to the training set, e.g., by measuring the distance between their activation vectors. Our work implemented and reviewed most of the abovementioned coverage criteria for a comprehensive evaluation. Note that some are omitted as they are extremely costly to compute.
Robustness of deep learning models
In the machine learning and the formal verification community, multiple metrics are used to measure the robustness of DNN models. The Lipschitz constant was proved to be useful as a metric for Feedforward Neural Networks by Xu, H.
(Robustness, ). Segedy et al. (Intriguing, ) leveraged the product of Lipschitz constants for each layer as a measure of the DNN robustness and proposed Parseval Networks (Parseval, ) to achieve improved robustness by maintaining a small Lipschitz constant at every hidden layer. Adversarial manipulation, which looks at the required distortion of adversarial samples is another direction. Matthias et al. intended to gave a formal guarantee on the robustness of a classifier by obtaining a robustness lower bound using a local Lipschitz continuous condition (Formalguarantees, ). Recently, Weng et al. (Clever, ) extended their work and proposed a robustness metric called CLEVER score which is calculated using extreme value theory. Our work adopted one latest criteria from each direction.Attack and Defense for deep learning models There is a large body work on adversarial attack and defense in recent years, which we are only able to cover the most relevant ones. In particular, we adopted three stateoftheart attacks to generate adversarial samples, i.e., a gradientbased approach (the FGSM method (fgsm, )), a saliency mapbased approach (JSMA (jsma, )), and an optimizationbased approach (C&W attack (cw, )). On the defense side, multiple attempts are available to obtain a relatively robust model at training phase or detect adversarial samples at runtime. For instance, adversarial training tries to include adversarial samples into consideration (scale, ). Another relevant direction is robust training which tries to train a robust DNN model by considering all the possible perturbation at training phase (madry, ). Besides, mutation testing is adopted to find adversarial samples at runtime (ouricse19, ). Essentially, testing is complementary to these defense works.
7. Conclusion
In this work, we conducted a systematic and quantitative empirical study on stateoftheart DNN models to investigate the relevance and effectiveness of recently proposed testing criteria and approaches for deep neural networks. Our study is based on a selfcontained toolkit which implements all the testing coverage criteria, two robustness metrics and a large set of measurable metrics during the adversarial attack and defense pipeline. Our results obtained from correlation analysis on all these metrics from different perspectives suggest that existing testing coverage criteria have limited correlation with the robustness (or the improvement of the robustness) of DNN models. Furthermore, we provide potential directions to improve DNN testing in general by correlation analysis of robustness metrics and other kinds of metrics.
While our results are mostly negative, we believe it is important that future proposed testing criteria and methods undergo similar evaluation so as to provide evidence of their relevance. Our models, adversarial samples, and programs for calculating the metrics are publicly available and can be used as a benchmark for evaluating future research in this direction.
References
 [1] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensorflow: A system for largescale machine learning. In 12th Symposium on Operating Systems Design and Implementation, pages 265–283, 2016.
 [2] J B. Stroud. Fundamental statistics in psychology and education. Journal of Educational Psychology, 42:318, 05 1951.
 [3] Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Prasoon Goyal, Lawrence D Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, et al. End to end learning for selfdriving cars. arXiv preprint arXiv:1604.07316, 2016.
 [4] Chad Brubaker, Suman Jana, Baishakhi Ray, Sarfraz Khurshid, and Vitaly Shmatikov. Using frankencerts for automated adversarial testing of certificate validation in SSL/TLS implementations. In IEEE Symposium on Security and Privacy, pages 114–129, 2014.
 [5] Nicholas Carlini and David A. Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, pages 39–57, 2017.
 [6] Moustapha Ciss, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. In Proceedings of the 34th International Conference on Machine Learning, pages 854–863, 2017.
 [7] Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. In Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2019.
 [8] Mahyar Fazlyab, Alexander Robey, Hamed Hassani, Manfred Morari, and George J Pappas. Efficient and accurate estimation of lipschitz constants for deep neural networks. arXiv preprint arXiv:1906.04893, 2019.
 [9] Timon Gehr, Matthew Mirman, Dana DrachslerCohen, Petar Tsankov, Swarat Chaudhuri, and Martin Vechev. Ai 2: Safety and robustness certification of neural networks with abstract interpretation. In IEEE Symposium on Security and Privacy, 2018.
 [10] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations, 2015.
 [11] Kelly Hayhurst, Dan Veerhusen, John Chilenski, and Leanna Rierson. A practical tutorial on modified condition/decision coverage. Technical report, NASA, 2001.

[12]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  [13] Matthias Hein and Maksym Andriushchenko. Formal guarantees on the robustness of a classifier against adversarial manipulation. In Advances in Neural Information Processing Systems, pages 2266–2276, 2017.
 [14] Laura Inozemtseva and Reid Holmes. Coverage is not strongly correlated with test suite effectiveness. In 36th International Conference on Software Engineering, pages 435–445, 2014.
 [15] Guy Katz, Clark Barrett, David L Dill, Kyle Julian, and Mykel J Kochenderfer. Reluplex: An efficient smt solver for verifying deep neural networks. In International Conference on Computer Aided Verification, pages 97–117. Springer, 2017.
 [16] Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, 1938.
 [17] Jinhan Kim, Robert Feldt, and Shin Yoo. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41th International Conference on Software Engineering, 2019.
 [18] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
 [19] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. In 5th International Conference on Learning Representations, 2017.
 [20] Alexey Kurakin, Ian J. Goodfellow, and Samy Bengio. Adversarial machine learning at scale. In 5th International Conference on Learning Representations, 2017.

[21]
Yann LeCun.
The mnist database of handwritten digits.
http://yann. lecun. com/exdb/mnist/, 1998.  [22] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [23] Zenan Li, Xiaoxing Ma, Chang Xu, and Chun Cao. Structural coverage criteria for neural networks could be misleading. In Proceedings of the 41st International Conference on Software Engineering: New Ideas and Emerging Results, pages 89–92. IEEE Press, 2019.
 [24] Xiang Ling, Shouling Ji, Jiaxu Zou, Jiannan Wang, Chunming Wu, Bo Li, and Ting Wang. Deepsec: A uniform platform for security analysis of deep learning model. In IEEE Symposium on Security and Privacy, 2019.

[25]
Bo Luo, Yannan Liu, Lingxiao Wei, and Qiang Xu.
Towards imperceptible and robust adversarial example attacks against
neural networks.
In
Proceedings of the ThirtySecond AAAI Conference on Artificial Intelligence
, pages 1652–1659, 2018.  [26] Lei Ma, Felix JuefeiXu, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Chunyang Chen, Ting Su, Li Li, Yang Liu, et al. Deepgauge: Multigranularity testing criteria for deep learning systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pages 120–131. ACM, 2018.
 [27] Lei Ma, Fuyuan Zhang, Jiyuan Sun, Minhui Xue, Bo Li, Felix JuefeiXu, Chao Xie, Li Li, Yang Liu, Jianjun Zhao, et al. Deepmutation: Mutation testing of deep learning systems. In 29th International Symposium on Software Reliability Engineering, pages 100–111, 2018.
 [28] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In 6th International Conference on Learning Representations, 2018.
 [29] William M. McKeeman. Differential testing for software. Digital Technical Journal, 10(1):100–107, 1998.
 [30] Augustus Odena and Ian Goodfellow. Tensorfuzz: Debugging neural networks with coverageguided fuzzing. arXiv preprint arXiv:1807.10875, 2018.
 [31] Nicolas Papernot, Fartash Faghri, Nicholas Carlini, Ian Goodfellow, Reuben Feinman, Alexey Kurakin, Cihang Xie, Yash Sharma, Tom Brown, and Aurko Roy. Technical report on the cleverhans v2.1.0 adversarial examples library. arXiv preprint arXiv:1610.00768, 2016.
 [32] Nicolas Papernot, Patrick D. McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In European Symposium on Security and Privacy, pages 372–387, 2016.
 [33] KARL PEARSON. Notes on the history of correlation. Biometrika, 13(1):25–45, 1920.
 [34] Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. Deepxplore: Automated whitebox testing of deep learning systems. In Proceedings of the 26th Symposium on Operating Systems Principles, pages 1–18. ACM, 2017.
 [35] Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
 [36] Karen Simonyan and Andrew Zisserman. Very deep coanvolutional networks for largescale image recognition. In 3rd International Conference on Learning Representations, 2015.
 [37] Charles Spearman. ”general intelligence,” objectively determined and measured. The American Journal of Psychology, 15(2):201–292, 1904.
 [38] Youcheng Sun, Xiaowei Huang, and Daniel Kroening. Testing deep neural networks. arXiv preprint arXiv:1803.04792, 2018.
 [39] Youcheng Sun, Min Wu, Wenjie Ruan, Xiaowei Huang, Marta Kwiatkowska, and Daniel Kroening. Concolic testing for deep neural networks. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pages 109–119, 2018.
 [40] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
 [41] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, 2014.
 [42] Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. Deeptest: Automated testing of deepneuralnetworkdriven autonomous cars. In Proceedings of the 40th international conference on software engineering, pages 303–314, 2018.
 [43] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. arXiv preprint arXiv:1805.12152, 2018.
 [44] Aladin Virmaux and Kevin Scaman. Lipschitz regularity of deep neural networks: analysis and efficient estimation. In Advances in Neural Information Processing Systems, pages 3835–3844, 2018.
 [45] Jingyi Wang, Guoliang Dong, Jun Sun, Xinyu Wang, and Peixin Zhang. Adversarial sample detection for deep neural network through model mutation testing. In Proceedings of the 41th International Conference on Software Engineering, 2019.
 [46] Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
 [47] Ronald L Wasserstein, Nicole A Lazar, et al. The asa’s statement on pvalues: Context, process, and purpose. The American Statistician, 70(2):129–133, 2016.
 [48] TsuiWei Weng, Huan Zhang, PinYu Chen, Jinfeng Yi, Dong Su, Yupeng Gao, ChoJui Hsieh, and Luca Daniel. Evaluating the robustness of neural networks: An extreme value theory approach. In 6th International Conference on Learning Representations, 2018.
 [49] Huan Xu and Shie Mannor. Robustness and generalization. Machine learning, 86(3):391–423, 2012.
 [50] Zhenlong Yuan, Yongqiang Lu, Zhaoguo Wang, and Yibo Xue. Droidsec: Deep learning in android malware detection. In Conference of the ACM Special Interest Group on Data Communication, pages 371–372, 2014.
 [51] Jiliang Zhang and Xiaoxiong Jiang. Adversarial examples: Opportunities and challenges. arXiv preprint arXiv:1809.04790, 2018.