I Introduction
DEEP learning models have achieved remarkable performance across a wide area of applications [21, 5, 20], however they are susceptible to adversarial examples [47]. These elaborately designed perturbations are imperceptible to humans but can easily lead DNNs to wrong predictions, threatening both digital and physical deep learning applications [22, 28, 27].
Since deep learning has been integrated into various securitysensitive applications (e.g., autodriving, healthcare, etc.), the safety problem brought by adversarial examples has attracted extensive attentions from the perspectives of both adversarial attack (generate adversarial examples to misclassify DNNs) and defense (build a model that is robust to adversarial examples) [17, 41, 22, 14, 4, 30, 34, 26, 7, 56, 10, 50, 24]. To improve model robustness against adversarial examples, a long line of adversarial defense methods have been proposed, e.g.
[42], input transformation [53]. However, most of current adversarial defenses conduct incomplete evaluations, which are far from providing comprehensive understandings for the limitations of these defenses. Thus, these defenses are quickly shown to be attacked successfully, which results in the “arm race” phenomenon between attack and defense [37, 38, 39, 3]. For example, by evaluating on simple whitebox attacks, most adversarial defenses pose a false sense of robustness by introducing gradient masking, which can be easily circumvented and defeated [3]. Therefore, it is of great significance and challenge to conduct rigorous and extensive evaluation on adversarial robustness for navigating the research field and further facilitating trustworthy deep learning in practice.To rigorously evaluate the adversarial robustness for DNNs, a number of works have been adopted [8, 58]. However, most of these works focus on providing practical advices or benchmarks for model robustness evaluation, which ignore the significance of evaluation metrics. By adopting the simple evaluation metrics (e.g., attack success rate, classification accuracy), most of current studies could only use model outputs to conduct incomplete evaluation. For instance, the classification accuracy against an attack under specific perturbation magnitude is reported as the primary and commonly used evaluation metric, which is far from satisfactory to measure model intrinsic behavior in adversarial setting. Therefore, the incomplete evaluation cannot provide comprehensive understandings of the strengths and limitations of these defenses.
In this work, with a hope to facilitate future research, we establish a model robustness evaluation framework containing a comprehensive, rigorous, and coherent set of evaluation metrics. These metrics could fully evaluate model robustness and provide deep insights into building robust models. This paper focuses on the robustness of deep learning models on the most commonly studied image classification tasks with respect to norm bounded adversaries and some other corruption. As illustrated in Figure 1, our evaluation framework can be roughly divided into two parts: dataoriented and modeloriented, which focus on the two key factors of adversarial learning (i.e., data and model). Since model robustness is evaluated based on a set of perturbed examples, we first use dataoriented metrics regarding neuron coverage and data imperceptibility to measure the integrity of test examples (i.e., whether the conducted evaluation covers most of the neurons within a model); meanwhile, we focus on evaluating model robustness via modeloriented metrics which consider both model structures and behaviors in the adversarial setting (e.g., decision boundary, model neuron, corruption performance, etc.). Our framework contains 23 evaluation metrics in total.
To fully demonstrate the effectiveness of the evaluation framework, we then conduct large scale experiments on multiple datasets (i.e., CIFAR10 and SVHN) using different models with different adversarial defense strategies. Through the experimental results, we could conclude that: (1) though showing high performance on some simple and intuitive metrics such as adversarial accuracy, some defenses are weak on more rigorous and insightful metrics; (2) besides norm adversarial examples, more diversified attacks should be performed to conduct comprehensive evaluations (e.g., corruption attacks, adversarial attacks, etc.); (3) apart from model robustness evaluation, the proposed metrics shed light on the model robustness and are also beneficial to the design of adversarial attacks and defenses. All evaluation experiments are conducted on our new adversarial robustness evaluation platform referred as AISafety, which could fully support our comprehensive evaluation. We hope our platform could facilitate follow researchers for better understanding of adversarial examples as well as further improvement of model robustness.
Our contributions can be summarized as follows:

We establish a comprehensive evaluation framework for model robustness containing 32 metrics, which could fully evaluate model robustness and provide deep insights into building robust models;

Based on our framework, we provide an opensourced platform named AISafety, which supports continuous integration of userspecific algorithms and languageindependent models;

We conduct largescale experiments using AISafety, and we provide preliminary suggestions to the evaluation of model robustness as well as the design of adversarial attacks/defenses in the future.
The structure of the paper is illustrated as follows: Section II introduces the related works; Section III defines and provides the detail of our evaluation metrics; Section IV demonstrates the experiments; Section VI introduces our opensourced platform; Section V provides some additional discussions and suggestions; and Section VII summarizes the whole contributions and provides the conclusion.
Ii Related Work
In this section, we provide a brief overview of existing work on adversarial attacks and defenses, as well as adversarial robustness evaluation works.
Iia Adversarial attacks and defenses
Adversarial examples are inputs intentionally designed to mislead DNNs [47, 17]. Given a DNN and an input image with the ground truth label , an adversarial example satisfies
where is a distance metric. Commonly, is measured by the norm ({1,2,}).
In the past years, great efforts have been devoted to generating adversarial examples in different scenarios and tasks [17, 38, 41, 14, 57, 27, 30]
. Adversarial attacks can be divided into two types: whitebox attacks, in which adversaries have the complete knowledge of the target model and can fully access the model; blackbox attacks, in which adversaries have limited knowledge of the target classifier and can not directly access the model. Specifically, most whitebox attacks craft adversarial examples based on the input gradient,
e.g., the fast gradient sign method (FGSM) [17], the projected gradient descent method (PGD) [34], the Carlini & Wagner method (C&W) [38], Deepfool [45] , etc. For the blackbox methods, they can be roughly divided into transferbased attacks [14], scorebased attacks [9, 25, 51], and decisionbased attacks [6].Meanwhile, to improve model robustness against adversarial examples, various defense approaches have been proposed, including defensive distillation [42], input transformation [53, 15, 26], robust training, [34, 10], and certified defense [12, 2, 1]. Among the adversarial defenses, adversarial training has been widely studied and demonstrated to be the most effective [17, 34]. Specifically, adversarial training minimizes the worst case loss within some perturbation region for classifiers, by augmenting the training set with adversarial examples as follows:
where is bounded within perturbations with radius , and
represents the loss function.
Besides, corruption such as snow and blur also frequently occur in the real world, which also presents critical challenges for the building of robust deep learning models. Supposing, we have a set of corruption functions in which each performs a different kind of corruption function. Thus, averagecase model performance on small, general, classifieragnostic corruption can be used to define model corruption robustness as follows:
A concerning fact is that most proposed defenses conduct incomplete or incorrect evaluations, which are quickly shown to be attacked successfully due to limited understanding of these defenses [37, 38, 39, 3]. Consequently, conducting rigorous and comprehensive evaluation on model robustness becomes particularly important.
IiB Model robustness evaluation
To comprehensively evaluate the model robustness for DNNs, a number of works have been proposed [8, 52, 33, 58]. A uniform platform for adversarial robustness analysis named DEEPSEC [52] is proposed to measure the vulnerability of deep learning models. Specifically, the platform incorporates 16 adversarial attacks with 10 attack utility metrics, and 13 adversarial defenses with 5 defensive utility metrics. Unlike prior works, [8] discussed the methodological foundations, reviewed commonly accepted best practices, and suggested new methods for evaluating defenses to adversarial examples. In particular, they provided principles for performing defense evaluations and a specific checklist for avoiding common evaluation pitfalls. Moreover, [33] proposed a set of multigranularity metrics for deep learning systems, which aims at rendering a multifaceted portrayal of the testbed (i.e., testing coverage). More recently, [58] established a comprehensive benchmark to evaluate adversarial robustness on image classification tasks. They incorporated several adversarial attack and defense methods for robustness evaluation, including 15 attack methods, 16 defense methods, and 2 evaluation metrics.
However, these studies mainly focus on establishing open source libraries for adversarial attacks and defenses, which fail to provide a comprehensive evaluation considering several aspects of a deep learning model towards different noises.

Metrics  Behavior  structure 


Whitebox  Blackbox 



Data  KMNCov [33]  ✓  ✓  ✓  ✓  
NBCov [33]  ✓  ✓  ✓  ✓  
SNACov [33]  ✓  ✓  ✓  ✓  
ALD [52]  ✓  ✓  ✓  
ASS [59]  ✓  ✓  ✓  
PSD [32]  ✓  ✓  ✓  
Model  CA  ✓  ✓  ✓  
AAW  ✓  ✓  ✓  ✓  
AAB  ✓  ✓  ✓  ✓  
ACAC [13]  ✓  ✓  ✓  ✓  
ACTC [13]  ✓  ✓  ✓  ✓  
NTE [32]  ✓  ✓  ✓  ✓  
mCE [19]  ✓  ✓  ✓  ✓  
RmCE [19]  ✓  ✓  ✓  ✓  
mFR [19]  ✓  ✓  ✓  ✓  
CAV [52]  ✓  ✓  ✓  ✓  ✓  
CRR/CSR [52]  ✓  ✓  ✓  ✓  ✓  
CCV [52]  ✓  ✓  ✓  ✓  ✓  
COS [52]  ✓  ✓  ✓  ✓  ✓  
EBD [29]  ✓  ✓  ✓  ✓  ✓  
EBD2  ✓  ✓  ✓  ✓  
ENI [29]  ✓  ✓  ✓  ✓  ✓  
Neuron Sensitivity [61]  ✓  ✓  ✓  ✓  
Neuron Uncertainty  ✓  ✓  ✓  ✓  ✓  

Iii Evaluation Metrics
To mitigate the problem brought by incomplete evaluation, we establish a multiview model robustness evaluation framework which consists of 23 evaluation metrics in total. As shown in Table I, our evaluation metrics can be roughly divided into two parts: dataoriented and modeloriented. We will illustrate them in this section.
Iiia DataOriented Evaluation Metrics
Since model robustness is evaluated based on a set of perturbed examples, the quality of the test data plays a critical role in robustness evaluation. Thus, we use dataoriented metrics considering both neuron coverage and data imperceptibility to measure the integrity of test examples.
For traditional software engineering, researchers design and seek a series of representative test data from the whole large input space to detect the software bugs. Test adequacy (often quantified by coverage criteria) is a key factor to measure whether the software has been comprehensively tested [36]. Inspired by that, DeepGauge [33] introduced the coverage criteria into neural networks and proposed Neuron Coverage to leverage the output values of neuron and its corresponding boundaries obtained from training data to approximate the major function region and cornercase region at the neuronlevel.
IiiA1 Neuron Coverage
We first use the coverage criteria for DNNs to measure whether the generated testset (e.g., adversarial examples) could cover enough amount of neurons within a model.
Multisection Neuron Coverage (KMNCov). Given a neuron , the KMNCov measures how thoroughly the given set of the test inputs covers the range of neuron output value [, ], where is a set of input data. Specifically, we divide the range [, ] into sections with the same size (), and denotes the th section where . Let denote a function that returns the output of a neuron under a given input sample . We use to denote that the th section of neuron is covered by the input . For a given test set and a specific neuron , the corresponding Multisection Neuron Coverage is defined as the ratio of the sections covered by and the overall sections. It can be written as
(1) 
where is a set of neurons for the model. It should be noticed that for a neuron and input , if is satisfied with , we say that this DNN is located in its major function region. Otherwise, it is located in the cornercase region.
Neuron Boundary Coverage (NBCov). Neuron Boundary Coverage measures how many cornercase regions have been covered by the given test input set . Given an input , a DNN is located in its cornercase region when given , . Thus, the NBCov can be defined as the ratio of the covered corner cases and the total corner cased ():
(2) 
where the is the set consisting of neurons that satisfy . And the is the set of neurons that satisfy .
Strong Neuron Activation Coverage (SNACov). This metric is designed to measure the coverage status of uppercorner case (i.e., how many corner cases have been covered by the given test sets). It can be described as the ratio of the covered uppercorner cases and the total corner cases ():
(3) 
IiiA2 Data Imperceptibility
In adversarial learning literature, the visual imperceptibility of the generated perturbation is one of the key factor that influences model robustness. Thus, we introduce several metrics to evaluate data visual imperceptibility by considering the magnitude of perturbations.
Average Distortion (ALD). Most adversarial attacks generate adversarial examples by constructing additive norm adversarial perturbations (e.g., ). To measure the visual perceptibility of generated adversarial examples, we use ALD as the average normalized distortion:
(4) 
where denotes the number of adversarial examples, and the smaller ALD is, the more imperceptible the adversarial example is.
Average Structural Similarity (ASS). To evaluate the imperceptibility of adversarial examples, we further use SSIM which is considered to be effective to measure human visual perception. Thus, ASS can be defined as the average SSIM similarity between all adversarial examples and the corresponding clean examples, i.e.,
(5) 
where denotes the number of successful adversarial examples, and the higher ASS is, the more imperceptible the adversarial example is.
Perturbation Sensitivity Distance (PSD). Based on the contrast masking theory [23, 31], PSD is proposed to evaluate human perception of perturbations. Thus, PSD is defined as:
(6) 
where is the total number of pixels, represents the th pixel of the th example. stands for the square surrounding region of , and . Evidently, the smaller PSD is, the more imperceptible the adversarial example is.
IiiB Modeloriented Evaluation Metrics
To evaluate the model robustness, the most intuitive direction is to measure the model performance in the adversarial setting. Given an adversary , it uses specific attack method to generates adversarial examples = for a clean example with the perturbation magnitude under norm.
In particular, we aim to analyze and evaluate model robustness from both dynamic and static views (i.e., model behaviors and structures). By inspecting model outputs towards noises, we can directly measure model robustness through studying the behaviors; meanwhile, by investigating model structures, we can provide more detailed insights into model robustness.
IiiB1 Model Behaviors
We first summarize evaluation metrics in terms of model behaviors as follows.

Task Performance
Clean Accuracy (CA). Model accuracy on clean examples is one of the most important properties in the adversarial setting. A classifier achieving high accuracy against adversarial examples but low accuracy on clean examples still cannot be employed in practice. CA is defined as the percentage of clean examples that are successfully classified by a classifier into the ground truth classes. Formally, CA can be calculated as follows
(7) 
where is the test set, is the indicator function.

Adversarial Performance
Adversarial Accuracy on Whitebox Attacks (AAW). In the untargeted attack scenario, AAW is defined as the percentage of adversarial examples generated in the whitebox setting that are successfully misclassified into an arbitrary class except for the ground truth class; for targeted attack, it can be measured by the percentage of adversarial examples generated in the whitebox setting classified as the target class. In the rest of the paper, we mainly focus on untargeted attacks. Thus, AAW can be defined as:
(8) 
Adversarial Accuracy on Blackbox Attacks (AAB). Similar to AAW, AAB is defined by the percentage of adversarial examples classified correctly by the classifier. By contrast, the adversarial examples are generated by blackbox or gradientfree attacks.
Average Confidence of Adversarial Class (ACAC). Besides model prediction accuracy, prediction confidence on adversarial examples gives further indications of model robustness. Thus, for an adversarial example, ACAC can be defined as the average prediction confidence towards the incorrect class
(9) 
where is the number of adversarial examples that attack successfully, is the prediction confidence of classifier towards the ground truth class .
Average Confidence of True Class (ACTC). In addition to ACAC, we also use ACTC to further evaluate to what extent the attacks escape from the ground truth. In other words, ACTC can be defined as the average model prediction confidence on adversarial examples towards the ground truth labels, i.e.,
(10) 
Noise Tolerance Estimation (NTE).
Moreover, given the generated adversarial examples, we further calculate the gap between the probability of misclassified class and the max probability of all other classes as follows
(11)  
where and .

Corruption Performance
To further comprehensively measure the model robustness against different corruption, we introduce evaluation metrics following [19].
mCE. This metric denotes the mean corruption error of a model compared to the baseline model [19]. Different from the original paper, we simply calculate the error rate of the classifier on each corruption type at each level of severity denoted as and compute mCE as follows:
(12) 
where denotes the number of severity levels. Thus, mCE is the average value of Corruption Errors (CE) using different corruption.
Relative mCE. A more nuanced corruption robustness measure is Relative mCE (RmCE) [19]. If a classifier withstands most corruption, the gap between mCE and the clean data error is minuscule. So, RmCE is calculated as follows:
(13) 
where is the error rate of on clean examples.
mFR. Hendrycks et al. [19] introduce mFR to represent the classification differences between two adjacent frames in the noise sequence for a specific image. Let us denote noise sequences with where each sequence is created with specific noise type . The ‘Flip Probability’ of network is
(14) 
Then, the Flip Rate (FR) can be obtained by and mFR is the average value of FR.

Defense Performance
In addition to the basic metrics, we further try to explore to what extent the model performance has been influenced when defense strategies are added to the model.
CAV.
Classification Accuracy Variance (CAV) is used to evaluate the impact of defenses based on the accuracy. We expect the defenseenhanced model
to maintain the classification accuracy on normal testing examples as much as possible. Therefore, it is defined as follows:(15) 
where denotes model accuracy on dataset .
CRR/CSR. CRR is the percentage of testing examples that are misclassified by previously but correctly classified by . Inversely, CSR is the percentage of testing examples that are correctly classified by but misclassified by . Thus, they are defined as follows:
(16) 
(17) 
where is the number of examples.
CCV. Defense strategies may not have negative influences on the accuracy performance, however, the prediction confidence of correctly classified examples may decrease. Classification Confidence Variance (CCV) can measure the confidence variance induced by robust models:
(18) 
where denotes the prediction confidence of model towards and is the number of examples correctly classified by both and .
COS. Classification Output Stability (COS) uses JS divergence to measure the similarity of the classification output stability between the original model and the robust model. It averages the JS divergence on all correctly classified test examples:
(19) 
where and denotes the prediction confidence of model and on , respectively. is the number of examples correctly classified by both and .
IiiB2 Model Structures
We further provide evaluation metrics with respect to model structures as follows.

Boundarybased
Empirical Boundary Distance (EBD). The minimum distance to the decision boundary among data points reflects the model robustness to small noise [11, 16]
. EBD calculates the minimum distance to the model decision boundary in a heuristic way. A larger EBD value means a stronger model. Given a learnt model
and point with class label (), it first generates a set of random orthogonal directions [18]. Then, for each direction in it estimates the root mean square (RMS) distances to the decision boundary of , until the model’s prediction changes, i.e., . Among , denotes the minimum distance moved to change the prediction for instance . Then, the Empirical Boundary Distance is defined as follows:(20) 
where denotes the number of instances used.
Empirical Boundary Distance2 (EBD2). Additionally, we introduce the evaluation metrics EBD2, which calculates the minimum distance of the model decision boundary for each class. Given a learnt model and dataset , for each direction in the classes, the metric estimates the distances to change the model prediction of , i.e., . Specifically, we use iterative adversarial attacks (e.g., BIM) in practice and calculate the steps used as the distance .

Consistencybased
Empirical Noise Insensitivity. [55] first introduced the concept of learning algorithms robustness from the idea that if two samples are “similar” then their test errors are very close. Empirical Noise Insensitivity measures the model robustness against noise from the view of Lipschitz constant, and a lower value indicates a stronger model. We first select clean examples randomly, then examples are generated from each clean example via various methods, e.g., adversarial attack, Gaussian noise, blur, etc. The differences between model loss function are computed when clean example and corresponding polluted examples are fed to. The different severities in loss function is used to measure model insensitivity and stability to generalized small noise within constraint :
(21)  
where , and denote the clean example, corresponding polluted example and the class label, respectively. Moreover, represents the loss function of model .

Neuronbased
Neuron Sensitivity. Intuitively, for a model that owns strong robustness, namely, insensitive to adversarial examples, the clean example and the corresponding adversarial example share a similar representation in the hidden layers of the model [55]. Neuron Sensitivity can be deemed as the deviation of the feature representation in hidden layers between clean examples and corresponding adversarial examples, which measures model robustness from the perspective of neuron. Specifically, given a benign example , where , from and its corresponding adversarial example from , we can get the dual pair set , and then calculate the neuron sensitivity as follows:
(22) 
where and respectively represents outputs of the th neuron at the th layer of towards clean example and corresponding adversarial example during the forward process.
denotes the dimension of a vector.
Neuron Uncertainty. Model uncertainty has been widely investigated in safety critical applications to induce the confidence and uncertainty behaviors during model prediction. Motivated by the fact that model uncertainty is commonly induced by predictive variance, we use the variance of neuron to calculate the Neuron Uncertainty as:
(23) 
Iv Experiments
In this section, we evaluate model robustness using our proposed evaluation framework. We conduct experiments on image classification benchmarks CIFAR10 and SVHN.
Iva Experiment Setup
Architecture and hyperparameters
. We us WRN2810 [60] for CIFAR10; and VGG16 [46] for SVHN. For fair comparisons, we keep the architecture and main hyperparameters the same for all the baselines on each dataset.Adversarial attacks. To evaluate the model robustness, we follow existing guidelines [34, 8] and incorporate multiple adversarial attacks for different perturbation types. Specifically, we adopt PGD attack [34], C&W attack [38], boundary attack (BA) [6], SPSA [51], and NATTACK [25]. We set the perturbation magnitude for attacks as 12 on CIFAR10 and SVHN; we set the perturbation magnitude for attacks as 0.5 on CIFAR10 and SVHN; we set the perturbation magnitude for attacks as 0.03 on CIFAR10 and SVHN. Note that, to check whether obfuscated gradient has been introduced, we adopt both the whitebox and the blackbox or gradientfree adversarial attacks. A more complete details of all attacks including hyperparameters can be found in the Supplementary Material.
Corruption attacks. To assess the corruption robustness, we evaluate models on CIFAR10C and CIFAR10P [19]. These two datasets are the first choice for benchmarking model static and dynamic model robustness against different common corruption and noise sequences at different levels of severity [19]. They are created from the test set of CIFAR10 using 75 different corruption techniques (e.g., Gaussian noise, Possion noise, pixelation, etc.). For SVHN, we use the code provided by [19] to generate the corrupted examples.
IvB Modeloriented Evaluation
We first evaluate model robustness by measuring the modeloriented evaluation metrics.
IvB1 Model Behaviors
We first evaluate model robustness with respect to behaviors.
As for adversarial robustness, we report metrics including CA, AAW, AAB, ACAC, ACTC, and NTE. The experimental results regarding CA and AAW can be found in Table II; the results of AAB are shown in Table III; and the results in terms of ACAC, ACTC, and NTE are listed in Figure 2. Besides standard blackbox attacks (NATTACK, SPSA, and BA), we also generate adversarial examples using an InceptionV3 then perform transfer attacks on the target model (denoted “PGD”, “PGD”, and “PGD” in Table III).
As for corruption robustness, the results of mCE, relative mCE, and mFR can be found in Figure 3. Moreover, the results of CAV, CRR/CSR, CCV, and COS are illustrated in Table IV.
From the above experimental results, we can draw several conclusions as follows: (1) TRADES achieves the highest adversarial robustness for almost all adversarial attacks in both blackbox and whitebox settings, however it is vulnerable against corruptions; (2) models trained on one specific perturbation type are vulnerable to other normbounded perturbations (e.g., trained models are weak towards and adversarial examples); and (3) according to Figure 2(b) and 2(d), standard adversariallytrained models (SAT and PAT) are still vulnerable from a more rigorous perspective by showing high confidence of adversarial classes and low confidence of true classes.






IvB2 Model Structures
We then evaluate model robustness with respect to structures. The results of EBD and EBD2 are illustrated in Table V and Fig 5; the results of Empirical Noise Insensitivity, Neuron Sensitivity, and Neuron Uncertainty can be found in Figure 6, 7, respectively.
In summary, we can draw several interesting observations: (1) in most cases, models with higher adversarial accuracy are showing better structure robustness; (2) though showing the highest adversarial accuracy, TRADES does not have the largest EBD value as shown in Table V.


IvC Dataoriented Evaluation
We then report the dataoriented evaluation metrics. For each dataset, given test set with 10000 images randomly selected from each class, we adversarially perturb these images using FGSM and PGD, respectively. We then compute and report the neuroncoverage related metrics (KMNCov, NBCov, SNACov, TKNCov) using these test sets. The results can be found in Table VII. Further, we show the results of ALD, ASS, and PSD on these test sets in Table IX.
In summary, we can draw conclusions as follows: (1) adversarial examples generated by norm attacks show significantly higher neuron coverage than other perturbation types (e.g., and ), which indicate that norm attacks cover more “paths” for a DNN when perform test or evaluation; (2) meanwhile, norm attacks are more imperceptible to the human vision according to Table IX (lower ALD, PSD, and higher ASS values compared to and attacks).














V Discussions and Suggestions
Having demonstrated extensive experiments on these datasets using our comprehensive evaluation framework, we now take a further step and provide additional suggestions to the evaluation of model robustness as well as the design of adversarial attacks/defenses in the future.
Va Evaluate Model Robustness using More Attacks
For most studies in the adversarial learning literature[29, 63, 54, 53], they evaluate model robustness primarily on norm bounded PGD attacks, which has been shown to be the most effective and representative adversarial attack. However, according to our experimental results, we suggest to provide more comprehensive evaluations on different types of attacks:
(1) Evaluate model robustness on norm bounded adversarial attacks. However, as shown in Table II and III, most adversarial defenses are designed to counteract a single type of perturbation (e.g., small noise) and offer no guarantees for other perturbations (e.g., , ), sometimes even increase model vulnerability [49, 35]. Thus, to fully evaluate adversarial robustness, we suggest to use , , and attacks.
(2) Evaluate model robustness on adversarial attacks as well as corruption attacks. In addition to adversarial examples, corruption such as snow and blur also frequently occur in the real world, which also presents critical challenges for the building of strong deep learning models. According to our studies, deep learning models behave distinctly subhuman to input images with different corruption. Meanwhile, adversarially robust models may also vulnerable to corruption as shown in Fig 3. Therefore, we suggest to take both adversarial robustness and corruption robustness into consideration, when measuring the model robustness against noises.
(3) Perform blackbox or gradientfree adversarial attacks, such as NATTACK, SPSA, etc. Blackbox attacks are effective to elaborating whether obfuscated gradients [3] have been introduced to a specific defense. Moreover, blackbox attacks are also shown to cover more neurons when perform test as shown in Table VII.
VB Evaluate Model Robustness Considering Multiple Views
To mitigate the problem brought by incomplete evaluations, we suggest to evaluate model robustness using more rigorous metrics, which consider multiview robustness.
(1) Consider model behaviors with respect to more profound metrics, e.g., prediction confidence. For example, though showing high adversarial accuracy, SAT and NAT are vulnerable by showing high confidence of adversarial classes and low confidence of true classes, which are similar to vanilla models.
(2) Evaluate model robustness in terms of model structures, e.g., boundary distance. For example, though ranking first among other baselines on adversarial accuracy, TRADES is not strong enough in terms of Neuron Sensitivity, EBD compared to other baselines.
VC Design of Attacks and Defenses
Besides model robustness evaluation, the proposed metrics are also beneficial to the design of adversarial attacks and defenses. Most of these metrics provide deep investigations of the model behaviors or structures towards noises, which can be used for researchers to design adversarial attack or defense methods. Regarding metrics in terms of model structures, we can develop new attacks or defenses by either enhancing or impairing them, since these metrics capture the structural pattern that manifest model robustness. For example, to improve model robustness, we can constrain the value of ENI, Neuron Sensitivity, and Neuron Uncertainty.
Vi An OpenSourced Platform
To fully support our multiview evaluation and facilitate further research, we provide an opensourced platform referred as AISafety^{2}^{2}2https://git.openi.org.cn/OpenI/AISafety
based on Pytorch. Our platform contains several highlights as follows:
(1) Multilanguage environment. To facilitate the user flexibility, our platform supports the use of languageindependent models (e.g., Java, C, Python, etc.). To achieve the goal, we establish the standardized input and output systems with uniform format. Specifically, we use the docker container to encapsulate the model input and output so that users could load their models freely.
(2) High extendibility. Our platform also supports continuous integration of userspecific algorithms and models. In other words, users are able to introduce externally personaldesigned attack, defense, evaluation methods, by simply inheriting the base classes through several public interfaces.
(3) Multiple scenarios. Our platform integrates multiple realworld application scenarios, e.g., autodriving, automatic checkout, interactive robots.
Our platform consists of five main components: Attack module, Defense module, Evaluation module, Prediction module, and Database module. As shown in Figure 8, the Prediction module executes the model (might be trained using defense strategies from the Defense module) on a specific dataset in the Database module using attacks from the Attack module with evaluation metrics selected from the Evaluation module.
(1) Attack module is used for generating adversarial examples and corruption attacks, which contains 15 adversarial attacks and 19 corruption attacks.
(2) Defense module provides 10 adversarial defense strategies which could be used to improve model robustness.
(3) Evaluation module is used to evaluate the model robustness considering both the data and model aspect of the issue, which contains 23 different evaluation metrics.
(4) Prediction module executes the models and standardize the model input and output using specific attacks and evaluation metrics.
(5) Database module collects several datasets and pretrained models, which can be used for evaluation.
To make the platform flexible and userfriendly, we decouple each part of the whole evaluation process. Users are able to customize their evaluation process by switching attack methods, defense methods, evaluation methods, and models through simple parameter modification.
In contrast to other opensourced platforms, as shown in Table X, our AISafety enjoys the advantages of static/dynamic analysis, robustness evaluation, etc.
Vii Conclusion
Most of current defenses only conduct incomplete evaluations, which are far from providing comprehensive understandings for the limitations of these defenses. Thus, most proposed defenses are quickly shown to be attacked successfully, which result in the “arm race” phenomenon between attack and defense. To mitigate this problem, we establish a model robustness evaluation framework containing a comprehensive, rigorous, and coherent set of evaluation metrics, which could fully evaluate model robustness and provide deep insights into building robust models. Our framework primarily focus on the two key factors of adversarial learning (i.e., data and model), and provide 23 evaluation metrics considering multiple aspects such as neuron coverage, data imperceptibility, decision boundary distance, adversarial performance, etc. We conduct large scale experiments on multiple datasets including CIFAR10 and SVHN using different models and defenses with our opensource platform AISafety, and provide additional suggestions to model robustness evaluation as well as attack/defense designing.
The objective of this work is to provide a comprehensive evaluation framework which could conduct more rigorous evaluations of model robustness. We hope our paper can facilitate fellow researchers for a better understanding of the adversarial examples as well as further improvement of model robustness.
References
 [1] (2018) Certified defenses against adversarial examples. In International Conference on Learning Representations, Cited by: §IIA.
 [2] (2018) Semidefinite relaxations for certifying robustness to adversarial examples. In Advances in Neural Information Processing Systems, Cited by: §IIA.
 [3] (2018) Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420. Cited by: §I, §IIA, §VA.
 [4] (2017) Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397. Cited by: §I.
 [5] (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §I.

[6]
(2018)
Decisionbased adversarial attacks: reliable attacks against blackbox machine learning models
. In International Conference on Learning Representations, Cited by: §IIA, §IVA.  [7] (2018) Thermometer encoding: one hot way to resist adversarial examples. In International Conference on Learning Representations, Cited by: §I.
 [8] (2019) On evaluating adversarial robustness. arXiv preprint arXiv:1902.06705. Cited by: §I, §IIB, §IVA.

[9]
(2017)
Zoo: zeroth order optimization based blackbox attacks to deep neural networks without training substitute models.
In
10th ACM Workshop on Artificial Intelligence and Security
, Cited by: §IIA.  [10] (2017) Parseval networks: improving robustness to adversarial examples. In International Conference on Machine Learning, Cited by: §I, §IIA.
 [11] (1995) Supportvector networks. Machine learning. Cited by: §IIIB2.
 [12] (2020) Provable robustness against all adversarial perturbations for . In International Conference on Learning Representations, Cited by: §IIA.
 [13] (2005) Prediction confidence for associative classification. Cited by: TABLE I.

[14]
(2018)
Boosting adversarial attacks with momentum.
In
IEEE Conference on Computer Vision and Pattern Recognition
, Cited by: §I, §IIA.  [15] (2016) A study of the effect of jpg compression on adversarial images. arXiv preprint arXiv:1608.00853. Cited by: §IIA.
 [16] (2018) Large margin deep networks for classification. In Advances in Neural Information Processing Systems, Cited by: §IIIB2.
 [17] (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §I, §IIA, §IIA, §IIA, §IVA.
 [18] (2018) Decision boundary analysis of adversarial examples. In International Conference on Learning Representations, Cited by: §IIIB2.
 [19] (2019) Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, Cited by: TABLE I, §IIIB1, §IIIB1, §IIIB1, §IIIB1, §IVA.
 [20] (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Processing Magazine. Cited by: §I.
 [21] (2012) ImageNet classification with deep convolutional neural networks. In International Conference on Neural Information Processing Systems, Cited by: §I.
 [22] (2016) Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533. Cited by: §I, §I.
 [23] (1980) Contrast masking in human vision. Josa 70 (12), pp. 1458–1471. Cited by: §IIIA2.
 [24] (2021) Understanding adversarial robustness via critical attacking route. Information Sciences. Cited by: §I.
 [25] (2019) NATTACK: learning the distributions of adversarial examples for an improved blackbox attack on deep neural networks. In International Conference on Machine Learning, Cited by: §IIA, §IVA.
 [26] (2018) Defense against adversarial attacks using highlevel representation guided denoiser. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §I, §IIA.
 [27] (2020) Spatiotemporal attacks for embodied agents. In European Conference on Computer Vision, Cited by: §I, §IIA.
 [28] (2019) Perceptualsensitive gan for generating adversarial patches. In 33rd AAAI Conference on Artificial Intelligence, Cited by: §I.
 [29] (2019) Training robust deep neural networks via adversarial noise propagation. arXiv preprint arXiv:1909.09034. Cited by: TABLE I, §VA.
 [30] (2020) Biasbased universal adversarial patch attack for automatic checkout. In European Conference on Computer Vision, Cited by: §I, §IIA.
 [31] (2010) Just noticeable difference for images with decomposition model for separating edge and textured regions. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §IIIA2.
 [32] (2018Apr.) Towards imperceptible and robust adversarial example attacks against neural networks. Proceedings of the AAAI Conference on Artificial Intelligence 32 (1). External Links: Link Cited by: TABLE I.
 [33] (2018) DeepGauge: multigranularity testing criteria for deep learning systems. In 33rd ACM/IEEE International Conference on Automated Software Engineering, Cited by: §IIB, TABLE I, §IIIA.
 [34] (2018) Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, Cited by: §I, §IIA, §IIA, §IVA, §IVA.
 [35] (2020) Adversarial robustness against the union of multiple perturbation model. In International Conference on Machine Learning, Cited by: §VA.
 [36] (2004) The art of software testing. In Chichester, Cited by: §IIIA.
 [37] (2016) Defensive distillation is not robust to adversarial examples. arXiv preprint arXiv:1607.04311. Cited by: §I, §IIA.
 [38] (2017) Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, Cited by: §I, §IIA, §IIA, §IVA.
 [39] (2019) Is ami attacks meet interpretability robust to adversarial examples. arXiv preprint arXiv:1902.02322. Cited by: §I, §IIA.

[40]
(2016)
Cleverhans v2. 0.0: an adversarial machine learning library
. arXiv preprint arXiv:1610.00768 10. Cited by: TABLE X.  [41] (2016) Practical blackbox attacks against deep learning systems using adversarial examples. arXiv preprint arxiv:1602.02697. Cited by: §I, §IIA.
 [42] (2015) Distillation as a defense to adversarial perturbations against deep neural networks. arXiv preprint arXiv:1511.04508. Cited by: §I, §IIA.
 [43] (2017) Deepxplore: automated whitebox testing of deep learning systems. In proceedings of the 26th Symposium on Operating Systems Principles, Cited by: TABLE X.
 [44] (2017) Foolbox: a python toolbox to benchmark the robustness of machine learning models. Cited by: TABLE X.
 [45] (2016) DeepFool: a simple and accurate method to fool deep neural networks. In IEEE International Conference on Computer Vision, Cited by: §IIA.
 [46] (2015) Very deep convolutional networks for largescale image recognition. International Conference on Learning Representations. Cited by: §IVA.
 [47] (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §I, §IIA.
 [48] (2018) Deeptest: automated testing of deepneuralnetworkdriven autonomous cars. In Proceedings of the 40th international conference on software engineering, Cited by: TABLE X.
 [49] (2019) Adversarial training and robustness for multiple perturbations. In Advances in Neural Information Processing Systems, Cited by: §VA.
 [50] (2017) Ensemble adversarial training: attacks and defenses. arXiv preprint arXiv:1705.07204. Cited by: §I.
 [51] (2018) Adversarial risk and the dangers of evaluating against weak attacks. In International Conference on Machine Learning, Cited by: §IIA, §IVA.
 [52] DEEPSEC: a uniform platform for security analysis of deep learning model. In 2019 IEEE Symposium on Security and Privacy (SP), year=2019, Cited by: §IIB, TABLE I, TABLE X.
 [53] (2018) Mitigating adversarial effects through randomization. In International Conference on Learning Representations, Cited by: §I, §IIA, §IVA, §VA.
 [54] (2020) INTRIGUING properties of adversarial training at scale. In International Conference on Learning Representations, Cited by: §VA.
 [55] (2012) Robustness and generalization. Machine learning. Cited by: §IIIB2, §IIIB2.
 [56] (2018) Deep defense: training dnns with improved adversarial robustness. In Advances in Neural Information Processing Systems, Cited by: §I.

[57]
(2019)
Efficient decisionbased blackbox adversarial attacks on face recognition
. In IEEE International Conference on Computer Vision, Cited by: §IIA.  [58] (2020) Benchmarking adversarial robustness on image classification. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §I, §IIB, TABLE X.
 [59] (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing. Cited by: TABLE I.
 [60] (2016) Wide residual networks. In The British Machine Vision Conference, Cited by: §IVA.
 [61] (2021) Interpreting and improving adversarial robustness of deep neural networks with neuron sensitivity. IEEE Transactions on Image Processing 30 (), pp. 1291–1304. External Links: Document Cited by: TABLE I.
 [62] (2019) Theoretically principled tradeoff between robustness and accuracy. Cited by: §IVA.
 [63] (2019) Interpreting adversarially trained convolutional neural networks. arXiv preprint arXiv:1905.09797. Cited by: §VA.
Comments
There are no comments yet.