Towards Characterizing Adversarial Defects of Deep Learning Software from the Lens of Uncertainty

by   Xiyue Zhang, et al.

Over the past decade, deep learning (DL) has been successfully applied to many industrial domain-specific tasks. However, the current state-of-the-art DL software still suffers from quality issues, which raises great concern especially in the context of safety- and security-critical scenarios. Adversarial examples (AEs) represent a typical and important type of defects needed to be urgently addressed, on which a DL software makes incorrect decisions. Such defects occur through either intentional attack or physical-world noise perceived by input sensors, potentially hindering further industry deployment. The intrinsic uncertainty nature of deep learning decisions can be a fundamental reason for its incorrect behavior. Although some testing, adversarial attack and defense techniques have been recently proposed, it still lacks a systematic study to uncover the relationship between AEs and DL uncertainty. In this paper, we conduct a large-scale study towards bridging this gap. We first investigate the capability of multiple uncertainty metrics in differentiating benign examples (BEs) and AEs, which enables to characterize the uncertainty patterns of input data. Then, we identify and categorize the uncertainty patterns of BEs and AEs, and find that while BEs and AEs generated by existing methods do follow common uncertainty patterns, some other uncertainty patterns are largely missed. Based on this, we propose an automated testing technique to generate multiple types of uncommon AEs and BEs that are largely missed by existing techniques. Our further evaluation reveals that the uncommon data generated by our method is hard to be defended by the existing defense techniques with the average defense success rate reduced by 35%. Our results call for attention and necessity to generate more diverse data for evaluating quality assurance solutions of DL software.


page 1

page 2

page 3

page 4


Distribution Awareness for AI System Testing

As Deep Learning (DL) is continuously adopted in many safety critical ap...

Combinatorial Testing for Deep Learning Systems

Deep learning (DL) has achieved remarkable progress over the past decade...

Detecting Operational Adversarial Examples for Reliable Deep Learning

The utilisation of Deep Learning (DL) raises new challenges regarding it...

Estimating Predictive Uncertainty Under Program Data Distribution Shift

Deep learning (DL) techniques have achieved great success in predictive ...

An Empirical Study towards Characterizing Deep Learning Development and Deployment across Different Frameworks and Platforms

Deep Learning (DL) has recently achieved tremendous success. A variety o...

Bias Busters: Robustifying DL-based Lithographic Hotspot Detectors Against Backdooring Attacks

Deep learning (DL) offers potential improvements throughout the CAD tool...

Adversarial Example Detection for DNN Models: A Review

Deep Learning (DL) has shown great success in many human-related tasks, ...

1. Introduction

In company with the booming of available domain-specific big data and hardware acceleration, deep learning (DL) experienced big performance leap in the past few years, in achieving competitive performance in many cutting edge applications (e.g., image processing (Russakovsky et al., 2015), speech recognition (Hinton et al., 2012)

, sentiment analysis 

(Dohaiha et al., 2018), e-commerce recommendation (Zhang et al., 2019), video game control (Mnih et al., 2015)). However, the state-of-the-art DL software still suffers from quality issues. A deep neural network (DNN) that achieves high prediction accuracy can still be vulnerable to adversarial examples (AEs) (Biggio et al., 2013). For example, an image recognition DL software can be easily fooled by pixel-level noises (The BBC, 2016) or noises perceived in a physical-world situation (Tian et al., 2018; Eykholt et al., 2018). The quality and reliability issues, without properly addressing or confining, could potentially hinder more widespread adoption of DL software especially in the applications with higher requirements of safety and security  (e.g., autonomous driving, medical diagnosis).

The incorrect decision of a DL software can trace back to several typical sources and patterns (e.g., generalization capability issue, robustness issue (Katz et al., 2017)). Up to present, AEs remain to be one of the most notable types of DL defects, which reveals the quality and robustness issues of the DL software. Although the arms races of many recently proposed adversarial attack (Goodfellow et al., 2015; Moosavi-Dezfooli et al., 2016; Kurakin et al., 2016a; Carlini and Wagner, 2017), defense (Gong et al., 2017; Wang et al., 2018; Hazan et al., 2016; Papernot et al., 2016; Xu et al., 2017; Guo et al., 2017; Prakash et al., 2018) and testing (Pei et al., 2017; Tian et al., 2018; Xie et al., 2019a; Du et al., 2019) continuously escalate, most of these techniques are rather ad-hoc. Thus far, research efforts on understanding and interpretation of AEs and benign examples (BEs) are still at an early stage (Tao et al., 2018). Uncertainty provides a new perspective to characterize AEs and BEs, towards understanding and designing better quality assurance techniques for DL software, e.g., testing, adversarial attack/defense.

(a) Prediction Confidence
(b) Bayesian Uncertainty
Figure 1. Prediction Confidence and Bayesian Uncertainty.

To bridge this gap, in this paper, we study 4 state-of-the-art Bayesian uncertainty metrics (based on the statistical analysis of multi-shot executions) along with the one-shot execution metrics, i.e., prediction confidence score (PCS) of the DL software under analysis (see Fig. 1 and definitions in § 2). We perform a large-scale comparative study to investigate the sensitivity/capability of these metrics in differentiating AEs and BEs, which are important indicators to characterize DL runtime behaviors. We find that PCS and Variation Ratio in terms of original prediction (VRO) (§ 2) are among the best candidates with such capability, and they are selected for further input data characterization. Then, an obvious question arises: what is the relation between the AEs/BEs generated by the state-of-the-art adversarial attack/testing techniques and these uncertainty metrics? In particular, do these AEs/BEs follow some patterns in terms of uncertainty? Our in-depth analysis reveals that AEs/BEs generated by existing techniques (Goodfellow et al., 2015; Moosavi-Dezfooli et al., 2016; Kurakin et al., 2016b; Carlini and Wagner, 2017; Odena et al., 2019; Xie et al., 2019a) largely fall into two common patterns (that we refer as common input samples): (1) AEs tend to have low PCS and high VRO uncertainty, and (2) BEs often come with high PCS and low VRO.

Based on the above observation, further questions naturally arise: What do those uncommon inputs look like and can we generate them possibly and automatically? and Would these uncommon data inputs differ from common inputs, e.g., are they even more challenging to be correctly handled by a DL software?

To answer these questions, we propose a genetic algorithm (GA) based automated test generation technique that iteratively generates uncommon input samples guided by uncertainty metrics. We implement the proposed technique as a tool named

KuK (to Know the UnK

nown), and demonstrate its effectiveness in generating uncommon inputs on a large benchmark, including three datasets MNIST, CIFAR10, ImageNet across four different DL model architectures

LeNet-5, NIN, ResNet-20, MobileNet. In line with existing adversarial defense techniques, our comparative experiments against the state-of-the-art adversarial attack/testing techniques also reveal that the uncommon samples could be more challenging to be correctly handled by DL software in many cases. Such uncommon samples represent a new type of hazard and potential defects to DL software, which so far lacks investigation and should draw further attention during future quality assurance solution design.

In summary, this paper investigates the following research questions with the support of a large-scale study:

  • [topsep=3pt,leftmargin=*]

  • RQ1: What is the capability of the state-of-the-art uncertainty metrics in differentiating AEs and BEs?

  • RQ2: Do the AEs and BEs generated by the state-of-the-art adversarial attack/testing techniques follow common uncertainty patterns? If so, what are such common patterns?

  • RQ3: Is it possible to generate those uncommon data that is missed by the state-of-the-art attack/testing techniques? Is KuK useful in generating uncommon data?

  • RQ4: To what extent are the uncommon samples defended by existing adversarial defense techniques compared with the common ones?

Through answering RQ1 and RQ2, we aim to characterize behaviors of DL software on AEs and BEs from the uncertainty perspective, and investigate whether some common patterns are followed by data inputs generated through current state-of-the-art adversarial attack/testing techniques. Our study confirms the existence of the common patterns for such generated data (i.e., low PCS and high VRO for AEs, high PCS and low VRO for BEs). It also identifies new uncommon sample categories which are missed by existing techniques.

RQ3 and RQ4 focus on understanding the feasibility to obtain the uncommon samples (i.e., RQ3) and the impact of such samples on the quality and reliability of DL software (i.e., RQ4). Our evaluation results confirm that the uncommon samples could be generated with proper testing guidance. Such uncommon samples, representing a new type of generated test data, could bypass a variety of adversarial defense techniques with higher success rates. Thus, it is quite important to generate such uncommon inputs to reveal the hidden defects of DL software for vulnerability analysis and further enhancement, especially for safety- and security-critical scenarios. We believe such uncommon data could be an important clue towards building trustworthy DL solutions, which should draw special attention for further quality assurance solution design.

The contributions of this paper are summarized as follows:

  • [leftmargin=*]

  • We perform an empirical study on four state-of-the-art Bayesian uncertainty metrics and one prediction confidence metric to investigate their ability to differentiate BEs and AEs, i.e., data inputs that can/cannot be correctly handled by a DL software. Among these metrics, PCS and VRO outperform the others in achieving higher differentiating accuracy.

  • We perform a systematic study on the BEs and AEs generated by six state-of-the-art adversarial attack/testing techniques (i.e., FGSM (Goodfellow et al., 2015), BIM (Kurakin et al., 2016b), Deepfool (Moosavi-Dezfooli et al., 2016), C&W (Carlini and Wagner, 2017), DeepHunter (Xie et al., 2019a), TensorFuzz (Odena et al., 2019)) to identify potential uncertainty patterns in terms of PCS and VRO. Our results reveal that (1) AEs and BEs from existing techniques largely follow two common patterns while the other patterns (denoted as uncommon patterns) are largely missed by existing methods, and (2) the characteristics of AEs generated by the testing tools differ from those by the adversarial attack techniques.

  • We propose a GA-based automated testing technique for DL software, implemented as a tool KuK, towards generating input samples with diverse uncertainty patterns. Especially, our evaluation on three datasets MNIST, CIFAR10, ImageNet across four different model architectures LeNet-5, NIN, ResNet-20, MobileNet demonstrates its effectiveness in generating uncommon input samples that are largely missed by existing techniques.

  • We further investigate how the adversarial defense techniques  (Gong et al., 2017; Wang et al., 2018; Hazan et al., 2016; Papernot et al., 2016; Xu et al., 2017; Guo et al., 2017; Prakash et al., 2018) react to the uncommon samples, in line with the samples generated by adversarial attack techniques. Our results indicate that the current defense techniques are often biased towards particular patterns of samples. The generated uncommon samples can bypass such defense techniques with a high success rate, potentially causing severe threats to the quality and reliability of DL software. For example, on the model NIN, the uncommon data achieve 97.5% success rate on bypassing the mutation-based defense, while the common data only make it 5.5%.

2. Preliminaries and overview

2.1. Deep Neural Networks

Definition 1 (Deep Neural Network).

A deep neural network (DNN) is a function that maps an input

to a predictive probability vector

. The output label of on input data is where is the set of classes and .

In general, a DNN learns to extract features from the distribution of training data layer by layer, and provides the decision on each candidate class with some probabilistic confidence. A higher predictive probability value of a class often indicates higher prediction confidence on that decision label.

Definition 2 (Benign Example).

A benign example (BE) with ground truth label is an input sample, such that the prediction decision of a DNN is consistent with the ground truth label: .

Benign examples refer to those inputs that could be correctly handled by a given DNN model .

Definition 3 (Adversarial Example).

An adversarial example (AE) is an input similar to a benign example by adding some minor perturbation , (i.e., ), but resulting in a different prediction decision of a DNN (i.e., ).

Existing attack methods usually generate the AEs by manipulating the output of the logit layer or softmax layer of a DNN , which gradually decreases in the output probability of ground truth label and increases in the probability of other labels. In this way, the prediction decision is shifted to other labels. Based on direct observations on the prediction results of such AEs, we identify that a typical type of unreliable prediction is usually accompanied by two classes that have close probability confidence values.

Definition 4 (Prediction Confidence Score).

Given a DNN and an input , the prediction confidence score (PCS) of the input on is defined as

where C is the set of classes, and .

Intuitively, PCS depicts the probability difference between the two classes with the highest probabilities, which provides an uncertainty proxy from the aspect of distance to the geometric boundary. For an input , the smaller the , the closer is to the decision boundary between the top-two classes. As a result, it is more likely to cross the boundary with noise perturbations. In the following sections, we use to denote , where is a set of inputs.

2.2. Bayesian Uncertainty Measures

Besides prediction confidence based metrics, Bayesian-based methods are recently proposed to estimate DNN’s uncertainty through multi-shot execution analysis 

(Gal, 2016) (see Fig. 1

). From the principled Bayesian perspective, DNNs are not regarded as deterministic functions with fixed parameters. Instead, the parameters of DNNs are treated as random variables, which obey a

prior distribution . The posterior distribution is then approximated given a training data set , based on which uncertainty estimates can be obtained. The relationship between the uncertainty representative of prediction confidence score and Bayesian uncertainty estimates is shown in Fig. 1. Intuitively, prediction confidence score reflects a distance abstraction to the fixed geometric boundary w.r.t.

a point-estimate neural network whose parameters are fixed; while Bayesian method ensembles a set of networks whose weights follow some probability distribution.

However, due to the complexity of DNNs, it is often impractical to sample infinitely many weights from the distribution to perform the runtime execution. To this end, the state-of-the-art approach to obtain uncertainty estimates makes use of the dropout technique from multiple runs (Monte Carlo dropout (Gal and Ghahramani, 2016)). Although dropout is originally proposed as a regularization method in the training process to avoid over-fitting, in the context of uncertainty estimation, dropout is leveraged in the testing process, which samples weights from the distribution to obtain DNN instances. As a result, it ensembles a number of neural networks with different weights. In each prediction execution, it randomly drops out some units in the DNN, which may cause different prediction results. As a result, it allows to obtain uncertainty estimates efficiently and scales to real-world neural networks. Specifically, there are three commonly-used metrics to estimate uncertainty (Gal, 2016), i.e., variation ratio, predictive entropy and mutual information.

2.2.1. Variation Ratio

Variation ratio measures the dispersion from the dominant class of the prediction (i.e., the predicted class with the highest frequency in multiple predictions).

Definition 5 (Variation Ratio).

Given a model and an input , the variation ratio (VR) of the input is defined as

where is the model with dropout-enable and denotes the - prediction result by . is the total number of prediction execution by . represents the dominant label from most of the predictions.

Another variant of the variation ratio in terms of the original prediction (the prediction of the model under analysis) is defined as follows:

Definition 6 ().

Given a model and an input , the variation ratio for original prediction (denoted as VRO) of the input is defined as

where is the prediction result from the original model .

Intuitively, VR measures the general uncertainty of the decision with the highest frequency (i.e., whether most predictions agree with the same result) while VRO represents the stability around the prediction mode of model (i.e., whether the majority predictions agree with the original result). The higher the VR or VRO is, the more uncertain the prediction is.

2.2.2. Predictive Entropy

Predictive entropy originates from information theory and measures the average amount of information contained in a stochastic source of predictive output. When all classes are predicted with equal probability in the form of a uniform distribution, the decision carries the most information, indicating high uncertainty. In contrast, when one of the classes is predicted with high probability value (

e.g., 0.9), then the prediction is relatively certain.

Definition 7 ().

Given all the predictive probability distributions , of an input across predictions on dropout enabling model , the predictive entropy is defined as

where denotes the probability value of a particular class on the - prediction of model .

2.2.3. Mutual Information

Mutual information quantifies the amount of information obtained about one random variable through the observations of the other random variable. In the case of DNNs, the two random variables are prediction and the posterior of the model parameters , whose distribution is approximated through stochastic forward passes of dropout enabled model . We use to denote an instance of the model parameters sampled from the posterior distribution.

Definition 8 ().

Given the predictive probability distributions , of an input in predictions, the mutual information of input is defined as

where is used to approximate in a similar way to predictive entropy, based on the probability vectors obtained through different stochastic forward passes.

2.3. Overview

Figure 2. The overview of our study and its application.

In this paper, we aim to understand the capability of different uncertainty portrayals, on which behaviors of AEs and BEs are further characterized. Fig. 2 shows an overview of our work, summarized as two major components: (1) an empirical study about the uncertainty metrics and characteristics of the existing data inputs, and (2) the data generation algorithm that generates input samples with uncommon patterns and potential applications. Specifically, we first perform an empirical study towards understanding the capability of different uncertainty metrics on distinguishing AEs and BEs (§ 3.2). Then, we propose a way of categorizing AEs and BEs based on two perspectives: the prediction confidence score from the perspective of the single-shot model execution and the uncertainty estimates from the statistical perspective of multi-shot execution. Next, we study the uncertainty patterns of the BEs and AEs generated by the existing adversarial attack/testing tools (§ 3.4).

From the empirical study, we find that the existing AEs and BEs follow specific uncertainty patterns. Then, We propose a genetic-algorithm-based approach to generate the uncommon data inputs (§ 4), whose uncertainty patterns are different from the patterns that existing data fall into. We identify and analyze the importance of data with diverse uncertainty patterns from a testing perspective. Finally, we evaluate the capability of such uncommon data in bypassing a variety of defense techniques (§ 5.2).

3. Empirical Study

In this section, we first perform a comparative study about the capability of different uncertainty metrics in distinguishing AEs and BEs. Then, we conducted a follow-up investigation of the characteristics of existing AEs/BEs from the uncertainty perspective.

3.1. Subject Dataset and Data Preparation

3.1.1. Datasets

Dataset DNN Model


#Layer Acc.(%)

LeNet-5 268 9 99.0%
CIFAR-10 NIN 1418 12 88.2%
ResNet-20 2,570 70 86.9%
ImageNet MobileNet 38,904 87 87.1%*

* The reported top-5 test accuracy of pretrained DNN model in (Howard et al., 2017).

Table 1. Subject datasets and DNN models.

We selected three popular publicly available datasets (i.e., MNIST (LeCun et al., 1998), CIFAR-10 (Krizhevsky et al., ), and ImageNet (Russakovsky et al., 2015)) as the evaluation subject datasets (see Table 1).

MNIST is for handwritten digit image recognition, containing training data and test data, with a total number of data categorized in 10 classes (i.e., handwritten digits from to ). Each MNIST image is of size with a single channel.

CIFAR-10 is a collection of images for general-purpose image classification, including training data and test data in different classes (e.g., airplane, bird, cat and ship), with images per class. Each CIFAR-10 image is of size with three channels. The classification task of CIFAR-10 is generally difficult than that of MNIST due to the data size and complexity.

ImageNet is a large-scale practice-sized dataset, which is used as the database for large-scale visual recognition challenge (ILSVRC) towards general-purpose image classification. The complexity of ImageNet is characterized by a training set comprised of over 1 million images, together with a validation set comprised of images and a test set with images. Each image is of size .

For each dataset, we study several popular DNN models used in previous work  (LeCun et al., 1998; Lin et al., 2013; He et al., 2016; Howard et al., 2017), which achieve competitive test accuracy (Table 1).

The DL models we study in this paper are all with convolutional architectures. However, our approach is generic and could be applied to other network architectures such as recurrent neural networks. Our approach focuses on the uncertainty nature of DNN. Whether the calculation methods of the uncertainty metrics are available determines the feasibility of applying our approach to DL models with other architectures. Basically, the only requirement on the DL models is with a classification setting, thus the model output would be a probability distribution among a set of classes and the uncertainty metrics can be obtained. In a word, our approach can be applied to DNN classification tasks, independent of the model architectures.

3.1.2. Adversarial Example Generation Tools

We chose four state-of-the-art adversarial attack tools, i.e., FGSM (Goodfellow et al., 2015) (Fast Gradient Sign Method), BIM (Kurakin et al., 2016b) (Basic Iterative Method), Deepfool (Moosavi-Dezfooli et al., 2016) and C&W (Carlini and Wagner, 2017) attacks, to generate adversarial examples. Specifically, we used the existing Python tool-set foolbox (Rauber et al., 2017) to perform these attacks and each attack is configured with the default setting.

We also selected two state-of-the-art automated testing tools for deep neuron networks, i.e., DeepHunter (Xie et al., 2019a) and TensorFuzz (Odena et al., 2019), both of which adopt coverage-based fuzzing techniques. For DeepHunter, we generated AEs with the -Multisection Neuron Coverage (KMNC) and Neuron Coverage (NC) as the guidance. Following the configuration in (Xie et al., 2019a; Pei et al., 2017), we set the parameter as 1,000 for KMNC and the threshold as 0.75 for NC. TensorFuzz is configured with the default setting used in (Odena et al., 2019).

Model DH-KMNC DH-NC TensorFuzz Adv_attacks
LeNet-5 2,980 4,607 1,436 49,000
NIN 5,715 9,110 3,848 49,000
ResNet-20 6,960 9,249 2,465 49,000
MobileNet 3,314 20,541 13,999 49,000
Table 2. Number of adversarial examples generated by the testing tools and adversarial attacks.

3.1.3. Data Preparation

Overall, we prepared the following three sets of data: one set of benign examples, one set of AEs generated by the attack methods, and one set of AEs generated by testing tools (see Table 2):

  • [leftmargin=*]

  • BenignData. For each dataset, we randomly sampled test data, which can be correctly predicted by the models, as the benign dataset for each model.

  • AttackAdv. For each input in BenignData, we generated four types of AEs with the four attack methods, resulting in a total of AEs. (Column Adv_attacks in Table 2).

  • TestAdv. For each dataset, we randomly sampled 500 benign examples as the initial seeds. Then, we ran DeepHunter and TensorFuzz for each model with 5,000 iterations. The columns DH-KMNC, DH-NC, and TensorFuzz show the number of adversarial examples generated by DeepHunter with KMNC and NC guidance, and Tensorfuzz.

3.2. RQ1: Empirical Study on Uncertainty Metrics

The objective of RQ1 is to study the relationship between uncertainty metrics and adversarial examples. In particular, we analyze the effectiveness of uncertainty metrics in distinguishing AEs and BEs. We adopt AUC-ROC (Fawcett, 2006)

to evaluate the classification performance of each metric. We utilize the score as the evaluation criterion because it measures the performance without dependence on a pre-setting threshold. Specifically, AUC-ROC is a performance measurement for classification problems at various threshold settings. ROC is short for receiver operating characteristic curve and AUC represents degree or measure of separability. It gives us detailed information on to what extent the evaluated model is capable of distinguishing between classes. In our metric effectiveness evaluation, the higher the AUC-ROC score, the better a metric is at distinguishing AEs and BEs.

Table 3 shows the AUC-ROC scores achieved on different metrics across different data. To be specific, we used BEs and AEs generated from each attack to calculate the AUC-ROC scores. Overall, PCS achieves the best performance as it is a direct measure of the prediction confidence of the target model. Interestingly, we found that AEs often have low prediction confidence.

From the Bayesian uncertainty perspective, we found that VRO achieves the best performance. On MNIST and CIFAR-10 datasets, all AUC-ROC scores are over 97%. On ImageNet, the minimum AUC-ROC score is 77.47% in differentiating BEs and AEs generated by Deepfool, while the scores on other attacks are all above 84%. From the in-depth analysis, we found that VRO (refer to Definition 6) captures the difference between the prediction of the original model and multiple predictions of the randomized dropout-enabled model. On the other hand, other metrics represent the uncertainty based on the stability of the multi-shot predictions without considering the model under analysis. For example, given an input, suppose the prediction of the original model is incorrect and all predictions of the dropout-enabled model are correct, the values of VR and VRO would be 0 and 1, respectively. Intuitively, VRO shows that the model is quite uncertain but VR indicates that the model is very certain. In other words, VRO is more sensitive to capture the uncertain behavior of the target model compared with other uncertainty metrics.

[size=title] Answer to RQ1: PCS of the original model is often effective in distinguishing BEs and AEs generated by existing attacks. For the Bayesian uncertainty metrics, VRO is often more effective than others when comparing the prediction stability between the original model and multiple dropout-enabled predictions.

3.3. Characterizing Data Behavior

PCS and Bayesian uncertainty depict the prediction results of a DNN from different angles. In particular, existing studies demonstrate that high prediction confidence is not equal to low Bayesian uncertainty, and vice versa (Gal, 2016).

The prediction confidence (i.e., PCS) represents the confidence of a single-shot model execution while the Bayesian uncertainty is measured by the statistical results from multi-shot model executions. From RQ1, the results reveal that VRO stands out to capture the different behaviors of BEs and AEs compared with other Bayesian uncertainty metrics. Therefore, we adopt a two-dimensional metric, i.e.

, (PCS, VRO), to characterize the BEs and AEs on a specific DNN model. Based on this, the data is classified into four patterns (see Fig. 

2 (a)): low PCS and high VRO (LH), high PCS and high VRO (HH), low PCS and low VRO (LL) and high PCS and low VRO (HL). The categorization provides a way to understand and analyze the behaviors of AEs and BEs.

3.4. RQ2: Categorization of Existing Data

Model Attacks PCS VRO VR PE MI
LeNet-5 BIM 99.98% 99.06% 90.72% 83.56% 81.52%
C&W 100.00% 99.08% 90.23% 82.86% 81.61%
Deepfool 99.44% 98.31% 93.47% 86.46% 84.78%
FGSM 99.98% 98.74% 95.52% 90.09% 87.36%
NIN BIM 99.95% 99.46% 88.57% 86.73% 86.95%
C&W 99.90% 99.44% 87.93% 85.99% 86.48%
Deepfool 99.79% 99.18% 91.44% 88.64% 88.51%
FGSM 99.43% 98.80% 93.97% 91.69% 91.54%
ResNet-20 BIM 99.97% 98.20% 86.74% 84.62% 85.10%
C&W 99.88% 98.28% 85.87% 83.93% 84.59%
Deepfool 99.80% 97.85% 88.02% 85.70% 86.23%
FGSM 99.23% 97.28% 90.74% 88.25% 88.24%
MobileNet BIM 99.95% 86.67% 84.25% 68.77% 68.37%
C&W 96.80% 84.49% 82.73% 67.80% 67.35%
Deepfool 79.36% 77.47% 77.13% 74.03% 72.32%
FGSM 97.79% 87.93% 86.49% 72.75% 71.84%
Table 3. The AUC-ROC scores of classification models with different metrics.
Model Metric Benign AEs from Attacks AEs from Testing tools
BIM C&W Deepfool FGSM DH-KMNC DH-NC TensorFuzz
LeNet-5 PCS 0.990 / 0.004 0.018 / 0.001 0.002 / 8.865e-06 0.186 / 0.114 0.012 / 1e-4 0.561/0.119 0.592/0.112 0.579/0.111
VRO 0.312 / 0.017 0.733 / 0.008 0.735 / 0.008 0.697 / 0.009 0.716 / 0.009 0.631/0.019 0.625/0.020 0.630/0.020
NIN PCS 0.953 / 0.023 0.007 / 0.0004 0.013 / 0.0003 0.027 / 0.007 0.083 / 0.005 0.571/0.113 0.608/0.110 0.384/ 0.093
VRO 0.054 / 0.016 0.777 / 0.025 0.777 / 0.027 0.720 / 0.026 0.679 / 0.029 0.380/0.065 0.352/0.063 0.615/0.039
ResNet-20 PCS 0.945 / 0.026 0.005 / 0.0002 0.016 / 0.0005 0.023 / 0.006 0.096 / 0.006 0.548/0.108 0.574/0.109 0.392/0.093
VRO 0.072 / 0.023 0.865 / 0.010 0.869 / 0.010 0.850 / 0.010 0.828 / 0.012 0.398/0.080 0.388/0.080 0.457/0.062
MobileNet PCS 0.788 / 0.082 0.002 / 1.598e-05 0.085 / 0.020 0.415 / 0.135 0.059 / 0.002 0.601/0.160 0.659/0.119 0.614/0.126
VRO 0.337 / 0.062 0.679 / 0.015 0.652 / 0.017 0.596 / 0.054 0.696 / 0.016 0.454/0.080 0.451/0.077 0.551/0.083
Table 4.

Results (mean / variance) of benign & adversarial data.

Based on the categorization method, we perform a study towards understanding the characteristics of BEs and AEs generated by adversarial attacks and testing tools. To compute the VRO, we follow the parameter configuration suggested in (Gal and Ghahramani, 2016) and set (see Definition 6) as 50 for MNIST, and 100 for CIFAR-10 and ImageNet, respectively.

Table 4 summarizes the quantitative results of BEs and AEs, w.r.t. the two-dimensional metrics in four models. Note that the data in the columns Benign, AEs from Attacks and AEs from Testing tools are from the three sets of data presented in § 3.1.3. In each cell, the two values represent the mean and variance of the corresponding PCS and VRO metric results, respectively.

Overall, benign data mostly have high PCS and low VRO. The mean values of PCS for the four models are 0.99, 0.953, 0.945 and 0.788, respectively, while the mean values of VRO are 0.293, 0.054, 0.072 and 0.337. The results are mostly in line with our expectations, i.e., BEs are expected to be predicted by the model with high confidence and low uncertainty. In the case of MobileNet, the PCS is relatively smaller than others while the VRO is larger than that of other models. The reason is that MobileNet handles a more complex task, i.e., image classification for a large-scale dataset (ImageNet). It is more difficult to train a high-quality DNN model. MobileNet we used here is among the state-of-the-art model for image classification in ImageNet, whose top-5 accuracy is 87.1%. This result indicates that BEs usually belong to the HL type.

For AEs generated by different attacks, the metric performance is mostly contrary to BEs, i.e., AEs generated by attacks usually come with low PCS and high VRO. Except for the AEs generated by Deepfool on ImageNet, all the mean values of PCS are rather low (e.g., almost all PCS values are below 0.1) and the mean values of VRO are relatively high (e.g., all VRO values are larger than 0.652). It indicates that AEs generated from state-of-the-art adversarial attacks usually fall into the category with low confidence and high uncertainty. From the results, we find that AEs usually belong to the LH type.

For the adversarial data generated from testing tools, we found that the obtained metrics are between BEs and AEs generated from the adversarial attacks. The mean values of PCS are smaller than the PCS results of BEs but often larger than the results of AEs from attacks. Conversely, the mean values of VRO are larger than the VRO results of BEs but smaller than the results of AEs from adversarial attacks. For example, consider the metric results of BEs, TensorFuzz AEs and C&W AEs on ResNet-20, the PCS values are 0.945, 0.392 and 0.016, respectively, while the VRO values are 0.072, 0.457 and 0.869. Compared with the results of BEs, we can still reach a similar conclusion that AEs from testing tools also belong to the LH type.

Even so, the results of AEs from adversarial attacks and testing tools have some differences. This might be caused by their differences in the test generation methods. Current adversarial attacks usually adopt the gradient-based or optimization-based technique to gradually decrease the predictive probability or the logit value of the truth label until the decision result changes. For example, when the probability of the truth label is reduced to 0.49 and the probability of another label becomes 0.51, an adversarial example is found and the attack stops. DeepHunter and TensorFuzz adopt the mutation-based technique to generate new tests. The random mutation can not guarantee that the predictive probability of the truth label is decreased gradually. For example, the probability may change from 0.99 to 0.10 by only one mutation, resulting in a higher PCS. In summary, although both adversarial attacks and testing tools can generate AEs, the behaviors of such generated AEs are different in terms of PCS and VRO. Therefore, this further confirms the difference between testing and adversarial attack, from the perspective of software engineering. Testing can not only simulate real-world scenarios to uncover potential issues of a DL software for deployment, but also generate more diverse data to capture the behaviors of the DNN systematically in the applied context.

Comparing the results of BEs and AEs, we also found that there also exists a tentative inverse correlation between PCS and VRO. For example, the PCS of BEs is often large and the VRO is small. The PCS of AEs from adversarial attacks is often small while the VRO is large. For AEs generated from testing tools, the PCS tends to be larger and the VRO is smaller. It is reasonable since high prediction confidence usually reflects that the prediction is relatively certain.

[size=title] Answer to RQ2: BEs and AEs usually belong to the HL type and LH type, respectively. Compared with state-of-the-art adversarial attacks, testing tools to some extent generate different AEs.

4. Uncommon data generation

The results of RQ2 reveal that BEs and AEs usually belong to the HL type and LH type, respectively. However, several questions still remain, i.e. whether there exist: 1) data samples with high PCS and high VRO (i.e., the HH type), 2) data with low PCS and low VRO (i.e., the LL type), 3) BEs with low PCS and high VRO (i.e., the LH BEs) and 4) AEs with high PCS and low VRO (i.e., the HL AEs). These samples have the potential to uncover the unknown behaviors of DNN, which are largely missed by existing methods. To answer these questions, we developed a tool, KuK, to generate such uncommon data.111We refer to these data as "uncommon" in the sense that they are rarely uncovered by existing techniques rather than that they rarely exist. Such data could occur widely in the real world.

As PCS and VRO on existing data usually follow an inverse correlation, it is non-trivial to generate the uncommon data. In fact, the test data generation of specific types could be a complicated optimization problem. In this paper, we leverage the Genetic Algorithm (GA) (Miller et al., 1995) to provide a solution.

Fig. 2 b) shows the workflow of our algorithm. The inputs of KuK include a seed (i.e., an initial image 222In this paper, although we mainly focus on the image domain, the approach can also generalize to other domains.), a model , the dropout enabled model and a target type . The output is a set of data samples that satisfy the objective. We elaborate on the details of each step as follows.

Population Initialization. Given an input image, we first generate a set of images by randomly adding noise to it. In order to generate high-quality images (i.e., recognizable by human), we abandon the affine transformation (e.g., translation and rotation (Xie et al., 2019a)) as the crossover may generate invalid images. We use norm to constrain the allowable changes between the original seed and its generated counterpart.

Objective and Fitness Calculation. In each iteration, we check whether some generated images (in the population) satisfy the objective, which is specifically designed for each type based on PCS and VRO. The test generation continues until some desirable data inputs are obtained. To satisfy the objective, we design a set of piecewise fitness functions to generate different types of uncommon data such that the higher the corresponding fitness value, the better the input.

We use to denote the population, i.e., a set of images, and use to denote . Given the input and model , the objectives and the fitness functions are defined as follows:

  • [leftmargin=*]

  • For the LL type, the objective is , where and are configurable parameters, and the fitness function is:


    If the minimum PCS of the population is larger than , we use to decrease the PCS until there are some inputs whose PCSs are below . Due to the custom inverse correlation between PCS and VRO, VRO tends to increase when PCS decreases. As for the fitness function in the other situation, () aims to ensure that the PCS is still below while stands out for a smaller VRO.

  • For the HH type, the objective is , where and are configurable parameters, and the fitness function is:


    If the maximum PCS of is smaller than , we increase the PCS until some PCSs are larger than . In addition, we ensure the PCSs is larger than (), and in the meantime attempt to increase VRO ().

  • To generate AEs which belong to the HL type, the objective is set as: is an AE, where and are the configurable parameters, and the fitness function is:


    where is 1 if is an AE. Otherwise, it is 0.

    The generation of HL AEs is extremely challenging since HL is the typical feature of BEs. To address the problem, we design a three-step approach. If all images in the population are BEs (step 1), we aim to generate AEs by decreasing the PCS, which is commonly used by state-of-the-art attacks. Whenever any inputs become AEs and all PCSs become smaller than (step 2), we increase PCS but still keep the high priority of AE (i.e., AE has a high fitness value with the support of ). For example, if an AE becomes BE but achieves high PCS, its fitness value will decrease. In the last step, we set AE and high PCS as the high priority, then decrease VRO.

  • To generate BEs that belong to LH type, the objective is set as: and is a BE. The fitness function is designed as follows:


    where is 1 if is a BE. Otherwise, it is 0.

    Similarly, the generation of LH BEs is also challenging as LH is a typical feature of AEs. In the first step, if all PCSs are larger than , we first decrease the PCS but keep the high priority of BE. In the second step, when there are some BEs with low PCSs, we increase the VRO but keep the high priority of BE and low PCS.

Crossover and Mutation.

For the crossover, we adopt the tournament selection strategy to select two tournaments. From each tournament, we select one image, which has the largest fitness value. The two selected images are used to perform the crossover by randomly exchanging the corresponding pixels. After the crossover, each image is randomly mutated by adding white noise, to increase the diversity of the population. The test generation continues until the objective is satisfied or the given computation resource (

e.g., time limit) exhausts.

5. Evaluation

We implemented the proposed test generation tool, KuK

, in Python based on Keras 

(Chollet and others, 2015)

 (2.2.4) with TensorFlow 

(Abadi et al., 2016) (1.12.0) as backend. In this section, we aim to evaluate the usefulness of KuK in generating uncommon data (RQ3) and the effectiveness of these data in bypassing the defense techniques (RQ4). All the experiments were run on a server with the Ubuntu 16.04 system with 28-core 2.0GHz Xeon CPU, 196 GB RAM and 4 NVIDIA Tesla V100 16G GPUs.

5.1. RQ3: Usefulness of Test Data Generation

Figure 3. The distribution of different types of data generated on NIN by KuK. The orange color represents AEs, while green color represents BEs. Star, circle, triangle and diamond represent the data of HH, LL, LH and HL, respectively.

Setting. We adopt KuK to generate different types of uncommon data on four widely used DL models – LeNet-5, NIN, ResNet-20 and MobileNet. In the genetic algorithm, the size of the population is set as 100, the crossover rate is set as 0.5 and the mutation rate is 0.005. For the mutation process, the radius of is set as . For each dataset, we randomly select 200 BEs as the initial seeds. For each seed, we generate four types of uncommon data. The maximum number of the iterations in the genetic algorithm is set to 50.

Threshold Selection. To perform the categorization, we need to set the upper bound for the low PCS/VRO, the lower bound for the high PCS/VRO, i.e., the configuration values of the parameters and in the objective and fitness functions. Actually, in Table 4, the results of BEs (HL) and AEs (LH) generated from adversarial attacks are the extreme cases, which can act as the guidance to select the thresholds.

  • [leftmargin=*]

  • High PCS. For the lower bound of high PCS, we set it as 0.7 as all high PCSs from BEs in the four models are above 0.7. Specifically, the minimum value of high PCS of BEs is 0.788 in MobileNet. Intuitively, if the PCS of a data sample is above 0.7, we regard it as a data sample with high PCS.

  • Low VRO. For the upper bound of the low VRO, we set the value as 0.4 since the low VROs of BEs in the four models are below 0.4 (e.g., 0.312 and 0.337 for LeNet-5 and MobileNet, respectively). If the VRO of a data sample is below 0.4, we categorize it into low VRO type.

  • Low PCS. For the upper bound of the low PCS, most of the low PCSs of AEs in Table 4 are below 0.1 while the PCS of samples generated by Deepfool for MobileNet has a larger value 0.415. We made a compromise and set the value as 0.3. If the PCS of a data sample is below 0.3, we regard that it falls into low PCS category.

  • High VRO. For the lower bound of high VRO, we set it as 0.6 because almost all of the high VROs of AEs in the four models are larger than 0.6 (one exception is 0.596 in MobileNet). If the VRO is above 0.6, we regard it as high VRO.

Type Objective LL (<0.3/<0.4) HH (>0.7/>0.6) LH (<0.3/>0.6) HL (>0.7/<0.4)
Ben Adv Ben Adv Ben Adv
LeNet-5 Total 124 18 130 46 172 1
PCS/VRO 0.067/0.324 0.030/0.343 0.947/0.720 0.903/0.803 0.082/0.609 0.99/0.339
NIN Total 176 22 10 184 58 165
PCS/VRO 0.065/0.095 0.065/0.204 0.960/0.846 0.922/0.910 0.073/0.666 0.981/0.245
ResNet-20 Total 168 32 11 181 67 93
PCS/VRO 0.065/0.105 0.064/0.224 0.960/0.851 0.921/0.910 0.058/0.655 0.978/0.253
MobileNet Total 83 5 70 11 152 74
PCS/VRO 0.234/0.348 0.192/0.348 0.845/0.714 0.784/0.785 0.092/0.719 0.915/0.328
Table 5. Results of different types of data generated by KuK.

Results. Fig. 3 depicts the distribution of the 200 generated data on the two-dimension plane (due to page limit, the results of other models are put on our website (Website, 2019)). Note that there are some seeds from which we failed to generate the uncommon data satisfying the objective. For each of these seeds, we plot the best result from the population (e.g., for HH type, we select the data, which has the maximum of the sum of PCS and VRO). The results show that KuK enables to generate inputs with diverse uncertainty patterns.

Table 5 shows the number of uncommon data inputs that satisfy the objective and the mean PCS and VRO value. Row Type Objective shows the objective setting for each uncommon uncertainty pattern in terms of upper and lower bound of PCS and VRO. For each model, Row Total and PCS/VRO give the total number of uncommon data generated for each type and the mean value of PCS and VRO, respectively. For LL and HH types, the results of generated AEs and BEs are shown separately (i.e., Column Ben and Column Adv).

The results demonstrate that KuK is effective in generating LL and HH data inputs that are rarely covered by existing methods. For example, for NIN model, KuK generated 198 (99%) LL data in total, of which 176 data are BEs and 22 data are AEs. For ResNet-20, KuK generated the LL data and HH data for all seeds (i.e., 200). From the quantitative result of LL data, we could find that LL data tend to be BEs (e.g., the number of LL BEs is much larger than the number of LL AEs). Considering the natural BEs usually belong to HL (refer to Table 4), it indicates that low VRO is a better metric to represent the characteristics of BEs. For the HH data, there is no such obvious trend. In particular, for LeNet-5 and MobileNet, the number of HH BEs is larger than the number of HH AEs. However, the case in NIN and ResNet-20 is on the contrary.

For LH BEs and HL AEs, it is more challenging to generate them since they are completely opposite to the characteristics of the common data. The results show that KuK is useful to generate such uncommon data. For example, we generated 172 (86%), 58 (29%), 67 (33.5%) and 152 (76%) LH BEs for LeNet-5, NIN, ResNet-20 and MobileNet, respectively. For HL AEs, we found that KuK only generated one HL AE for LeNet-5. For other models, KuK generated 165 (82.5%), 93 (46.5%) and 74 (37%) HL AEs, respectively. The results indicate that generating LH BEs and HL AEs is more difficult and KuK can still generate them for a part of seeds.

We can see that the PCS and VRO in Table 5 are consistent with those in Table 4. For example, although we set the lower bound of high PCS as 0.7 in the fitness functions and objectives, KuK still generated very high PCS for HH and HL data. The mean value of PCS is larger than 0.9 in LeNet-5, NIN, and ResNet-20, which is very consistent with the high PCS of BEs in Table 4. Even we set the upper bound of low VRO as 0.4, KuK still generated data with pretty low VRO, e.g., for NIN and ResNet-20, the VROs of LL data are 0.095 and 0.105, respectively.

Comparing the results among the four models, we could find that the difficulty in generating uncommon data varies for different models. For example, KuK is effective in generating LL and HH data for LeNet-5, NIN, and ResNet-20, but only generated 88 (44%) and 81 (40.5%) for MobileNet. For LeNet-5 and MobileNet, it is easier to generate LH BEs than HL AEs but the case in NIN and ResNet-20 is on the contrary. These differences show that the uncommon data generated through KuK can be used to characterize different behaviors of the models.

[size=title] Answer to RQ3: KuK is useful for generating different types of uncommon data. The HL BEs and HL AEs are often more difficult to be generated.

5.2. RQ4: Evaluation on Defense Techniques

To demonstrate the usefulness of the generated uncommon data in Table 5, this experiment intends to study whether the data can bypass the existing defense techniques.

Setting. Since different defense techniques are proposed on different subject datasets, we selected popular techniques based on the datasets. For MNIST and CIFAR10 dataset, we selected the following defense techniques: binary activation classifier (Gong et al., 2017), mutation-based adversarial attack detection (Wang et al., 2018)

, defensive distillation 

(Papernot et al., 2016), label smoothing (Hazan et al., 2016), and feature squeezing (Xu et al., 2017). For ImageNet, we selected the mutation-based adversarial attack detection (Wang et al., 2018), input transformations (Guo et al., 2017) and pixel deflection (Prakash et al., 2018). Due to the space limit, we put the details about the configuration and the introduction of each defense technique on our website (Website, 2019).

To validate the performance of the defense techniques, we selected 1) the common data including 9,000 BEs and 9,000 AEs generated from the existing adversarial attacks (§ 3.1.3), and 2) the uncommon data from Table 5. We use the success rate to evaluate the capability of the defense technique, i.e., divide the number of BEs/AEs, which can be correctly identified, by the total number.

Results. Table 6 and 7 show the results about the performance of the defense techniques on four models. Column/Row Comm represents the success rate on existing common data. The results show that the defense techniques are very effective in identifying BEs and AEs of the common data, especially on the smaller models. For example, except for that the success rate of feature squeezing is 88.1% on ResNet-20, the success rates of other defense techniques are above 90% on NIN, ResNet-20, and LeNet-5. In particular, the success rate is above 97% on LeNet-5. For the larger model MobileNet, the success rate is relatively low as it is more difficult to perform the defense for the complex model.

Column/Row UnCo represents the success rate on the uncommon data generated from KuK. The overall result shows that the existing defense techniques perform poorly on the uncommon data we generated. We could find that the success rate of the binary classifier and mutation-based detection are reduced a lot. For example, on NIN, the success rate is reduced to 25.5% and 5.5%, respectively. The reason is that these two techniques mainly depend on the PCS and VRO characteristics for detection. Specifically, binary classifier is trained with the value of logit layer that is closely related to prediction confidence and the label change ratio in mutation-based detection is similar to VRO. The uncommon data is very different from the common data w.r.t. these two metrics. As a result, if they perform well on common data, the success rate on the uncommon data could be low.

For other defense techniques, the reduction in success rate appears smaller than that of binary classifier and mutation-based detection. For example, the success rates drop to 78.3% and 76.2% for defensive distillation and label smoothing on NIN. The reason is that a new model is retrained with these defense techniques, while the attacks are generated regarding to the original one, making it a more challenging transfer attack scenario. For example, defensive distillation retrains a more robust model by reducing the gradients. In this case, some of the data, which are uncommon for original model, become common data w.r.t. the retrained model, because of some weight variation. However, it still can be seen from the results that the uncommon data reveal stronger transferability.

We also found that the uncommon data on LeNet-5 is not effective on defensive distillation and label smoothing. We performed an investigation and found that it may be caused by the following reasons. 1) Most of the data become common in the new model. It confirms the usefulness of uncommon data in characterizing the different behaviors of multiple models. 2) Most of the uncommon data generated on LeNet-5 are BEs (see Table 5). The success rate is reduced to 90.8% and 79.9% if we only use the uncommon AEs.

NIN ResNet-20 LeNet5
Comm UnCo Comm UnCo Comm UnCo
#Data 18,000 615 18,000 552 18,000 491
binary classifier 0.944 0.255 0.958 0.159 0.986 0.303
mutation-based 0.975 0.055 0.985 0.101 0.97 0.390
distillation 0.93 0.786 0.913 0.773 0.985 0.963
label smoothing 0.936 0.762 0.921 0.748 0.981 0.934
feature squezzing 0.905 0.663 0.881 0.637 0.973 0.327
Table 6. Success rate of the defense techniques on the generated data for NIN, ResNet-20 and LeNet-5.

[size=title] Answer to RQ4: The uncommon data inputs are not well defended by existing defense techniques while the common data are relatively easier to be defended. In particular, the binary classifier and mutation-based detection approaches are less useful in defending the uncommon data inputs (e.g., with only 5% and 10% on NIN and ResNet-20, respectively).

6. Threat to Validity

The selection of the subject datasets and DNN models could be a threat to validity. We try to counter this by using three publically available datasets with diverse scales, and popular pre-trained DNN models that achieve competitive performance.

The selection of the thresholds for the categorization may affect the results of Table 5. We carefully select the thresholds based on the results of Table 4. Furthermore, the results of Table 5 are relatively consistent with the results of Table 4, indicating that the selection basically does not affect the results of Table 5.

A further threat would be the randomness factors for computing the VRO (i.e., configuration parameter in § 6). The previous work (Carlini and Wagner, 2017) found that the result is not sensitive to the choice of as long as is greater than 20 in CIFAR-10 and MNIST. We follow the configuration in the existing paper (Gal and Ghahramani, 2016), i.e., 50, 100, 100 for MNIST, CIFAR-10 and ImageNet. In addition, we tested that the values are sufficient for computing the stable VRO.

7. Related work

In this section, we summarize the most relevant work to ours.

Attack and defense. Ever since the demonstration of deep learning models being vulnerable to even a small perturbation of input data (Szegedy et al., 2013), a sequence of attack techniques have been developed on the strand. To date, multiple types of attacks including FGSM (Goodfellow et al., 2015), JSMA (Papernot et al., 2016), BIM (Kurakin et al., 2016a), DeepFool (Moosavi-Dezfooli et al., 2016), C&W (Carlini and Wagner, 2017) have been proposed; a parallel research focus on improving the robustness of deep learning models. Goodfellow et al. (2015) presented a method of introducing nonlinear model families into the training process. Defensive distillation was introduced to reduce the effectiveness of adversarial samples (Papernot et al., 2016), which was then broken by C&W attack (Carlini and Wagner, 2017)

. Meanwhile, a set of recent defense techniques was surveyed and shown that all could be defeated by constructing new loss functions

(Carlini and Wagner, 2017). A more recent work (Madry et al., 2018) exploited the framework of robust optimization for network adversarial training to resist a wide range of attacks. Besides dealing with datasets like MNIST and CIFAR10, defense techniques (Guo et al., 2017; Prakash et al., 2018) were also proposed to handle real-world large-scale datasets like ImageNet. However, it still lacks a study on the characteristics of BEs and AEs generated through these methods, which we attempt to attain from the uncertainty perspective.

Uncertainty measures. In general, a deep learning model is trained with a dataset and results in a set of fixed parameters, which further sets up a deterministic function mapping an input to a probability distribution. Bayesian approach, however, does not view deep learning models as deterministic functions; instead, they treat the parameters as random variables (MacKay David, 1992). As a representative work to solve the scalability problem of obtaining uncertainty measures, (Gal and Ghahramani, 2016) proposed a dropout-based solution, which allows us to calculate uncertainty estimates of existing deep learning models with a good trade-off between uncertainty quality and computational complexity. Existing research on uncertainty measure applications mainly focuses on the adversary detection problem. For example, (Feinman et al., 2017) used both density estimates and Bayesian uncertainty estimates to learn a regression model for adversarial example detection. Further, an empirical study on two types of uncertainty measures, predictive entropy and mutual information, was proposed to understand the effectiveness of them for detection (Smith and Gal, 2018). However, in our work, we perform a comparative study on both single-shot and multi-shot execution uncertainty estimates to dig out the uncertainty patterns that existing AEs/BEs followed.

#Data mutation-based input trans deflection
Comm 18,000 0.865 0.702 0.782
UnCo 395 0.549 0.660 0.688
Table 7. Success rate of the defense techniques on the generated data for MobileNet.

Testing and debugging. Researchers have attempted to leverage decades of advances in the software engineering community to seek for solutions towards more secure and robust DL systems and have developed a set of fruitful results. These approaches share a different spirit from those in the DL community, and, for the first time, have been evidenced to be unique and promising without studies. Testing criteria come out as the first research focus. A series of measurements have been adapted to evaluate the quality of DL test dataset, including neuron coverage (Pei et al., 2017), multi-granularity coverage criteria (Ma et al., 2018a), MC/DC test criteria (Sun et al., 2018a), combinatorial testing criteria (Ma et al., 2019) , surprise adequacy (Kim et al., 2019) and uncertainty-based metrics (Ma et al., 2019). Classic testing methodologies have also been incorporated for DNN testing, including differential testing (Pei et al., 2017), coverage-guided testing (Xie et al., 2019a; Odena et al., 2019; Tian et al., 2018), mutation testing (Ma et al., ) and concolic testing (Sun et al., 2018b). Some advanced test generation methods (Zhang et al., 2018a; Du et al., 2019; Xie et al., 2019b) have also been proposed to achieve better testing for different applications. Similar to samples generated by adversarial attacks, it still lacks a study on the relationship of samples generated by testing techniques and uncertainty. In addition, the results of this paper confirm the difference between testing and adversarial attack, in obtaining samples with different uncertainty behaviors.

Some recent efforts have been made to debug DL models (Ma et al., 2018b), and to study DL program bugs (Zhang et al., 2018b), library bugs (Pham et al., 2019) and DL software bugs across different frameworks and platforms (Guo et al., 2019). The results of this paper provide a new angle to characterize DL model defects, which could be useful for other quality assurance activities besides testing.

8. Conclusion and Future Work

This paper performed an empirical study to characterize the data inputs from the perspective of uncertainty. We first presented an empirical study on the capability of uncertainty metrics in differentiating AEs and BEs. Then, we performed a systematic study of the characteristics of BEs and AEs generated by existing attack/testing methods in terms of uncertainty metrics. The results reveal that existing BEs and AEs largely fall into two uncertainty patterns in terms of PCS and VRO. Based on the investigation results, we proposed a GA-based automated test generation technique to generate data with more diverse uncertainty patterns, especially those uncommon samples. The results demonstrated the usefulness of the generated data in bypassing defense techniques. In future, we plan to perform more in-depth investigations on the application of the uncommon data towards robustness enhancement. We believe further understanding of these uncommon data is crucial for building reliable and trustworthy DL solutions.

We thank the anonymous reviewers for their comprehensive feedback. This research was supported (in part) by the National Research Foundation, Prime Ministers Office, Singapore under its National Cybersecurity R&D Program (Award No.NRF2018NCR-NCR005-0001), National Satellite of Excellence in Trustworthy Software System (Award No.NRF2018NCR-NSOE003-0001); JSPS KAKENHI Grant No.19K24348, 19H04086, 18H04097, Qdai-jump Research Program No.01277; the National Natural Science Foundation of China under grant No.61772038, 61532019, and the Guangdong Science and Technology Department (Grant No.2018B010107004). We also gratefully acknowledge the support of NVIDIA AI Tech Center (NVAITC) to our research.


  • M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng (2016)

    TensorFlow: a system for large-scale machine learning

    In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §5.
  • B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Šrndić, P. Laskov, G. Giacinto, and F. Roli (2013) Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pp. 387–402. Cited by: §1.
  • N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, pp. 39–57. Cited by: 2nd item, §1, §1, §3.1.2, §7.
  • N. Carlini and D. Wagner (2017) Adversarial examples are not easily detected: bypassing ten detection methods. In

    Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security

    AISec ’17, pp. 3–14. External Links: ISBN 978-1-4503-5202-4 Cited by: §6, §7.
  • F. Chollet et al. (2015) Keras. GitHub. Note: Cited by: §5.
  • H. H. Dohaiha, P. Prasad, A. Maag, and A. Alsadoon (2018) Deep learning for aspect-based sentiment analysis: a comparative review. Expert Systems With Applications. Cited by: §1.
  • X. Du, X. Xie, Y. Li, L. Ma, Y. Liu, and J. Zhao (2019) Deepstellar: model-based quantitative analysis of stateful deep learning systems. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 477–487. Cited by: §1, §7.
  • K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, and D. Song (2018) Robust physical-world attacks on deep learning visual classification. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 1625–1634. Cited by: §1.
  • T. Fawcett (2006) An introduction to roc analysis. Pattern recognition letters 27 (8), pp. 861–874. Cited by: §3.2.
  • R. Feinman, R. R. Curtin, S. Shintre, and A. B. Gardner (2017) Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410. Cited by: §7.
  • Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §2.2, §3.4, §6, §7.
  • Y. Gal (2016) Uncertainty in deep learning. Ph.D. Thesis, University of Cambridge. Cited by: §2.2, §2.2, §3.3.
  • Z. Gong, W. Wang, and W. Ku (2017) Adversarial and clean data are not twins. arXiv preprint arXiv:1704.04960. Cited by: 4th item, §1, §5.2.
  • I. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. In International Conference on Learning Representations, External Links: Link Cited by: 2nd item, §1, §1, §3.1.2, §7.
  • C. Guo, M. Rana, M. Cisse, and L. van der Maaten (2017) Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117. Cited by: 4th item, §1, §5.2, §7.
  • Q. Guo, S. Chen, X. Xie, L. Ma, Q. Hu, H. Liu, Y. Liu, J. Zhao, and X. Li (2019) An empirical study towards characterizing deep learning development and deployment across different frameworks and platforms. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pp. 810–822. Cited by: §7.
  • T. Hazan, G. Papandreou, and D. Tarlow (2016) Perturbations, optimization, and statistics. MIT Press. Cited by: 4th item, §1, §5.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.1.1.
  • G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Processing Magazine 29 (6), pp. 82–97. External Links: ISSN 1053-5888 Cited by: §1.
  • A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017)

    Mobilenets: efficient convolutional neural networks for mobile vision applications

    arXiv preprint arXiv:1704.04861. Cited by: §3.1.1, Table 1.
  • G. Katz, C. W. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer (2017) Reluplex: an efficient SMT solver for verifying deep neural networks. In Computer Aided Verification - 29th International Conference, CAV 2017, Heidelberg, Germany, July 24-28, 2017, Proceedings, Part I, pp. 97–117. Cited by: §1.
  • J. Kim, R. Feldt, and S. Yoo (2019) Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41st International Conference on Software Engineering, ICSE ’19, pp. 1039–1049. Cited by: §7.
  • [23] A. Krizhevsky, V. Nair, and G. Hinton () CIFAR-10 (canadian institute for advanced research). . External Links: Link Cited by: §3.1.1.
  • A. Kurakin, I. Goodfellow, and S. Bengio (2016a) Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533. Cited by: §1, §7.
  • A. Kurakin, I. Goodfellow, and S. Bengio (2016b) Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236. Cited by: 2nd item, §1, §3.1.2.
  • Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §3.1.1, §3.1.1.
  • M. Lin, Q. Chen, and S. Yan (2013) Network in network. arXiv preprint arXiv:1312.4400. Cited by: §3.1.1.
  • L. Ma, F. Juefei-Xu, M. Xue, B. Li, L. Li, Y. Liu, and J. Zhao (2019) DeepCT: tomographic combinatorial testing for deep learning systems. In 2019 IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), Vol. , pp. 614–618. External Links: ISSN 1534-5351 Cited by: §7.
  • L. Ma, F. Juefei-Xu, F. Zhang, J. Sun, M. Xue, B. Li, C. Chen, T. Su, L. Li, Y. Liu, J. Zhao, and Y. Wang (2018a) DeepGauge: multi-granularity testing criteria for deep learning systems. In Proc. of the 33rd ACM/IEEE Intl. Conf. on Automated Software Engineering, ASE 2018, pp. 120–131. Cited by: §7.
  • [30] L. Ma, F. Zhang, J. Sun, M. Xue, B. Li, F. Juefei-Xu, C. Xie, L. Li, Y. Liu, J. Zhao, and Y. Wang DeepMutation: mutation testing of deep learning systems. In 29th IEEE International Symposium on Software Reliability Engineering (ISSRE), Memphis, USA, Oct. 15-18, 2018, pp. 100–111. Cited by: §7.
  • S. Ma, Y. Liu, W. Lee, X. Zhang, and A. Grama (2018b) MODE: automated neural network model debugging via state differential analysis and input selection. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 175–186. Cited by: §7.
  • W. Ma, M. Papadakis, A. Tsakmalis, M. Cordy, and Y. L. Traon (2019) Test selection for deep learning systems. arXiv preprint arXiv:1904.13195. Cited by: §7.
  • J. MacKay David (1992)

    A practical bayesian framework for backpropagation networks

    Neural computation. Cited by: §7.
  • A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2018) Towards deep learning models resistant to adversarial attacks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, Cited by: §7.
  • B. L. Miller, D. E. Goldberg, et al. (1995) Genetic algorithms, tournament selection, and the effects of noise. Complex systems 9 (3), pp. 193–212. Cited by: §4.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, and et al (2015)

    Human-level control through deep reinforcement learning

    Nature 518 (7540), pp. 529–533. Cited by: §1.
  • S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard (2016) DeepFool: a simple and accurate method to fool deep neural networks. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2574–2582. Cited by: 2nd item, §1, §1, §3.1.2, §7.
  • A. Odena, C. Olsson, D. Andersen, and I. Goodfellow (2019) TensorFuzz: debugging neural networks with coverage-guided fuzzing. In International Conference on Machine Learning, pp. 4901–4911. Cited by: 2nd item, §1, §3.1.2, §7.
  • N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami (2016) Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pp. 582–597. External Links: ISSN 2375-1207 Cited by: 4th item, §1, §5.2, §7.
  • N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami (2016) The limitations of deep learning in adversarial settings. In 2016 IEEE European Symposium on Security and Privacy (EuroS&P), pp. 372–387. Cited by: §7.
  • K. Pei, Y. Cao, J. Yang, and S. Jana (2017) Deepxplore: automated whitebox testing of deep learning systems. In SOSP, pp. 1–18. Cited by: §1, §3.1.2, §7.
  • H. V. Pham, T. Lutellier, W. Qi, and L. Tan (2019) CRADLE: cross-backend validation to detect and localize bugs in deep learning libraries. In Proceedings of the 41st International Conference on Software Engineering, ICSE ’19, Piscataway, NJ, USA, pp. 1027–1038. External Links: Link, Document Cited by: §7.
  • A. Prakash, N. Moran, S. Garber, A. DiLillo, and J. Storer (2018) Deflecting adversarial attacks with pixel deflection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8571–8580. Cited by: 4th item, §1, §5.2, §7.
  • J. Rauber, W. Brendel, and M. Bethge (2017) Foolbox: a python toolbox to benchmark the robustness of machine learning models. arXiv preprint arXiv:1707.04131. Cited by: §3.1.2.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §1, §3.1.1.
  • L. Smith and Y. Gal (2018) Understanding measures of uncertainty for adversarial example detection. arXiv preprint arXiv:1803.08533. Cited by: §7.
  • Y. Sun, X. Huang, and D. Kroening (2018a) Testing deep neural networks. arXiv preprint arXiv:1803.04792. Cited by: §7.
  • Y. Sun, M. Wu, W. Ruan, X. Huang, M. Kwiatkowska, and D. Kroening (2018b) Concolic Testing for Deep Neural Networks. External Links: Document, 1805.00089 Cited by: §7.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §7.
  • G. Tao, S. Ma, Y. Liu, and X. Zhang (2018) Attacks meet interpretability: attribute-steered detection of adversarial samples. In Advances in Neural Information Processing Systems 31, pp. 7728–7739. Cited by: §1.
  • The BBC (2016) AI image recognition fooled by single pixel change. External Links: Link Cited by: §1.
  • Y. Tian, K. Pei, S. Jana, and B. Ray (2018) Deeptest: Automated testing of deep-neural-network-driven autonomous cars. In ICSE, pp. 303–314. Cited by: §1, §1, §7.
  • J. Wang, G. Dong, J. Sun, X. Wang, and P. Zhang (2018) Adversarial sample detection for deep neural network through model mutation testing. arXiv preprint arXiv:1812.05793. Cited by: 4th item, §1, §5.2.
  • M. U. P. Website (2019) External Links: Link Cited by: §5.1, §5.2.
  • X. Xie, L. Ma, F. Juefei-Xu, M. Xue, H. Chen, Y. Liu, J. Zhao, B. Li, J. Yin, and S. See (2019a) DeepHunter: a coverage-guided fuzz testing framework for deep neural networks. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2019, New York, NY, USA, pp. 146–157. External Links: ISBN 978-1-4503-6224-5, Link, Document Cited by: 2nd item, §1, §1, §3.1.2, §4, §7.
  • X. Xie, L. Ma, H. Wang, Y. Li, Y. Liu, and X. Li (2019b) Diffchaser: detecting disagreements for deep neural networks. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 5772–5778. Cited by: §7.
  • W. Xu, D. Evans, and Y. Qi (2017) Feature squeezing: detecting adversarial examples in deep neural networks. arXiv preprint arXiv:1704.01155. Cited by: 4th item, §1, §5.2.
  • M. Zhang, Y. Zhang, L. Zhang, C. Liu, and S. Khurshid (2018a) DeepRoad: gan-based metamorphic testing and input validation framework for autonomous driving systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE 2018, New York, NY, USA, pp. 132–142. External Links: ISBN 978-1-4503-5937-5, Link, Document Cited by: §7.
  • S. Zhang, L. Yao, A. Sun, and Y. Tay (2019) Deep learning based recommender system: a survey and new perspectives. ACM Comput. Surv. 52 (1), pp. 5:1–5:38. External Links: ISSN 0360-0300 Cited by: §1.
  • Y. Zhang, Y. Chen, S. Cheung, Y. Xiong, and L. Zhang (2018b) An empirical study on tensorflow program bugs. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2018, pp. 129–140. Cited by: §7.