Input Prioritization for Testing Neural Networks

01/11/2019 ∙ by Taejoon Byun, et al. ∙ University of Minnesota Collins Aerospace 0

Deep neural networks (DNNs) are increasingly being adopted for sensing and control functions in a variety of safety and mission-critical systems such as self-driving cars, autonomous air vehicles, medical diagnostics, and industrial robotics. Failures of such systems can lead to loss of life or property, which necessitates stringent verification and validation for providing high assurance. Though formal verification approaches are being investigated, testing remains the primary technique for assessing the dependability of such systems. Due to the nature of the tasks handled by DNNs, the cost of obtaining test oracle data---the expected output, a.k.a. label, for a given input---is high, which significantly impacts the amount and quality of testing that can be performed. Thus, prioritizing input data for testing DNNs in meaningful ways to reduce the cost of labeling can go a long way in increasing testing efficacy. This paper proposes using gauges of the DNN's sentiment derived from the computation performed by the model, as a means to identify inputs that are likely to reveal weaknesses. We empirically assessed the efficacy of three such sentiment measures for prioritization---confidence, uncertainty, and surprise---and compare their effectiveness in terms of their fault-revealing capability and retraining effectiveness. The results indicate that sentiment measures can effectively flag inputs that expose unacceptable DNN behavior. For MNIST models, the average percentage of inputs correctly flagged ranged from 88

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deep neural networks (DNN) are beginning to gain adoption in safety and mission-critical applications as a means for realizing higher levels of autonomous operation. However, this brings with it considerable risk because failures of such systems can be damaging to life, property or environment. Recent accidents caused by autonomous vehicles or self-driving features of automobiles [1] highlight an urgent need for rigorous yet scalable methods of assuring that DNNs will function in an acceptable fashion in all circumstances. As with traditional safety-critical software, one may be able to formally model and algorithmically verify that essential properties hold for a DNN. However, formal verification methods are often hampered by the difficulty of scaling up to larger models. Thus, as is typical for all software, testing remains the most practical and readily applicable approach for verifying DNNs.

It is standard practice in the DNN development process to use testing to to evaluate the learned model. This is typically done in two phases. The first phase is model validation with a validation set; this process is tightly integrated with the training process to tune the hyper-parameters and select the most appropriate model among the candidates. The second is the testing phase, executed with a separate test set, which is independent of the training and validation sets, to evaluate the trained model at the final stage of development. The development process is deemed complete only when desired properties such as adversarial robustness or generalizability to unseen data are satisfied on the test set.

The second phase of testing, however, can be very expensive due to the large input space that testing has to cover, and the cost of labeling those inputs which is needed to determine the correctness of the outputs produced by a DNN over this input space. The cost of labeling is typically much higher than that of collecting test inputs because for many of the tasks that DNNs are designed to handle, the input data is abundant and easy to collect, but the oracle cannot be typically automated–for, if otherwise, there is no need for the DNN in the first place. Thus, it is prudent to find ways to minimize the labeling effort for new test inputs.

One way to achieve this is to prioritize those inputs that are likely to reveal the weakness of a trained model, so that labeling effort can be focused only on prioritized inputs. We hypothesize that this priority can be determined by deriving some additional information about the computation performed by the DNN—its sentiment—when processing the inputs. Higher priority inputs are those for which the DNN expresses a stronger relevant sentiment. In particular, we study three sentiments—confidence

, which is defined as the predicted probability associated with the output label in DNNs that use

softmax output layers, surprise

, which is defined as the distance of the neuron activation pattern on an input from the activation patterns on the training data, and

uncertainty

, which is defined for Bayesian Neural Networks based on the the probability distribution of the DNN’s prediction. These metrics are useful for prioritizing inputs that help us to more efficiently: (a) address model weakness with reduced labeling cost, (b) assess model accuracy with a reduced test suite, and (c) retrain more effectively with fewer, prioritized inputs. Furthermore, prioritization may also be useful in the context of run-time monitoring of DNN components as a means to determine when to trigger system safety or back-up mechanisms to mitigate potentially erroneous DNN outputs.

We empirically assess how these metrics perform as indicators of the test input’s value using examples of DNNs for image classification and image regression. Our initial results show that the measures can prioritize inputs that lead erroneous DNN outputs, with 74.9% to 94.8% of average percentage of fault detection (APFD) score, indicating that sentiment based metrics provide a meaningful basis for prioritization.

Ii Related Work

Testing of DNNs is attracting consideriable interest from the research community. For an overview, see  [2]. Broadly speaking, many approaches to test DNNs have focused on ways to generate test data that expose a weakness such as lack of robustness [3, 4, 5, 6]—for example, generating an adversarial input that minimally perturbs a known input in a way such that the expected output does not change, but leads the DNN to change its output. Generation techniques range from introduction of adversarial perturbations [7, 3, 6] to domain-relevant transformations [8, 9]. Ideas from software testing such as statement, branch [3], condition and MC/DC coverage [5] have been suitably modified and adopted to define various forms of neuron coverage [10, 11] and show how those metrics can be used to guide test generation [6, 12, 10]. In these approaches, the test oracle data (expected output) is determined based, in general, on a known metamorphic relation between the reference input and the synthesized input.

While synthesizing test data is indeed quite useful and effective in identifying faults, it is, in general, not clear how to systematically identify all these faults or determine the likelihood of the exposed faulty behavior manifesting when the system is fielded. Recent work in the area of adversarial input generation would seem to suggest that DNNs provide a target-rich environment for attacks.

In the present work, we instead look for ways to rank data in some form based on their utility for learning the DNN’s weaknesses, without regard to how this data is obtained. This is particularly useful if we have a rich input space for which determining the correct output is tedious. If we aim for generalizability of the DNN—i.e. performance metrics obtained during training and validation being truly indicative of the DNN’s performance when fielded—we need a way to determine for which inputs generalizability may be adversely impacted without knowing what the expected output should be.

Works in the area of active learning studied techniques for minimizing the cost of labeling in an interactive learning scheme 

[13, 14], which we leverage in this paper for test input prioritization. Especially in a pool-based sampling [15] scenario, training data is sampled from a large pool of unlabeled data such that the accuracy of the model can improve most effectively with the least amount of data. Unlike active learning where the goal is to find training data that provides the most information to the model under development, this work evaluates the prioritization techniques in the context of testing where the primary goal is to find fault-revealing inputs efficiently.

Iii White-box Test Input Prioritization

The key idea of test input prioritization is to capture information available from a DNN that represents sentiments such as confidence, uncertainty or surprise on an input presented to the DNN. The relative value of each test input can be judged based on the model’s sentiment, and higher priority can be assigned to uncertain or surprising inputs, since those may more likely reveal erroneous behaviors in the model. Although the sentiments such as uncertainty usually not provided by a typical DNN unless explicitly modeled, multiple techniques exist that captures the sentiments by inspecting the internal computation of the neural network. This section introduces three such techniques.

Iii-a Softmax Output as Confidence Prediction

Softmax is a logistic function that squashes a

-dimensional vector

of real values to a -dimensional vector of real values where each entry of is in the range and the entries add up to : , . It is typically used as the last layer of a neural network for a multi-class classification task so that the output can represent the categorical probability distribution of the classes. When available, the priority score can be computed directly from the softmax output while incurring a minimal computational overhead. As an instantiation of the scoring function, we borrow the notion of entropy to summarize the distribution and assign a single score to an unseen test input :

(1)

where is the number of output classes. Intuitively speaking, the score is lower for a certain classification where only one is high, and higher for an uncertain classification where the predicted distribution is spread out, thus assigning high scores to inputs that are more uncertain.

An obvious limitation of softmax-based prioritization is that it can be applied only to classification models where a softmax layer is used. However, a more fundamental limitation is that the predicted probability does not reflect the model’s confidence, as demonstrated by Gal and Ghahramani 

[16] and also shown in the case of adversarial input attacks [17, 18, 19]

. For instance, an adversarially perturbed input that looks just like an ostrich to human eyes can be classified as a panda with 99% confidence. These limitations call for prioritization techniques that are more reliable and also apply to regression models.

Iii-B Bayesian Uncertainty

Model uncertainty is the degree to which a model is uncertain about its prediction for a given input. An uncertain prediction can be due to a lack of training data—known as epistemic uncertainty—or due to the inherent noise in the input data—known as aleatoric uncertainty [20]; but we do not distinguish the two in this paper because they cannot be distinguished unless a neural network is explicitly modeled to predict them as outputs [20]

. As it is practically impossible for a machine-learning model to achieve 100% accuracy, model uncertainty is immensely useful for engineering a more robust learning-enabled component. In order to obtain model’s uncertainty along with the prediction, we need mathematically grounded techniques based on Bayesian probability theory. We briefly introduce Bayesian neural network and a technique to approximate it using existing neural networks. The uncertainty estimated by these techniques can then be used as scores to prioritize test inputs.

Iii-B1 Bayesian Neural Network

A typical (non-Bayesian) neural network has deterministic parameters that are optimized to have fixed values. A Bayesian neural network (BNN) [21]

, on the other hand, treats parameters as random variables which can encode distributions. For training, Bayesian inference 

[22] is used to update the posterior over the weights given the data and :

, which captures the set of plausible model parameters given the data. To make the training of the weights tractable, the weights are often fitted to a simple distribution such as the Gaussian, and the parameters (mean and variance in the case of the Gaussian distribution) of the distributions are optimized during training 

[23]. The likelihood of the prediction is often defined as a Gaussian with mean given by the model output: where denotes random output of the BNN [20] for an input and

a normal distribution.

Iii-B2 Uncertainty in Bayesian Neural Networks

For a classification task, the likelihood of predicting an output for an input is defined as:

(2)

with samples. The uncertainty of the probability vector is then summarized as the entropy of the probability vector: . For regression, the uncertainty is captured by the predictive variance which is approximated as:

(3)

with samples and the predicted mean  [20]. In other words, the predictive variance is obtained by passing the input times to the model and by computing the variance among sampled outputs.

Iii-B3 Monte-Carlo Dropout as a Bayesian Approximation

Dropout is a simple regularization technique that prevents neural network models from over-fitting to training data [24]

. It works during the training phase by randomly dropping out some neurons in specified layers with a given probability, so that the model parameters are changed only for the sampled neurons. Since the model parameters are adjusted only by an infinitesimal amount in each iteration, the cost converges after sufficient training iterations, even with the variance introduced by random selection of neurons. At test time, the dropout is disabled so that every neuron participates in making a deterministic prediction. This simple technique is shown to be very effective in improving the performance of neural networks on supervised learning tasks. It was later discovered by Gal and Ghahramani 

[16] that a dropout network can approximate a Gaussian Process [25]. They proved that an arbitrary neural network with dropout applied before every weight layer is mathematically equivalent to an approximation of a probabilistic Gaussian process. They also showed that any deep neural network that uses dropout layers can be changed to produce uncertainty estimations by simply turning on the dropout at test time (unlike the typical use case where dropout is turned off), and the likelihood can be approximated with Monte Carlo integration. The uncertainty of the model can then be estimated in a same way as in Equation 2 and 3; the only difference being that the weight varies by sample and follows the dropout distribution such that where is the dropout distribution. We refer more curious readers to the works by Gal and Ghahraman [16] and Kendall and Gal [20].

Iii-C Input Surprise

Surprise Adequacy (SA, in short) is a test adequacy criterion defined by Kim et al[26]

to assess the adequacy of a test suite for testing deep learning systems. Informally, SA is achieved when a set of test inputs demonstrates varying degrees of model’s

surprise, measured by the likelihood or distance function, relative to the training data. The rationale is that a good test suite shall demonstrate a diverse and representative behavior of a trained model, and that the surprise can be a good representation of such diversity in behavior.

Unlike other coverage criteria introduced for testing neural network so far, such as neuron coverage [3], MC/DC-inspired criteria [5], or other structural criteria [11], SA is more fine-grained and unique in that it can assess the quality of each input individually. For example, SA can measure the relative surprise of an input to the training data and give it a numeric score—the higher the score is, the more surprising it is to the model. Our take of SA is that a surprising input may more likely reveal an erroneous behavior in the trained model, since a high surprise may indicate that the model is not well prepared for the input, and thus should be given a high score.

Kim et al[26] defined two ways of measuring the surprise. Both ways make use of the activation trace of a neural network during a classification. For the classification of every input, each neuron in the neural network gets activated with a certain value. The vector of activation values seen when classifying every input can then be termed as the activation trace of that input. Given a set of activation traces, for a known set of inputs , the surprise of a new input with respect to known inputs can be computed by comparing the activation trace of the new input with those from known inputs . Kim et al.proposed two ways of making such comparisons.

  1. A probability density function can be computed for the set of activation traces for known inputs. For a new input, we can compute the sum of differences between its estimated density and densities of known inputs. The higher this sum, the more surprising the new input is. This method is termed the

    Likelihood-based Surprise Adequacy (LSA).

  2. Another method for comparing activation traces is to use a distance function to create another surprise adequacy criterion called Distance-based Surprise Adequacy (DSA). Given the set of known inputs and a new input , DSA computation first finds the input that is the closest neighbor of with the same predicted class as respectively. Next, it finds the input, , that is closest to , but has a predicted class different from the one predicted for . Next and is computed as:

    (4)
    (5)

    with and defined as:

    (6)
    (7)

    Finally, a value of DSA for the new input can be computed as:

    (8)

LSA is computationally more expensive and requires more parameter tuning than DSA. One parameter is the small set of layers that needs to be chosen for LSA. Activation traces for LSA will then only consist of activation values of neurons in these selected layers. Another parameter is the value for variance used to filter out neurons whose activation values were below a certain threshold. DSA, while still being sensitive to layer selection, benefits more than LSA from choosing deeper layers in the network and has fewer parameters that need to be tuned. For these reasons, we implemented DSA and compared it with techniques mentioned in the previous two subsections.

Iv Experiment

We experimentally assessed the efficacy of the input prioritization techniques in two use-case scenarios. The first use-case is labeling cost minimization. The second use-case is retraining the model with the selected fraction of the prioritized inputs as in active learning, which is a natural next step for utilizing prioritized inputs. With these scenarios in mind, we propose the following research questions.

  • RQ1. Can we effectively prioritize test inputs that reveal erroneous behavior in the model?

  • RQ2. Can the prioritized inputs be used to retrain the model effectively?

The efficacy for RQ1 is measured by the cumulative percentage of error revealing inputs after prioritization. The efficacy for RQ2 is measured by comparing the accuracy of the model retrained with prioritized inputs with the baseline model retrained with randomly-selected inputs.

We answer these research questions for each prioritization technique and compare their relative performance. As concrete instantiation of the prioritization techniques, we compare among the following: 1) softmax, 2) dropout Bayesian with 10 and 100 Monte-Carlo samples, 3) Distance-based Surprise Adequacy (DSA) measured over the last one layer and last two layers. For the techniques that require multiple Monte-Carlo sampling, we compare between 10 and 100 to assess the trade-off between sample size and prioritization efficacy. For DSA, we measure the distance of activation traces taken from the last one layer or last two layers. Although the efficacy of DSA can be higher when the activation traces were taken from the middle layers of the neural networks according to Kim et al[26], our choice of layers were limited because of the high space and time complexity of the DSA algorithm. The time complexity of the DSA algorithm is quadratic to the length of the hidden-layer output vector, and the hidden layers deeper than the last two were typically too long to be handled efficiently given our hardware constraints.

Iv-a Systems Under Test

To simulate a realistic testing scenario where a trained model is scrutinized with additional test data, we chose two representative systems for image classification and image regression. The first system is a digit classification system trained with the 60,000 MNIST [27] training dataset. We test the system with the EMNIST [28] dataset, an extension of MNIST which is compatible to its predecessor. The second system is called TaxiNet, which is designed for an aircraft in ground operation to predict the distance to a center line and the heading angle deviation from a center line while taxiing. It is designed and developed by our industrial partner as a research prototype to assess the applicability of learning-enabled components in the safety-critical domain. The data collection and training was done by ourselves.

To avoid the high cost of operating an actual aircraft in the real environment, we collected the dataset in the X-Plane 11 simulation environment wherein the graphics and the dynamics of the environment and the aircraft are accurately modeled. For a preliminary assessment, we fixed the runway to KMWH-04 and the aircraft to be Cessna 208B Grand Caravan, while varying the position and the angle of the aircraft together with the weather condition. We used 40,000 samples for training with some realistic image augmentations—such as brightness, contrast, blur, vertical affine transformation—turned on in order to maximize the utility of the training data and create a more robust model.

Iv-B Model Configuration

The accuracy of a neural network depends on many factors including the amount and quality of training data, the structure of the network, and the training process. As the performance of our proposed prioritization techniques may also depend on these factors, we treated the structural configuration as an independent variable. However, since it is infeasible to compare the effect of all the independent variables to the prioritization techniques, we configured a number of representative neural networks with different structures. We controlled the other hyper-parameters—such as learning rate and mini-batch size—to be constant across different configurations so that the effect of the structure alone can be studied. The hyper-parameters are configured according to the known good practices at the time of writing this paper so that we can objectively simulate a realistic testing scenario.

Model
Trainable
Parameters
Model Structure
Training
Epochs
Validation
Accuracy
EMNIST
Test Accuracy
A 594,922 2 Conv2D - MaxPool - 2 Conv2D - MaxPool - Flatten - Dropout - 2 Dense 82 99.16% 95.74%
B 177,706 2 Conv2D - MaxPool - Flatten - Dropout - 3 Dense 93 98.90% 89.66%
C 728,170 2 Conv2D - MaxPool - Flatten - Dropout - 3 Dense 138 98.81% 86.14%
D 111,514 Dense - Dropout - 3 Dense 102 97.74% 72.90%
TABLE I: Four digit classification models trained with the MNIST dataset [27] and tested with the EMNIST dataset [28].

For the digit classification task, we configured four networks as described in Table I

. For all layers except the last one, ReLu (rectified linear unit) was used as an activation function, and L2 kernel regularization was applied to prevent the parameters from over-fitting. During training, we check-pointed the epoch only when the validation accuracy (with the 10,000 validation set) improved over the previous epochs, and stopped the training when the validation accuracy did not improve for more than twenty consecutive epochs.

For the taxiing task, we compare two different networks named MobileNet and SimpleNet, supplied by our industry partner. MobileNet is a convolutional neural network inspired by MobileNetV2 

[29]. The structure of the network is similar to what is described in Table 2 in Sandler et al

.’s paper, and it is relatively compact in size, with 2,358,642 trainable parameters. SimpleNet is also a convolutional neural network, with a simpler structure, but with 4,515,338 trainable parameters. It has five sets of convolution, batch normalization, and activation layers back-to-back, followed by a dropout layer and four dense layers. Both of the networks implement L2 regularization, and trained with stochastic gradient-descent algorithm with weight decay.

Iv-C Efficacy Measure

An ideal prioritization technique would consistently assign high scores to all the error-revealing inputs and low scores to all the rest. For example, if there were 20 prioritized test inputs among which 5 were error-revealing, the first five inputs should all reveal errors and the rest should not. If we draw a graph of the cumulative sum that represents the cumulative number of errors revealed by executing each prioritized input, the graph will be monotonically increasing until it hits 5, which is the total number of error-revealing inputs in the given test suite (the orange line in Figure 1). A test suite without prioritization would produce a line like the blue line. In practice, the efficacy of a prioritization technique will be somewhere between random selection and an ideal prioritization, since it is undecidable to predict whether the prediction is correct or not, producing a curve that looks like the green line in Figure 1. The efficacy of a prioritization technique can then be captured by computing the area under curve for each technique and computing the ratio of each to the area under the curve of the idea prioritization criterion. This is a slight modification to Average Percentage of Fault Detected (APFD) measure, which is typically used for measuring the efficacy of test prioritization [30].

Fig. 1: Cumulative sum of the errors found by test suites prioritized by each technique (ideal, poor, technique1). The x-axis corresponds to the test case, where the priority is higher for the former ones. The y-axis is the total number of errors found by executing the test cases up to test cases. The efficacy score of technique1 is the ratio of its area under curve to the area under the ideal curve: .

Iv-D Implementation and Experiment Environment

We implemented the three prioritization techniques—softmax, dropout Bayesian, and Surprise Adequacy—in Python on top of Keras 

[31], which is one of the most popular machine-learning libraries. Our tool is thus compatible with any trained model that abides by Keras’ Model interface. The surprise adequacy measurement part is implemented in C++ to better utilize lower-level performance optimizations and thread-based parallelization. Every feature is integrated seamlessly and provided as a Python API. The tool is publicly available on GitHub at http://www.github.com/bntejn/keras-prioritizer.

The experiments are performed on Ubuntu 16.04 running on an Intel i5 CPU, 32GB DDR3 RAM, an SSD, and a single NVIDIA GTX 1080-Ti GPU.

V Results and Discussion

We ran our prioritization tool on the test datasets for all the trained models of MNIST and TaxiNet and measured the prioritization effectiveness in terms of misbehavior identification and retraining improvement.

V-a RQ1: Effectiveness of prioritization in identifying erroneous behavior

(a) MNIST-A
(b) MNIST-B
MNIST-D
(d) MNIST-D
(e) TaxiNet-MobileNet
(f) TaxiNet-SimpleNet
(c) MNIST-C
(c) MNIST-C
Fig. 2: The cumulative sum of the error revealing inputs by test inputs of decreasing priority: The x-axis represents the test cases sorted in a decreasing order of priority. The y-axis shows the cumulative sum of error-revealing inputs. An ideal prioritization should sort every error-revealing inputs to the front, drawing a highly convex curve. A poor prioritization, on the other hand, will produce a curve with lower convexity.

The effectiveness of prioritization in identifying misbehavior is illustrated in Figure 2 and summarized in Table II. For each model we present the validation accuracy, test accuracy, and the score of prioritization in Average Percentage of Faults Detected (APFD) as described in Section IV-C. The accuracy of classification is presented as the percentage of correct classification, and the accuracy of regression is presented as mean absolute error (MAE). The MAE for TaxiNet is defined as where —the length of the output vector—is two for TaxiNet. The output of TaxiNet is normalized to be between and , so the MAE is always between and where a lower error is more desirable. We also present the accuracy as a percentage; the correctness of an output is determined by a fixed error threshold on MAE of .

(a) High priority input
(b) Low priority input
Fig. 3: Representative inputs of high vs. low priority: high utility score was assigned to inputs that produce high uncertainty—in this case, due to the lack of visible center line in the runway image taken around an intersection.
Dataset Architecture Validation Accuracy Test Accuracy Sentiment Measures
Softmax Dropout Surprise (DSA)
10 100 last1 last2
MNIST A (CNN + Batch norm) 99.16% 95.74% 94.80 93.20 93.57 94.26 92.80
B (CNN) 98.90% 89.66% 91.10 90.87 91.21 90.99 87.06
C (CNN) 98.81% 86.14% 89.30 89.09 89.35 88.98 89.26
D (fully-connected) 97.74% 72.90% 87.90 87.58 88.02 88.13 87.77
TaxiNet MobileNet 0.0394 (99.90%) 0.0764 (97.73%) _ 82.84 86.16 _ _
SimpleNet 0.0575 (99.66%) 0.1243 (90.53%) _ 74.91 77.56 _ _
TABLE II: The efficacy of test input prioritization in misbehavior identification

The efficacy of prioritization presented as APFD scores ranged from to over the different models, which suggests that test input prioritization works in highlighting error-revealing inputs, regardless of the type of task and the structure of the network. Among different techniques, the efficacy of softmax, dropout Bayesian, and DSA were all similar for the same model, but the efficacy of the dropout Bayesian method was higher with more samples as larger samples can be used to more accurately depict the posterior distribution.

An interesting observation is that the efficacy of the prioritization metrics is correlated with the test accuracy of the model, or more precisely, the difference of the validation accuracy and the test accuracy of the model. The APFD was consistently high for the well-performing models and consistently low for the worse-performing models, regardless of the choice of sentiment measure. One plausible cause for this phenomenon is covariate shift [32], which is a situation when the distribution of the input data shifts from training dataset to test dataset. A stark decrease in the test accuracy for some models suggests that the distribution of the data shifted from the training data to test data, and some models (such as A and B) are relatively robust to the shift while the others are not. Model A, for instance, implements batch normalization, a technique known to reduce the internal covariate shift [33] and was more robust to the covariate shift in input distribution, which contributed to a higher prioritization effectiveness. In conclusion to our first research question: Prioritized inputs can effectively identify erroneous behavior in a trained model. The prioritization is more effective when the model has higher test accuracy.

V-B RQ2: Effectiveness of prioritization in retraining

We assess the utility of the prioritized inputs when a model is retrained with the training dataset augmented with the prioritized inputs. This evaluation is similar to the active learning scenario but different in that the model under test is already well-trained. We only sample 1% of the amount of training data from the test set, which equals to 600 for the MNIST models and 400 for the TaxiNet models. The baseline approach we compare against is random selection with the same sample size—we hypothesize that prioritization techniques perform better than the random selection. Hyper-parameters other than the augmented training data were kept constant in retraining runs.

Dataset Architecture
Validation Accuracy
Baseline
Test Accuracy
Baseline
Sentiment Measures
Softmax Dropout Surprise (DSA)
10 100 last 1 last 2
MNIST A 99.23% 97.48% 97.95% 98.09 % 98.06 % 98.18% 98.44%
B 98.90% 95.66% 96.78% 97.41% 96.26% 97.35% 96.48%
C 98.70% 95.02% 94.65% 95.70% 94.50% 95.02% 94.45%
D 97.80% 90.57% 89.93% 87.58% 88.26% 91.87% 92.00%
TaxiNet MobileNet 0.0336 (100.00%) 0.0364 (99.86%) _ 99.84% 99.89% _ _
Simple 0.0502 (99.85%) 0.0522 (99.05%) _ 99.71% 99.56% _ _
TABLE III: The efficacy of input prioritization in retraining

Table III shows that the relative efficacy of retraining follows a similar trend to the error-revealing efficacy presented in Table II. The prioritized inputs could improve the accuracy of the retrained models more effectively than randomly sampled inputs in most cases, and the efficacy was more pronounced for MNIST model A and B than in model C and D. When model B and C are compared, model B consistently performed better when retrained with prioritized inputs while model C almost always performed worse when retrained with with randomly sampled inputs. One hypothetical explanation could be that a well-architectured DNN model with a better generalization benefits more from learning the corner-cases, whereas an unoptimal DNN learns more from general cases. However, the exact reason for this phenomenon cannot be drawn from the limited experiment—a future investigation is necessary. In conclusion to the second research question: Sentiment measures can prioritize inputs that can augment the training dataset with which a better accuracy can be achieved. But random sampling was found more effective for the models that achieve low test accuracy.

V-C Threats to Validity

In the experiment, we evaluated the sentiment measures with both an image classification task and an image regression task, and configured several DNNs with various structural features. Despite our effort, the representativeness of the configured DNNs were inevitably limited in number and variety, and our empirical findings might not generalize to other types of DNNs such as recurrent neural nets. Nevertheless, the DNNs we evaluated are of realistic sizes and implement some of the most widely used techniques that are applied in practical deep learning practices [34].

The second experiment for answering RQ2 was performed without the statistical rigor required for hypothesis testing due to the prohibitive cost of retraining a large model multiple times. We present the result and the finding as a preliminary assessment of the sentiment measures in the context of testing which calls for more rigorous empirical assessment in future work.

Vi Conclusion and Future Work

This paper presented techniques for mitigating the oracle problem in testing DNNs by prioritizing error-revealing inputs based on white-box measures of DNN’s sentiment—softmax confidence, Bayesian uncertainty, and input surprise. We evaluated the three techniques on two example systems for image classification and image regression, and multiple versions of the DNNs configured with different architectures. The experiment showed that the sentiment measures can prioritize error-revealing inputs with an average fault-detection rate of 74.9 to 94.8, indicating that input prioritization based on sentiment measures is a viable approach for effectively identifying weakness of trained models with reduced labeling cost.

We firmly believe that more attention should be paid to techniques that can facilitate field testing of safety-critical DNNs, which can be a laborious process and that test prioritization is an important step towards that goal, providing practical utility and good scalability. Further research is still warranted for assessing the representativeness and completeness of test sets with respect to the operational environments of DNN-based systems.

Acknowledgment

This work is supported by AFRL and DARPA under contract FA8750-18-C-0099.

References

  • [1] A. Singhvi and K. Russell, “Inside the self-driving Tesla fatal accident,” The New York Times, 2016.
  • [2] H. B. Braiek and F. Khomh, “On Testing Machine Learning Programs,” 2018.
  • [3] K. Pei, Y. Cao, J. Yang, and S. Jana, “DeepXplore: Automated Whitebox Testing of Deep Learning Systems,” in Proceedings of the 26th Symposium on Operating Systems Principles - SOSP ’17.   New York, New York, USA: ACM Press, 2017, pp. 1–18.
  • [4] J. Wang, J. Sun, P. Zhang, and X. Wang, “Detecting Adversarial Samples for Deep Neural Networks through Mutation Testing,” pp. 1–10, may 2018.
  • [5] Y. Sun, X. Huang, and D. Kroening, “Testing Deep Neural Networks,” mar 2018.
  • [6] Y. Sun, M. Wu, W. Ruan, X. Huang, M. Kwiatkowska, and D. Kroening, “Concolic Testing for Deep Neural Networks,” pp. 1–30, apr 2018.
  • [7] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami, “Practical Black-Box Attacks against Machine Learning,” feb 2016.
  • [8] Y. Tian, K. Pei, S. Jana, and B. Ray, “DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars,” aug 2017.
  • [9] M. Zhang, Y. Zhang, L. Zhang, C. Liu, and S. Khurshid, “DeepRoad: GAN-based Metamorphic Autonomous Driving System Testing,” 2018.
  • [10] X. Xie, L. Ma, F. Juefei-Xu, H. Chen, M. Xue, B. Li, Y. Liu, J. Zhao, J. Yin, and S. See, “Coverage-Guided Fuzzing for Deep Neural Networks,” no. Dl, pp. 1–25, 2018.
  • [11] L. Ma, F. Juefei-Xu, F. Zhang, J. Sun, M. Xue, B. Li, C. Chen, T. Su, L. Li, Y. Liu, J. Zhao, and Y. Wang, “DeepGauge: Multi-Granularity Testing Criteria for Deep Learning Systems,” mar 2018.
  • [12] A. Odena and I. Goodfellow, “TensorFuzz: Debugging Neural Networks with Coverage-Guided Fuzzing,” 2018.
  • [13] D. Cohn, L. Atlas, and R. Ladner, “Improving Generalization with Active Learning,” Machine Learning, vol. 15, no. 2, pp. 201–221, 1994.
  • [14] B. Settles, “Active learning,”

    Synthesis Lectures on Artificial Intelligence and Machine Learning

    , vol. 6, no. 1, pp. 1–114, 2012.
  • [15] D. D. Lewis and W. A. Gale, “A sequential algorithm for training text classifiers,” in Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval.   Springer-Verlag New York, Inc., 1994, pp. 3–12.
  • [16] Y. Gal and G. Zoubin, “Dropout as a Bayesian approximation: Representing model uncertainty in deep learning,” Jmlr.Org, vol. 48, 2016.
  • [17] A. Nguyen, J. Yosinski, and J. Clune, “Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images,” dec 2014.
  • [18] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and Harnessing Adversarial Examples,”

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pp. 1–9, dec 2014.
  • [19] A. Subramanya, S. Srinivas, and R. V. Babu, “Confidence estimation in Deep Neural networks via density modelling,” 2017.
  • [20] A. Kendall and Y. Gal, “What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?” no. Nips, 2017.
  • [21] M. D. Richard and R. P. Lippmann, “Neural network classifiers estimate bayesian a posteriori probabilities,” Neural computation, vol. 3, no. 4, pp. 461–483, 1991.
  • [22] R. M. Neal, Bayesian learning for neural networks.   Springer Science & Business Media, 2012, vol. 118.
  • [23] A. Graves, “Practical variational inference for neural networks,” in Advances in neural information processing systems, 2011, pp. 2348–2356.
  • [24] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting,” Journal of Machine Learning Research, vol. 15, pp. 1929–1958, 2014.
  • [25] A. C. Damianou and N. D. Lawrence, “Deep Gaussian Processes,” vol. 31, 2012.
  • [26] J. Kim, R. Feldt, and S. Yoo, “Guiding Deep Learning System Testing using Surprise Adequacy,” aug 2018.
  • [27]

    Y. LeCun, “The mnist database of handwritten digits,”

    http://yann.lecun.com/exdb/mnist/, 1998.
  • [28] G. Cohen, S. Afshar, J. Tapson, and A. van Schaik, “EMNIST: an extension of MNIST to handwritten letters,” CoRR, vol. abs/1702.05373, 2017.
  • [29] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen, “Inverted residuals and linear bottlenecks: Mobile networks for classification, detection and segmentation,” CoRR, vol. abs/1801.04381, 2018.
  • [30] G. Rothermel, R. H. Untch, C. Chu, and M. J. Harrold, “Test case prioritization: an empirical study,” Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM’99). ’Software Maintenance for Business Change’ (Cat. No.99CB36360), pp. 179–188, 1999.
  • [31] F. Chollet et al., “Keras: The python deep learning library,” Astrophysics Source Code Library, 2018.
  • [32] M. Sugiyama and M. Kawanabe, Machine learning in non-stationary environments: Introduction to covariate shift adaptation.   MIT press, 2012.
  • [33] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” CoRR, vol. abs/1502.03167, 2015.
  • [34] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning.   MIT press Cambridge, 2016, vol. 1.