Data Sanity Check for Deep Learning Systems via Learnt Assertions

09/06/2019 ∙ by Haochuan Lu, et al. ∙ 0

Deep learning (DL) techniques have demonstrated satisfactory performance in many tasks, even in safety-critical applications. Reliability is hence a critical consideration to DL-based systems. However, the statistical nature of DL makes it quite vulnerable to invalid inputs, i.e., those cases that are not considered in the training phase of a DL model. This paper proposes to perform data sanity check to identify invalid inputs, so as to enhance the reliability of DL-based systems. To this end, we design and implement a tool to detect behavior deviation of a DL model when processing an input case, and considers it the symptom of invalid input cases. Via a light, automatic instrumentation to the target DL model, this tool extracts the data flow footprints and conducts an assertion-based validation mechanism.



There are no comments yet.


page 3

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In recent years, deep learning (DL) techniques have shown great effectiveness in various aspects. A huge amount of DL-based applications and systems have been proposed in favor of people’s daily life and industrial production[13, 16, 1], even in safety-critical applications. Image recognition module for auto-driving vehicles[13], for instance, determines what operation should be taken according to the real-time images captured by cameras. In such safety-critical scenarios, any unreliable system misbehavior may cause severe incidents. Reliability is hence of great significance for practical DL- based systems.

It is widely-accepted that every software system has its valid input domain[55, 50, 29, 52]. Inputs staying in such a domain, namely, valid inputs, are expected to receive proper execution results. Unfortunately, in real circumstances, there is no guarantee the inputs are always valid. Anomalous, unexpected inputs may arrive and result in unpredictable misbehavior, which in turn degrades reliability. In particular, this is a severe threat to the reliability of DL-based systems. The statistical-inference nature of DL models makes them quite vulnerable to invalid input cases.

Take classification tasks for example. A DL model is trained based on the cases belonging to a set of target categories (e.g., different traffic signs). The cases in such target categories form the valid input domain. But in real practice, the cases belonging to other categories may be taken as inputs. Without a proper check, the DL model will still process an invalid case and generate classification result, indicating that the case belongs to one of the target categories (i.e., a traffic sign). This result is obviously wrong and may potentially incur risks. Suppose a traffic sign recognition system in an auto- driving vehicle receives a commercial logo by the street as an input. If it is mistakenly classified as a certain traffic sign, the operations of auto-driving vehicle taken accordingly may lead to severe accidents. Hence, it is critical to develop a mechanism for DL-based systems that can detect invalid inputs, so as to enhance its reliability.

Instantly, we can use the output of the DL model, which to some extent indicates the likelihood of the result, to identify ambiguous input cases[18]. Moreover, we can include an additional class that indicates an input is not in the target categories, i.e., it is an invalid input [7, 43]. However, we find that these solutions result in poor detection capability. They may also influence the accuracy of original model. In this paper, we propose SaneDL , a tool that provides systematic data Sanity che ck for Deep Learning-based Systems.

SaneDL serves as a lightweight, flexible, and adaptive plugin for DL models, which can achieve effective detection of invalid inputs during system runtime. The key notion of SaneDL design is that a DL-model will behave differently given valid input cases and the invalid ones. The behavior deviation is the symptom of invalid input cases [14]

. Since a DL model is not complicated in its control flow, we use data flow to model its behaviors. Via a light, automatic instrumentation to the target network, SaneDL extracts the data flow footprints. It then processes the footprints with an assertion-based mechanism. Such assertions, typically, are constructed via AutoEncoder

[19], which exhibits a perfect compatibility with the DL model. Similarly to traditional assertions for general programs inserted between code blocks, the assertions are inserted in certain network layers, so as to detect behavior deviations in a data flow perspective. Invalid input cases are thus identified effectively.

We summarize the contributions of this paper as follows.

  • We approach reliability enhancement of DL systems via data sanity check. We proposed a tool, namely SaneDL, to perform data sanity check for DL-based systems. SaneDL detects behavior deviation of DL model to identify invalid input cases. SaneDL is an assertion-based approach, where the assertions are automatically learned and inserted. To our knowledge, SaneDL is the first assertion-based tool that can automatically detects invalid input cases for DL systems. Our work can shed light to other practices in improving DL reliability.

  • SaneDL adopts a non-intrusive design that will not degrade the performance of the target DL model. In other words, the SaneDL mechanism does not require the adaptation of the target DL model. As a result, SaneDL can be easily applied to existing DL-based systems.

  • We show the effectiveness and efficiency of invalid input detection via real-world invalid input cases, other than manipulated faulty samples typically used in the com- munity. This proves the SaneDL mechanism is promising. In addition, our experimental method may shed light to other DL tool evaluation work in the software engineering community.

We organize the rest of the paper as follows. Section II introduces the preliminary knowledge of this work. In Section III, we discuss the motivations and challenges. Sections IV elaborates the SaneDL design in detail. We present our exper- imental results with real-world case studies in Section V to evaluate our tool effectiveness. The related work is introduced in Section VI. Section VII provides further discussions and we conclude the paper in Section VIII.

Ii Preliminaries

To facilitate our following discussions, we introduce some background knowledge in this section. First, we will illustrate the assertion mechanism for data sanity check in traditional software development. Then, as our tool focuses on DL-based systems, we will briefly introduce deep neural networks.

Ii-a Assertion based Data Sanity Check

Data sanity check is a mechanism that examines whether the data fed to a system is those expected during the system design [8]. Even though a system’s logical functionalities are properly developed, severe system failures may still occur when undesirable input data are fed into the system[10, 37]. A proper data sanity check can help block invalid inputs and to some extent guarantee a system to work as expected.

Assertions, typically, have been proved to be a very effective method of data sanity check in traditional software development [20, 54, 41]. Developers develop properly-designed, system-specific conditions, and insert the checking statements of such conditions into the target system. The satisfaction of such conditions will be checked during system runtime when the system process its input data. Applied as an instant data sanity check mechanism, assertions guarantee that every conditions should be satisfied when a valid input is given. Invalid inputs can result in assertion failure on a certain assertion, which can then been identified. Figure 1 gives an example on how assertions indicates invalid inputs.

Fig. 1: Example for assertions used to prevent invalid input for a function computing .

It’s worth noting that the effectiveness of assertions is strongly related to the design of conditions. Conditions are usually manually worked out according to the system internal logic and developers’ understanding on system requirements.

This paper also proposes to perform data sanity check with a specifically-tailored assertion mechanism for DL-based systems. Unlike traditional assertions that are typically built manually, the assertion mechanism of SaneDL is based on the mechanism of deep neural networks, and generated automatically. Through evaluating the effectiveness of SaneDL, We confirm that the assertion based data sanity check is of great practical significance in DL-based system.

Ii-B Deep Neural Network

Deep neural network (DNN) has been proven quite effective in many application scenarios, including data analysis [35, 4, 58]

, computer vision

[30, 38]

, natural language processing

[27, 25]. DNN is designed as an emulation of human nervous system. It generally includes two critical activities, connection and activation. Figure 2 gives a classical DNN structure. It shows that DNN is organized in layers, each of which consists of multiple neurons. The general computation process of DNN follows a feed-forward manner, i.e.

, the outputs of the neurons in a layer form a vector

which is sent out as an input for the connected neurons in the next layer. A neuron, specifically, is a computation unit that processes with a certain function (e.g., linear combination

) and applies an activation function (


, the ReLU function

[34]) of to generate its output. The input data of a sample, in this way, flow through the network, layer by layer, and generate a final result in the last layer.

Fig. 2: A classical structure of deep neural network for classification.

During the training phase, the final result is generally used to compute a loss value, which is minimized via certain optimization strategies. The loss evaluates the distance between the network results and the ideal results of a target class. Typically, a set of properly-collected data samples are used to conduct the training process of a DNN. Its aim is to find a set of best network parameters so as to minimize the loss values for the data samples. This training-based nature may have its own limitation: It can only perform well when the input data share the similar characteristics with the training set.

(a) Valid inputs and normal workflow
(b) Invalid input and misbehavior
Fig. 3: Motivating examples for a traffic sign recognition system
(a) Target categories in the training set of Fashion-MNIST
(b) Invalid inputs that appear in the testing set of Fashion-MNIST
Fig. 4: Examples of invalid inputs in a properly fetched public dataset

Finally, when a DNN is applied in a classification task (e.g. image classification), the number of neurons in the last layer determines how many categories the neural network can classify. The output of a neuron in the last layer indicates the likelihood that the input case belongs to the neuron’s corresponding class. Generally the neuron with the largest output value is considered as that indicates the classification result, as shown in Figure 2. As a result, a DNN model guarantees that every given input case will be compulsively classified into a pre-defined class, even when it actually belongs to an additional one other than those indicated by the output neurons. Unexpected system behavior may occur consequently. In this paper, we propose SaneDL to address this issue: SaneDL conducts a data sanity check to indicate that whether an input case belongs to other untrained classes. We find that such a mechanism can significantly enhance the reliability of DL-based systems.

Iii Problem Statement

A DNN model is trained with a set of training data samples. A developer will typically use representative samples that will be encountered in real deployment to train the DNN model. In other words, she expects that the input samples in real deployment fall in the same categories with training samples. For example, a model to recognize flowers can only recognize flowers, but not animals. To facilitate our discussions, we call the expected samples the valid inputs, the unexpected samples belonging to different categories the invalid inputs.

Unfortunately, when deployed in production environment, a DNN model will inevitably meet invalid inputs. However, current DNN design practice generally does not consider the existence of invalid inputs. As a result, even a sophisticated DNN model may make obvious mistakes. For example, a DNN model to recognize flowers may consider a dog as a sunflower.

Lack of sanity check of input data may result in severe consequences in safety-critical applications. Let us consider a traffic sign recognition DNN model in auto-driving vehicles[45]. Figure 3(a) shows its normal workflow when valid traffic signs are given as inputs. Even if the DNN model can recognize traffic signs perfectly by classifying traffic signs to the correct classes, it may produce severe errors. For example, sometimes a commercial logo on the roadside will be fed into the DNN model, as shown in Figure 3(b). If without a sanity check of inputs, the DNN model will produce a classification result accordingly. In other words, it will mistakenly recognize the logo as a specific traffic sign. The auto-driving vehicle may potentially suffer an accident.

Encountering invalid inputs is not just illusive cases. Some public datasets widely used in deep learning studies do contain obviously-invalid samples, i.e., those totally irrelevant to the training samples. For example, the Fashion-MNIST[57]

dataset provided by version 2.1.6 of Keras, a widely-used DNN development tool

[11], mistakenly contains samples that should belong to the MNIST[12] dataset in its testing set111Fashion-MNIST is a public dataset providing pictures of fashionable clothing. MNIST is another public dataset consists of handwritten digits images.. We show such a case in Figure 4. With invalid samples during model testing, a DNN model will produce errors and show poor performance in accuracy. However, even for the in-house data, it is quite labor-intensive to pinpoint the existence of the invalid samples manually, not to mention the localization of invalid samples that may met in real deployment.

Therefore, an effective sanity check mechanism of input data is highly critical to improve the reliability of a DNN model. Such a mechanism can automatically detect if a given input is invalid. Follow-up approaches are then possible to minimize the influence of the invalid inputs. For example, manual operation can be enforced to override DNN decisions. The aim of this work is to provide our experience in designing such a sanity check framework, namely, SaneDL.

However, although a sanity check mechanism of input data is of great importance to the reliability of DNN-based software, designing such a mechanism is not an easy task. The features of DNN introduce many specific difficulties to data sanity check. We present several challenges faced by such a sanity check as follows.

First of all, state of the art still lacks a good interpretation on how on earth a DNN works. For example, why a neuron in a DNN is activated and what its subtle role is in the whole task. A set of inputs may share the same classification result, while the computation processes through the network vary from case to case. Without an accurate understanding of how a DNN works, it is very difficult to build a principle based on the execution dynamics to examine the validity of the input data.

Secondly, Data sanity check is in nature an additional supportive component, which must not influence the original DNN model performance on valid inputs. The DL system’s performance relies heavily on the preset network structure and trained parameters. Any slight changes among these components can greatly affect the performance. Hence, it is required that such data sanity check should be a carefully-designed mechanism which works in a non-intrusive manner. Straight-forward approaches like model reconstruction fail to achieve satisfaction.

Last but not least, it’s worth noting that the mechanism of data sanity check usually turns out to be a set of very brief inspections, rather than a more exhaustive round of redundant testing. As many DL-based systems are deployed to handle real-time tasks, the additional sanity check on the input data should be accomplished with a low computation overhead. Hence, we expect a cost-efficient tool to perform such process.

We will elaborate how SaneDL’s special design helps in handling these challenges in the next section.

Iv Methodologies

In this section, we introduce the design of SaneDL. We first show the overview of SaneDLand then discuss its technical details.

Iv-a Overview

Figure 5 shows the framework of SaneDL. Generally, for a pre-trained DNN model, the execution process of SaneDL follows the workflow as below: First, a series of AutoEncoder based assertions are inserted into the structure of a pre-trained DNN. For a given input, such assertions take the intermediate results of neural network as inputs to evaluate the anomalous degree via the losses. We check each assertions by using properly preset thresholds to constrain the losses. The assertion results then determine the validity of given input.

Fig. 5: Overview of SaneDL

It’s worth noting that SaneDL is designed in a white-box manner, which considers a deep neural network as a sequential iterative program. According to the previous introduction in Section II, we concluded the computation process of neural network as the following functional expressions, having x as the input data, as the connection functions, as the activation functions, and as the output for each layer.

In this case, it provides the possibility to retrieve the intermediate results(i.e. ) from the computation dataflow of the network. Similar to checking runtime internal states in conventional program testing, we focus on analyzing such intermediate data to discover the abnormal behavior of neural network. Specifically, our working process is built on the belief that the intermediate results of invalid inputs and training data show different feature patterns. We will describe the details on how we evaluate such differences to indicate invalid inputs through the key steps mentioned above.

Iv-B AutoEncoder based assertions

For conventional software, developers typically use assertions to check whether the system runtime satisfies certain conditions. Similarly, we use an unsupervised model, namely AutoEncoder to construct assertions. set constrain on the intermediate result of each layer in the deep neural network.

AutoEncoder is a specific-tailored neural network capable of restoring the input data. Figure 6 presents a sample architecture of the AutoEncoder. Generally, it shares a similar working process with the classic neural network described in section II. However, instead of giving classification results, such network aims at reconstructing an output vector that is similar towards the input vector (i.e. serving as a fixed-point function ). As shown in Figure 6, it’s network structure is organized in a symmetric form, which can be divided into two phases. Input data are first encoded into compressed vectors in the encoding phase, and then reconstructed back to their original sizes in the decoding phase. AutoEncoder requires that the reconstructed data should be as close to the original input data as possible. A Mean Square Error(MSE) loss [3], which is calculated and minimized during training, is used to represent the distance between input data and the reconstructed output. In conclusion, AutoEncoder can successfully restore the input data with certain feature patterns at a low loss.

Fig. 6: Architecture of AutoEncoder

We introduce AutoEncoder (AE for short) in SaneDL to build assertions to conduct data sanity check. Specifically, given a pre-trained neural network , we insert AE networks between its layers. These AEs are trained using the intermediate result generated by each layer of . Then, AEs can reconstruct all intermediate results from valid inputs. For invalid inputs, the reconstruction leads to a huge deviation with a high loss value for the intermediate results of invalid puts show abnormal patterns that don’t fit the AEs. In practice, for every input fed to , both the network output and the loss for every injected AE are calculated simultaneously. We further set constrains by giving certain thresholds. Losses above the thresholds reveal the assertion failures, which further negatively show the validity of input. Algorithm 1 formally concludes the checking process of a given AE based assertion.

Particularly, we demonstrate the reasons for choosing AE as an essential module of SaneDL. First, AE is an effective approach that transfers data anomalies into intuitionistic deviate behaviors (i.e. failing to perform reconstruction). It is much more feasible to conduct assertions based on the deviation of certain behaviors than understanding intermediate results of neural networks. AE bridges this gap by providing an access to the anomalous degree of internal states via its reconstruction performance, which helps to achieve a better sanity checking effect.

Second, AE presents a high compatibility with the deep neural network. Given that AE is a neural network itself, it can be perfectly inserted between layers of a deep neural network. AE takes the output of its front layer as its input. Therefore, AE is a well-fitted extension for deep neural network to achieve additional functionalities.

Third, AE introduces low computation overhead in practice. Due to the conciseness of its structure, AE can be pre-trained and execute at a very low time cost. We compared AE with several other mechanisms that can also be used in such scenarios. Results show that AE achieves the best performance.

In conclusion, from a software engineering perspective, the inserted AEs work as the assertions that ensure the system’s intermediate results follow a certain set of constrains. Invalid inputs are likely to break such constrains by generating abnormal intermediate results that AEs fail to reconstruct. Along with its good compatibility and cost-efficiency, we prove that such AE-based assertion is a effective approach to achieve data sanity check.

Iv-C Threshold setting and validity verification

With assertions inserted, we further introduce how we select appropriate thresholds for each assertion, as well as how we verify the inputs validity according to the assertion results.

Iv-C1 Threshold setting

To achieve an accurate detection on invalid input, it is vital to set proper thresholds on each assertion that constrain the valid upper bound of coresponding losses. As the design philosophy of SaneDL is to exploit the differences between valid and invalid input via intermediate results, we choose to set the thresholds as relative values with losses generated by the intermediate results of valid inputs. For a certain assertion, we calculate the mean loss by feeding the underlying AE with the intermediate result of training data(i.e. the available valid inputs). And then the threshold is set at a relative scale towards such mean loss. Formally, it can be expressed as the follow equation:

The stands for the input of the AE, which is the intermediate result at certain layer of the th case in the training set. The means the total samples of the training set. The represents the scale coefficient, which determines the assertion threshold as a certain scale of the mean loss of valid inputs.

Instead of setting different values of for different assertions, through practice, we find that a properly single value of works good enough to conduct input sanity check. Therefore, we only need to preset a proper rather than exhaustively fine-tuning series of values. Such selection strategy can be adjusted according to different scenarios. Usually, selecting a larger can detect more real invalid inputs but may mistakenly reject some valid inputs as well, while selecting a low tends to stay cautious in verifying inputs as invalid and may miss some true invalid cases. Our practice shows that selecting this scale coefficient in a interval from 2 to 4 (i.e. ) results in good performance.

1:the intermediate result given by the layer where assertion inserted , the assertion’s underlying AutoEncoder , preset threshold thres
2:a boolean value represents the assertion result, i.e. pass or fail
3:function Assertion_Check(, , )
4:     // get the reconstructed vector for using
6:     // compute MSE loss between and as the reconstruction loss
8:     // check if loss exceeds the threshold
9:     if  then
10:         // if so, assertion failed
12:     else
13:         // otherwise, it passed
15:     end if
16:     return
17:end function
Algorithm 1 Assertion Check

Iv-C2 Validity verification

After receiving the result of each assertion, a proper judgement on the input’s validity would be performed accordingly. In traditional software testing, it is common that a program aborts and raises exceptions once assertion failure occurs. Similarly, SaneDL’s working process also follows such notion. If any AE gets the reconstruction loss that exceeds the corresponding threshold, the assertion then fails and detects such input as an invalid case.

1:the intermediate results of a test sample , a series of AutoEncoders , the index list of layers where assertions inserted , a series of preset threshold
2:a boolean value which means whether the sample is anomaly or not
3:// default validity set to True
5:for  do
6:     // check the corresponding assertion inserted at
7:     if  then
8:         // if assertion failed, set validity to False
11:     end if
12:end for
Algorithm 2 Input Validity Verification

We illustrate the input validity verification process at Algorithm 2. Specifically, for every assertion , we feed the output of the layer where it is inserted to its underlying , if the reconstruction loss generated from is larger than the preset threshold, then the assertion claims failure, and SaneDL will raise exception to indicate that this input case is invalid. Only those inputs that pass all the assertions are verified as valid inputs that the deep learning system can properly handle. Such mechanism guarantees that for a given input, the dataflow of the DNN should always follow the pre-trained feature pattern throughout its computation process. This conforms the concept of reliability enhancement in conventional software testing, which requires the every sub-process of a software works as pre-defined requirements.

Iv-D Layer selection

Fig. 7: A sample structure of CNN

The implementation of SaneDL supports inserting assertions at any layers of the network. Intuitively, it is reasonable to insert assertions for all layers throughout the network to form an integrated data sanity check. However, through our practice, it’s better to select those layers that provide more representative feature patterns to insert the assertions. A widely accepted fact is that the DNN’s classification functionality works in a gradually progressive manner with its computation through deeper layers[59]. In the superficial layers, DNN only learns the simple intuitionistic features from the given input, while the deeper layers synthesize the previously learning results and provide more abstract features, which are more indicative towards the final classification results. Hence, a proper selection strategy is to choose the last few layers of the DNN to insert the assertions. We believe that the outputs of these layers represent the model’s deep understanding towards input data. Our practice shows that such selection leads to good performance.

On the other hand, a proper selection also improves the efficiency of SaneDL. Complex neural network usually consists of layers at large size (e.g.

the convolutional neural network(CNN)

[26] shown in the Figure 7). We can see that those superficial layers give outputs at high dimension, which require an equally high-dimensional AE network with massive neurons if assertion inserted. But through the computation of network the scale of layer output is gradually reduced(e.g. via pooling layers). So deeper layers can provide compressed output with low dimension. Selecting such layers to insert assertions can avoid exhaustive computation on large AE network and therefore reduce the computation overhead.

V Experiment

We present the effectiveness of SaneDL in this section. We first introduce the environment setup of the experiments. Then, we evaluate its performance in detail on detection accuracy, non-intrusive practice, and computation efficiency.

V-a Experimental Setup

Our experiments are designed on the problem-settings described as the motivating examples in the previous section. Specifically, we build two experimental target scenarios, one of which aims at detecting invalid inputs in the test set of a public dataset(namely Scenario i@, hereinafter), and the other focuses on protecting a traffic sign recognition system from being violated by trivial street commercial logos(namely Scenario ii@, hereinafter). For Scenario i@, we build a test set with a combination of cases from Fashion-MNIST and MNIST, so as to test if SaneDL can successfully enhance a pre-trained model on Fashion-MNIST. Similarly, for Scenario ii@, we train traffic sign recognition models based on GTSRB222GTSRB stands for German Traffic Sign Recognition Benchmark, which is a dataset consists of traffic signs of different categories[22], and then test the embedded SaneDL with a hybrid test set. Such hybrid test set is built by combining extra cases from FlickrLogos-27333FlickrLogos-27 is dataset containing real images of various brand logos that appear in the wild[24] with the original test.

We build pre-trained models in different network structures. For Fashion-MNIST, we use LeNet[28] and AlexNet[26]. And for GTSRB, the AlexNet and VGG16[47] are utilized. The structures of these selected networks are ascending in scale. We successfully embed SaneDL to these different models through proper insertions. It is expected that SaneDL should have a wide availability to network structures in different scales.

Last, our experimental environment is a personal computer with an Intel(R) Core(TM) i7-7700 CPU @ 4.00HZ, 32GB memory and an Nvidia GTX-1080 graphics card. The computation time collected is based on this hardware setting.

V-B Detection Accuracy

V-B1 Evaluation Metric

As our goal is to detect invalid inputs within a test set, an effective detection mechanism is expected to find as much invalid cases as possible, and reduce the mistakes of falsely rejecting valid cases. Therefore, we consider the True Positive Rate(TPR) and the False Positive Rate(FPR) as two proper metrics to evaluate the performance of SaneDL. Formally, the two metrics can be expressed as the equations below, where stands for true positives, stands for true negatives, and stands for false negatives:

In our experiment settings, TPR marks the ratio of invalid cases detected by SaneDL, and FPR marks the ratio of valid cases that are falsely detected. Hence, the higher TPR and the lower FPR represent the better system performance.

Further more, it’s worth noting that both TPR and FPR are computed under a certain setting of (i.e. the scale coefficient for the threshold. To evaluate the overall performance under different thresholds, the ROC-AUC score is a widely accepted choice[23]

. It measures the area under the ROC-Curve, where a high value marks the good overall performance. This is also considered as one of our evaluation metrics.

V-B2 Experiment Data

Figure 8 and 9 show our data content. For Scenario i@, we select those images of handwriting digits as the extra invalid inputs for the model. Each category of invalid input contains about 1000 cases. And for Scenario ii@, we choose five representative logos that often appear as street signs. We set them as the invalid inputs for the traffic sign recognition system. Each category of invalid inputs contains about 200 cases.

(a) Categories of valid inputs in Scenario i@
(b) Categories of injected invalid inputs in Scenario i@
Fig. 8: Categories of experiment data for invalid input detection in Scenario i@
(a) Categories of valid inputs for Scenario ii@
(b) Categories of injected invalid inputs for Scenario ii@
Fig. 9: Categories of experiment data for invalid input detection in Scenario ii@

V-B3 Detection Results

Table I and Table II shows the detection results in two scenarios. For each test setting, we inject invalid cases by randomly replacing some cases in the original test set with the invalid input cases. We first separately inject cases of different invalid classes, to show the detection effectiveness of SaneDL on each invalid class. And than, we randomly choose a subset of all invalid cases, containing various classes of data, to perform the injection. This evaluates SaneDL’s comprehensive performance the in the complex scenario with different invalid inputs.

Dataset Fashion-Mnist & Mnist
Model LeNet AlexNet
Injected Categories 0, 1 2, 3 4, 5 6, 7 8, 9 Random 0, 1 2, 3 4, 5 6, 7 8, 9 Random
Metrics TPR 0.9702 0.9863 0.9739 0.9884 0.9919 0.9818 0.9995 0.998 0.9995 1 1 0.9975
FPR 0.0156 0.016 0.0153 0.0144 0.0156 0.0155 0.0225 0.0213 0.0212 0.0214 0.0216 0.0221
AUC 0.9952 0.9974 0.9964 0.9972 0.9978 0.9969 0.9991 0.9989 0.9991 0.9989 0.9984 0.9987
TABLE I: Evaluation results on invalid input detection in Scenario i@
Dataset GTSRB & Flickr
Model AlexNet VGG
Injected Categories Adidas Apple CocaCola Pepsi Texaco Random Adidas Apple CocaCola Pepsi Texaco Random
Metrics TPR 0.8516 0.9339 0.9857 0.9743 1 0.9451 0.8516 0.9027 0.9964 0.9383 0.9672 0.9268
FPR 0.1008 0.1005 0.1012 0.0948 0.097 0.1002 0.0857 0.0818 0.0798 0.0838 0.0758 0.0801
AUC 0.9531 0.9763 0.9857 0.9943 0.9972 0.9808 0.945 0.9641 0.9947 0.9774 0.982 0.9716
TABLE II: Evaluation results on invalid input detection in Scenario ii@

Particularly, we follow the guidance and insert assertions at the fully-connected layers. Also, we properly fine-tune and select the in two scenarios, i.e., for Fashion-MNIST test set validation and for traffic sign recognition system protection. Such leads to the best TPR and FPR results in the table. And the ROC-AUC score further evaluate the overall performance. As the table shows, SaneDL achieves a high TPR, low FPR and high ROC-AUC score in both scenarios, which turn out to be satisfactory results.

V-B4 Comparisons

To further show the effectiveness of SaneDL, we conduct contrast experiments to compare SaneDLwith other existing approaches that can be adopted to our problem setting. We take two approaches under consideration. One is the softmax probability filtering (SPF) proposed in

[18], which consider those low softmax outputs as unconfident results to be filtered. And the other is an One-Class SVM (OCSVM) based approach for manipulated corner case detection present in [56]. The two approaches both show significant performance in their own targeting tasks, and we consider they providing some reference to our scenarios. We conduct these two approaches together with SaneDL in our two target scenarios respectively, to test their performance on detecting real-world invalid inputs for pre-trained models.

(a) ROC-Curves of differnent approaches for invalid input detection applied in Scenario i@
(b) ROC-Curves of differnent approaches for invalid input detection applied in Scenario ii@
Fig. 10: Comparison between SaneDL and existing approaches

Figure 10 shows the ROC-Curves of the three approaches according to their detection performance. We can see that in both of our target scenarios, the curve of SaneDLentirely covers the other two approaches’, which means that SaneDLachieves higher TPR at any FPR. The other two approaches, though obtaining impressive performance in their own targeting tasks, fail to give satisfactory results in these situations. Hence, with the evaluation on ROC-AUC score, SaneDL shows remarkable better performance than any other existing approaches.

V-C Retaining Original System Performance

With our sophisticated design, SaneDL works as a lightweight plugin for pre-trained DNN models. It performs its functionality on data sanity check in a non-intrusive manner, which means that the original performance of the pre-trained model will not be affected. Several approaches presented in other work from artificial intelligence domain tend to reconstruct the neural network with new structure, and retrain it to support the additional category for unseen data

[7, 43]. However, reconstructing a deep learning system requires huge additional human efforts in collecting extra data, developing additional code and fine-tuning parameters. Setting aside the infeasibility, with data of distractors introduced to training set, such reconstruction can result in negative influences on system’s previous performance.

Origin After Enhancement
0.9059 Model Reconstruction 0.8830
SaneDL 0.9059
TABLE III: Influence to model accuracy caused by different approaches

Table III shows a case, where a distractor set is added as the additional class to retrain a fashion-MNIST model. The accuracy of model drop by 2% after the reconstruction comparing to the previous performance. It’s worth noting that this accuracy is computed with the removal of cases that are falsely classified to the additional category. This means that the additional distractor data can bring in extra confusion, which prevents models from learning indicative features from those data of target categories. In contrast, SaneDL only inserts assertions in post-training phase, and leverage the intermediate results to achieve detecting invalid inputs. No further changes are applied on network structure and parameters. Hence, results shown the system performance is not affected, which proves SaneDL to be a non-intrusive and effective tool.

V-D Computation Efficiency

As previously introduced, such auxiliary tool for data sanity check should achieve its functionality with low computation overhead. Specifically, the computation overhead consists of both preprocessing cost and detection cost. SaneDL  designed as an efficient tool for data sanity check in DNNs, requires low computation cost in both prepocessing and detecting phase. We compare SaneDL with the OCSVM approach mentioned above to show SaneDL’s advantages in efficiency. For SaneDL, the preprocessing cost is made up of the training time of AE based assertions, and the detection cost is the additional time it takes to verify the validity of given inputs. Similarly, OCSVM also has such two sources of computation cost in generating it’s one-class SVM based detector and making decisions on manipulated corner cases. We collect the computation time of these two phases for each approach when applying them in our Scenario i@. In particular, during detection phase, we feed 12,000 cases as input for two approaches and collect the overall computation time respectively. The results are shown in Table IV.

We can see that in the preprocessing phase, the computation time of SaneDLis about 4 times less than the DeepValidation. It’s because that generating SVM on high-dimensional data for deep learning system is severely infeasible, while the AE based assertions is quite compatible with DNNs. As for the detection cost, SaneDL can successfully verify 12,000 input cases in only one second, which is only 0.5% of the time that DeepValidation takes. This shows that SaneDL can check an input data in microseconds as average. It enjoys a high efficiency in detecting invalid inputs, and can even be applied to those deep learning systems with real-time demand.

Preprocessing Cost Detecting Cost
OCSVM 1276s 186s
SaneDL 286s 1s
TABLE IV: Computation cost for different approaches

Vi Related Work

Vi-a Input Validation and Sanitization in Software

In real production environment, input validation is usually required to prevent unexpected inputs from violating system’s pre-defined functionalities. Existing approaches seek for ways that better perform sanity check for specific scenarios[17, 5, 2, 9, 6, 46, 44, 39]. In particular, some focus on checking user inputs for command based systems(CBS)[53]. e.g. Hayes et al. propose a proof-of-concept tool, namely MICASA, to address syntactic validation on falsely input user command towards system[17]. On the other hand, another set of work aims at verifying inputs for web applications to prevent potential threats[2, 9, 6, 39, 44]. Scholte et al. present IPAAS, using input data type detection to prevent the exploitation of XSS and SQL injection vulnerabilities for web applications[44]. Park et al. propose WAIDS, adopting an Anomaly Intrusion Detection model that protects web applications from input validation attacks[39]. Alkhalaf et al. design an automated differential repair technique that synthesis redundant input validation codes from multiple apps, in order to enhance the validation and sanitization checks[2]. All these work shows input validation plays an important role in maintaining software reliability.

Similarly, SaneDL is also an input validation technique designed for deep learning systems. Such systems also suffer a lot from the misbehaviour led by invalid inputs. As previously introduced, inputs for deep learning systems turn out to be complex real world cases(e.g. images), which has less semantic information to conduct a validation. Therefore, we build compatible assertions for neural network, and achieve a better input validation via the entire computation process.

Vi-B Deep Learning Testing

Recent studies in SE community realize the inadequacy of traditional deep learning testing via statistical results(e.g. accuracy), which can hardly expose the specific defects of neural network. Therefore, several work tends to regard deep neural network as a sophisticated software and apply systematic testing mechanisms according to software engineering domain knowledge[48, 36, 33, 15, 32, 31, 49]. e.g. Ma et al. propose DeepMutation, which resorts mutation test from software engineering to evaluate the quality of test set for deep learning[32]. Sun et al. present the first concolic testing approach for deep neural networks to efficiently find adversarial examples[49]. While these work successfully adapt software engineering concepts to deep learning, they only achieve in-house testing before system’s deployment in practice.

Test case generation turns out to be another effective approach to extend the testing effect with more synthesized test samples available. Such generation is also welcomed in testing deep learning systems to enhance performance and reliability[60, 40, 51]. Tianet al. propose DeepTest, which generates cases through a set of transformations on original images to satisfy a neuron coverage criterion[51]. Zhanget al. present DeepRoad, leveraging generative adversarial network (GAN) to generate image under different weather conditions to improve a deep learning based auto driving system[60]. Typically, one concurrent inspiring work presented by Wu et al. regards such misleading generated samples as corner cases of deep learning system[56]. They introduce a transformation based generation strategy to generate corner cases, as well as detect these corner cases via an SVM based approach after injecting them in the test set. However, we consider such cases can be generated via excessive manipulations, which violate the internal feature of the original cases. Therefore such manipulated inputs rarely appear in real practice. In fact, what systems really encounter most are those true real-world cases without any artificial perturbations. But such cases can also lead to system misbehavior when they are out of a valid input domain. Hence, our tool, SaneDL, focus mainly on handling such situation. We conduct our experiment on detecting true real-world invalid input sample, and the satisfactory result reveals SaneDL’s value for improving reliability in real sense.

Vii Discussion

Vii-a Concerns on Generalization Performance

Generalization is a critical property for machine learning models. Such property expects the model to expand its capability to new cases rather than to overfit the training set

[42]. In this notion, one may argue that such strict assertion mechanism of SaneDL can potentially harm the generalization performance of a DNN model. But first of all, our experiment shows its low FPR in real practice, which reveals little influence on handling those valid inputs. Moreover, as a matter of fact, the gap between a proper generalization and an unacceptable misbehavior turns out to be fuzzy. For instance, considering the DNN model to recognize flowers mentioned before, we may think it acceptable to recognize a bunch of chives as a daffodil. But recognizing a dog as a sunflower is definitely regarded as a ridiculous misbehavior. Although these opinions are reasonable in a human perspective, those images however, are simply common cases fed to the deep learning system. It is hard to tell the differences between these two faulty classification results in a system’s perspective. However, when placing such concerns on critical scenarios, e.g. traffic sign recognition system, any faulty behaviour is beyond tolerance. A classification regarded as a successful generalization may also incur unacceptable misbehavior in this case. Hence, SaneDL mainly focuses on the reliability enhancement, which guarantees that the any input does not lead to potential system risks. Such enhancement surely requires a sacrifice on the generalization performance. But we prove it valuable in the critical scenarios by ensuring that the system’s essential functionalities work as expected.

Vii-B Extra Strategy for Data Sanity Check

The validity verification of SaneDL follows the assertion mechanism in traditional software testing, where the computation process aborts and raise exceptions once an assertion fails. However, in some cases, an overall consideration on every assertion result is needed to achieve a comprehensive detection of bugs. Some existing test framework(e.g. Selenium, an automatic testing framework for web services) has already supported such mild assertion mechanism. Similarly, for deep learning system, different severity level of data sanity check can also be considered. Specifically, we can collect the loss from every AE based assertions, and sum them up to calculate an overall loss . We check if the overall loss is under a certain threshold to determine the validity of given inputs. In this case, it permits that some layer outputs fail to satisfy the assertions, but the constrain on the overall loss determines the validity of input. We also test such strategy in practice, and find it tends to stay cautious in checking inputs as invalid and shows a lower FPR. Therefore, it can also be a choice when applied to non-critical scenarios.

Viii Future Work and Conclusions

Deep learning systems exhibit huge effectiveness in real life. But, it may suffer from invalid inputs encountered in complex circumstances. This work presents our efforts on handling invalid inputs that deep learning systems may come across in practice. We propose a white-box verification framework, namely, SaneDL  to perform systematic data sanity check for deep learning systems via assertion-based mechanism. It works in an efficient, non-intrusive manner to detect invalid inputs. Our experiment shows its good performance when being applied to two practical problem scenarios, where the invalid input cases are real-world cases rather than manually-generated toy cases. We find that SaneDL outperforms existing approaches. It can serve as a good supplement to a DL-based system to enhance its reliability.

In this work, we only apply SaneDL to DNNs with sequential structures and feed-forward computation process. Those networks with recurrent structures(e.g. LSTM[21]) are currently out of consideration. Although such networks can actually be unrolled as the sequential form, making possible for SaneDL to be applied on them, it is still of our interest to explore the availability of SaneDL for those recurrent structured DNNs, and further check its effectiveness in more scenarios including natural language processing and audio processing. We expect SaneDL to be a general tool that can help promote the use of DL-based systems in real life, in particular, in safety-critical scenarios.


  • [1] R. B. Abdessalem, S. Nejati, L. C. Briand, and T. Stifter (2016) Testing advanced driver assistance systems using multi-objective search and neural networks. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, ASE, pp. 63–74. Cited by: §I.
  • [2] M. Alkhalaf, A. Aydin, and T. Bultan (2014) Semantic differential repair for input validation and sanitization. In International Symposium on Software Testing and Analysis, ISSTA, pp. 225–236. Cited by: §VI-A.
  • [3] D. M. Allen (1971) Mean square error of prediction as a criterion for selecting variables. Technometrics 13 (3), pp. 469–475. Cited by: §IV-B.
  • [4] G. Andrew, R. Arora, J. A. Bilmes, and K. Livescu (2013) Deep canonical correlation analysis. In Proceedings of the 30th International Conference on Machine Learning, ICML, pp. 1247–1255. Cited by: §II-B.
  • [5] A. Avancini (2012) Security testing of web applications: A research plan. In 34th International Conference on Software Engineering, ICSE, pp. 1491–1494. Cited by: §VI-A.
  • [6] D. Balzarotti, M. Cova, V. Felmetsger, N. Jovanovic, E. Kirda, C. Kruegel, and G. Vigna (2008) Saner: composing static and dynamic analysis to validate sanitization in web applications. In IEEE Symposium on Security and Privacy (S&P), pp. 387–401. Cited by: §VI-A.
  • [7] A. Bendale and T. E. Boult (2016) Towards open set deep networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 1563–1572. Cited by: §I, §V-C.
  • [8] J. E. Boritz (2005) IS practitioners’ views on core concepts of information integrity. International Journal of Accounting Information Systems 6 (4), pp. 260–279. Cited by: §II-A.
  • [9] G. Buehrer, B. W. Weide, and P. A. G. Sivilotti (2005) Using parse tree validation to prevent SQL injection attacks. In Proceedings of the 5th International Workshop on Software Engineering and Middleware, SEM, pp. 106–113. Cited by: §VI-A.
  • [10] C. Cadar, V. Ganesh, P. M. Pawlowski, D. L. Dill, and D. R. Engler (2008) EXE: automatically generating inputs of death. ACM Trans. Inf. Syst. Secur. 12 (2), pp. 10:1–10:38. Cited by: §II-A.
  • [11] F. Chollet et al. (2015) Keras. Note: Cited by: §III.
  • [12] L. Deng (2012)

    The MNIST database of handwritten digit images for machine learning research [best of the web]

    IEEE Signal Process. Mag. 29 (6), pp. 141–142. Cited by: §III.
  • [13] J. Duan and V. Malichenko (2015) Real time road edges detection and road signs recognition. In International Conference on Control, Automation and Information Sciences, ICCAIS, pp. 107–112. Cited by: §I.
  • [14] D. Engler, D. Y. Chen, S. Hallem, A. Chou, and B. Chelf (2001) Bugs as deviant behavior: a general approach to inferring errors in systems code. In ACM SIGOPS Operating Systems Review, Vol. 35, pp. 57–72. Cited by: §I.
  • [15] D. Gopinath, G. Katz, C. S. Pasareanu, and C. Barrett (2017) DeepSafe: A data-driven approach for checking adversarial robustness in neural networks. External Links: Link Cited by: §VI-B.
  • [16] H. Greenspan, B. van Ginneken, and R. M. Summers (2016) Guest editorial deep learning in medical imaging: overview and future promise of an exciting new technique. IEEE Trans. Med. Imaging 35 (5), pp. 1153–1159. Cited by: §I.
  • [17] J. H. Hayes and A. J. Offutt (1999) Increased software reliability through input validation analysis and testing. In Proceedings 10th International Symposium on Software Reliability Engineering (Cat. No. PR00443), pp. 199–209. Cited by: §VI-A.
  • [18] D. Hendrycks and K. Gimpel (2016) A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136. Cited by: §I, §V-B4.
  • [19] G. E. Hinton and R. R. Salakhutdinov (2006) Reducing the dimensionality of data with neural networks. science 313 (5786), pp. 504–507. Cited by: §I.
  • [20] C. A. R. Hoare (2003) Assertions: A personal perspective. IEEE Annals of the History of Computing 25 (2), pp. 14–25. Cited by: §II-A.
  • [21] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §VIII.
  • [22] S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, and C. Igel (2013) Detection of traffic signs in real-world images: the German Traffic Sign Detection Benchmark. In International Joint Conference on Neural Networks, Cited by: §V-A.
  • [23] J. Huang and C. X. Ling (2005) Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 17 (3), pp. 299–310. Cited by: §V-B1.
  • [24] Y. Kalantidis, L. G. Pueyo, M. Trevisiol, R. van Zwol, and Y. Avrithis (2011) Scalable triangulation-based logo recognition. In Proceedings of the 1st International Conference on Multimedia Retrieval, ICMR, pp. 20. Cited by: §V-A.
  • [25] Y. Kim (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882. Cited by: §II-B.
  • [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §IV-D, §V-A.
  • [27] S. Lai, L. Xu, K. Liu, and J. Zhao (2015) Recurrent convolutional neural networks for text classification. In Twenty-ninth AAAI conference on artificial intelligence, Cited by: §II-B.
  • [28] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §V-A.
  • [29] N. Li, T. Xie, M. Jin, and C. Liu (2010) Perturbation-based user-input-validation testing of web applications. Journal of Systems and Software 83 (11), pp. 2263–2274. Cited by: §I.
  • [30] Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In 2015 IEEE International Conference on Computer Vision, ICCV, pp. 3730–3738. Cited by: §II-B.
  • [31] L. Ma, F. Juefei-Xu, F. Zhang, J. Sun, M. Xue, B. Li, C. Chen, T. Su, L. Li, Y. Liu, J. Zhao, and Y. Wang (2018) DeepGauge: multi-granularity testing criteria for deep learning systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE, pp. 120–131. Cited by: §VI-B.
  • [32] L. Ma, F. Zhang, J. Sun, M. Xue, B. Li, F. Juefei-Xu, C. Xie, L. Li, Y. Liu, J. Zhao, and Y. Wang (2018) DeepMutation: mutation testing of deep learning systems. In 29th IEEE International Symposium on Software Reliability Engineering, ISSRE, pp. 100–111. Cited by: §VI-B.
  • [33] S. Ma, Y. Liu, W. Lee, X. Zhang, and A. Grama (2018) MODE: automated neural network model debugging via state differential analysis and input selection. In Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE, pp. 175–186. Cited by: §VI-B.
  • [34] V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML), pp. 807–814. Cited by: §II-B.
  • [35] M. M. Najafabadi, F. Villanustre, T. M. Khoshgoftaar, N. Seliya, R. Wald, and E. Muharemagic (2015) Deep learning applications and challenges in big data analytics. J. Big Data 2, pp. 1. Cited by: §II-B.
  • [36] A. Odena and I. J. Goodfellow (2018) TensorFuzz: debugging neural networks with coverage-guided fuzzing. External Links: Link Cited by: §VI-B.
  • [37] P. Oehlert (2005) Violating assumptions with fuzzing. IEEE Security & Privacy 3 (2), pp. 58–62. Cited by: §II-A.
  • [38] W. Ouyang and X. Wang (2013) Joint deep learning for pedestrian detection. In IEEE International Conference on Computer Vision, ICCV, pp. 2056–2063. Cited by: §II-B.
  • [39] Y. Park and J. Park (2008) Web application intrusion detection system for input validation attack. In Third International Conference on Convergence and Hybrid Information Technology, Vol. 2, pp. 498–504. Cited by: §VI-A.
  • [40] K. Pei, Y. Cao, J. Yang, and S. Jana (2017) DeepXplore: automated whitebox testing of deep learning systems. In Proceedings of the 26th Symposium on Operating Systems Principles, pp. 1–18. Cited by: §VI-B.
  • [41] G. K. Saha (2006) Application semantic driven assertions toward fault tolerant computing. Ubiquity 2006 (June), pp. 1:1–1:27. Cited by: §II-A.
  • [42] C. Schaffer (1994) A conservation law for generalization performance. In Machine Learning, Proceedings of the Eleventh International Conference, pp. 259–265. Cited by: §VII-A.
  • [43] W. J. Scheirer, A. de Rezende Rocha, A. Sapkota, and T. E. Boult (2013) Toward open set recognition. IEEE transactions on pattern analysis and machine intelligence 35 (7), pp. 1757–1772. Cited by: §I, §V-C.
  • [44] T. Scholte, W. K. Robertson, D. Balzarotti, and E. Kirda (2012) Preventing input validation vulnerabilities in web applications through automated type analysis. In 36th Annual IEEE Computer Software and Applications Conference, COMPSAC, pp. 233–243. Cited by: §VI-A.
  • [45] P. Sermanet and Y. LeCun (2011) Traffic sign recognition with multi-scale convolutional networks. In The 2011 International Joint Conference on Neural Networks, IJCNN 2011, San Jose, California, USA, July 31 - August 5, 2011, pp. 2809–2813. Cited by: §III.
  • [46] L. K. Shar and H. B. K. Tan (2012) Predicting common web application vulnerabilities from input validation and sanitization code patterns. In IEEE/ACM International Conference on Automated Software Engineering, ASE, pp. 310–313. Cited by: §VI-A.
  • [47] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §V-A.
  • [48] Y. Sun, X. Huang, and D. Kroening (2018) Testing deep neural networks. External Links: Link Cited by: §VI-B.
  • [49] Y. Sun, M. Wu, W. Ruan, X. Huang, M. Kwiatkowska, and D. Kroening (2018) Concolic testing for deep neural networks. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE, pp. 109–119. Cited by: §VI-B.
  • [50] K. Taneja, N. Li, M. R. Marri, T. Xie, and N. Tillmann (2010) MiTV: multiple-implementation testing of user-input validators for web applications. In 25th IEEE/ACM International Conference on Automated Software Engineering, ASE, pp. 131–134. Cited by: §I.
  • [51] Y. Tian, K. Pei, S. Jana, and B. Ray (2018) DeepTest: automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th International Conference on Software Engineering, ICSE, pp. 303–314. Cited by: §VI-B.
  • [52] P. Tsankov, M. T. Dashti, and D. A. Basin (2013) Semi-valid input coverage for fuzz testing. In International Symposium on Software Testing and Analysis, ISSTA, pp. 56–66. Cited by: §I.
  • [53] A. von Mayrhauser, J. Walls, and R. T. Mraz (1994) Sleuth: A domain-based testing tool. In Proceedings IEEE International Test Conference, TEST: The Next 25 Years,, pp. 840–849. Cited by: §VI-A.
  • [54] J. Wagner, V. Kuznetsov, G. Candea, and J. Kinder (2015) High system-code security with low overhead. In 2015 IEEE Symposium on Security and Privacy, SP, pp. 866–879. External Links: Link Cited by: §II-A.
  • [55] L. J. White and E. I. Cohen (1980) A domain strategy for computer program testing. IEEE Transactions on Software Engineering (3), pp. 247–257. Cited by: §I.
  • [56] W. Wu, H. Xu, S. Zhong, M. R. Lyu, and I. King (2019) Deep validation: toward detecting real-world corner cases for deep neural networks. In 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Cited by: §V-B4, §VI-B.
  • [57] H. Xiao, K. Rasul, and R. Vollgraf (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: §III.
  • [58] J. Xie, R. B. Girshick, and A. Farhadi (2016)

    Unsupervised deep embedding for clustering analysis

    In Proceedings of the 33nd International Conference on Machine Learning, ICML, pp. 478–487. Cited by: §II-B.
  • [59] M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §IV-D.
  • [60] M. Zhang, Y. Zhang, L. Zhang, C. Liu, and S. Khurshid (2018) DeepRoad: gan-based metamorphic testing and input validation framework for autonomous driving systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE, pp. 132–142. Cited by: §VI-B.