Machine learning with deep neural networks (i.e.
, deep learning, or DL) has been proven effective in many specific tasks, including object recognition, nature language processing , and even those in traditional domains, e.g, biomedical science, business strategy, and education . Recent years have witnessed a widespread adoption of DL in our life. The hardware industry has also provided many DL solutions for mobile devices. For example, Apple Inc. has included additional chips into smartphones to handle DL-based tasks, such as virtual assistant and airbrushing photos . As a result, it becomes a trend to include DL-based functionality into traditional software design. More and more developers have used DL models in their software implementation, including those who are not machine learning experts.
Fortunately, modern DL software frameworks, e.g.1], makes it quite convenient to develop DL applications without much professional DL knowledge. For example, a DL-based image object recognition application can be implemented within tens of lines of code using TensorFlow . Thanks to the good design of such frameworks, the implementation of DL application rarely introduces traditional software bugs, since the codes typically do not contain complicated control flow.
However, DL applications still frequently exhibit undesirable behaviors, leading to unexpected mistakes. For example, an Amazon Echo device randomly blasted disturbing music . Even in some safety-critical applications, such as those for DL-based self-driving cars, DL defects are inevitable, which may jeopardize lives . How to reduce such mistakes, in other words, improve the reliability of DL-based applications has become an urgent call-for-research problem.
Unfortunately, machine learning techniques, including DL, are essentially based on statistical models. People can always find or generate examples that cause DL applications to make mistakes . Although many methods have been proposed to make DL applications robust [38, 35], even the state-of-art strategies cannot guarantee 100% correctness. In other words, DL applications make mistakes inevitably due to the nature of DL.
On the other hand, human factors may also lead to the bad performance of DL applications. In traditional software, the execution results depend on the input and the code logic. Similarly, the execution of a DL application is determined by the input as well as the DL model, which is obtained by training a deep neural network with a set of training data. Hence, model defects such as bad network design or improper training process with improper data can also incur misbehavior of DL applications. However, for DL practitioners, especially those inexperienced ones, it is very hard, if not impossible, to figure out whether the low accuracy of a DL model is an inevitable result, or caused by model defects that can be fixed.
This paper presents DeepMorph (Deep neural networks Tomography), a tool to analyze the root cause of DL model defects. We approach the problem with a careful analysis of the root causes of low accuracy first. DL algorithms generally assume that the training data have similar distribution with those encounter in production environments . However, the distribution of the latter is hard to obtain in reality. The gap between the distributions may degrade the performance of a DL model. In addition, the training data may be unreliable, especially those labeled manually. Mistakenly-labeled training data may also degrade DL model performance. Finally, different network structures have different powers in drawing statistical rules based on the training data. If a developer adopts an improper network structure and accordingly trains a DL model, its performance may not be satisfying .
DeepMorph is specifically-tailored for detecting these typical defects. We consider the execution of a DL model as a functional composition of functions defined by each network layer in the model. Hence, similar to the functional programming paradigm , the computation of each layer is idempotent. This allows us to model the execution process through extracting internal data flow footprints, i.e., the intermediate outputs of every layers. We find that such footprints can provide insights to locate the root cause effectively, which can instantly direct a developer to improving the DL model.
Our contributions of this work are highlighted as follows.
We systematically analyze the execution process of DL models in a software engineering perspective. We interpret how the intermediate outputs of the hidden layers describe the execution process of a DL model.
We present a new method to extract the internal data flow footprints, which can be used to reason bad performance and locating the corresponding defects.
We implement DeepMorph based on TensorFlow, a widely-adopted DL software framework. We show the effectiveness of DeepMorph in locating model defects with different DL models trained on four popular datasets.
The rest of this paper is organized as follows. We introduce related work in Section II. We present background knowledge, in particular, that on DL techniques, in Section III. Section IV introduces the motivation of this work with an example. We elaborate our DeepMorph design in Section V. Section VI provides the details of our experimental study. We provide further discussions and conclude the work in Section VII.
Ii related work
In recent years, machine learning with deep neural networks has surged into popularity in many application areas [18, 43, 6]. Much work has also proven the effectiveness of DL in solving traditional software engineering problems [24, 21, 9, 17]. Feng et al.
propose an LSTM-based network to perform anomaly detection in time-series execution data. Nie et al. propose to predict GPU errors in HPC systems with DL .
Meanwhile, with the wide adoption of machine learning based applications, the reliability and security of machine learning-based software also attract much research attention [2, 14, 25, 49]. Rajabi et al.
leverage out-distribution learning to augment convolution neural networks, so that the networks can reject adversarial examples and reduce misclassification rate. Lu et al. conduct attacks on MagNet, a network supposed to be resilient to such attacks, and analyze the defects in designing threat models . Wen et al. propose the detector-corrector network to detect adversarial examples in DL models . DL techniques are inherently probabilistic models, so that DL models may inevitably make mistakes sometimes. This line of work mainly focuses on understanding and addressing such inherent defects in DL models. Our work, in contrast, intends to locate model defects introduced by improper network design or faulty training data, and direct developers to further improve their DL models.
first propose DeepXplore, which relies on neuron coverage as a coverage metric to systematically test deep learning systems. Similar to DeepXplore, Ma et al. further define both neuron and layer level coverage criteria to help gauging the testing quality of DNNs . However, it is still unclear whether such coverage criteria are appropriate to measure the comprehensiveness of DNNs testing, due to the nature of low neuron activation rate . These efforts mainly focus on designing coverage criteria to address DNN model testing problems. Tian et al. implement DeepTest for automatically generating test cases through image transformations and detecting erroneous behaviors of DL-based self-driving cars . Zhang et al. implement DeepRoad, which applies Generative Adversarial Networks (GANs) and use metamorphic testing to validate input for self-driving cars . Both DeepTest and DeepRoad intend to test DL-based self-driving vehicles through generating new testing cases. Ma et al. propose MODE, which improves the training performance of DL model through input selection . MODE assumes that DNN bugs are caused by data, and performs a differential analysis to locate useful features in selecting better inputs for training DL models. In this paper, we present DeepMorph to summarize three typical model defects through analyzing the data flow footprints inside DNN models. In contrast to existing approaches, DeepMorph pays attention to locate the root cause of bad model performance, instead of aiming solely on performance improvement.
Deep neural networks (DNNs) are designed to automatically draw statistical rules from training data . In this section, we introduce preliminary knowledge on DNNs with a focus on their typical structures.
A DNN usually consists of a number of neurons with a layered, connected structure, where a neuron is the basic computational unit. Figure 1(a) shows the structure of a simple neural network. Each connection between two neurons, shown as a line in the figure, represents a data flow link in the network, which is decorated with a weight parameter.
A neuron can then calculate the weighted sum of all its inputs, add a bias, and finally evaluate its output with an activation function, as shown in Figure 1(b). Specifically, the output is
. The activation functionis typically a non-linear function, which introduces non-linear property to a network. A network can thus perform non-linear calculation. In other words, it is capable of fitting a complicated function, which is critical to deal with complicated data such as images, audio, and videos. The most widely-used activation function is the Relu
(Rectified Linear Unit) function.
A layer in a network can be considered as a set of neurons that locate at the same level in the network structure, where neurons in the same layer will not connect to each other. A network includes an input layer, one or many hidden layers, and an output layer. The input layer directly connects to the input data (i.e.
, a vector), where each neuron connects to one dimension of the data. The hidden layers are typically designed to extract multiple levels of representations and abstraction from the input data. Data flow from the input layer, calculated by neurons in between, to the output layer so as to yield a result. For example, an image can go through a network and output the object it contains (e.g., a flower, a bird, or an aircraft).
To deal with multi-class classification problems (i.e., determining which class in classes the input belongs to), the output layer commonly uses the softmax activation function. It is shown below, where is a vector of the inputs to the softmax function, and is the index of each target class.
The output of this function is a vector of size . The value of each element ( = 1, 2, …, ) of the vector is in interval [0, 1]. Moreover, . In this way, represents how likely the input belongs to class .
Essentially, a DL practitioner will typically design the structure of the proposed DNN, where she will design the layers, how layers are connected, and the activation functions. The network structure is specifically tailored for a problem domain. For example, the VGG-16 structure is designed for object image recognition 
. The training process of a neural network is to find the weights so that the output can best produce expected results, typically with backpropagation.
Finally, it is straight-forward to know that a network, together with the weight parameters between neurons, determines how it processes the input to generate output. We call a network (including the weight parameters between neurons and the network structure) a DL model, or in short, a model. The performance of a model, i.e., whether it can produce correct output given an input, is largely determined by the weight parameters and the network structure. Since the parameters are obtained during the training process with the training data, the performance is determined by the network structure and the training data.
DL models generally cannot achieve 100% accuracy due to its basis: it draws rules statistically according to the training data. When deployed in a production environment, it may encounter inputs that do not follow the statistical rules, and consequently produces faulty results .
When a DL developer builds a model for a specific task, it is quite hard for her to understand whether there is room to improve its performance. For example, suppose she builds a deep neural network model to recognize flowers, and the model achieves 80% accuracy in the production environment (i.e., 20% of the inputs are misclassified). It is in fact a very challenging problem for her to tell whether and how this accuracy can be improved. She may make mistakes in building the model, which bring defects to the model. Such defects can then possibly be fixed to improve the accuracy.
Such a problem is more severe as deep learning surges into popularity recently. More and more developers, who may not be experienced machine learning practitioners, have included deep learning techniques into their applications.
We further illustrate this problem with a simple motivating example. Let us suppose a developer builds a simple DNN model to recognize digits. The developer trains the network with MNIST , a widely-adopted handwritten digits dataset. With modern deep learning software framework, e.g., TensorFlow, it is quite convenient for her to build her own models. Suppose she builds a five-layer (excluding the trivial input layer by convention) network with TensorFlow. To facilitate our discussions, we name these layers , …, respectively, where connects to the input layer and is the output layer.
We show her codes in figure 2, which contain three parts: loading the MNIST dataset, defining the network structure, and training the model with the data. We can see that the implementation is quite easy to follow. But simple as it shows, the codes may still produce a model with low accuracy in recognizing digits. For example, the model may be trained with improper training cases, where these cases bear different statistical characteristics with those that can be met in practice. As a result, the network is misled by the training data, and generates a model with wrong statistical rules. Poor accuracy is then inevitable in practice.
For illustration purpose, consider in the training data some images that actually contain digit ’4’ are wrongly labeled as those including ’9’. Note that in typical machine learning application, training data are manually labeled usually. Errors in training data introduced due to human factors are not rare cases .
After the developer trains the model with the data that contains errors, the model is applied to recognize handwritten digits. Figure 3 shows a set of errors the model produces, where the images containing digit ’4’ are misclassified as those containing the digit ’9’. We can see that with these faulty cases only, it is still hard for the developer to know whether she has mistakenly introduced defects to the model, and if so, how to locate the root cause. Current tools, e.g., DeepXplore and DeepTest, generally focus on generating effective testing cases. They cannot help the developer to pinpoint the defects as well.
As we have discussed in Section III, an output is produced after data flow from the input layer, processed by every hidden layer, to the output layer. Motivated by such a nature of neural networks, we consider that the internal data flow footprints (i.e., the output of each layer ), may shed light to locating the root cause of the faulty cases.
To this end, we compare how similar a faulty case (i.e., digit ’4’ wrongly identified as ’9’) to those correctly identified as digit ’4’ and those correctly identified as digit ’9’. For the output in every layer from three kinds of cases, we use the t-distributed stochastic neighbor embedding (t-SNE) algorithm 
, which is commonly used for data exploration and visualization, to visualize these high-dimensional data.
The visualization results is shown in figure 4, where every shape represents the intermediate output of one test case from the corresponding layer, the distance between two shapes reflects the similarity of the outputs. We can find that the intermediate outputs of correct test cases are promiscuous in layer , separable in layer , and totally separated in layer . We pay attention to the output of faulty cases that wrongly identified as ’9’ instead of ’4’, especially from layer to since the outputs of latter layers are more distinguishable than previous ones.
It is worth noting that the output of faulty case 1 in each layer is closer to the outputs of correct ’9’ cases, comparing with those of the correct ’4’ cases. In other words, the model has wrongly consider a ’4’ in the faulty case as a ’9’ from the very beginning (i.e., ), throughout every hidden layers, and recognize it as a ’9’ eventually in the output layer. Although the original case is like a ’4’ to a human being (see figure 3), the model have persistently considered the faulty case is more similar to a ’9’. One can consider that such a progress of yielding the wrong result ’9’ indicates the model is quite certain that the faulty case is a ’9’, although it is a ’4’ to a human being. The developer may then conclude the model has been trained with such similar cases (where ’4’ labeled as ’9’), so that the model considers the faulty case is a ’9’ throughout all the hidden layers.
On the other hand, the outputs of faulty case 2 are more similar to correct ’9’ cases in layer to , and become closer to correct ’4’ cases in layer and . Indeed, the model extracted informative features from these faulty cases, while the extracted features may be not effective enough to produce the correct results. The developer may consider to improve the representation power of the model by changing the model structure.
To this end, she can then be directed to inspect the training cases, remove those wrongly-labeled data, and train a new model with new structure using the polished training data.
In conclusion, when a DL model suffers bad performance, it is hard for its developer, especially an inexperienced one, to figure out whether the results are reasonable or caused by her mistake. The above motivating example demonstrates that it is promising to pinpoint the root cause with the internal data flow footprints. This work aims at designing and implementing a tool that can automatically analyze the defects in DL models via considering the internal data flow footprints, and direct developers to effectively enhance the model.
V Locating Model Defects with Data Flow Footprints
Suppose a developer builds a DL model for a specific task and the model suffer from bad performance in production environment. It is possible that she may make mistakes in building the model, introducing defects to the model. DeepMorph is a tool designed to facilitate the developer to analyze whether there is a potential defect that causes the bad performance of the model. As we discussed in the motivating example in Section IV, it is promising to locate the root cause of bad model performance via analyzing the internal data flow footprints. This section illustrates how we realize an automatic approach to this end. We first show the design overview of DeepMorph.
V-a DeepMorph design overview
Figure 5 overviews the design of DeepMorph. When the performance of a deep neural network is lower than expected, DeepMorph first builds the softmax-instrumented model via adding auxiliary softmax layers to the target model. The softmax-instrumented model is used to learn the execution pattern of the training cases for each target class. Then DeepMorph feeds the faulty cases to the softmax-instrumented model, which extracts data flow footprint specifics from the intermediate outputs of hidden layers in the target model.
The footprint specifics are capable of representing the classification process, and allow DeepMorph to compare the footprints against the execution pattern of each target class. By examining the process, layer by layer, of how inputs are misclassified, DeepMorph can then reason the defect that causes the faulty cases.
Next, we discuss the defects considered in this work in Section V-B. We then discuss why internal data flow footprints of a DNN model can be used to model the execution of a model in Section V-C. How DeepMorph models and analyzes the data flow footprints is presented in Section V-D, followed by how such data can be used towards defect localization in Section V-E.
V-B Defects of DL models
As discussed in Section III, the process of building a DL model typically includes preparing training data and designing the structure of a deep neural network. It is natural to see that the structure design, including the layers, how layers are connected, and the activation functions, have direct impacts on the model performance. In other words, bad performance may result from a bad structure, which we name structure defect.
A structure defect may result in underfitting or overfitting problem. Underfitting problem appears when the model is not powerful enough to deal with complicated data. A simple example could be using a shallow network to deal with complicated machine learning tasks (e.g, object recognition in real-life images). Such a network structure, even trained with correct training data, may not be capable of drawing suitable statistical rules on the data . On the other hand, overfitting refers to the model that fits the training data too well. The model may learn the details and noises in the training data, which do not apply to new data and negatively impact the model ability to generalize. Bad performance is then inevitable.
Although it is straight-forward for an inexperienced developer to instantly blame the network structure if the model encounters poor performance. But, the structure defect may not be the sole cause of poor model performance. The training data determine the weight parameters of the model via the training process. The parameters, together with the network structure, determines the correctness of the output given an input. Hence, the model performance is determined both by the network structure and the training data. Errors in training data also incur bad performance because statistical machine learning, including DL, requires that the training data should bear similar distribution with the data encountered in practical . However, data in the production environment is hard to be obtained and comprehensively analyzed beforehand. If the distribution of training data and that of the data encountered in production environment is different, statistical rules learned via the training data cannot be correctly applied in production environment, which will lead to bad performance.
For example, it is found that a DL-based self-driving car tends to make mistakes in rainy or foggy days [47, 53]. The reason is that most of the training data are collected in sunny days. The model is biased by such data, resulting in poor performance of the DL-based self-driving car in heavy weather.
Moreover, as we have discussed, training data are in general labeled manually. Human labeling is unreliable in nature. In other words, the training data may contain incorrect labels. Such unreliable training data can mislead the training process and introduce incorrect statistical rules to the model as well.
For example, suppose and are similar cases, both actually belonging to class . If is in the training data while being falsely labeled as in class , the model may learn the features of during the training process, and falsely considers them as the the features of class . When the model meets in production environment, it may conclude that also belongs to class . Hence, unreliable training data may also lead to bad performance , as we have shown in the motivating example in Section IV. It is worth noting that locating such defects are not the focus of current DL testing tools, e.g., DeepXplore and MODE.
Based on the above discussions, we summarize three representative types of model defects considered in this paper as follows.
Structure Defect (SD): The improper network structure leads the model to learn inappropriate features from the training data.
Insufficient Training Data (ITD): The distribution of the training data is different from that of the data in production environment.
Unreliable Training Data (UTD): The training set contain falsely labeled cases.
For the SD case, the developer can improve the structure, e.g., with a more powerful network structure, and retrain the model. For the ITD case, the developer can retrain the model with more data similar to those faulty cases, e.g., with data augmentation techniques , to improve the performance. For the UTD case, the developer can perform a careful inspection on the training data and correct the errors, and then retrain the model. A possible way to reduce human inspection efforts is to focus on the training data that are similar to the faulty cases.
V-C Execution footprints of DL models
To locate the root cause of bad model performance, we first analyze how an output is produced by a model. To facilitate the discussion in this paper, suppose the network has layers, including hidden layers denoted by (), the input layer , and the output layer . Let denote the weights of layer , the input to layer , the bias, and the activation function.
As introduced in section III, the computation of a layer () can be presented as follows.
Note that the output of each layer is fed as the input to its next layer, and the input data are fed from the input layer, processed through every hidden layer, to the output layer. The entire computational process of a DL model can be treated as the functional composition of each function of . In other words, since , the output of the model given an input can be written as:
Similar to the functional programming paradigm where a function’s return value depends only on its arguments, for each layer in a DL model, the output is determined by its input. So every layer in a DL model is idempotent. Generally speaking, we can consider each hidden layer conducts some computation to extract specific features from its input via , and then leveraging the activation function to remove some redundant information while reserving some useful features. For example, the Relu activation function will set all negative values to zero and keep all positive values.
Take a classification task (e.g., recognizing a handwritten digit) as an example. Since a hidden layer takes the output of its previous layer
as input, as the data flow though the network, we can consider that the output data of every hidden layer are made “distinguishable” with more distinct features in terms of the classification task. It is expected that the features extracted eventually by the last hidden layer can best help the output layer produce a correct result.
To understand the execution process of a program, traditional approaches generally resort to code coverage metrics (e.g., where a line of code is executed). But the execution process of a DL model in general cannot be formulated as code coverage, due to its structural specifics . In functional programming paradigm, observing whether a function produces the correct result is a common method to locate bugs. Inspired by the idempotent character of the layers in DL models, we can consider the intermediate results (), given an input, encapsulate all the execution information of a model. In this regard, intermediate results () are the data flow footprints of how a model processes an input, which can tell why the model produces a correct or incorrect result. This is the basis for DeepMorph to perform root cause analysis of bad model performance.
Finally, it is worth noting that the intermediate outputs of every layer typically contain large amount of data, which is too complicated for further analysis. DeepMorph requires to determine the effective information for the root cause analysis task in the footprints. Next, we elaborate how DeepMorph collects effective information from data flow footprints and accordingly performs root cause analysis of the faulty cases, with a discussion of how DeepMorph addresses the above challenges.
V-D Modeling execution footprints
A straight-forward approach to analyze a faulty case is to compare the execution process of how the model processes the faulty cases with that of how it processes correct cases. As we have discussed, the intermediate results (j) of each hidden layer reflect the execution process of a model. Hence, we propose to make use of the intermediate outputs of hidden layers for defect localization. Specifically, DeepMorph compares the execution processes of how the model processes a faulty case and the training cases via extracting footprint specifics from such data flow footprints.
Let us suppose there are training cases for a classification task, each labeled as one of the classes. Class () contains training case set . The data flow footprints of the training cases in form a set , and those of the faulty cases are denoted by .
To compare the execution processes for the faulty cases and those of the training cases, one may instantly suggest to compare the distributions of with . Specifically, we can compare the distributions of intermediate outputs of from each hidden layer with those of faulty cases to investigate how the cases are eventually misclassified. As discussed in Section V-C, according to the functional nature of deep neural networks, it is expected that each hidden layer extracts more and more subtle features that can help the output layer correctly perform the machine learning task. The output data of each hidden layer, from to , should tend to be more distinguishable in terms of the classification task. Therefore, comparing the feature of the faulty case extracted in each layer with that of each in the training data can help us understand whether the features of the faulty case are correctly extracted. For example, we can see whether such features are getting less distinguishable or whether it persistently misleads the model to misclassification.
A widely-adopted way to compare a data item (i.e., for the faulty case) to the distribution of a set of data items (i.e., all the data for the training cases) is to calculate their Mahalanobis distance . However, calculating Mahalanobis distance is computation-intensive, especially when there are tremendous training cases and the data volume of the intermediate results (j) are huge. Moreover, such a distance can best model how close a point is to a distribution based on the assumption that the distribution is Gaussian , which may not hold for the intermediate results.
To deal with the huge volume of the intermediate results, we propose to perform additional feature extraction of the intermediate results . Actually, the underlying idea of the Mahalanobis-distance proposal is that we should track whether the features of the faulty cases are correctly extracted in the hidden layers. Inspired by the functionality of the softmax activation function typically used in the output layer, which produces the likelihood of that the input belongs to each class, q.v. Equation (1), we use a softmax-based approach to extract the footprint specifics of faulty cases.
Specifically, DeepMorph resorts to an auxiliary softmax layer connected to each hidden layer in the target model, and builds the softmax-instrumented model so as to extract the footprint specifics. Figure 6 shows how such auxiliary softmax layers are connected to the original deep neural networks via a simple example. For an -class classification task, an auxiliary softmax layer contains neurons. Each hidden layer is fully-connected to one particular auxiliary softmax layer. In other words, each neuron in the auxiliary softmax layer connects to all neurons in the hidden layer.
We can consider that the softmax-instrumented model is a new neural network that outputs the likelihood, in the view of each hidden layer, that the input case belongs to each target class. We adopt such likelihood data to model whether the features of an input case are correctly extracted by the hidden layers. Just like other DL models, we first need to train the parameters of the softmax-instrumented model.
Note that our focus is to compare intermediate results of training data () with those of the faulty case in each hidden layer of the original model, we only need to train the parameters of auxiliary softmax layers with .
Specifically, DeepMorph freezes all parameters in the original model during the training progress. Then DeepMorph use the same training data to train the softmax-instrumented model. When the training data is processed by each hidden layer, the intermediate outputs not only flow to the next layer, but also to the corresponding auxiliary softmax layer. Since the parameters of the original model are frozen, the intermediate results can be calculated correctly and then used to train the corresponding auxiliary softmax layer with the label of training data. In this way, DeepMorph train the softmax-instrumented model that can be used to extract the footprint specifics of the faulty cases.
Specifically, given an input, let denote the output of the auxiliary softmax layer corresponding to hidden layer . is a vector of size , where each element represents the likelihood that the input case belongs to each target class, according to the intermediate results .
Note that the output layer uses the softmax activation function as well, so the output of the original DL model is a vector similar to (). We consider the outputs of all the the auxiliary softmax layers, together with the final output , can describe the data flow footprint specifics of the input. In other words, it describes how the DL model processes the input. We let DFS denote such footprint specifics, presented as follows.
In Section V-E, we will elaborate how DFS can help locate model defects.
V-E Footprint-based defect localization
Note that each () in the DFS is a vector containing likelihood values, which implies to what extents the model considers the input case belonging to each target class, according to the extracted features from the corresponding layer. Given the DFS of an input case belonging to class , we further define the value-rank of , represented by , as the ranking of the likelihood value of the true class in . For example, means that the input case is deemed as the second most likely belonging to the true class , according to the outputs from layer . Through calculating the value-rank of every in the DFS, DeepMorph can obtain the value-rank list
which tracks the variance of the classification results layer by layer when the input case is processed by a DL model. Algorithm1 elucidates the procedure for calculating the value-rank list given the DFS of a faulty case in DeepMorph.
We now discuss how DeepMorph uses the value-rank list to localize model defects. Let us first suppose the input case is eventually correctly identified as one belonging to by the DL model. The values in the value-rank list can: 1) remain 1 constantly, or 2) be overall ascending and reach 1 in . Figure 7(a) shows the examples of such trends, where the -axis represents the value-rank and the
-axis represents the layer number. The first case means the DL model can extract distinct features to successfully classify the case toeven from the lower layers (i.e., , ). It indicates the DL model is quite certain that the input belongs to . The second case means the DL model can gradually extract distinct features to successfully classify the case to . Both cases indicate that the DL model performs as expected: As the data flow thought the network, the output data of every layer are made “distinguishable” with more distinct features in terms of the classification task.
If the input case is eventually incorrectly identified as one that does not belong to by the DL model. The trend of the value-ranks in the list can exhibit three possible scenarios: 1) The value-ranks in the list are ascending, but do not reach 1 in . 2) The value-ranks in the list are descending. 3) The value-ranks in the list remain a relatively constant value other than 1 or oscillating. We also show the examples of these cases in Figure 7(b)-(d), respectively.
In the first case, it indicates that the hidden layers have gradually extract useful features from the input case, but they are still not enough even till the output layer. As a result, the input case is still misclassified. Since the hidden layers have gradually extract more and more distinct features, if the model could be improved (e.g., by adding hidden layers), it is quite potential that the value-rank can reach 1 eventually. Hence, DeepMorph considers that such a trend indicates that the model structure can be improved. In other words, DeepMorph will report that the DL model has a SD defect.
In the second case, the overall trend of value-ranks in the list is descending. The DL model makes mistakes in the very beginning, and even becomes worse till the end. We consider that the DL model is totally confused about the input case. The statistical rules it has learned from the training data cannot apply to the faulty case. This phenomenon is what we discuss in the motivating example, which shows the model has been trained with similar cases but the cases have a different label other than . Consequently, DeepMorph considers that such a trend indicates a UTD defect.
As for the third case, the trend is oscillating, which implies that the statistical rules learned by the model are not obvious in these cases. This instantly means that such cases are rare in the training data. In this regard, DeepMorph considers that such a trend indicates an ITD defect.
Algorithm 2 demonstrates the implementation of defect localization in DeepMorph. DeepMorph summarizes the ascending and descending trend of two consecutive value-ranks in the list, and thus determine the overall trend of value-ranks. For example, if the number of ascending value-rank pairs is less than the manually-set threshold, then DeepMorph determines the overall trend of value-ranks is descending and considers the faulty case is caused by UTD. Since DL models may consist of tens to hundreds number of layers, the thresholds are set according to the total number of layers in the target model. Specifically, the thresholds for ascending and descending both by default are set to in DeepMorph, where is the total number of layers in the target model.
To sum up, DeepMorph analyzes the trends of the value-ranks to perform root cause analysis of bad performance. We evaluate how DeepMorph works in our experimental study next.
We implement DeepMorph over TensorFlow, a widely-adopted DL software framework. Note that DeepMorph depends only on the general features of the framework, including obtaining the outputs of hidden layers. These features are not specifically available only on TensorFlow. In other words, the DeepMorph mechanism can be easily implemented over other DL software frameworks, e.g., PaddlePaddle 
, Pytorch and Caffe2 .
Our experiments are designed with a focus on answering the following research questions.
RQ1: How effective DeepMorph is in locating model defects in controlled environments?
RQ2: How DeepMorph performs in practical production scenarios?
RQ1 intends to comprehensively study how DeepMorph performs in locating DL defects with manually injected defects. To answer RQ1, we manually inject defects (e.g., ITD, UTD and SD) to different kinds of DL models, and test whether DeepMorph is capable of localizing the injected defect correctly.
RQ2 aims at investigating whether DeepMorph is effective to locate model defects in practical production scenario. To answer RQ2, DeepMorph is applied to analyze properly trained DL models. Based on the defect reported by DeepMorph, we modify the models accordingly and evaluate whether DeepMorph is helpful to improving model performance.
Vi-a Experimental Setup
All experiments are implemented on Python 3.5.4 with TensorFlow 1.13, and conducted on a server running Ubuntu 16.04 with a i9-7960X CPU, 128G memory and one Nvidia GTX 1080Ti GPU.
All the DL models are trained using the Adam Optimizer with learning rate
for 100 epochs and the batch size is set to 128. We do not apply any data augmentation during training.
For each experimental setting, we train the corresponding DL model with a set of training data. The DL model is applied to the test data used to emulate those encountered in production environments. Given the target DL model, training set, and faulty cases found in the test data, DeepMorph first builds the softmax-instrumented model and trains the auxiliary softmax layers with the training data. The softmax-instrumented model processes the faulty cases and produces the footprint specifics of these cases. DeepMorph obtains the possible root cause of each faulty case according to the trend of the value-rank lists, as discussed in Section V-E. Then for all faulty cases, it produces the ratio of each type of defects. The defect with the highest ratio value is considered to be the dominant defect of the target DL model.
Vi-B Locating model defects in controlled environments
To answer RQ1, we employ two standard datasets for image classification: MNIST and Cifar-10 , both of which have 10 target classes labeled as 0-9. We consider 4 typical DL classifier implementations from Github . For MNIST, we utilize LeNet  and AlexNet , which have 5 and 8 layers respectively. For Cifar-10, we use AlexNet, ResNet-34  and DenseNet-40 . Comparing to LeNet and AlexNet, ResNet and DenseNet are more modern DL model structures with shortcuts. To study how DeepMorph performs in locating each defect, we manually inject the defects to these DL-models and conduct our experimental study, which is elaborated as follows.
ITD happens when the data distribution have obvious difference between training data and test data. So we can randomly remove a part of data of some specific classes. In our experiments, we inject ITD by respectively removing 40%, 60% and 80% of training data that belong to the classes labeled as 0-4, while the test data remain unchanged. In this way, the training data are biased comparing with the test data.
UTD refers to the unreliable training data and happens when human made mistakes. To best simulate UTD in practical, we manually choose two classes sharing similar feature in the datasets. We then tag a part of the training data of one class to the other. Specifically, for MNIST, we respectively choose 30%, 50% and 70% of the training data with label 4 in random, and modify the labels to 9. For Cifar-10, we randomly choose and modify the same part of training data containing a cat (labeled as 3) to a dog (labeled as 5).
We inject SD
through manually removing three kinds of layers, namely Convolution (Conv) layer, fully-connect (FC) layer and batch normalization (BN) layer, from the original network structures, which aims at degrading the models via a weaker network structure. Then we train the models with all correct training data. We conduct two experiments for each model structure, since LeNet and AlexNet originally have no BN layer, while ResNet and DenseNet only have one FC layer as the output layer. Specifically, we remove the second Conv layer or the first FC layer from LeNet and AlexNet. Both ResNet and DenseNet are consist of Conv layer blocks, so we remove the last block or the BN layers in the last three blocks.
|SD||Lack of Conv||0.280||0.091||0.629||0.238||0.174||0.588||0.145||0.351||0.504||0.433||0.086||0.481||0.452||0.013||0.535|
|Lack of FC||0.377||0.038||0.585||0.32||0.146||0.534||0.160||0.317||0.523||-||-||-||-||-||-|
|Lack of BN||-||-||-||-||-||-||-||-||-||0.370||0.136||0.494||0.324||0.167||0.509|
The results reported by DeepMorph on models with different injected defects are shown in Table I. We can see that, for all cases, DeepMorph is able to locate the injected defect effectively. According to the ratio values reported by DeepMorph, the injected defects are always the largest. For example, after injecting ITD to the MNIST dataset, the reported ratio value of ITD reaches at least 0.679 in LeNet, which is the largest comparing with that of UTD (at most 0.089) and SD(at most 0.232). This indicates that DeepMorph can successfully identified the injected defects.
Vi-C Evaluation on practical models
For the sake of fairly evaluating the effectiveness of DeepMorph in practical production scenarios, we consider another two datasets for image classification: Fashion-MNIST , and SVHN . To better simulate the usage scenario of DeepMorph, we utilize a seven-layer CNN as the network designed by the developer, and VGG-16 
, a widely-adopted network in computer vision applications, as the reference network. The architecture of the seven-layer CNN is shown in TableII. We train the DL classifiers without injecting any defects.
|Conv + Relu||(33) kernel, 64 filters|
Conv + Relu + Max Pooling(22)
|(33) kernel, 64 filters|
|Conv + Relu||(33) kernel, 128 filters|
|Conv + Relu + Max Pooling(22)||(33) kernel, 128 filters|
|FC + Relu||256|
|FC + Relu||256|
|FC + Softmax||10|
Then we apply DeepMorph directly to these models. The performance of the DL models, measured by its accuracy on test data, together with the defect radio reported by DeepMorph are shown in Table III.
We can find that the defect radio of UTD is quite low for every DL model, since the training data are unmodified. The CNN-7 trained on Fashion-MNIST is reported having SD, while the remaining all suffering from ITD.
For evaluating how DeepMorph can help a developer in removing these defects, we fix these detected defects accordingly. For the CNN-7 on Fashion-MNIST dataset has 0.917 accuracy on the test data, but DeepMorph still reports SD. So we consider this model encounters overfitting problem and hence add BN layers to this model. We employ data augmentation techniques to the ITD in other three models. The accuracy of the improved models are shown in the “New Acc.” column in Table III. We can see that the performance of every model is significantly improved, which show that fixing the defect indicated by DeepMorph is effective.
This experiment shows that DeepMorph can achieve good performance even in locating unknown defects in practical scenarios. With the direction provided by DeepMorph, the developer can effectively improve DL model performance.
|Dataset||Model||Acc.||Defect Radio Reported||New Acc.|
Vii Discussions and Conclusion
This paper aims at addressing the model defects of DL applications. We argue that the model defects, i.e., those caused by improper network structure and improper network parameters (caused originally by improper training data), can be located with a white-box approach.
We formulate a DL model as a functional composition of hidden layers, and analyze its execution with data flow footprints, i.e., the intermediate outputs of the hidden layers. We attempt to interpret the model execution as how the distinct features of an input case towards the DL task can be extracted layer by layer gradually. Accordingly, we propose to examine model defects according to the feature extraction process, with the help of data flow footprints.
We demonstrate the effectiveness of our proposal by implementing a tool, namely, DeepMorph. Based on a set of experiments on DL models with injected defects, we evaluate the tool on different kinds of DL classifiers. We also evaluate the tool with realistic models without defect injection. The results show it is very promising for DeepMorph in locating model defects. Moreover, it can greatly facilitate DL model developers in guiding them towards improving the model.
Finally, it is worth noting that DeepMorph is not evaluated on all popular network structures, e.g.
, those based on recurrent neural networks (RNNs). We consider thatDeepMorph can directly use the same methods to extract data flow footprints from a RNN model, as RNNs are executed similarly with hidden layers, although more experimental study on RNNs and corresponding footprint analysis can be further conducted in future work.
-  (2016) TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI, pp. 265–283. Cited by: §I.
-  (2018) Fairness and transparency of machine learning for trustworthy cloud services. In Proceedings of the 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, DSN Workshops, pp. 188–193. Cited by: §II.
-  (https://en.wikipedia.org/wiki/Apple-A12) Apple a12. Cited by: §I.
-  (https://www.techrepublic.com/article/the-10-biggest-ai-failures-of-2017/) Artificial ignorance. Cited by: §I.
-  (https://www.biomedcentral.com/collections/ai) Artificial intelligence in biomedical imaging. Cited by: §I.
-  (2014) Neural machine translation by jointly learning to align and translate. arxiv abs/1409.0473. Cited by: §I, §II.
-  (https://caffe2.ai/) Caffe2: a new lightweight, modular, and scalable deep learning framework. Cited by: §VI.
-  (2017) Towards evaluating the robustness of neural networks. In Proceedings of the IEEE Symposium on Security and Privacy, SP, pp. 39–57. Cited by: §I.
-  (2018) From UI design image to GUI skeleton: a neural machine translator to bootstrap mobile GUI implementation. In Proceedings of the 40th International Conference on Software Engineering, ICSE, pp. 665–676. Cited by: §II.
-  (https://github.com/BIGBALLON/cifar-10-cnn) Convolutional neural networks for cifar-10. Cited by: §VI-B.
-  (2000) The mahalanobis distance. Chemometrics and intelligent laboratory systems 50 (1), pp. 1–18. Cited by: §V-D.
-  (2018) Identifying implementation bugs in machine learning based image classifiers using metamorphic testing. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA, pp. 118–128. Cited by: §II.
-  (2017) Multi-level anomaly detection in industrial control systems via package signatures and LSTM networks. In Proceedings of the 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN, pp. 261–272. Cited by: §II.
Model, data and reward repair: trusted machine learning for markov decision processes. In Proceedings of the 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, DSN Workshops, pp. 194–199. Cited by: §II.
-  (2016) Deep learning. Adaptive computation and machine learning, MIT Press. Cited by: §I, §III, §IV, §V-B.
-  (2017) DeepSafe: A data-driven approach for checking adversarial robustness in neural networks. External Links: Cited by: §II.
-  (2018) Deep code search. In Proceedings of the 40th International Conference on Software Engineering, ICSE, pp. 933–944. Cited by: §II.
Deep residual learning for image recognition.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 770–778. Cited by: §I, §II, §VI-B.
-  (2017) Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 2261–2269. Cited by: §I, §VI-B.
-  (1989-09) Conception, evolution, and application of functional programming languages. ACM Computing Surveys 21, pp. 359–411. Cited by: §I.
-  (2018) Obfuscated VBA macro detection using machine learning. In Proceedings of the 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN, pp. 490–501. Cited by: §II.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, NIPS, pp. 1097–1105. Cited by: §VI-B.
-  (1989) Handwritten digit recognition with a back-propagation network. In Proceedings of the Advances in Neural Information Processing Systems, NIPS, pp. 396–404. Cited by: §III, §VI-B.
-  (2017) Implicit smartphone user authentication with sensors and contextual machine learning. In Proceedings of the 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN, pp. 297–308. Cited by: §II.
-  (2018) On the limitation of magnet defense against l1-based adversarial examples. In Proceedings of the 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, DSN Workshops, pp. 200–214. Cited by: §II.
-  (2018) DeepGauge: multi-granularity testing criteria for deep learning systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE, pp. 120–131. Cited by: §II.
-  (2018) DeepMutation: mutation testing of deep learning systems. In Proceedings of the 29th IEEE International Symposium on Software Reliability Engineering, ISSRE, pp. 100–111. Cited by: §II.
-  (2018) MODE: automated neural network model debugging via state differential analysis and input selection. In Proceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE, pp. 175–186. Cited by: §II.
-  (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §IV.
Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML), pp. 807–814. Cited by: §III.
-  (2011) Reading digits in natural images with unsupervised feature learning. Cited by: §VI-C.
-  (2018) Machine learning models for GPU error prediction in a large scale HPC system. In Proceedings of the 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN, pp. 95–106. Cited by: §II.
-  (2018) TensorFuzz: debugging neural networks with coverage-guided fuzzing. External Links: Cited by: §II.
-  (http://www.paddlepaddle.org/en) PaddlePaddle: an easy-to-use, easy-to-learn deep learning platform. Cited by: §VI.
-  (2018) SoK: security and privacy in machine learning. In Proceedings of the IEEE European Symposium on Security and Privacy, EuroS&P, pp. 399–414. Cited by: §I.
-  (2017) DeepXplore: automated whitebox testing of deep learning systems. In Proceedings of the 26th Symposium on Operating Systems Principles, SOSP, pp. 1–18. Cited by: §II, §V-C.
-  (https://pytorch.org/) Pytorch: an open source deep learning platform that provides a seamless path from research prototyping to production deployment. Cited by: §VI.
-  (2018) Towards dependable deep convolutional neural networks (cnns) with out-distribution learning. arXiv preprint arXiv:1804.08794. Cited by: §I, §II.
-  (2018) Reachability analysis of deep neural networks with provable guarantees. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI, pp. 2651–2659. Cited by: §II.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §III, §VI-C.
-  (2018) Testing deep neural networks. External Links: Cited by: §II.
-  (2018) Concolic testing for deep neural networks. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE, pp. 109–119. Cited by: §II.
-  (2012) LSTM neural networks for language modeling. In Proceedings of the 13th Annual Conference of the International Speech, INTERSPEECH, pp. 194–197. Cited by: §I, §II.
-  (https://www.tensorflow.org/tutorials/images/image-recognition) TensorFlow tutorials. Cited by: §I.
-  (https://www.cs.toronto.edu/ kriz/cifar.html) The cifar-10 dataset. Cited by: §VI-B.
The mnist database. Cited by: §IV.
-  (2018) DeepTest: automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th International Conference on Software Engineering, ICSE, pp. 303–314. Cited by: §II, §V-B.
-  (2001) The art of data augmentation. Journal of Computational and Graphical Statistics 10 (1), pp. 1–50. Cited by: §V-B.
-  (2018) DCN: detector-corrector network against evasion attacks on deep neural networks. In Proceedings of the 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, DSN Workshops, pp. 215–221. Cited by: §II.
-  (2017-08-28)(Website) External Links: Cited by: §VI-C.
-  (2015) Understanding neural networks through deep visualization. CoRR abs/1506.06579. External Links: Cited by: §II.
-  (2016) Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: §IV, §V-B, §V-B.
-  (2018) DeepRoad: gan-based metamorphic testing and input validation framework for autonomous driving systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ASE, pp. 132–142. Cited by: §II, §V-B.