Machine learning software, deep neural networks (DNN) software, is based on inductive methods to discern valuable information from a given vast amount of data . The quality of such software is usually viewed from predication performance that obtained approximation functions exhibit for incoming data. Functional behavior of the resultant inference functions is dependent on trained learning models, which learning programs calculate with training datasets as their input.
The quality of DNN software is dependent on both the learning programs and training datasets; either or both is a source of degraded quality. The learning programs result in inappropriate trained learning models if they are not implemented faithfully in regard to well-designed machine learning algorithms , Moreover, problematic datasets, suffering from sample selection bias  for example, have negative impacts on trained learning models.
Although the learning programs and datasets are sources to affect the quality of inference functions, they are more or less indirect. Trained learning models determine the quality directly, but have not been considered as first-class citizens to study quality issues. The models are important numeral data, but are intermediate in that they are synthesized by learning programs and then transferred to inference programs.
This paper adapts a hypothesis that distortions in the trained learning models manifest themselves as faults resulting in quality degradation. Although such distortion is difficult to be measured directly as they are, relative distortion degrees can be defined. Moreover, this paper proposes a new way of generating datasets that show characteristics of the dataset diversity , which is supposed to be effective in testing machine learning software from various ways.
2 Machine Learning Software
2.1 Learning Programs
. Our goal is to synthesize, inductively from a large dataset, an approximation input-output relation classifying a multi-dimensional vector datainto one of categories. The given dataset is a set of number of pairs, , where a supervisor tag takes a value of . A pair in means that is classified as .
Given a learning model as a multi-dimensional non-linear function, differentiable with respect to both learning parameters and input data . Learning aims at obtaining a set of learning parameter values () by solving a numerical optimization problem.
The function denotes distances between its two parameters, representing how much a calculated output differs from its accompanying supervisor tag value .
We denote a function to calculate as , which is a program to solve the above optimization problem with its input dataset . Moreover, we denote the empirical distribution of as . is a collection of learning parameter values to minimize the mean of under .
From viewpoints of the software quality,
, a learning program, is concerned with the product quality, because it must be a faithful implementation of a machine learning method, the supervised learning method for this case.
2.2 Inference Programs
We introduce another function , using a trained learning model or , calculates inference results of an incoming data .
For classification problems, the inference results are often expressed as probability that the databelongs to a category . , a function of , is probability such that the data is classified as . This is readily implemented in if we choose Softmax
as an activation function of the output layer of the learning model.
The prediction performance of is, indeed, defined compactly as the accuracy of classification results for a dataset different from used to calculate ; . For a specified , is a set-valued function to obtain a subset of data vectors in .
If we express as a size of a set , then an accuracy is defined as a ratio as below.
Given a obtained by , the predication performance of is defined in terms of for a dataset different from . For an individual incoming data , a function of is a good performance measure.
2.3 Quality Issues
2.3.1 Loss and Accuracy
An NN learning problem is non-convex optimization, and thus reaching a globally optimal solution is not guaranteed (e.g. ). The learning program
is iterated over epochs to search for solutions and is taken as converged when the value of the loss () is not changed much between two consecutive epochs. The learning parameter values at this converged epoch are taken as . The derived may not be optimal, and thus an accuracy is calculated to ensure that the obtained is appropriate. Moreover, may be over-fitting to the training dataset .
In the course of the iteration, at an epoch , the learning parameter values are extracted. The iteration is continued until the accuracy becomes satisfactory. Both and are monitored to ensure the training process goes as desired. If the accuracy of is not much different from the accuracy with , we may think that the learning result does not have the over-fitting problem.
Figure 1 shows loss and accuracy graphs as epochs proceed measured during experiments111We return to the experiment later in Section 4.. The graphs, for example in Figure 1 (a), actually demonstrate that the search converges at a certain desirable point in the solution space because the loss decreases to be almost stable below a certain threshold, and the accuracies of both and reach a satisfactory level of higher than 0.95. Figure 1 shows that the loss together with the accuracy may be good indicators to decide whether NN training processes behave well or not.
2.3.2 Sources of Faults
Intuitively, NN machine learning software shows good quality if prediction performance of is acceptable. The graphs in Figure 1, however, depict that there is a counterexample case as discussed in , in which the learning task uses MNIST dataset, for classifying hand-written numbers.
Figure 1(a) are graphs of loss and accuracy of a probably correct implementation of NN learning program, while Figure 1(b) are those of a bug-injected program. The two graphs for loss are mostly the same to be converged. The two accuracy graphs are similar as well, although the program of Figure 1(b) has faults in it.
MNIST dataset is a standard benchmark and is supposed to be well-prepared free from any sample selection bias. A bug-injected program accepts a training dataset of MNIST and calculates a set of trained learning parameters . Intuitively, this is inappropriate, because a bug-injected program produces it. However, the accuracy graphs show that there is no clear sign of faults in the prediction results of , although its behavior is completely determined by the probably inappropriate .
A question arises how faults in affect , which follows another question whether such faults in are observable.
3 Distortion Degrees
3.1 Observations and Hypotheses
Firstly, we introduce a few new concepts and notations. For two datasets and , a relation denotes that is more distorted than . For two sets of trained learning parameters and of the same machine learning model , a relation denotes that is more distorted than . A question here is how to measure such degrees of the distortion. We assume a certain observer function and a relation with a certain small threshold such that . The distortion relation is defined in terms of , . We introduce below three hypotheses referring to these notions.
[Hyp-1] Given a training dataset , a machine learning program , either correct or faulty, derives its optimal solution . For the training dataset and a testing dataset , if both are appropriately sampled or both follows the same empirical distribution, then is almost the same as .
[Hyp-2] For a training program and two training datasets ( or ), if , then .
[Hyp-3] For two training datasets ( or ) such that and a certain appropriate , if is correct with respect to its functional specifications, then two results, , are almost the same, written as (or ). However, if is a faulty implementation, then .
The accuracy graph in Figure 1 is an instance of [Hyp-1]. In Figure 1 (a), the accuracy graphs for and are mostly overlapped, and the same is true for the case of Figure 1 (b), which illustrates that the accuracy is satisfactory even if the learning programs is buggy.
Moreover, the example in  is an instance of the [Hyp-2] because of the followings. A training dataset is obtained by adding a kind of disturbance signal to so that . With an appropriate observer function , is falsified where .
3.2 Generating Distorted Dataset
We propose a new test data generation method. We first explain the L-BFGS , which illustrates a simple way to calculate adversarial examples.
Given a dataset of , . An adversarial example is a solution of an optimization problem;
Such a data is visually close to a seed for human eyes, but is actually added a faint noise so as to induce miss-inference such that .
Consider an optimization problem, in which a seed is and its target label is .
The method is equivalent to constructing a new data to be added small noises. Because the inferred label is not changed, is not adversarial, but is distorted from the seed . When the value of the hyper-parameter is chosen to be very small, the distortion of is large from . On the other hand, if is appropriate, the effects of the noises on can be small so that the data is close to the original .
By applying the method to all the elements in , a new dataset is obtained to be . We introduce a function to generate such a dataset from and .
Now, (for and ).
3.3 Some Properties
This section presents some properties that generated datasets satisfy; where is equal to be a given training dataset .
[Prop-1] serves the same machine learning task as does.
We have that . As the optimization problem with indicates, , an element of does not deviate much from in , and is almost the same as in special cases. Therefore, serves as the same machine learning task as does. Similarly, serves as the same machine learning task as does. By induction, serves as the same machine learning task as does, although the deviation may be large.
The distortion relation is satisfied by construction if we take as a starting criterion.
[Prop-3] is more over-fitted to than .
In the optimization problem, if the loss is small, in can be considered to be well-fitted to because the data reconstruct the supervisor tag well. We make the loss is so small as in the above by choosing carefully an appropriate value.
[Prop-4] There exists a certain such that, for all to satisfy a relation , and . is a dataset different from , but follows the same empirical distribution .
From Prop-3, we can see is over-fitted to if . Because and both and follow the empirical distribution , we have . Furthermore, implies and thus .
[Prop-5] The dataset and trained learning model reach respectively and if we repeatedly conduct the training and the dataset generation interleavingly.
If we choose a to be sufficiently larger than , we have, from Prop-4, , which may imply that . From Prop-3, is over-fitted to , and thus we have , which implies that we can choose a representative from them. Using this dataset, we have that , and that is a representative.
4 A Case Study
4.1 MNIST Classification Problem
MNIST dataset is a standard problem of classifying handwritten numbers . It consists of a training dataset of 60,000 vectors, and a testing dataset of 10,000. Both and are randomly selected from a pool of vectors, and thus are considered to follow the same empirical distribution. The machine learning task is to classify an input sheet, or a vector data, into one of ten categories from 0 to 9. A sheet is presented as 2828 pixels, each taking a value between 0 and 255 to represent gray scales. Pixel values represent handwritten strokes, and a number appears as a specific pattern of these pixel values.
In the experiments, the learning model is a classical neural network with a hidden layer and an output layer. Activation function for neurons in the hidden layer isReLU; its output is linear for positive input values and a constant zero for negatives. A softmax activation function is introduced so that the inference program returns probability that an incoming data belongs to the ten categories.
We prepared two learning programs and . The former is a probably correct implementation of a learning algorithm, and the latter is a bug-injected version of the former. We conducted two experiments in parallel, one using and the other with , and made comparisons. Below, we use notations such as where is either or .
4.2.1 Training with MNIST dataset
We conducted trainings the MNIST training dataset ; . Figure 1 illustrates several graphs to show their behavior, that are obtained in the training processes. Both accuracy graphs in Figure 1 show that and are mostly the same. In addition, and are indistinguishable. The above observation is consistent with [Hyp-1].
4.2.2 Generating Distorted Datasets
We generated distorted datasets with the method described in Section 3.2. We introduce here short-hand notations such as ; .
Figure 2 shows a fragment of . We recognize that all the data are not so clear as those of the original MNIST dataset and thus are considered distorted. We may consider them as , which is an instance of [Prop-2]. Furthermore, for human eyes, Figure 2 (b) for the case with is more distorted than Figure 2 (a) of , which may be described as .
4.2.3 Training with Distorted Datasets
Secondly, for the MNIST testing dataset , is lower than , while reaches close to . Together with the fact of , the above implies , which is consistent with [Hyp-2].
Thirdly, we consider how much the accuracies differ. We define the relation where . Let be defined in terms of with a certain . Comparing the graphs in Figure 1(a) and Figure 3(a), we observe, for , is about . Contrarily, for from Figure 1(b) and Figure 3(b), is about . If we choose a threshold to be about , the two cases are distinguishable.
Moreover, we define the relation for . As we know that is probably correct and is bug-injected, we have followings. (a) , and (b) . These are, indeed, consistent with [Hyp-3].
4.2.4 Accuracy for Distorted Testing Datasets
We generated distorted datasets from the MNIST testing dataset ; . We, then, checked the accuracy , whose monitored results are shown in Figure 4. and are not distinguishable, because both and are constructed in the same way with and thus their empirical distributions are the same. The graphs are consistent again with [Hyp-1].
5.1 Neuron Coverage
As explained in Section 3.1, the distortion relation () between trained learning parameters is calculated in terms of observer functions. However, depending on the observer, the resultant distortion degree may be different. In an extreme case, a certain observer is not adequate to differentiate distortions. A question arises whether such distortion degrees are able to be measured directly. We will study neuron coverage  whether we can use it as such a measure.
A neuron is said to be activated if its output signal is larger than a given threshold when a set of input signals is presented; . The weight values s are constituents of trained . Activated Neurons below refer to a set of neurons that are activated when a vector data is input to a trained learning model as .
In the above, denotes the size of a set . Using this neuron coverage as a criterion is motivated by an empirical observation that different input-output pairs result in different degrees of neuron coverage .
5.1.1 Results of Experiment
We focus on the neurons constituting the hidden layer in our NN learning model. As its activation function is ReLU, we choose as the threshold. Figure 5 is a graph to show the numbers of input vectors leading to the chosen percentages of inactive neurons, . These input vectors constitute the MNIST testing dataset of the size .
According to Figure 5, the graph for the case of ProgPC, the ratio of inactive neurons is almost ; namely, of neurons in the hidden layer are activated to have effects on the classification results. However, the ProgBI graph shows that about of them are inactive and do not contribute to the results. To put it differently, this difference in the ratios of inactive neurons implies that the trained learning model of ProgBI is distorted from of ProgPC, .
From the generation method of the distorted dataset, we have and . may also be satisfied, which is in accordance with the visual inspection of Figure 2. Furthermore, because of [Hyp-2] (Section 3.1), is true. It is consistent with the situation shown in Figure 5 in that activated neurons in are fewer than those in .
Figure 3 can be understood from a viewpoint of the neuron coverage. The empirical distribution of MNIST testing dataset is the same as that of MNIST training dataset . Because of the distortion relationships on training datasets, the distribution of is different from those of ( or ). Moreover, Figure 3 shows that is smaller than . Therefore, we see that the relationship is satisfied. Because of [Hyp-2], it implies . Therefore, the difference seen in Figure 3 is consistent with the situation shown in Figure 5.
In summary, the neuron coverage would be a good candidate as the metrics to quantitatively define the distortion degrees of trained learning models. However, because this view is based on the MNIST dataset experiments only, further studies are desirable.
5.2 Test Input Generation
We will see how the dataset or data generation method in Section 3.2 is used in software testing. Because the program is categorized as untestable , Metamorphic Testing (MT)  is now a standard practice for testing of machine learning programs. We here indicate that generating an appropriate data is desirable to conduct effective testing. In the MT framework, given an initial test data , a translation function generates a new follow-up test data automatically.
For testing machine learning software, either (whether a training program is faithful implementation of machine learning algorithms)  or (whether an inference program show acceptable prediction performance against incoming data), generating a variety of data to show Dataset Diversity  is a key issue. The function introduced in Section 3.2 can be used to generate such a follow-up dataset used in the MT framework . In particular, corner-case testing would be possible by carefully chosen such a group of biased datasets.
DeepTest  employs Data Augmentation methods  to generate test data. Zhou and Sun  adopts generating fuzz to satisfy application-specific properties. Both works are centered around generating test data for negative testing, but do not refer to statistical distribution of datasets.
DeepRoad  adopts an approach with Generative Adversarial Networks (GAN)  to synthesize various weather conditions as driving scenes. GAN is formulated as a two-player zero-sum game. Given a dataset whose empirical distribution is
, its Nash equilibrium, solved with Mixed Integer Linear Programing (MILP), results in a DNN-based generative model to emit new data to satisfy the relation. Thus, such new data preserve characteristics of the original machine learning problem. Consequently, we regard the GAN-based approach as a method to enlarge coverage of test scenes within what is anticipated at the training time.
Machine Teaching  is an inverse problem of machine learning, and is a methodology to obtain a dataset to optimally derive a given trained learning model. The method is formalized as a two-level optimization problem, which is generally difficult to solve. We regard the machine teaching as a method to generate unanticipated dataset. Obtained datasets can be used in negative testing.
Our method uses an optimization problem with one objective function for generating datasets that are not far from what is anticipated, but probably are biased to build up the dataset diversity.
6 Concluding Remarks
We introduced a notion of distortion degrees which would manifest themselves as faults and failures in machine learning programs, and studied the characteristics in terms of neuron coverages. However, further study would be needed how we rigorously measure the distortion degrees, which will make it possible to debug programs with the measurement results. If the measurement is light-weight and can be conducted for in-operation machine learning systems, we will be able to diagnose systems at operation time.
The work is supported partially by JSPS KAKENHI Grant Number JP18H03224, and is partially based on results obtained from a project commissioned by the NEDO.
-  T.Y. Chen, F.-C. Kuo, H. Liu, P.-L. Poon, D. Towey, T.H. Tse, and Z.Q. Zhou : Metamorphic Testing: A Review of Challenges and Opportunities, ACM Computing Surveys 51(1), Article No.4, pp.1-27, 2018.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, anY. d Bengio : Generative Adversarial Nets, In Adv. NIPS 2014, pp.2672-2680, 2014.
-  I. Goodfellow, Y. Bengio, and A. Courville : Deep Learning, The MIT Press 2016.
-  S. Haykin : Neural Networks and Learning Machines (3ed.), Pearson India 2016.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner: Gradient-based learning applied to document recognition, In Proceedings of the IEEE, 86(11), pp.2278-2324, 1998.
-  S. Nakajima and H.N. Bui : Dataset Coverage for Testing Machine Learning Computer Programs, In Proc. 23rd APSEC, pp.297-304, 2016.
-  S. Nakajima : Quality Assurance of Machine Learning Software, In Proc. IEEE GCCE 2018, pp.601-604, 2018.
-  S. Nakajima : Dataset Diversity for Metamorphic Testing of Machine Learning Software, In Post-Proc. 8th SOFL+MSVL, pp.21-38, 2018.
-  S. Nakajima and T.Y. Chen: Generating Biased Dataset for Metamorphic Testing of Machine Learning Programs, In Proc. IFIP-ICTSS 2019, pp.56-64, 2019.
-  K. Pei, Y. Cao, J. Yang, and S. Jana : DeepXplore: Automated Whitebox Testing of Deep Learning Systems, In Proc. 26th SOSP, pp.1-18, 2017.
-  J. Quinonero-Candela, M. Sugiyama, A. Schwaighofer, and N.D. Lawrence (eds.) : Dataset Shift in Machine Learning, The MIT Press 2009.
-  S. Segura, D. Towey, Z.Q. Zhou and T.Y. Chen: Metamorphic Testing: Testing the Untestable, IEEE Software (in press).
-  C. Szegedy, W. Zaremba, I. Sutskever, J. Bruma, D. Erhan, I. Goodfellow, and R. Fergus : Intriguing properties of neural networks, In Proc. ICLR, 2014.
-  Y. Tian, K. Pei, S. Jana, and B. Ray : DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars, In Proc. 40th ICSE, pp.303-314, 2018.
-  D. Warde-Farley and I. Goodfellow: Adversarial Perturbations of Deep Neural Networks, in Perturbation, Optimization and Statistics, The MIT Press 2016.
-  X. Xie, J.W.K. Ho, C. Murphy, G. Kaiser, B. Xu, and T.Y. Chen : Testing and Validating Machine Learning Classifiers by Metamorphic Testing, J. Syst. Softw., 84(4), pp.544-558, 2011.
-  M. Zhang, Y. Zhang, L. Zhang, C. Liu, and S. Khurshid: DeepRoad: GAN-Based Metamorphic Testing and Input Validation Framework for Autonomous Driving Systems, In Proc. 33rd ASE, pp.132-142, 2018.
-  Z.Q. Zhou and L. Sun: Metamorphic Testing of Driverless Cars, Comm. ACM, vol.62, no.3, pp.61-67, 2019.
-  X. Zhu : Machine Teaching: An Inverse Problem to Machine Learning and an Approach Toward Optimal Education, In Proc. 29th AAAI, pp.4083-4087, 2015.