Software Engineering for AI/ML -- An Annotated Bibliography
This paper provides a comprehensive survey of Machine Learning Testing (ML testing) research. It covers 128 papers on testing properties (e.g., correctness, robustness, and fairness), testing components (e.g., the data, learning program, and framework), testing workflow (e.g., test generation and test evaluation), and application scenarios (e.g., autonomous driving, machine translation). The paper also analyses trends concerning datasets, research trends, and research focus, concluding with research challenges and promising research directions in ML testing.READ FULL TEXT VIEW PDF
Software Engineering for AI/ML -- An Annotated Bibliography
The prevalent applications of machine learning arouse natural concerns about trustworthiness. Safety-critical applications such as self-driving systems [1, 2] and medical treatments , increase the importance of behaviour relating to correctness, robustness, privacy, efficiency and fairness. Software testing refers to any activity that aims to detect the differences between existing and required behaviour . With recently rapid rise in interest and activity, testing has been demonstrated to be an effective way to expose problems and potentially facilitate to improve the trustworthiness of machine learning systems.
For example, DeepXplore 
, a differential white-box testing technique for deep learning, revealed thousands of incorrect corner case behaviours in autonomous driving learning systems; Themis, a fairness testing technique for detecting causal discrimination, detected significant ML model discrimination towards gender, marital status, or race for as many as 77.2% of the individuals in datasets.
In fact, some aspects of the testing problem for machine learning systems are shared with well-known solutions already widely studied in the software engineering literature. Nevertheless, the statistical nature of machine learning systems and their ability to make autonomous decisions raise additional, and challenging, research questions for software testing [6, 7].
Machine learning testing poses challenges that arise from the fundamentally different nature and construction of machine learning systems, compared to traditional (relatively more deterministic and less statistically-orientated) software systems. For instance, a machine learning system inherently follows a data-driven programming paradigm where the decision logic is obtained via a training procedure from training data under the machine learning algorithm’s architecture . The model’s behaviour may evolve over time, in response to the frequent provision of new data . While this is also true of traditional software systems, the core underlying behaviour of a traditional system does not typically change in response to new data, in the way that a machine learning system can.
Testing machine learning also suffers from a particularly pernicious instance of the Oracle Problem . Machine learning systems are difficult to test because they are designed to provide an answer to a question for which no previous answer exists . As Davis and Weyuker say, for these kinds of systems ‘There would be no need to write such programs, if the correct answer were known’. Much of the literature on testing machine learning systems seeks to find techniques that can tackle the Oracle problem, often drawing on traditional software testing approaches.
The behaviours of interest for machine learning systems are also typified by emergent properties, the effects of which can only be fully understood by considering the machine learning system as a whole. This makes testing harder, because it is less obvious how to break the system into smaller components that can be tested, as units, in isolation. From a testing point of view, this emergent behaviour has a tendency to migrate testing challenges from the unit level to the integration and system level. For example, low accuracy/precision of a machine learning model is typically a composite effect, arising from a combination of the behaviours of different components such as the training data, the learning program, and even the learning framework/library .
Errors may propagate to become amplified or suppressed, inhibiting the tester’s ability to decide where the fault lies. These challenges also apply in more traditional software systems, where, for example, previous work has considered failed error propagation [12, 13] and the subtleties introduced by fault masking [14, 15]. However, these problems are far-reaching in machine learning systems, since they arise out of the nature of the machine learning approach and fundamentally affect all behaviours, rather than arising as a side effect of traditional data and control flow .
For these reasons, machine learning systems are thus sometimes regarded as ‘non-testable’ software. Rising to these challenges, the literature has seen considerable progress and a notable upturn in interest and activity: Figure 1 shows the cumulative number of publications on the topic of testing machine learning systems between 2007 and June 2019. From this figure, we can see that 85% of papers have appeared since 2016, testifying to the emergence of new software testing domain of interest: machine learning testing.
In this paper, we use the term ‘Machine Learning Testing’ (ML testing) to refer to any activity aimed at detecting differences between existing and required behaviours of machine learning systems. ML testing is different from testing approaches that use machine learning or those that are guided by machine learning, which should be referred to as ‘machine learning-based testing’. This nomenclature accords with previous usages in the software engineering literature. For example, the literature uses the terms ‘state-based testing’  and ‘search-based testing’ [17, 18] to refer to testing techniques that make use of concepts of state and search space, whereas we use the terms ‘GUI testing’  and ‘unit testing’  to refer to test techniques that tackle challenges of testing GUIs (Graphical User Interfaces) and code units.
This paper seeks to provide a comprehensive survey of ML testing. We draw together the aspects of previous work that specifically concern software testing, while simultaneously covering all types of approaches to machine learning that have hitherto been tackled using testing. The literature is organised according to four different aspects: the testing properties (such as correctness, robustness, and fairness), machine learning components (such as the data, learning program, and framework), testing workflow (e.g., test generation, test execution, and test evaluation), and application scenarios (e.g., autonomous driving and machine translation). Some papers address multiple aspects. For such papers, we mention them in all the aspects correlated (in different sections). This ensures that each aspect is complete.
Additionally, we summarise research distribution (e.g., among testing different machine learning categories), trends, and datasets. We also identify open problems and challenges for the emerging research community working at the intersection between techniques for software testing and problems in machine learning testing. To ensure that our survey is self-contained, we aimed to include sufficient material to fully orientate software engineering researchers who are interested in testing and curious about testing techniques for machine learning applications. We also seek to provide machine learning researchers with a complete survey of software testing solutions for improving the trustworthiness of machine learning systems.
There has been previous work that discussed or surveyed aspects of the literature related to ML testing. Hains et al. , Ma et al. , and Huang et al.  surveyed secure deep learning, in which the focus was deep learning security with testing as one aspect of guarantee techniques. Masuda et al.  outlined their collected papers on software quality for machine learning applications in a short paper. Ishikawa  discussed the foundational concepts that might be used in any and all ML testing approaches. Braiek and Khomh  discussed defect detection in machine learning data and/or models in their review of 39 papers. As far as we know, no previous work has provided a comprehensive survey particularly focused on machine learning testing.
In summary, the paper makes the following contributions:
1) Definition. The paper defines Machine Learning Testing (ML testing), overviewing the concepts, testing workflow, testing properties, and testing components related to machine learning testing.
. The paper provides a comprehensive survey of 128 machine learning testing papers, across various publishing areas such as software engineering, artificial intelligence, systems and networking, and data mining.
. The paper analyses and reports data on the research distribution, datasets, and trends that characterise the machine learning testing literature. We observed a pronounced imbalance in the distribution of research efforts: among the 128 papers we collected, over 110 of them tackle supervised learning testing, three of them tackle unsupervised learning testing, and only one paper tests reinforcement learning. Additionally, most of them (85) centre around correctness and robustness, but only a few papers test interpretability, privacy, or efficiency.
4) Horizons. The paper identifies challenges, open problems, and promising research directions for ML testing, with the aim of facilitating and stimulating further research.
This section reviews the fundamental terminology in machine learning so as to make the survey self-contained.
Machine Learning (ML) is a type of artificial intelligence technique that makes decisions or predictions from data [27, 28]. A machine learning system is typically composed from following elements or terms. Dataset: A set of instances for building or evaluating a machine learning model.
At the top level, the data could be categorised as:
Training data: the data used to ‘teach’ (train) the algorithm to perform its task.
Validation data: the data used to tune the hyper-parameters of a learning algorithm.
Test data: the data used to validate machine learning model behaviour.
Learning program: the code written by developers to build and validate the machine learning system.
In the remainder of this section, we give definitions for other ML terminology used throughout the paper.
Instance: a piece of data recording the information about an object.
Feature: a measurable property or characteristic of a phenomenon being observed to describe the instances.
Label: value or category assigned to each data instance.
Test error: the difference ratio between the real conditions and the predicted conditions.
Generalisation error: the expected difference ratio between the real conditions and the predicted conditions of any valid data.
Model: the learned machine learning artefact that encodes decision or prediction logic which is trained from the training data, the learning program, and frameworks.
There are different types of machine learning. From the perspective of training data characteristics, machine learning includes:
Supervised learning: a type of machine learning that learns from training data with labels as learning targets. It is the most widely used type of machine learning .
Unsupervised learning: a learning methodology that learns from training data without labels and relies on understanding the data itself.
Reinforcement learning: a type of machine learning where the data are in the form of sequences of actions, observations, and rewards, and the learner learns how to take actions to interact in a specific environment so as to maximise the specified rewards.
Let be the set of unlabelled training data. Let be the set of labels corresponding to each piece of training data . Let concept be the mapping from to (the real pattern). The task of supervised learning is to learn a mapping pattern, i.e., a model, based on and so that the learned model is similar to its true concept with a very small generalisation error. The task of unsupervised learning is to learn patterns or clusters from the data without knowing the existence of labels . Reinforcement learning requires a set of states , actions
, a transition probability and a reward probability. It learns how to take actions fromunder different to get the best reward.
Machine learning can be applied to the following typical tasks :
1) classification: to assign a category to each data instance; E.g., image classification, handwriting recognition.
2) regression: to predict a value for each data instance; E.g., temperature/age/income prediction.
: to partition instances into homogeneous regions; E.g., pattern recognition, market/image segmentation.
4) dimension reduction: to reduce the training complexity; E.g., dataset representation, data pre-processing.
5) control: to control actions to maximise rewards; E.g., game playing.
Figure 3 shows the relationship between different categories of machine learning and the five machine learning tasks. Among the five tasks, classification and regression belong to supervised learning; Clustering and dimension reduction belong to unsupervised learning. Reinforcement learning is widely adopted to control actions, such as to control AI-game player to maximise the rewards for a game agent.
In addition, machine learning can be classified intoclassic machine learning and deep learning
. Algorithms like Decision Tree, SVM 36]
, and Naive Bayes all belong to classic machine learning. Deep learning 
applies Deep Neural Networks (DNNs) that uses multiple layers of nonlinear processing units for feature extraction and transformation. Typical deep learning algorithms often follow some widely used neural network structures like Convolutional Neural Networks (CNNs)
and Recurrent Neural Networks (RNNs). The scope of this paper involves both classic machine learning and deep learning.
This section gives a definition and analyses of ML testing. It describes the testing workflow (how to test), testing properties (why to test), and testing components (where and what to test).
A software bug refers to an imperfection in a computer program that causes a discordance between the existing and the required conditions . In this paper, we refer the term ‘bug’ to the differences between existing and required behaviours of an ML system111The existing related papers may use other terms like ‘defect’ or ‘issue’. This paper uses ‘bug’ as a representative of all such related terms considering that ‘bug’ has a more general meaning ..
An ML bug refers to any imperfection in a machine learning item that causes a discordance between the existing and the required conditions.
Having defined ML bugs, in this paper, we define ML testing as the activities aimed to detect ML bugs.
Machine Learning Testing (ML testing) refers to any activities designed to reveal machine learning bugs.
The definitions of machine learning bugs and ML testing indicate three aspects of machine learning: the required conditions, the machine learning items, and the testing activities. A machine learning system may have different types of ‘required conditions’, such as correctness, robustness, and privacy. An ML bug may exist in the data, the learning program, or the framework. The testing activities may include test input generation, test oracle identification, test adequacy evaluation, and bug triage. In this survey, we refer to the above three aspects as testing properties, testing components, and testing workflow, respectively, according to which we collect and organise the related work.
Note that a test input in ML testing can be much more diverse in its form than that in traditional software testing, due to the fact that not only the code may contain bugs, but also the data. When we try to detect the bugs in data, one may even use a toy training program as a test input to check some must-to-hold data properties.
ML testing workflow is about how to conduct ML testing with different testing activities. In this section, we first briefly introduce the role of ML testing when building ML models, then present the key procedures and activities in ML testing. We introduce more details of the current research related to each procedure in Section 5.
Figure 4 shows the life cycle of deploying a machine learning system with ML testing activities involved. At the very beginning, a prototype model is generated based on historical data; before deploying the model online, one needs to conduct offline testing, such as cross-validation, to make sure that the model meets the required conditions. After deployment, the model makes predictions, yielding new data that can be analysed via online testing to evaluate how the model interacts with user behaviours.
There are several reasons that make online testing essential. First, offline testing usually relies on test data, while test data usually fail to fully represent future data ; Second, offline testing is not able to test some circumstances that may be problematic in real applied scenarios, such as data loss and call delays. In addition, offline testing has no access to some business metrics such as open rate, reading time, and click-through rate.
In the following, we present a ML testing workflow adapted from classic software testing workflow. Figure 5 shows the workflow, including both offline testing and online testing.
The workflow of offline testing is shown by the top of Figure 5. At the very beginning, developers need to conduct requirement analysis to define the expectations of the users for the machine learning system under test. In requirement analysis, specifications of a machine learning system are analysed and the whole testing procedure is planned. After that, tests inputs are either sampled from the collected data or generated somehow based on a specific purpose. Test oracles are then identified or generated (see Section 5.2
for more details of test oracles in machine learning). When the tests are ready, they need to be executed for developers to collect results. The test execution process involves building a model with the tests (when the tests are training data) or running a built model against the tests (when the tests are test data), as well as checking whether the test oracles are violated. After the process of test execution, developers may use some evaluation metrics to check the quality of tests, i.e., the ability of the tests to expose ML problems.
The test execution results yield a bug report to help developers duplicate, locate, and solve the bug. Those identified bugs will be labelled with different severity and assigned for different developers. Once the bug is debugged and repaired, regression testing is conducted to make sure the repair solves the reported problem and does not bring new problems. If no bugs are identified, the offline testing process ends, and the model is deployed.
Offline testing tests the model with historical data without in the real application environment. It also lacks the data collection process of user behaviours. Online testing complements the shortage of offline testing.
The workflow of online testing is shown by the bottom of Figure 5. Usually the users are split into different groups to conduct control experiments, to help find out which model is better, or whether the new model is superior to the old model under certain application contexts.
A/B testing is one of the key types of online testing of machine learning to validate the deployed models . It is a splitting testing technique to compare two versions of the systems (e.g., web pages) that involve customers. When performing A/B testing on ML systems, the sampled users will be split into two groups using the new and old ML models separately.
MRB (Multi-Rrmed Bandit) is another online testing approach . It first conducts A/B testing for a short time and finds out the best model, then put more resources on the chosen model.
To build a machine learning model, an ML software developer usually needs to collect data, label the data, design learning program architecture, and implement the proposed architecture based on specific frameworks. The procedure of machine learning model development requires interaction with several components such as data, learning program, and learning framework, while each component may contain bugs.
Figure 6 shows the basic procedure of building an ML model and the major components involved in the process. Data are collected and pre-processed for use; Learning program is the code for running to train the model; Framework (e.g., Weka, scikit-learn, and TensorFlow) offers algorithms and other libraries for developers to choose from when writing the learning program.
Thus, when conducting ML testing, developers may need to try to find bugs in every component including the data, the learning program, and the framework. In particular, error propagation is a more serious problem in ML development because the components are more closely bonded with each other than traditional software , which indicates the importance of testing each of the ML components. We introduce the bug detection in each ML component in the following.
Bug Detection in Data. The behaviours of a machine learning system largely depends on data . Bugs in data affect the quality of the generated model, and can be amplified to yield more serious problems over a period a time . Bug detection in data checks problems such as whether the data is sufficient for training or test a model (also called completeness of the data 
), whether the data is representative of future data, whether the data contains a lot of noise such as biased labels, whether there is skew between training data and test data, and whether there is data poisoning  or adversary information that may affect the model’s performance.
Bug Detection in Frameworks. Machine Learning requires a lot of computations. As shown by Figure 6, ML frameworks offer algorithms to help write the learning program, and platforms to help train the machine learning model, making it easier for developers to build solutions for designing, training and validating algorithms and models for complex problems. They play a more important role in ML development than in traditional software development. ML Framework testing thus checks whether the frameworks of machine learning have bugs that may lead to problems in the final system .
Bug Detection in Learning Program. A learning program can be classified into two parts: the algorithm designed by the developer or chosen from the framework, and the actual code that developers write to implement, deploy, or configure the algorithm. A bug in the learning program may either because the algorithm is designed, chosen, or configured improperly, or because the developers make typos or errors when implementing the designed algorithm.
Testing properties refer to why to test in ML testing: for what conditions ML testing needs to guarantee for a trained model. This section lists some properties that the literature concern the most, including basic functional requirements (i.e., correctness and overfitting degree) and non-functional requirements (i.e., efficiency, robustness222we adopt the more general understanding from software engineering community [49, 50], and regard robustness as a non-functional requirement., fairness, interpretability).
These properties are not strictly independent with each other when considering the root causes, yet they are different external performances of an ML system and deserve being treated independently in ML testing.
Correctness measures the probability that the ML system under test ‘gets things right’.
Let be the distribution of future unknown data. Let be a data item belonging to . Let be the machine learning model that we are testing. is the predicted label of , is the true label. The model correctness is the probability that and are identical:
Achieving acceptable correctness is the fundamental requirement of an ML system. The real performance of an ML system should be evaluated on future data. Since future data are often not available, the current best practice usually splits the data into training data and test data (or training data, validation data, and test data), and uses test data to simulate future data. This data split approach is called cross-validation.
Let be the set of unlabelled test data sampled from . Let be the machine learning model under test. Let be the set of predicted labels corresponding to each training item . Let be the true labels, where each corresponds to the label of . The empirical correctness of model (denoted as ) is:
where is the indicator function. A predicate returns 1 if is true, and returns 0 otherwise.
A machine learning model comes from the combination of a machine learning algorithm and the training data. It is important to ensure that the adopted machine learning algorithm should be no more complex than just needed . Otherwise, the model may fail to have good performance and generalise to new data.
We define the problem of detecting whether the machine learning algorithm’s capacity fits the data as the identification of Overfitting Degree. The capacity of the algorithm is usually approximated by VC-dimension  or Rademacher Complexity  for classification tasks.
Let be the training data distribution. Let be the simplest required capacity of machine learning algorithm . is the capacity of the machine learning algorithm under test. The overfitting degree is the difference between and .
The overfitting degree aims to measure how much a machine learning algorithm fails to fit future data or predict future observations reliably because it is ‘overfitted’ to currently available training data.
Robustness is defined by the IEEE standard glossary of software engineering terminology [54, 55] as: ‘The degree to which a system or component can function correctly in the presence of invalid inputs or stressful environmental conditions’. Taking the similar spirit of this definition, we define the robustness of ML as follows:
Let be a machine learning system. Let be the correctness of . Let be the machine learning system with perturbations on any machine learning components such as the data, the learning program, or the framework. The robustness of a machine learning system is a measurement of the difference between and :
Robustness thus measures the resilience of an ML system towards perturbations.
A popular sub-category of robustness is called adversarial robustness. For adversarial robustness, the perturbations are designed to be hard to detect. Referring to the work of Katz et al. , we classify adversarial robustness into local adversarial robustness and global adversarial robustness. Local adversarial robustness is defined as follows.
Let a test input for an ML model . Let be another test input generated via conducting adversarial perturbation on . Model is -local robust at input if for any
Local adversarial robustness concerns the robustness at one specific test input, while global adversarial robustness requests robustness against all inputs. We define global adversarial robustness as follows.
Let a test input for an ML model . Let be another test input generated via conducting adversarial perturbation on . Model is -global robust if for any and
The security of an ML system is the system’s resilience against potential harm, danger, or loss made via manipulating or illegally accessing ML components.
Security and robustness are closely related. An ML system with low robustness may be insecure: if it is less robust in resisting the perturbations in the data to predict, the system may be easy to suffer from adversarial attacks (i.e., fooling the ML system via generating adversarial examples, which are special test inputs modified from original inputs but look the same as original inputs to the humans); if it is less robust in resisting the perturbations in the training data, the system may be easy to suffer from data poisoning (i.e., change the predictive behaviour via modifying the training data).
Nevertheless, low robustness is just one cause for high-security risk. Except for perturbations attacks, security issues also include other aspects such as model stealing or extraction. This survey focuses on the testing techniques on detecting ML security problems, which narrows the security scope to robustness-related security. We combine the introduction of robustness and security in Section 6.3.
Machine learning is widely adopted to predict individual habits or behaviours to maximise the benefits of many companies. For example, Netflix offered a large dataset (with the ratings of around 500,000 users) for an open competition to predict user ratings for films. Nevertheless, sensitive information about individuals in the training data can be easily leaked. Even if the data is anonymised (i.e., to encrypt or remove personally identifiable information from datasets to keep the individual anonymous), there is a high risk of linkage attack, which means the individual information may still be recovered to some extent via linking the data instances among different datasets.
We define privacy in machine learning as the ML system’s ability to preserve private data information. For the formal definition, we use the most popular differential privacy taken from the work of Dwork .
Let be a randomised algorithm. Let and be two training data sets that differ only on one instance. Let be a subset of the output set of . gives -differential privacy if
In other words, -Differential privacy ensures that the learner should not get much more information from than when than differs in only one instance.
Data privacy has been regulated by The EU General Data Protection Regulation (GDPR)  and California CCPA , make the protection of data privacy a hot research topic. Nevertheless, the current research mainly focus on data privacy is how to present privacy-preserving machine learning, instead of detecting privacy violations. We discuss privacy-related research opportunities and research directions in Section 10.
The efficiency of a machine learning system refers to its construction or prediction speed. An efficiency problem happens when the system executes slowly or even infinitely during the construction or the prediction phase.
With the exponential growth of data and complexity of systems, efficiency is an important feature to consider for model selection and framework selection, sometimes even more important than accuracy . For example, given a large VGG-19 model which hundreds of MB, to deploy such a model to a mobile device, certain optimisation, compression, and device-oriented customisation must be performed to make it feasible for a mobile device execution in a reasonable time, but the accuracy is often sacrificed.
Machine learning is a statistical method and is widely adopted to make decisions, such as income prediction and medical treatment prediction. It learns what human beings teach (i.e., in form of training data), while human beings may have bias over cognition, further affecting the data collected or labelled and the algorithm designed, leading to bias issues. It is thus important to ensure that the decisions made by a machine learning system are in the right way and for the right reason, to avoid problems in human rights, discrimination, law, and other ethical issues.
The characteristics that are sensitive and need to be protected against unfairness are called protected characteristics  or protected attributes and sensitive attributes. Examples of legally recognised protected classes include race, colour, sex, religion, national origin, citizenship, age, pregnancy, familial status, disability status, veteran status, and genetic information.
Fairness is often domain specific. Regulated domains include credit, education, employment, housing, and public accommodation333To prohibit discrimination ‘in a place of public accommodation on the basis of sexual orientation, gender identity, or gender expression’ ..
To formulate fairness is the first step to solve the fairness problems and build fair machine learning models. The literature has proposed many definitions of fairness but no firm consensus is reached at this moment. Considering that the definitions themselves are the research focus of fairness in machine learning, we discuss how the literature formulates and measures different types of fairness in Section6.5 in details.
Machine learning models are oftentimes applied to assist/make decisions in medical treatment, income prediction, or personal credit assessment. It is essential for human beings to understand the logic and reason behind the final decisions made by some certain machine learning system, so that human beings can build more trust over the decisions made by ML to make it more socially acceptable, understand the cause of decisions to avoid discrimination and get more knowledge, transfer the models to other situations, and avoid safety risks as much as possible [63, 64, 65].
The motives and definitions of interpretability are diverse and still discordant . Nevertheless, unlike fairness, a mathematical definition of ML interpretability remains elusive . Referring to the work of Biran and Cotton  as well as the work of Miller , we define the interpretability of ML as follows.
ML Interpretability refers to the degree to which an observer can understand the cause of a decision made by an ML system.
Interpretability contains two aspects: transparency (how the model works) and post hoc explanations (other information that could be derived from the model) . Interpretability is also regarded as a request by regulations like GDPR , where the user has the legal ‘right to explanation’ to ask for an explanation of an algorithmic decision that was made about them. A thorough introduction of ML interpretability can be referred to in the book of Christoph .
Traditional software testing and ML testing are different in many aspects. To understand the unique features of ML testing, we summarise the primary differences between traditional software testing and ML testing in Table I.
|Characteristics||Traditional Testing||ML Testing|
|Component to test||code||data and code (learning program, framework)|
|Behaviour under test||usually fixed||change overtime|
|Test input||input data||data or code|
|Test oracle||defined by developers||unknown|
|Adequacy Criteria||coverage/mutation score||unknown|
|False positives in bugs||rare||prevalent|
|Tester||developer||data scientist, algorithm designer, developer|
1) Component to test (where the bug may exist): traditional software testing detects bugs in the code, while ML testing detects bugs in the data, the learning program, and the framework, each of which playing an essential role in building an ML model.
2) Behaviours under test: the behaviours of traditional software code are usually fixed once the requirement is fixed, while the behaviours of an ML model may frequently change as the update of training data.
3) Test input: the test inputs in traditional software testing are usually the input data when testing code; in ML testing, however, the test inputs in may have various forms and one may need to use a piece of code as a test input to test the data. Note that we separate the definition of ‘test input’ and ‘test data’. Test inputs in ML testing could be but not limited to test data. When testing the learning program, a test case may be a single test instance from the test data or a toy training set; when testing the data, the test input could be a learning program.
4) Test oracle: traditional software testing usually assumes the presence of a test oracle. The output can be verified against the expected values by the developer, and thus the oracle is usually determined beforehand. Machine learning, however, is used to generate answers based on a set of input values, yet the answers are usually unknown. There would be no need to write such programs, if the correct answer were known’ as described by Davis and Weyuker . Thus, the oracles that could be adopted by ML testing are more difficult to obtain and are usually some pseudo oracles such as metamorphic relations .
5) Test adequacy criteria: test adequacy criteria are used to provide quantitative measurement on the degree of the target software is tested. Up to present, many adequacy criteria are proposed and widely adopted in software industry, e.g., line coverage, branch coverage, dataflow coverage. However, due to the fundamental difference of programming paradigm and logic representation format for machine learning software and traditional software, new test adequacy criteria are required so as to take the characteristics of machine learning software into consideration.
6) False positives in detected bugs: due to the difficulty in obtaining reliable oracles, ML testing tend to yield more false positives in the reported bugs.
7) Roles of testers: the bugs in ML testing may exist not only in the learning program, but also in the data or the algorithm, and thus data scientists or algorithm designers could also play the role of testers.
This section introduces the scope, the paper collection approach, an initial analysis of the collected papers, and the organisation of our survey.
An ML system may include both hardware and software. The scope of our paper is software testing (as defined in the introduction) applied to machine learning.
We apply the following inclusion criteria when collecting papers. If a paper satisfies any one or more of the following criteria, we will include it. When speaking of related ‘aspects of ML testing’, we refer to the ML properties, ML components, and ML testing procedure introduced in Section 2.
1) The paper introduces/discusses the general idea of ML testing or one of the related aspects of ML testing.
2) The paper proposes an approach, study, or tool/framework that targets testing one of the ML properties or components.
3) The paper presents a dataset or benchmark especially designed for the purpose of ML testing.
4) The paper introduces a set of measurement criteria that could be adopted to test one of the ML properties.
Some papers concern traditional validation of ML model performance such as the introduction of precision, recall, and cross-validation. We do not include these papers because they have had a long research history and have been thoroughly and maturely studied. Nevertheless, for completeness, we include the knowledge when introducing the background to set the context. We do not include the papers that adopt machine learning techniques for the purpose of traditional software testing and also those target ML problems, which do not use testing techniques as a solution.
To collect the papers across different research areas as much as possible, we started by using exact keyword searching on popular scientific databases including Google Scholar, DBLP and arXiv one by one. The keywords used for searching are listed below. ML properties means the set of ML testing properties including correctness, overfitting degree, robustness, efficiency, privacy, fairness, and interpretability. We used each element in this set plus ‘test’ or ‘bug’ as the search query. Similarly, ML components denotes the set of ML components including data, learning program/code, and framework/library. Altogether, we conducted searches across the three repositories before March 8, 2019.
machine learning + test|bug|trustworthiness
deep learning + test|bug|trustworthiness
neural network + test|bug|trustworthiness
ML properties+ test|bug
ML components+ test|bug
Machine learning techniques have been applied in various domains across different research areas. As a result, authors may tend to use very diverse terms. To ensure a high coverage of ML testing related papers, we therefore also performed snowballing  on each of the related papers found by keyword searching. We checked the related work sections in these studies and continue adding the related work that satisfies the inclusion criteria introduced in Section 4.1, until we reached closure.
To ensure a more comprehensive and accurate survey, we emailed the authors of the papers that were collected via query and snowballing, and let them send us other papers they are aware of which are related with machine learning testing but have not been included yet. We also asked them to check whether our description about their work in the survey was accurate and correct.
|machine learning test||211||17||13|
|machine learning bug||28||4||4|
|machine learning trustworthiness||1||0||0|
|deep learning test||38||9||8|
|deep learning bug||14||1||1|
|deep learning trustworthiness||2||1||1|
|neural network test||288||10||9|
|neural network bug||22||0||0|
|neural network trustworthiness||5||1||1|
Table II shows the details of paper collection results. The papers collected from Google Scholar and arXiv turned out to be subsets of from DBLP so we only present the results of DBLP. Keyword search and snowballing resulted in 109 papers across six research areas till May 15th, 2019. We received over 50 replies from all the cited authors until June 4th, 2019, and added another 19 papers when dealing with the author feedback. Altogether, we collected 128 papers.
Figure 7 shows the distribution of papers published in different research venues. Among all the papers, 37.5% papers are published in software engineering venues such as ICSE, FSE, ASE, ICST, and ISSTA; 11.7% papers are published in systems and network venues; surprisingly, only 13.3% of the total papers are published in artificial intelligence venues such as AAAI, CVPR, and ICLR. Additionally, 25.0% of the papers have not yet been published via peer-reviewed venues (the arXiv part). This distribution of different venues indicates that ML testing is the most widely published in software engineering venues, but with a wide publishing venue diversity.
We present the literature review from two aspects: 1) a literature review of the collected papers, 2) a statistical analysis of the collected papers, datasets, and tools, The sections and the corresponding contents are presented in Table III.
|Testing Workflow||5.1||Test Input Generation|
|5.2||Test Oracle Identification|
|5.3||Test Adequacy Evaluation|
|5.4||Test Prioritisation and Reduction|
|5.5||Bug Report Analysis|
|5.6||Debug and Repair|
|6.3||Robustness and Security|
|Testing Components||7.1||Bug Detection in Data|
|7.2||Bug Detection in Program|
|7.3||Bug Detection in Framework|
|Application Scenario||8.1||Autonomous Driving|
|8.3||Natural Language Inference|
|9.2||Analysis of Categories|
|9.3||Analysis of Properties|
|9.5||Open-source Tool Support|
1) Literature Review. The papers in our collection are organised and presented from four angles. We introduce the work about different testing workflow in Section 5. In Section 6 we classify the papers based on the ML problems they target, including functional properties like correctness and overfitting degree and non-functional properties like robustness, fairness, privacy, and interpretability. Section 7 introduces the testing technologies on detecting bugs in data, learning programs, and ML frameworks, libraries, or platforms. Section 8 introduces the testing techniques applied in particular application scenarios such as autonomous driving and machine translation.
We’d like to strengthen that the four aspects have different focuses of ML testing, each of which is a complete organisation of the total collected papers (see more discussion in Section 3.1). That is to say, for one ML testing paper, it may fit multiple aspects if being viewed from different angles. For such a kind of paper, we fit it into one or more sections wherever applicable.
2) Statistical Analysis and Summary. We analyse and compare the number of research papers on different machine learning categories (supervised/unsupervised/reinforcement learning), machine learning structures (classic/deep learning), testing properties in Section 9. We also summarise the datasets and tools adopted so far in ML testing.
The four different angles of presenting the related work as well as the statistical summary, analysis, and comparison, enable us to observe the research focus, trend, challenges, opportunities, and directions of ML testing. These contents are presented in Section 10.
This section organises ML testing research based on the testing workflow as shown by Figure 5.
ML testing includes offline testing and online testing, the current research centres around offline testing. The procedures that are not covered based on our paper collection, such as requirement analysis and regression testing and those belonging to online testing are discussed as research opportunities in Section 10.
We organise the test input generation research based on the techniques they adopt.
Test inputs of ML testing can be classified into two categories: adversarial inputs and natural inputs. Adversarial inputs are perturbed based on the original inputs. They may not belong to normal data distribution (i.e., maybe rarely exist in practice), but could expose robustness or security flaws. Natural inputs, instead, are those inputs that belong to the data distribution of a practical application scenario. Here we introduce the related work that aims to generate natural inputs via domain-specific test input synthesis.
proposed a white-box differential testing technique to generate test inputs for a deep learning system. Inspired by test coverage in traditional software testing, the authors proposed neuron coverage to drive test generation (we discuss different coverage criteria forML testing in Section 5.2.3). The test inputs are expected to have high neuron coverage. Additionally, the inputs need to expose differences among different DNN models, as well as be like real-world data as much as possible. The joint optimisation algorithm iteratively uses a gradient search to find a modified input that satisfies all of these goals. The evaluation of DeepXplore indicates that it covers 34.4% and 33.2% more neurons than the same number of randomly picked inputs and adversarial inputs.
To create useful and effective data for autonomous driving systems, DeepTest  performed greedy search with nine different realistic image transformations: changing brightness, changing contrast, translation, scaling, horizontal shearing, rotation, blurring, fog effect, and rain effect. There are three types of image transformation styles provided in OpenCV444https://github.com/itseez/opencv(2015): linear, affine, and convolutional. The evaluation of DeepTest uses the Udacity self-driving car challenge dataset . It detected more than 1,000 erroneous behaviours on CNNs and RNNs with low false positive rates555The examples of detected erroneous behaviours are available at https://deeplearningtest.github.io/deepTest/..
Generative adversarial networks (GANs)  are algorithms to generate models that approximate the manifolds and distribution on a given set of data. GAN has been successfully applied to advanced image transformation (e.g., style transformation, scene transformation) that look at least superficially authentic to human observers. Zhang et al.  applied GAN to deliver driving scene-based test generation with various weather conditions. They sampled images from Udacity Challenge dataset  and YouTube videos (snowy or rainy scenes), and fed them into UNIT framework666A recent DNN-based method to perform image-to-image transformation  for training. The trained model takes the whole Udacity images as the seed inputs and yield transformed images as generated tests.
Zhou et al.  proposed DeepBillboard to generate real-world adversarial billboards that can trigger potential steering errors of autonomous driving systems.
To test audio-based deep learning systems, Du et al.  designed a set of transformations tailored to audio inputs considering background noise and volume variation. They first abstracted and extracted a probabilistic transition model from an RNN. Based on this, stateful testing criteria are defined and used to guide test generation for stateful machine learning system.
To test the image classification platform when classifying biological cell images, Ding et al.  built a testing framework for biological cell classifiers. The framework iteratively generates new images and uses metamorphic relations for testing. For example, they generate new images by adding new or increasing the number/shape of artificial mitochondrion into the biological cell images, which can arouse easy-to-identify changes in the classification results.
Fuzz testing is a traditional automatic testing technique that generates random data as program inputs to detect crashes, memory leaks, failed (built-in) assertions, etc, with many sucessfully application to system security and vulnerability detection . As another widely used test generation technique, search-based test generation often uses metaheuristic search techniques to guide the fuzz process for more efficient and effective test generation [80, 17, 81]. These two techniques have also been proved to be effective in exploring the input space of ML testing:
Odena et al.  presented TensorFuzz. TensorFuzz used a simple nearest neighbour hill climbing approach to explore achievable coverage over valid input space for Tensorflow graphs, and to discover numerical errors, disagreements between neural networks and their quantized versions, and surfacing undesirable behaviour in RNNs.
DLFuzz, proposed by Guo et al. 
, is another fuzz test generation tool based on the implementation of DeepXplore with nueron coverage as guidance. DLFuzz aims to generate adversarial examples. The generation process thus does not require similar functional deep learning systems for cross-referencing check like DeepXplore and TensorFuzz, but only need to try minimum changes over the original inputs to find those new inputs that improve neural coverage but have different predictive results with the original inputs. The preliminary evaluation on MNIST and ImageNet shows that compared with DeepXplore, DLFuzz is able to generate 135% to 584.62% more inputs with 20.11% less time consumption.
Xie et al.  presented a metamorphic transformation based coverage guided fuzzing technique, DeepHunter, which leverages both neuron coverage and coverage criteria presented by DeepGauge . DeepHunter uses a more fine-grained metamorphic mutation strategy to generate tests, which demonstrates the advantage in reducing the false positive rate. It also demonstrates its advantage in achieve high coverage and bug detection capability.
Wicker et al. 
proposed feature-guided test generation. They adopted Scale-Invariant Feature Transform (SIFT) to identify features that represent an image with a Gaussian mixture model, then transformed the problem of finding adversarial examples into a two-player turn-based stochastic game. They used Monte Carlo Tree Search to identify those elements of an image most vulnerable as the means of generating adversarial examples. The experiments show that their black-box approach is competitive with some state-of-the-art white-box methods.
Instead of targeting supervised learning, Uesato et al. 
proposed to evaluate reinforcement learning with adversarial example generation. The detection of catastrophic failures is expensive because failures are rare. To alleviate the cost problem, the authors proposed to use a failure probability predictor to estimate the probability that the agent fails, which was demonstrated to be both effective and efficient.
There are also fuzzers for specific application scenarios other than image classifications. Zhou et al.  combined fuzzing and metamorphic testing to test the LiDAR obstacle-perception module of real-life self-driving cars, and reported previously unknown fatal software faults eight days before Uber’s deadly crash in Tempe, AZ, in March 2018. Jha et al. 
investigated how to generate the most effective test cases (the faults that are most likely to lead to violations of safety conditions) via modelling the fault injection as a Bayesian network. The evaluation based on two production-grade AV systems from NVIDIA and Baidu revealed many situations where faults lead to safety violations.
Udeshi and Chattopadhyay  generates inputs for text classification tasks and make fuzzing considering the grammar under test as well as the distance between inputs. Nie et al.  and Wang et al.  mutated the sentences in NLI (Natural Language Inference) tasks to generate test inputs for robustness testing. Chan et al.  generated adversarial examples for DNC to expose its robustness problems. Udeshi et al.  focused much on individual fairness and generated test inputs that highlight the discriminatory nature of the model under test. We give details about these domain-specific fuzz testing techniques in Section 8.
Tuncali et al.  proposed a framework for testing autonomous driving systems. In their work they compared three test generation strategies: random fuzz test generation, covering array  fuzz test generation, and covering array search-based test generation (using Simulated Annealing algorithm ). The results indicated that the test generation strategy with search-based technique involved has the best performance in detecting glancing behaviours.
Symbolic execution is a program analysis technique to test whether certain properties can be violated by the software under test . Dynamic Symbolic Execution (DSE, also called concolic testing) is a technique used to automatically generate test inputs that achieve high code coverage. DSE executes the program under test with random test inputs and performs symbolic execution in parallel to collect symbolic constraints obtained from predicates in branch statements along the execution traces. The conjunction of all symbolic constraints along a path is called a path condition. When generating tests, DSE randomly chooses one test input from the input domain, then uses constraint solving to reach a target branch condition in the path . DSE has been found to be accurate and effective, and has been the fundamental technique of some vulnerability discovery tools .
In ML testing, the model’s performance is decided, not only by the code, but also by the data, and thus symbolic execution has two application scenarios: either on the data or on the code.
Symbolic analysis was applied to abstract the data to generate more effective tests to expose bugs by Ramanathan and Pullum . They proposed a combination of symbolic and statistical approaches to efficiently find test cases to find errors in ML systems. The idea is to distance-theoretically abstract the data using symbols to help search for those test inputs where minor changes in the input will cause the algorithm to fail. The evaluation of the implementation of a -means algorithm indicates that the approach is able to detect subtle errors such as bit-flips. The examination of false positives may also be a future research interest.
When applying symbolic execution on the machine learning code, there comes many challenges. Gopinath  listed three such challenges for neural networks in their paper, which work for other ML modes as well: (1) the networks have no explicit branching; (2) the networks may be highly non-linear, with no well-developed solvers for constraints; and (3) there are scalability issues because the structure of the ML models are usually very complex and are beyond the capabilities of current symbolic reasoning tools.
Considering these challenges, Gopinath 
introduced DeepCheck. It transforms a Deep Neural Network (DNN) into a program to enable symbolic execution to find pixel attacks that have the same activation pattern as the original image. In particular, the activation functions in DNN follow an IF-Else branch structure, which can be viewed as a path through the translated program. DeepCheck is able to create 1-pixel and 2-pixel attacks by identifying most of the pixels or pixel-pairs that the neural network fails to classify the corresponding modified images.
Similarly, Agarwal et al.  apply LIME , a local explanation tool that approximates a model with linear models, decision trees, or falling rule lists, to help get the path used in symbolic execution. Their evaluation based on 8 open source fairness benchmarks shows that the algorithm generates 3.72 times more successful test cases than random test generation approach THEMIS .
Sun et al.  presented DeepConcolic, a dynamic symbolic execution testing method for DNNs. Concrete execution is used to direct the symbolic analysis to particular MC/DC criteria’ condition, through concretely evaluating given properties of the ML models. DeepConcolic explicitly takes coverage requirements as input, and demonstrates yields over 10% higher neuron coverage than DeepXplore for the evaluated models.
Murphy et al.  generated data with repeating values, missing values, or categorical data for testing two ML ranking applications. Breck et al.  used synthetic training data that adhere to schema constraints to trigger the hidden assumptions in the code that do not agree with the constraints. Zhang et al.  used synthetic data with known distributions to test overfitting. Nakajima and Bui  also mentioned the possibility of generating simple datasets with some predictable characteristics that can be adopted as pseudo oracles.
Test oracle identification is one of the key problems in ML testing, to enable the judgement of bug existence. This is the so-called ‘Oracle Problem’ .
In ML testing, the oracle problem is challenging, because many machine learning algorithms are probabilistic programs. In this section, we list several popular types of test oracle that have been studied for ML testing, i.e., metamorphic relations, cross-referencing, and model evaluation metrics.
Metamorphic relations have been proposed by Chen et al.  to ameliorate the test oracle problem in traditional software testing. A metamorphic relation refers to the relationship between the software input change and output change during multiple program executions. For example, to test the implementation of the function , one may check how the function output changes when the input is changed from to . If differs from , this observation signals an error without needing to examine the specific values computed by the implementation. is thus a metamorphic relation that plays the role of test oracle (also named ‘pseudo oracle’) to help bug detection.
In ML testing, metamorphic relations are widely studied to tackle the oracle problem. Many metamorphic relations are based on transformations of training or test data that are expected to yield unchanged or certain expected changes in the predictive output. There are different granularity of data transformations when studying the corresponding metamorphic relations. Some transformations conduct coarse-grained changes such as enlarging the dataset or changing the data order, without changing each single data instance. We call these transformations ‘Coarse-grained data transformations’. Some transformations conduct data transformations via smaller changes on each data instance, such as mutating the attributes, labels, or pixels of images against each piece of instance, and are referred to as ‘fine-grained’ data transformations in this paper. The related works of each type of transformations are introduced below.
Coarse-grained Data Transformation. As early as in 2008, Murphy et al.  discuss the properties of machine learning algorithms that may be adopted as metamorphic relations. Six transformations of input data are introduced: additive, multiplicative, permutative, invertive, inclusive, and exclusive. The changes include adding a constant to numerical values; multiplying numerical values by a constant; permuting the order of the input data; reversing the order of the input data; removing a part of the input data; adding additional data. Their analysis is on MartiRank, SVM-Light, and PAYL. Although unevaluated in the 2008 introduction, this work provided a foundation for determining the relationships and transformations that can be used for conducting metamorphic testing for machine learning.
Ding et al.  proposed 11 metamorphic relations to test deep learning systems. At the dataset level, the metamorphic relations were also based on training data or test data transformations that were not supposed to affect the classification accuracy, such as adding 10% training images into each category of the training data set or removing one category of the data from the dataset. The evaluation is based on a classification of biology cell images.
Murphy et al.  presented function-level metamorphic relations. The evaluation on 9 machine learning applications indicated that functional-level properties were 170% more effective than application-level properties.
Fine-grained Data Transformation. The metamorphic relations of Murphy et al.  introduced above are general relations suitable for both supervised and unsupervised learning algorithms. In 2009, Xie et al. 
proposed to use metamorphic relations that were specific to a certain model to test the implementations of supervised classifiers. The paper presents five types of metamorphic relations that enable the prediction of expected changes to the output (such as changes in classes, labels, attributes) based on particular changes to the input. Manual analysis of the implementation of KNN and Naive Bayes from Weka indicates that not all metamorphic relations are proper or necessary. The differences in the metamorphic relations between SVM and neural networks are also discussed in . Dwarakanath et al.  applied metamorphic relations to image classifications with SVM and deep learning systems. The changes on the data include changing the feature or instance orders, linear scaling of the test features, normalisation or scaling up the test data, or changing the convolution operation order of the data. The proposed MRs are able to find 71% of the injected bugs. Sharma and Wehrheim  considered fine-grained data transformations such as changing feature names, renaming feature values to test fairness. They studied 14 classifiers, none of them were found to be sensitive to feature name shuffling.
Zhang et al.  proposed Perturbed Model Validation (PMV) which combines metamorphic relation and data mutation to detect overfitting. PMV mutates the training data via injecting noise in the training data to create perturbed training datasets, then checks the training accuracy decrease rate when the noise degree increases. The faster the training accuracy decreases, the less the overfitting degree is.
Al-Azani and Hassine  studied the metamorphic relations of Naive Bayes, -Nearest Neighbour, as well as their ensemble classifier. It turns out that the metamorphic relations necessary for Naive Bayes and k-Nearest Neighbour may be not necessary for their ensemble classifier.
Tian et al.  and Zhang et al.  stated that the autonomous vehicle steering angle should not change significantly or stay the same for the transformed images under different weather conditions. Ramanagopal et al.  used the classification consistency of similar images to serve as test oracles for testing self-driving cars. The evaluation indicates a precision of 0.94 when detecting errors in unlabelled data.
Additionally, Xie et al. 
proposed METTLE, a metamorphic testing approach for unsupervised learning validation. METTLE has six types of different-grained metamorphic relations that are specially designed for unsupervised learners. These metamorphic relations manipulate instance order, distinctness, density, attributes, or inject outliers of the data. The evaluation was based on synthetic data generated by Scikit-learn, showing that METTLE is practical and effective in validating unsupervised learners. Nakajima et al.[107, 120] discussed the possibilities of using different-grained metamorphic relations to find problems in SVM and neural networks, such as to manipulate instance order or attribute order and to reverse labels and change attribute values, or to manipulate the pixels in images.
Metamorphic Relations between Different Datasets. The consistency relations between/among different datasets can also be regarded as metamorphic relations that could be applied to detect data bugs. Kim et al.  and Breck et al.  studied the metamorphic relation between training data and new data. If the training data and new data have different distributions, the training data have problems. Breck et al.  also studied the metamorphic relations among different datasets that are close in time: these datasets are expected to share some characteristics because it is uncommon to have frequent drastic changes to the data-generation code.
Frameworks to Apply Metamorphic Relations. Murphy et al.  implemented a framework called Amsterdam to automates the process of using metamorphic relation to detect ML bugs. The framework reduces false positives via setting thresholds when doing result comparison. They also developed Corduroy , which extended Java Modelling Language to let developers specify metamorphic properties and generate test cases for ML testing.
Cross-Referencing is another type of test oracles in ML testing, including differential Testing and N-version Programming. Differential testing is a traditional software testing technique that detects bugs by observing whether similar applications yield different outputs regarding identical inputs [125, 11]. Differential testing is one of the major testing oracles for detecting compiler bugs . It is closely related with N-version programming : N-version programming aims to generate multiple functionally equivalent programs based on one specification, so that the combination of different versions are more fault-tolerant and robust.
Davis and Weyuker  discussed the possibilities of differential testing for ‘non-testable’ programs. The idea is that if multiple implementation of an algorithm yields different outputs on one identical input, then one or both of the implementation contains a defect. Alebiosu et al.  evaluated this idea on machine learning, and successfully found 5 faults from 10 Naive Bayes implementations and 4 faults from 20 -nearest neighbour implementation.
Pham et al. 
also adopted multiple implementation to test ML implementations, but focused on the implementation of deep learning libraries. They proposed CRADLE, the first approach that focuses on finding and localising bugs in deep learning software libraries. The evaluation was conducted on three libraries (TensorFlow, CNTK, and Theano), 11 datasets (including ImageNet, MNIST, and KGS Go game), and 30 pre-trained models. It turned out that CRADLE detects 104 unique inconsistencies and 12 bugs.
DeepXplore  and DLFuzz  used differential testing as test oracles to find effective test inputs. Those test inputs causing different behaviours among different algorithms or models are preferred during test generation.
Most differential testing relies on multiple implementations or versions, while Qin et al.  used the behaviours of ‘mirror’ programs, generated from the training data as pseudo oracles. A mirror program is a program generated based on training data, so that the behaviours of the program represent the training data. If the mirror program has similar behaviours on test data, it is an indication that the behaviour extracted from the training data suit test data as well.
Some work has presented definitions or statistical measurements of non-functional features of ML systems including robustness , fairness [131, 132, 133], and interpretability [64, 134]. These measurements are not direct oracles for testing, but are essential for testers to understand and evaluate the property under test, and provide some actual statistics that can be compared with the expected ones. For example, the definitions of different types of fairness [131, 132, 133] (more details are in Section 6.5.1) define the conditions an ML system has to satisfy without which the system is not fair. These definitions can be adopted directly to detect fairness violations.
Except for these popular test oracles in ML testing, there are also some domain-specific rules that could be applied to design test oracles. We discussed several domain-specific rules that could be adopted as oracles to detect data bugs in Section 7.1.1. Kang et al.  discussed two types of model assertions under the task of car detection in videos: flickering assertion to detect the flickering in car bounding box, and multi-box assertion to detect nested-car bounding. For example, if a car bounding box contains other boxes, the multi-box assertion fails. They also proposed some automatic fix rules to set a new predictive result when a test assertion fails.
There has also been a discussion about the possibility of evaluating ML learning curve in lifelong machine learning as the oracle . An ML system can pass the test oracle if it can grow and increase its knowledge level over time.
Test adequacy evaluation is to find out whether the existing tests have a good fault-revealing ability, and provides an objective confidence measurement on the testing activities. The adequacy criteria can also be adopted to guide test generation. Popular test adequacy evaluation techniques in traditional software testing include code coverage and mutation testing, which are also adopted in ML testing.
In traditional software testing, code coverage measures the degree to which the source code of a program is executed by a test suite . The higher coverage a test suite achieves, it is more probable that the hidden bugs could be uncovered. In other words, covering the code fragment is a necessary condition to detect the defects hidden in the code. It is often desirable to create test suites to achieve higher coverage.
Unlike traditional software, code coverage is seldom a demanding criterion for ML testing, since the decision logic of an ML model is not written manually but rather are learned from training data. For example, in the study of Pei et al. , 100 % traditional code coverage is easy to achieve with a single randomly chosen test input. Instead, researchers propose various types of coverage for ML models beyond code coverage.
Neuron coverage. Pei et al.  pioneered to propose the very first coverage criterion, neuron coverage, particularly designed for deep learning testing. Neuron coverage is calculated as the ratio of the number of unique neurons activated by all test inputs and the total number of neurons in a DNN. In particular, a neuron is activated if its output value is larger than a user-specified threshold.
Ma et al.  extended the concept of neuron coverage. They first profile a DNN based on the training data, so that obtain the activation behaviour of each neuron against the training data. Based on this they propose more fined-grained criteria, -multisection neuron coverage, neuron boundary coverage, and strong neuron activation coverage, to represent the major functional behaviour and corner behaviour of a DNN.
MC/DC coverage variants. Sun et al.  proposed four test coverage criteria that are tailored to the distinct features of DNN inspired by the MC/DC coverage criteria . MC/DC observes the change of a Boolean variable, while their proposed criteria observe a sign, value, or distance change of a neuron, in order to capture the causal changes in the test inputs. The approach assumes the DNN to be a fully-connected network, and does not consider the context of a neuron in its own layer as well as different neuron combinations within the same layer .
Layer-level coverage. Ma et al.  also presented layer-level coverage criteria, which considers the top hyperactive neurons and their combinations (or the sequences) to characterise the behaviours of a DNN. The coverage is evaluated to have better performance together with neuron coverage based on dataset MNIST and ImageNet. In their following-up work [141, 142] , they further proposed combinatorial testing coverage, which checks the combinatorial activation status of the neurons in each layer via checking the fraction of neurons activation interaction in a layer. Sekhon and Fleming  defined a coverage criteria that looks for 1) all pairs of neurons in the same layer having all possible value combinations, and 2) all pairs of neurons in consecutive layers having all possible value combinations.
. While aftermentioned criteria to some extent capture the behaviours of feed-forward neural networks, they do not explicitly characterise the stateful machine learning system like recurrent neural network (RNN). The RNN-based ML system achieves great success in application to handle sequential inputs, e.g., speech audio, natural language, cyber physical control signals. Towards analyzing such stateful ML systems, Du et al. proposed the first set of testing criteria specialised for RNN-based stateful deep learning systems. They first abstracted a stateful deep learning system as a probabilistic transition system. Based on the modelling, they proposed criteria based on the state and traces of the transition system, to capture the dynamic state transition behaviours.
Limitations of Coverage Criteria. Although there are different types of coverage criteria, most of them focus on DNNs. Sekhon and Fleming  examined the existing testing methods for DNNs and discussed the limitations of these criteria.
Up to present, most proposed coverage criteria are based on the structure of a DNN. Li et al.  pointed out the limitations of structural coverage criteria for deep networks caused by the fundamental differences between neural networks and human-written programs. Their initial experiments with natural inputs found no strong correlation between the number of misclassified inputs in a test set and its structural coverage. Due to the black-box nature of a machine learning system, it is not clear how such criteria directly relate to the system decision logic.
In traditional software testing, mutation testing evaluates the fault-revealing ability of a test suite via injecting faults [144, 137]. The ratio of detected faults against all injected faults is called the Mutation Score.
In ML testing, the behaviour of an ML system depends on, not only the learning code, but also data and model structure. Ma et al.  proposed DeepMutation, which mutates DNNs at the source level or model level, to make minor perturbation on the decision boundary of a DNN. Based on this, a mutation score is defined as the ratio of test instances of which results are changed against the total number of instances.
Shen et al.  proposed five mutation operators for DNNs and evaluated properties of mutation on the MINST dataset. They pointed out that domain-specific mutation operators are needed to enhance mutation analysis.
Compared with structural coverage criteria, mutation testing based criteria is more directly relevant to the decision boundary of a DNN. For example, an input data that is near the decision boundary of a DNN, could more easily detect the inconsistency between a DNN and its mutants.
Kim et al.  introduced surprise adequacy
to measure the coverage of discretised input surprise range for deep learning systems. They argued that a ‘good’ test input should be ‘sufficiently but not overly surprising’ comparing with the training data. Two measurements of surprise were introduced: one is based on Keneral Density Estimation (KDE) to approximate the likelihood of the system having seen a similar input during training, the other is based on the distance between vectors representing the neuron activation traces of the given input and the training data(e.g., Euclidean distance).
To ensure the functionality of an ML system, there may be some typical rules that are necessary to examine. Breck et al.  offered 28 test aspects to consider and a scoring system used by Google. Their focus is to measure how well a given machine learning system is tested. The 28 test aspects are classified into four types: 1) the tests for the ML model itself, 2) the tests for ML infrastructure used to build the model, 3) the tests for ML data used to build the model, and 4) the tests that check whether the ML system work correctly over time. Most of them are some must-to-check rules that could be applied to guide test generation. For example, the training process should be reproducible; all features should be beneficial; there should be no other model that is simpler but better in performance than the current one. They also mentioned that to randomly generate input data and train the model for a single step of gradient descent is quite powerful for detecting library mistakes. Their research indicates that although ML testing is complex, there are shared properties to design some basic test cases to test the fundamental functionality of the ML system.
Test input generation in ML has a very large input space to cover. On the other hand, we need to label every test instance so as to judge predictive accuracy. These two aspects lead to high test generation costs. Byun et al.  used DNN metrics like cross entropy, surprisal, and Bayesian uncertainty to prioritise test inputs and experimentally showed that these are good indicators of inputs that expose unacceptable behaviours, which are also useful for retraining.
Generating test inputs is also computationally expensive. Zhang et al.  proposed to reduce costs by identifying those test instances that denote the more effective adversarial examples. The approach is a test prioritisation technique that ranks the test instances based on their sensitivity to noises, because the instance that is more sensitive to noise is more likely to yield adversarial examples.
Li et. al  focused on test data reduction in operational DNN testing. They proposed a sampling technique guided by the neurons in the last hidden layer of a DNN, using a cross-entropy minimisation based distribution approximation technique. The evaluation was conducted on pre-trained models with three image datasets: MNIST, Udacity challenge, and ImageNet. The results show that compared with random sampling, their approach samples only half the test inputs to achieve a similar level of precision.
Thung et al.  were the first to study machine learning bugs via analysing the bug reports of machine learning systems. 500 bug reports from Apache Mahout, Apache Lucene, and Apache OpenNLP were studied. The explored problems included bug frequency, bug categories, bug severity, and bug resolution characteristics such as bug-fix time, effort, and file number. The results indicated that incorrect implementation counts for the most ML bugs, i.e., 22.6% of bugs are due to incorrect implementation of defined algorithms. Implementation bugs are also the most severe bugs, and take longer to fix. In addition, 15.6% of bugs are non-functional bugs. 5.6% of bugs are data bugs.
Zhang et al. 
studied 175 TensorFlow bugs, based on the bug reports from Github or StackOverflow. They studied the symptoms and root causes of the bugs, the existing challenges to detect the bugs and how these bugs are handled. They classified TensorFlow bugs into exceptions or crashes, low correctness, low efficiency, and unknown. The major causes were found to be in algorithm design and implementations such as TensorFlow API misuse (18.9%), unaligned tensor (13.7%), and incorrect model parameter or structure (21.7%)
Banerjee et al.  analysed the bug reports of autonomous driving systems from 12 autonomous vehicle manufacturers that drove a cumulative total of 1,116,605 VC miles in California. They used NLP technologies to classify the causes of disengagements (i.e., failures that cause the control of the vehicle to switch from the software to the human driver) into 10 types. The issues in machine learning systems and decision control account for the primary cause of 64 % of all disengagements based on their report analysis.
Data Resampling. The generated test inputs introduced in Section 5.1 can not only expose ML bugs, but are also studied to serve as a part of the training data and improve the model’s correctness through retraining. For example, DeepXplore achieves up to 3% improvement in classification accuracy by retraining a deep learning model on generated inputs. DeepTest  improves the model’s accuracy by 46%.
Ma et al.  identified the neurons responsible for the misclassification and call them ‘faulty neurons’. They resampled training data that influence such faulty neurons to help improve model performance.
Debugging Framework Development. Cai et al.  presented tfdbg, a debugger for ML models built on TensorFlow, containing three key components: 1) the Analyzer, which makes the structure and intermediate state of the runtime graph visible; 2) the NodeStepper, which enables clients to pause, inspect, or modify at a given node of the graph; 3) the RunStepper, which enables clients to take higher level actions between iterations of model training. Vartak et al.  proposed the MISTIQUE system to capture, store, and query model intermediates to help the debug. Krishnan and Wu  presented PALM, a tool that explains a complex model with a two-part surrogate model: a meta-model that partitions the training data and a set of sub-models that approximate the patterns within each partition. PALM helps developers find out what training data impact the prediction the most, and target the subset of training data that account for the incorrect predictions to assist debugging.
Fix Understanding. Fixing bugs in many machine learning systems is difficult because bugs can occur at multiple points in different components. Nushi et al.  proposed a human-in-the-loop approach that simulates potential fixes in different components through human computation tasks: humans were asked to simulate improved component states. The improvements of the system are recorded and compared, to provide guidance to designers about how they can best improve the system.
There has also been some work focusing on providing a testing tool or framework that helps developers to implement testing activities in a testing workflow. There is a test framework to generate and validate test inputs for security testing . Dreossi et al.  presented a CNN testing framework that consists of three main modules: an image generator, a collection of sampling methods, and a suite of visualisation tools. Tramer et al.  proposed a comprehensive testing tool to help developers to test and debug fairness bugs with an easily interpretable bug report. Nishi et al.  proposed a testing framework including different evaluation aspects such as allowability, achievability, robustness, avoidability and improvability. They also discussed different levels of ML testing, such as system, software, component, and data testing.
ML properties concern what required conditions we should care about in ML testing, and are usually connected with the behaviour of an ML model after training. The poor performance in a property, however, may be due to bugs in any of the ML components (see more in Introduction 7).
This section presents the related work of testing both functional ML properties and non-functional ML properties. Functional properties include correctness (Section 6.1) and overfitting (Section 6.2). Non-functional properties include robustness and security (Section 6.3), efficiency (Section 6.4), fairness (Section 6.5).
Correctness concerns the fundamental function accuracy of an ML system. Classic machine learning validation is the most well-established and widely-used technology for correctness testing. Typical machine learning validation approaches are cross-validation and bootstrap. The principle is to isolate test data via data sampling to check whether the trained model fits new cases. There are several approaches to do cross-validation. In hold out cross-validation , the data are split into two parts: one part as the training data and the other part as test data777Sometimes validation set is also needed to help train the model, which circumstance the validation set will be isolated from the training set.. In -fold cross-validation, the data are split into equal-sized subsets: one subset used as the test data and the remaining as the training data. The process is then repeated times, with each of the subsets serving as the test data. In leave-one-out cross-validation, -fold cross-validation is applied, where is the total number of data instances. In Bootstrapping, the data are sampled with replacement , and thus the test data may contain repeated instances.
There are several widely-adopted correctness measurements such as accuracy, precision, recall, and AUC. There has been work 
analysing the disadvantages of each measurement criterion. For example, accuracy does not distinguish between the types of errors it makes (False Positive versus False Negatives). Precision and Recall may be misled when data is very unbalanced. An indication of these work is that we should carefully choose the performance metrics and it is essential to consider their meanings when adopting them. Chen et al. studied the variability of both training data and test data when assessing the correctness of an ML classifier
. They derived analytical expressions for the variance of the estimated performance and provided an open-source software implemented with an efficient computation algorithm. They also studied the performance of different statistical methods when comparing AUC, and found that-test has the best performance .
To better capture correctness problems, Qin et al.  proposed to generate a mirror program from the training data, and then use the behaviours this mirror program to serve as a correctness oracle. The mirror program is expected to have similar behaviours as the test data.
There has been a study on the popularity of prevalence of correctness problems among all the reported ML bugs: Zhang et al.  studied 175 Tensorflow bug reports from StackOverflow QA (Question and Answer) pages and from Github projects. Among the 175 bugs, 40 of them concern poor correctness.
When a model is too complex for the data, even the noise of training data is fitted by the model . The overfitting easily happens, especially when the training data is not sufficient, [173, 174, 175]. Overfitting may lead to high correctness on the existing training data yet low correctness on the unseen data.
Cross-validation is traditionally considered to be a useful way to detect overfitting. However, it is not always clear how much overfitting is acceptable and cross-validation could be unlikely to detect overfitting if the test data is unrepresentative of potential unseen data.
Zhang et al.  introduced Perturbed Model Validation (PMV) to detect overfitting (and also underfitting). PMV injects noise to the training data, re-trains the model against the perturbed data, then uses the training accuracy decrease rate to detect overfitting/underfitting. The intuition is that an overfitted learner tends to fit noise in the training sample, while an underfitted learner will have low training accuracy regardless the presence of injected noise. Thus, both overfitting and underfitting tend to be less insensitive to noise and exhibit a small accuracy decrease rate against noise degree on perturbed data. PMV was evaluated on four real-world datasets (breast cancer, adult, connect-4, and MNIST) and nine synthetic datasets in the classification setting. The results reveal that PMV has much better performance and provides more recognisable signal in detecting both overfitting/underfitting than cross-validation.
An ML system usually gathers new data after deployment, which will be added into the training data to improve correctness. The test data, however, cannot be guaranteed to represent the future data. DeepMind presents an overfitting detection approach via generating adversarial examples from test data . If the reweighted error estimate on adversarial examples is sufficiently different from that of the original test set, overfitting is detected. They evaluated their approach on ImageNet and CIFAR-10.
Gossmann et al.  studied the threat of test data reuse practice in the medical domain with extensive simulation studies, and found that the repeated reuse of the same test data will inadvertently result in overfitting under all considered simulation settings.
Kirk  mentioned that we could use training time as a complexity proxy of an ML model; it is better to choose the algorithm with equal correctness but relatively small training time.
Ma et al.  treated data as one of the important reasons for the low performance of the model and its overfitting. They tried to relieve the overfitting problem via re-sampling the training data. The approach is found to improve test accuracy from 75% to 93% in average based on evaluation of three image classification datasets.
Unlike correctness or overfitting, robustness is a non-functional characteristic of a machine learning system. A natural way to measure robustness is to check the correctness of the system with the existence of noise ; a robust system should maintain performance in the presence of noise.
Moosavi-Dezfooli et al.  proposed DeepFool that computes perturbations (added noise) that ‘fool’ deep networks so as to quantify their robustness. Bastani et al.  presented three metrics to measure robustness: 1) pointwise robustness, indicating the minimum input change a classifier fails to be robust; 2) adversarial frequency, indicating how often changing an input changes a classifier’s results; 3) adversarial severity, indicating the distance between an input and its nearest adversarial example.
Carlini and Wagner  created a set of attacks that can be used to construct an upper bound on the robustness of a neural network. Tjeng et al.  proposed to use the distance between a test input and its closest adversarial example to measure the robustness. Ruan et al.  provided global robustness lower and upper bounds based on the test data to quantify the robustness. Gopinath  et al. proposed DeepSafe, a data-driven approach for assessing DNN robustness: inputs that are clustered into the same group should share the same label.
More recently, Mangal et al.  proposed the definition of probabilistic robustness. Their work used abstract interpretation to approximate the behaviour of a neural network and to compute an over-approximation of the input regions where the network may exhibit non-robust behaviour.
Banerjee et al.  explored the use of Bayesian Deep Learning to model the propagation of errors inside deep neural networks to mathematically model the sensitivity of neural networks to hardware errors, without performing extensive fault injection experiments.
The existence of adversarial examples allows attacks that may lead to serious consequences in safety-critical applications such as self-driving cars. There is a whole separate literature on adversarial example generation that deserves a survey of its own, and so this paper does not attempt to fully cover it, but focus on those promising aspects that could be fruitful areas for future research at the intersection of traditional software testing and machine learning.
Carlini and Wagner 
developed adversarial example generation approaches using distance metrics to quantify similarity. The approach succeeded in generating adversarial examples for all images on the recently proposed defensively distilled networks.
Adversarial input generation has been widely adopted to test the robustness of autonomous driving system [1, 72, 85, 75, 86]. There has also been research efforts in generating adversarial inputs for NLI models [91, 92](Section 8.3), malware detection , and Differentiable Neural Computer (DNC) .
Papernot et al. [185, 186] designed a library to standardise the implementation of adversarial example construction. They pointed out that standardising adversarial example generation is very important because ‘benchmarks constructed without a standardised implementation of adversarial example construction are not comparable to each other’: it is not easy to tell whether a good result is caused by a high level of robustness or by the differences in the adversarial example construction procedure.
There has been research on generating perturbations in the system and record the system’s reactions towards them. Jha et al.  presented AVFI, which used application/software fault injection to approximate hardware errors in the sensors, processors, or memory of the autonomous vehicle (AV) systems to test the robustness. They also presented Kayotee , a fault injection-based tool to systematically inject faults into software and hardware components of the autonomous driving systems. Compared with AVFI,Kayotee is capable of characterising error propagation and masking using a closed-loop simulation environment, which is also capable of injecting bit flips directly into GPU and CPU architectural state. DriveFI , further presented by Jha et al., is a fault-injection engine that mines situations and faults that maximally impact AV safety.
Tuncali et al.  considered the closed-loop behaviour of the whole system to support adversarial example generation for autonomous driving systems, not only in image space, but also in configuration space.
The empirical study of Zhang et al.  on Tensorflow bug-related artefacts (from StackOverflow QA page and Github) found that nine out of 175 ML bugs (5.1%) belong to efficiency problems. This proportion is not high. The reasons may either be that efficiency problems rarely occur or these issues are difficult to detect.
Kirk  pointed out that it is possible to use the efficiency of different machine learning algorithms when training the model to compare their complexity.
Spieker and Gotlieb  studied three training data reduction approaches, the goal of which was to find a smaller subset of the original training data set with similar characteristics during model training, so that the model building speed could be improved for faster machine learning testing.
Fairness is a relatively recently emerging non-functional characteristic. According to the work of Barocas and Selbst , there are the following five major causes of unfairness.
1) Skewed sample: once some initial bias happens, such bias may compound over time.
2) Tainted examples: the data labels are biased because of biased labelling activities of human beings.
3) Limited features: features may be less informative or reliably collected, misleading the model in building the connection between the features and the labels.
4) Sample size disparity: if the data from the minority group and the majority group are highly imbalanced, it is less likely to model the minority group well.
5)Proxies: some features are proxies of sensitive attributes(e.g., neighbourhood), and may cause bias to the ML model even if sensitive attributes are excluded.
Research on fairness focuses on measuring, discovering, understanding, and coping with the observed differences regarding different groups or individuals in performance. Measuring and discovering the differences are actually defining and discovering fairness bugs. Such bugs can offend and even harm users, and cause programmers and businesses embarrassment, mistrust, loss of revenue, and even legal violations .
There are several definitions of fairness that proposed in the literature but with no firm consensus being yet reached [191, 192, 193, 194]. Even though, these definitions can be used as oracles to detect fairness violations in ML testing.
To help illustrate the formalisation of ML fairness, we use to denote a set of individuals, to denote the true label set when making decisions regarding each individual in . Let be the trained machine learning predictive model that we are testing. Let be the set of sensitive attributes, and be the remaining attributes.
1) Fairness Through Unawareness. Fairness Through Unawareness (FTU) means that an algorithm is fair so long as the protected attributes are not explicitly used in the decision-making process . It is a relatively low-cost way to define and ensure fairness. Nevertheless, sometimes the non-sensitive attributes in may contain sensitive information correlated to those sensitive attributes that may still lead to discrimination [191, 195]. Excluding sensitive attributes may also impact model accuracy and yield less effective predictive results .
2) Group Fairness. A model under test has group fairness if groups selected based on sensitive attributes have an equal probability of decision outcomes. There are several types of group fairness.
Demographic Parity is a popular group fairness measurement . It is also named Statistical Parity or Independence Parity. It requires that a decision should be independent of the protected attributes. Let and be the two groups belonging to divided by a sensitive attribute . A model under test has group fairness if .
Equalised Oddsis another group fairness approach proposed by Hardt et al. . A model under test satisfies Equalised Odds if is independent of the protected attributes when a target label is fixed as : .
When the target label is set to be positive, Equalised Odds becomes Equal Opportunity . It requires that the true positive rate should be the same for all the groups. A model satisfies Equal Opportunity if is independent of the protected attributes when a target class is fixed as being positive: .
3) Counter-factual Fairness. Kusner et al.  introduced Counter-factual Fairness. A model satisfies Counter-factual Fairness if its output remains the same when the protected attribute is flipped to a counter-factual value, and other variables modified as determined by the assumed causal model: . This measurement of fairness additionally provides a mechanism to interpret the causes of bias.
4) Individual Fairness. Dwork et al.  proposed a use task-specific similarity metric to describe the pairs of individuals that should be regarded as similar. According to Dwork et al., a model with individual fairness should give similar predictive results among similar individuals: iff , where is a distance metric for individuals that measures their similarity.
Analysis and Comparison of Fairness Metrics. Although there are many existing definitions of fairness, each has its advantages and disadvantages. Which fairness is the most suitable remains controversial. There is thus some work surveying and analysing the existing fairness metrics, or investigate and compare their performance based on experimental results, as introduced below.
Gajane and Pechenizkiy  surveyed how fairness is defined and formalised in the literature. Corbett-Davies and Goel  studied three types of fairness definitions: anti-classification, classification parity, and calibration. They pointed out the deep statistical limitations of each type with examples. Verma and Rubin  explained and illustrated the existing most prominent fairness definitions based on a common, unifying dataset.
Saxena et al.  investigated people’s perceptions of three of the fairness definitions. About 200 recruited participants from Amazon’s Mechanical Turk were asked to choose their preference over three allocation rules on two individuals having each applied for a loan. The results demonstrate a clear preference for the way of allocating resources in proportion to the applicants’ loan repayment rates.
Galhotra et al. [5, 198] proposed Themis which considers group fairness using causal analysis . It defines fairness scores as measurement criteria for fairness and uses random test generation techniques to evaluate the degree of discrimination (based on fairness scores). Themis was also reported to be more efficient on systems that exhibit more discrimination.
Themis generates tests randomly for group fairness, while Udeshi et al.  proposed Aequitas, focusing on test generation to uncover discriminatory inputs and those inputs essential to understand individual fairness. The generation approach first randomly samples the input space to discover the presence of discriminatory inputs, then searches the neighbourhood of these inputs to find more inputs. Except for detecting fairness bugs, Aeqitas also retrains the machine-learning models and reduce discrimination in the decisions made by these models.
Agarwal et al.  used symbolic execution together with local explainability to generate test inputs. The key idea is to use the local explanation, specifically Local Interpretable Model-agnostic Explanations888Local Interpretable Model-agnostic Explanations produces decision trees corresponding to an input that could provide paths in symbolic execution  to identify whether factors that drive decision include protected attributes. The evaluation indicates that the approach generates 3.72 times more successful test cases than THEMIS across 12 benchmarks.
Tramer et al.  firstly proposed the concept of ‘fairness bugs’. They consider a statistically significant association between a protected attribute and an algorithmic output to be a fairness bug, specially named ‘Unwarranted Associations’ in their paper. They proposed the first comprehensive testing tool, aiming to help developers test and debug fairness bugs with an ‘easily interpretable’ bug report. The tool is available for various application areas including image classification, income prediction, and health care prediction.
Sharma and Wehrheim  tried to figure out where the unfairness comes from via checking whether the algorithm under test is sensitive to training data changes. They mutated the training data in various ways to generate new datasets, such as changing the order of rows, columns, and shuffle the feature names and values. 12 out of 14 classifiers were found to be sensitive to these changes.
Manual Assessment of Interpretability. The existing work on empirically assessing the interpretability property usually includes human beings in the loop. That is, manual assessment is currently the major approach to evaluate interpretability. Doshi-Velez and Kim  gave a taxonomy of evaluation (testing) approaches for interpretability: application-grounded, human-grounded, and functionally-grounded. Application-grounded evaluation involves human beings in the experiments with a real application scenario. Human-grounded evaluation is to do the evaluation with real humans but with simplified tasks. Functionally-grounded evaluation requires no human experiments but uses a quantitative metric as a proxy for explanation quality, for example, a proxy for the explanation of a decision tree model may be the depth of the tree.
Friedler et al. 
introduced two types of interpretability: global interpretability means understanding the entirety of a trained model; local interpretability means understanding the results of a trained model on a specific input and the corresponding output. They asked 1000 users to produce the expected output changes of a model given an input change, and then recorded accuracy and completion time over varied models. Decision trees and logistic regression models were found to be more locally interpretable than neural networks.
Automatic Assessment of Interpretability. Cheng et al.  presented a metric to understand the behaviours of an ML model. The metric measures whether the learned has actually learned the object in object identification scenario via occluding the surroundings of the objects.
Christoph  proposed to measure the interpretability based on the category of ML algorithms. He mentioned that ‘the easiest way to achieve interpretability is to use only a subset of algorithms that create interpretable models’. He identified several models with good interpretability, including linear regression, logistic regression and decision tree models.
Zhou et al. 
defined the concepts of metamorphic relation patterns (MRPs) and metamorphic relation input patterns (MRIPs) that can be adopted to help end users understand how an ML system really works. They conducted case studies of various systems, including large commercial websites, Google Maps navigation, Google Maps location-based search, image analysis for face recognition (including Facebook, MATLAB, and OpenCV), and the Google video analysis service Cloud Video Intelligence.
Evaluation of Interpretability Improvement Methods. Machine learning classifiers are widely used in many medical applications, yet the clinical meaning of the predictive outcome is often unclear. Chen et al.  investigated several interpretability improving methods which transform classifier scores to the probability of disease scale. They showed that classifier scores on arbitrary scales can be calibrated to the probability scale without affecting their discrimination performance.
This section organises the related work of ML testing based on which ML component (data, learning program, or framework) ML testing tests.
Data is a ‘component’ to be tested in ML testing, since the performance of the ML system largely depends on the data. Furthermore, as pointed out by Breck et al. , it is important to detect data bugs early because predictions from the trained model are often logged and used to generate more data, constructing a feedback loop that may amplify even small data bugs over time.
Nevertheless, data testing is challenging . According to the study of Amershi et al. , the management and evaluation of data is among the most challenging tasks when developing an AI application in Microsoft. Breck et al.  mentioned that data generation logic often lacks visibility in the ML pipeline; the data are often stored in a raw-value format (e.g., CSV) that strips out semantic information that can help identify bugs; what’s more, the resilience of some ML algorithms to noisy data and the problems in correctness metrics add more difficulty to observe the problems in data.
Rule-based Data Bug Detection. Hynes et al.  proposed data linter– a ML tool inspired by code linters to automatically inspect ML datasets. They considered three types of data problems: 1) miscoding data, such as mistyping a number or date as a string; 2) outliers and scaling, such as uncommon list length; 3) packaging errors, such as duplicate values, empty examples, and other data organisation issues.
Cheng et al.  presented a series of metrics to evaluate whether the training data have covered all important scenarios.
Performance-based Data Bug Detection. To solve the problems in training data, Ma et al.  proposed MODE. MODE identifies the ‘faulty neurons’ in neural networks that are responsible for the classification errors, and tests the training data via data resampling to analyse whether the faulty neurons are influenced. MODE allows to improve test effectiveness from 75 % to 93 % on average based on evaluation using the MNIST, Fashion MNIST, and CIFAR-10 datasets.
Metzen et al.  proposed to augment DNNs with a small sub-network specially designed to distinguish genuine data from data containing adversarial perturbations. Wang et al.  used DNN model mutation to expose adversarial examples since they found that adversarial samples are more sensitive to perturbations . The evaluation was based on the MNIST and CIFAR10 datasets. The approach detects 96.4 %/90.6 % adversarial samples with 74.1/86.1 mutations for MNIST/CIFAR10.
Adversarial example detection can be regarded as bug detection in the test data. Carlini and Wagner 
surveyed ten proposals that are designed for detecting adversarial examples and compared their efficacy. They found that detection approaches rely on the loss functions and can thus be bypassed when constructing new loss functions. They concluded that adversarial examples are significantly harder to detect than previously appreciated.
The training instances and the instances that the model predicts should be consistent in aspects such as features and distributions. Kim et al.  proposed two measurements to evaluate the skew between training and test data: one is based on Keneral Density Estimation (KDE) to approximate the likelihood of the system having seen a similar input during training, the other is based on the distance between vectors representing the neuron activation traces of the given input and the training data(e.g., Euclidean distance).
investigated the skew in training data and serving data (the data that the ML model predicts after deployment). To detect the skew in features, they do key-join feature comparison. To quantify the skew in distribution, they argued that general approaches such as KL divergence or cosine similarity may not work because produce teams have difficulty in understanding the natural meaning of these metrics. Instead, they proposed to use the largest change in probability for a value in the two distributions as a measurement of their distance.
Breck et al.  proposed a data validation system for detecting data bugs. The system applies constraints (e.g., type, domain, valency) to find bugs in single-batch (within the training data or new data), and quantifies the distance between training data and new data. Their system is deployed as an integral part of TFX (an end-to-end machine learning platform at Google). The deployment in production shows that the system helps early detection and debugging of data bugs. They also summarised the type of data bugs, in which new feature column, unexpected string values, and missing feature columns are the three most common.
Krishnan et al. [208, 209] proposed a model training framework, ActiveClean, that allows for iterative data cleaning while preserving provable convergence properties. ActiveClean suggests a sample of data to clean based on the data’s value to the model and the likelihood that it is ‘dirty’. The analyst can then apply value transformations and filtering operations to the sample to ‘clean’ the identified dirty data.
In 2017, Krishnan et al.  presented a system named BoostClean to detect domain value violations (i.e., when an attribute value is outside of an allowed domain) in training data. The tool utilises the available cleaning resources such as Isolation Forest  to improve a model’s performance. After resolving the problems detected, the tool is able to improve prediction accuracy by up to 9% in comparison to the best non-ensemble alternative.
ActiveClean and BoostClean may involve human beings in the loop of testing process. Schelter et al.  focus on the automatic ‘unit’ testing of large-scale datasets. Their system provides a declarative API that combines common as well as user-defined quality constraints for data testing. Krishnan and Wu  also targeted automatic data cleaning and proposed AlphaClean. They used a greedy tree search algorithm to automatically tune the parameters for data cleaning pipelines. With AlphaClean, the user could focus on defining cleaning requirements and let the system find the best configuration under the defined requirement. The evaluation was conducted on three datasets, demonstrating that AlphaClean finds solutions of up to 9X better than state-of-the-art parameter tuning methods.
Training data testing is also regarded as a part of a whole machine learning workflow in the work of Baylor et al. . They developed a machine learning platform that enables data testing, based on a property description schema that captures properties such as the features present in the data and the expected type or valency of each feature.
There are also data cleaning technologies such as statistical or learning approaches from the domain of traditional database and data warehousing. These approaches are not specially designed or evaluated for ML, but they can be re-purposed for ML testing .
Bug detection in the learning program checks whether the algorithm is correctly implemented and configured, e.g., the model architecture is designed well, and whether there exist coding errors.
Unit Tests for ML Learning Program. McClure  introduced ML unit testing with TensorFlow built-in testing functions to help ensure that ‘code will function as expected’ to help build developers’.
Schaul et al.  developed a collection of unit tests specially designed for stochastic optimisation. The tests are small-scale, isolated with well-understood difficulty. They could be adopted in the beginning learning stage to test the learning algorithms to detect bugs as early as possible.
Algorithm Configuration Examination. Sun et al.  and Guo et al.  identified operating systems, language, and hardware Compatibility issues. Sun et al.  studied 329 real bugs from three machine learning frameworks: Scikit-learn, Paddle, and Caffe. Over 22% bugs are found to be compatibility problems due to incompatible operating systems, language versions, or conflicts with hardware. Guo et al. 
investigated deep learning frameworks such as TensorFlow, Theano, and Torch. They compared the learning accuracy, model size, robustness with different models classifying dataset MNIST and CIFAR-10.
The study of Zhang et al.  indicates that the most common learning program bug is due to the change of TensorFlow API when the implementation has not been updated accordingly. Additionally, 23.9% (38 in 159) of the bugs from ML projects in their study built based on TensorFlow arise from problems in the learning program.
Karpov et al. 
also highlighted testing algorithm parameters in all neural network testing problems. The parameters include the number of neurons and their types based on the neuron layer types, the ways the neurons interact with each other, the synapse weights, and the activation functions. However, the work currently still remains unevaluated.
Algorithm Selection Examination. Developers usually face more than one learning algorithms to choose from. Fu and Menzies  compared deep learning and classic learning on the task of linking Stack Overflow questions, and found that classic learning algorithms (such as refined SVM) could achieve similar (and sometimes better) results at a lower cost. Similarly, the work of Liu et al.  found that the -Nearest Neighbours (KNN) algorithm achieves similar results to deep learning for the task of commit message generation.
Mutant Simulations of Learning Program Faults. Murphy et al. [111, 122] used mutants to simulate programming code errors to investigate whether the proposed metamorphic relations are effective in detecting errors. They introduced three types of mutation operators: switching comparison operators, mathematical operators, and off-by-one errors for loop variables.
Dolby et al.  extended WALA to support static analysis of the behaviour of tensors in Tensorflow learning programs written in Python. They defined and tracked tensor types for machine learning, and changed WALA to produce a dataflow graph to abstract possible program behavours.
The current research on framework testing focuses on studying and detecting framework relevant bugs.
Xiao et al.  focused on the security vulnerabilities of popular deep learning frameworks including Caffe, TensorFlow, and Torch. They examined the code of popular deep learning frameworks of these frameworks. The dependency of these frameworks is found to be very complex. Multiple vulnerabilities are identified to exist in the implementation of these frameworks. The most common vulnerabilities are software bugs that cause programs to crash, or enter an infinite loop, or exhaust all the memory.
Guo et al.  tested deep learning frameworks, including TensorFlow, Theano, and Torch, by comparing their runtime behaviour, training accuracy, and robustness under identical algorithm design and configuration. The results indicate that runtime training behaviours are quite different for each framework, while the prediction accuracies remain similar.
Low Efficiency is a problem for ML frameworks, which may directly lead to inefficiency of the models built on them. Sun et al.  found that approximately 10% of the reported framework bugs concern low efficiency. These bugs are usually reported by users. Compared with other types of bugs, they may take longer for developers to resolve.
Many learning algorithms are implemented inside ML frameworks. Implementation bugs in ML frameworks may cause neither crashes, errors, nor efficiency problems , making their detection challenging.
Challenges in Detecting Implementation Bugs. Thung et al.  studied machine learning bugs as early as in 2012. The results regarding 500 bug reports from three machine learning systems indicated that approximately 22.6% bugs are due to incorrect implementation of defined algorithms. Cheng et al.  injected implementation bugs into classic machine learning code in Weka and observed the performance changes that result. They found that 8% to 40% of the logically non-equivalent executable mutants (injected implementation bugs) were statistically indistinguishable from their original versions.
Solutions towards Detecting Implementation Bugs. Some work has used multiple implementations or differential testing to detect bugs. For example, Alebiosu et al.  found five faults in 10 Naive Bayes implementations and four faults in 20 -nearest neighbour implementations. Pham et al.  found 12 bugs in three libraries (i.e., TensorFlow, CNTK, and Theano), 11 datasets (including ImageNet, MNIST, and KGS Go game), and 30 pre-trained models (more details could be referred to Section 5.2.2).
However, not every algorithm has multiple implementations, at which case metamorphic testing may be helpful. Murphy et al. [10, 109] were the first to discuss the possibilities of applying metamorphic relations to machine learning implementations. They listed several transformations of the input data that should not affect the outputs, such as multiplying numerical values by a constant, permuting or reversing the order of the input data, and adding additional data. Their case studies, on three machine learning applications, found metamorphic testing to be an efficient and effective approach to testing ML applications.
Xie et al.  focused on supervised learning. They proposed to use more specific metamorphic relations to test the implementations of supervised classifiers. They discussed five types of potential metamorphic relations on KNN and Naive Bayes on randomly generated data. In 2011, they further evaluated their approach using mutated machine learning code . Among the 43 injected faults in Weka  (injected by MuJava ), the metamorphic relations were able to reveal 39, indicating effectiveness. In their work, the test inputs were randomly generated data. More evaluation with real-world data is needed.
Dwarakanath et al.  applied metamorphic relations to find implementation bugs in image classification. For classic machine learning such as SVM, they conducted mutations such as changing feature or instance orders, linear scaling of the test features. For deep learning models such as residual networks, since the data features are not directly available, they proposed to normalise or scale the test data, or to change the convolution operation order of the data. These changes were intended to bring no change to the model performance when there are no implementation bugs. Otherwise, implementation bugs are exposed. To evaluate, they used MutPy to inject mutants to simulate implementation bugs, of which the proposed MRs are able to find 71%.
Machine learning has been widely adopted in different areas such as autonomous driving and machine translation. This section introduces the domain-specific testing approaches in three typical application domains.
Testing autonomous vehicles has a long history. Wegener and Bühler discussed compared different fitness functions when evaluating the tests of autonomous car parking systems .
Most of the current autonomous vehicle systems that have been put into the market are semi-autonomous vehicles, which require a human driver to serve as a fall-back in the case of failure , as was the case with the work of Wegener and Bühler . A failure that causes the human driver to take control of the vehicle is called a disengagement.
Banerjee et al.  investigated the causes and impacts of 5,328 disengagements from the data of 12 AV manufacturers for 144 vehicles that drove a cumulative 1,116,605 autonomous miles, 42 (0.8%) of which led to accidents. They classified the causes of disengagements into 10 types. As high as 64% of the disengagements were found to be caused by the bugs in the machine learning system, among which the low performance of image classification (e.g., improper detection of traffic lights, lane markings, holes, and bumps) were the dominant causes accounting for 44% of all reported disengagements. The remaining 20% were due to the bugs in the control and decision framework such as improper motion planning.
Pei et al.  used gradient-based differential testing to generate test inputs to detect potential DNN bugs and leveraged neuron coverage as a guideline. Tian et al.  proposed to use a set of image transformation to generate tests, which simulate the potential noises that could happen to a real-world camera. Zhang et al.  proposed DeepRoad, a GAN-based approach to generate test images for real-world driving scenes. Their approach is able to support two weather conditions (i.e., snowy and rainy). The images were generated with the pictures from YouTube videos. Zhou et al.  proposed DeepBillboard, which generates real-world adversarial billboards that can trigger potential steering errors of autonomous driving systems. It demonstrates the possibility of generating continuous and realistic physical-world tests for practical autonomous-driving systems.
Wicker et al.  used feature-guided Monte Carlo Tree Search to identify elements of an image which are most vulnerable to a self-driving system to generate adversarial examples. Jha et al.  accelerated the process of finding ‘safety-critical’ via analytically modelling the injection of faults into an AV system as a Bayesian network. The approach trains the network to identify safety critical faults automatically. The evaluation was based on two production-grade AV systems from NVIDIA and Baidu, indicating that the approach can find many situations where faults lead to safety violations.
Uesato et al.  aimed to find catastrophic failures in safety-critical agents like autonomous driving in reinforcement learning. They demonstrated the limitations of traditional random testing, then proposed a predictive adversarial example generation approach to predict failures and estimate reliable risks. The evaluation on TORCS simulator indicates that the proposed approach is both effective and efficient with fewer Monte Carlo runs.
To test whether an algorithm can lead to a problematic model, Dreossi et al.  proposed to generate training data as well as test data. Focusing on Convolutional Neural Networks (CNN), they build a tool to generate natural images and visualise the gathered information to detect blind spots or corner cases under the autonomous driving scenario. Although there is currently no evaluation, the tool has been made available999https://github.com/shromonag/FalsifyNN.
Tuncali et al.  presented a framework that supports both system-level testing and the testing of those properties of an ML component. The framework also supports fuzz test input generation and search-based testing using approaches such as Simulated Annealing and Cross-Entropy optimisation.
While many other studies investigated DNN model testing for research purposes, Zhou et al.  combined fuzzing and metamorphic testing to test LiDAR, which is an obstacle-perception module of real-life self-driving cars, and detected real-life fatal bugs.
Machine translation automatically translates text or speech from one language to another. The BLEU (BiLingual Evaluation Understudy) score  is a widely-adopted measurement criterion to evaluate machine translation quality. It assesses the correspondence between a machine’s output and that of a human.
Zhou et al. [123, 124] used metamorphic relations in their tool ‘MT4MT’ to evaluate the translation consistency of machine translation systems. The idea is that some changes to the input should not affect the overall structure of the translated output. Their evaluation showed that Google Translate outperformed Microsoft Translator for long sentences whereas the latter outperformed the former for short and simple sentences. They hence suggested that the quality assessment of machine translations should consider multiple dimensions and multiple types of inputs.
The work of Zheng et al. [228, 229, 230] proposed two algorithms for detecting two specific types of machine translation violations: (1) under-translation, where some words/phrases from the original text are missing in the translation, and (2) over-translation, where some words/phrases from the original text are unnecessarily translated multiple times. The algorithms are based on a statistical analysis of both the original texts and the translations, to check whether there are violations of one-to-one mappings in words/phrases.
A Nature Language Inference (NLI) task judges the inference relationship of a pair of natural language sentences. For example, the sentence ‘A person is in the room’ could be inferred from the sentence ‘A girl is in the room’.
Several works have tested the robustness of NLI models. Nie et al.  generated sentence mutants (called ‘rule-based adversaries’ in the paper) to test whether the existing NLI models have semantic understanding. Surprisingly, seven state-of-the-art NLI models (with diverse architectures) were all unable to recognise simple semantic differences when the word-level information remains unchanged.
Similarly, Wang et al. 
mutated the inference target pair by simply swapping them. The heuristic is that a good NLI model should report comparable accuracy between the original test set and swapped test set for contradiction pairs and neutral pairs, and lower accuracy in swapped test set for entailment pairs (the hypothesis may or may not be true given a premise).
This section analyses the research distribution among different testing properties and machine learning categories. It also summarise the datasets (name, description, size, and usage scenario of each dataset) that have been used in ML testing.
Figure 8 shows several big events of ML testing. As early as in 2007, Murphy et al.  mentioned the idea of testing machine learning applications. They classified machine learning applications as ‘non-testable’ programs considering the difficulty of getting test oracles. Their testing mainly refers to the detection of implementation bugs, described as ‘to ensure that an application using the algorithm correctly implements the specification and fulfils the users’ expectations. Afterwards, Murphy et al.  discussed the properties of machine learning algorithms that may be adopted as metamorphic relations to detect implementation bugs.
In 2009, Xie et al.  also applied metamorphic testing on supervised learning applications.
In 2017, Pei et al.  published the first white-box testing paper on deep learning systems. This has become the milestone of machine learning testing, which pioneered to propose coverage criteria for DNN. Enlightened by this paper, a number of machine learning testing techniques have emerged, such as DeepTest , DeepGauge , DeepConcolic , and DeepRoad . A number of software testing techniques has been applied to ML testing, such as different testing coverage criteria [72, 138, 85], mutation testing , combinatorial testing , metamorphic testing , and fuzz testing .
This section introduces and compares the research status of each machine learning category.
To explore the research trend of ML testing, we classify the papers into two categories: those targeting only deep learning and those designed for general machine learning (including deep learning).
Among all the 128 papers, 52 papers (40.6 %) present testing techniques that are specially designed for deep learning; the remaining 76 papers cater to general machine learning.
We further investigated the number of papers in each category for each year, to observe whether there is a trend of moving from testing general machine learning to deep learning. Figure 9 shows the results. Before 2017, papers mostly focus on general machine learning; after 2018, both general machine learning learning and deep learning testing notably arise.
We further classified the papers based on the three machine learning categories: 1) supervised learning testing, 2) unsupervised learning testing, and 3) reinforcement learning testing. Perhaps a bit surprisingly, almost all the work we identified in this survey focused on testing supervised machine learning. Among the 128 related papers, there are currently only three papers testing unsupervised machine learning: Murphy et al.  introduced metamorphic relations that work for both supervised and unsupervised learning algorithms. Ramanathan and Pullum  proposed a combination of symbolic and statistical approaches to test -means algorithm, which is a clustering algorithm. Xie et al.  designed metamorphic relations for unsupervised learning. One paper focused on reinforcement learning testing: Uesato et al.  proposed a predictive adversarial example generation approach to predict failures and estimate reliable risks in reinforcement learning.
We then tried to understand whether the imbalanced testing distribution among different categories is due to the imbalance of their research popularity. To approximate the research popularity of each category, we searched terms ‘supervised learning’, ‘unsupervised learning’, and ‘reinforcement learning’ in Google Scholar and Google. Table IV shows the results of search hits. The last column shows the number/ratio of papers that touch each machine learning category in ML testing. For example, 115 out of 128 papers were observed for supervised learning testing purpose. The table shows that testing popularity of different categories is obviously disproportionate to their overall research popularity. In particular, reinforcement learning has higher search hits than supervised learning, but we did not observe any related work that conducts direct reinforcement learning testing.
There may be several reasons for this observation. First, supervised learning is a widely-known learning scenario associated with classification, regression, and ranking problems . It is natural that researchers would emphasise the testing of widely-applied, known and familiar techniques at the beginning. Second, supervised learning usually has labels in the dataset. It is thereby easier to judge and analyse test effectiveness.
|Category||Scholar hits||Google hits||Testing hits|
Nevertheless, many opportunities clearly remain for research in the widely studied areas of unsupervised learning and reinforcement learning (we discuss more in Section 10).
ML has different tasks such as classification, regression, clustering, and dimension reduction (see more in Section 2). The research focus on different tasks also shows to be highly imbalanced: among the papers we identified, almost all of them focus on classification.
We counted the number of papers concerning each ML testing property. Figure 10 shows the results. The properties in the legend are ranked based on the number of papers that are specially focused testing this property (‘general’ refers to those papers discussing or surveying ML testing generally).
From the figure, around one-third (32.1%) of the papers test correctness. Another one-third of the papers focus on robustness and security problems. Fairness testing ranks the third among all the properties, with 13.8% papers.
Nevertheless, for overfitting detection, interpretability testing, and efficiency testing, only three papers exist for each in our paper collection. For privacy, there are some papers discussing how to ensure privacy , but we did not find papers that systematically ‘test’ privacy or detect privacy violations.
|MNIST ||Images of handwritten digits||60,000+10,000||correctness, overfitting, robustness|
|Fashion MNIST ||MNIST-like dataset of fashion images||70,000||correctness, overfitting|
|CIFAR-10 ||General images with 10 classes||50,000+10,000||correctness, overfitting, robustness|
|ImageNet ||Visual recognition challenge dataset||14,197,122||correctness, robustness|
|IRIS flower ||The Iris flowers||150||overfitting|
|SVHN ||House numbers||73,257+26,032||correctness,robustness|
|Fruits 360 ||Dataset with 65,429 images of 95 fruits||65,429||correctness,robustness|
|Handwritten Letters ||Colour images of Russian letters||1,650||correctness,robustness|
|Balance Scale ||Psychological experimental results||625||overfitting|
|DSRC ||Wireless communications between vehicles and road side units||10,000||overfitting, robustness|
|Udacity challenge ||Udacity Self-Driving Car Challenge images||101,396+5,614||robustness|
|Nexar traffic light challenge ||Dashboard camera||18,659+500,000||robustness|
|MSCOCO ||Object recognition||160,000||correctness|
|Autopilot-TensorFlow ||Recorded to test the NVIDIA Dave model||45,568||robustness|
|KITTI ||Six different scenes captured by a VW Passat station wagon equipped with four video cameras||14,999||robustness|
Tables V to VIII show the details for some widely-adopted datasets used in ML testing research. In each table, the first column shows the name and link of each dataset. The next a few columns give a brief description, the size (training set size + test set size if applicable), the testing problem(s), the usage application scenario of each dataset101010These tables do not list data cleaning datasets because many data can be dirty and some evaluation may involve hundreds of data sets .
Table V shows the datasets used for image classification tasks. Most datasets are often very large (e.g., more than 1.4 million images in ImageNet). The last six rows show the datasets collected for autonomous driving system testing. Most image datasets are adopted to test the correctness, overfitting, and robustness of ML systems.
shows the datasets related to natural language processing. The contents are usually texts, sentences, or files with texts, with the usage scenarios like robustness and correctness.
|bAbI ||questions and answers for NLP||1000+1000||robustness|
|Tiny Shakespeare ||Samples from actual Shakespeare||100,000 character||correctness|
|Stack Exchange Data Dump ||Stack Overflow questions and answers||365 files||correctness|
|SNLI ||Stanford Natural Language Inference Corpus||570,000||robustness|
|MultiNLI ||Crowd-sourced collection of sentence pairs annotated with textual entailment information||433,000||robustness|
|DMV failure reports ||AV failure reports from 12 manufacturers in California111111The 12 AV manufacturers are: Bosch, Delphi Automotive, Google, Nissan, Mercedes- Benz, Tesla Motors, BMW, GM, Ford, Honda, Uber, and Volkswagen.||keep updating||correctness|
The datasets used to make decisions are introduced in Table VI. They are usually records with personal information, and thus are widely adopted to test the fairness of the ML models.
We also calculate how many datasets an ML testing paper usually uses in the evaluation (for those papers with an evaluation). Figure 11 shows the results. Surprisingly, most papers use only one or two datasets in the evaluation; One reason might be training and testing machine learning models have high costs. There is one paper with as many as 600 datasets, but that paper used these datasets to evaluate data cleaning techniques, which has relatively low cost .
We also discuss research directions of building dataset and benchmarks for ML testing in Section 10.
|German Credit ||Descriptions of customers with good and bad credit risks||1,000||fairness|
|Adult ||Census income||48,842||fairness|
|Bank Marketing ||Bank client subscription term deposit data||45,211||fairness|
|US Executions ||Records of every execution performed in the United States||1,437||fairness|
|Fraud Detection ||European Credit cards transactions||284,807||fairness|
|Berkeley Admissions Data ||Graduate school applicants to the six largest departments at University of California, Berkeley in 1973||4,526||fairness|
|Broward County COMPAS ||Score to determine whether to release a defendant||18,610||fairness|
|MovieLens Datasets ||People’s preferences for movies||100k-20m||fairness|
|Zestimate ||data about homes and Zillow’s in-house price and predictions||2,990,000||correctness|
|FICO scores ||United States credit worthiness||301,536||fairness|
|Law school success ||Information concerning law students from 163 law schools in the United States||21,790||fairness|
|VirusTotal ||Malicious PDF files||5,000||robustness|
|Contagio ||Clean and malicious files||28,760||robustness|
|Drebin ||Applications from different malware families||123,453||robustness|
|Chess ||Chess game data: King+Rook versus King+Pawn on a7||3,196||correctness|
|Waveform ||CART book’s generated waveform data||5,000||correctness|
There are several tools specially designed for ML testing. Angell et al. presented Themis , an open-source tool for testing group discrimination121212http://fairness.cs.umass.edu/. There is also an ML testing framework for tensorflow, named mltest131313https://github.com/Thenerdstation/mltest, for writing simple ML unit tests. Similar to mltest, there is a testing framework for writing unit tests for pytorch-based ML systems, named torchtest141414https://github.com/suriyadeepan/torchtest. Dolby et al.  extended WALA to enable static analysis for machine learning code using TensorFlow.
Overall, unlike traditional testing, the existing tool support in ML testing is immature. There is still enormous space for tool-support improvement for ML testing.
ML testing has experienced rapid progress. Nevertheless, ML testing is still at an early stage, with many challenges and open questions ahead.
Challenges in Test Input Generation. Although a series of test input generation techniques have been proposed (see more in Section 5.1), test input generation remains a big challenge because of the enormous behaviour space of ML models.
Search-based test generation (SBST) 
uses a meta-heuristic optimising search technique, such as a Genetic Algorithm, to automatically generate test inputs. It is one of the key automatic test generation techniques in traditional software interesting. Except for generating test inputs for testing functional properties like program correctness, SBST has also been used to explore tensions in algorithmic fairness in requirement analysis.[194, 268]. There exist huge research opportunities of applying SBST on generating test inputs for testing ML systems, since SBST has the advantage of searching test inputs in very large input space.
The existing test input generation techniques focus more on generating adversarial input to test the robustness of an ML system. However, adversarial examples are often arguably criticised that they do not represent the real input data that could happen in practice. Thus, an interesting research direction is to how to generate natural test inputs and how to automatically measure the naturalness of the generated inputs.
We have seen some related work that tries to generate test inputs as natural as possible under the scenario of autonomous driving, such as DeepTest , DeepHunter  and DeepRoad , yet the generated images could still suffer from unnaturalness: sometimes even human beings could not recognise the generated images by these tools. It is both interesting and challenging to explore whether such kinds of test data that are meaningless to human beings should be adopted/valid in ML testing.
Challenges on Test Assessment Criteria. There have been a lot of work exploring how to assess the quality or adequacy of test data (see more in Section 5.3). However, there is still a lack of systematic evaluation about how different assessment metrics correlated with each other, or how these assessment metrics correlate with the fault-revealing ability of tests, which has been maturaly studied in traditional software testing .
Challenges in The Oracle Problem. Oracle problem remains a challenge in ML testing. Metamorphic relations are effective pseudo oracles, but are proposed by human beings in most cases, and may contain false positives. A big challenge is thus to automatically identify and construct reliable test oracles for ML testing.
Murphy et al.  discussed that flaky tests are likely to come up in metamorphic testing whenever floating point calculations are involved. Flaky test detection is a very challenging problem in traditional software testing . It is even more challenging in ML testing because of the oracle problem.
Even without flaky tests, pseudo oracles may sometimes be inaccurate, leading to many false positives. There is a need to explore how to yield more accurate test oracles or how to reduce the false positives among the reported issues. We could even use ML algorithm to learn to detect false-positive oracles when testing ML algorithm .
Challenges in Testing Cost Reduction. In traditional software testing, the cost problem is a big problem, yielding many cost reduction techniques such as test selection, test prioritisation, and predicting test execution results. In ML testing, the cost problem could be even more serious, especially when testing the ML component, because ML component testing usually needs retraining of the model or repeating of the prediction process, as well as data generation to explore the enormous mode behaviour space.
A possible research direction of reducing cost is to represent an ML model into some kind of intermediate state to make it easier for testing.
We could also apply traditional cost reduction techniques such as test prioritisation or test minimisation to reduce the size of test cases while remaining the test correctness.
As more and more ML solutions are deployed to diverse devices and platforms (e.g., mobile device, IoT edge device). Due to the resource limitation of a target device, how to effectively test ML model on diverse devices as well as the deployment process would be also a challenge.
There remains many research opportunities in ML testing.
Testing More Application Scenarios. Many current research focuses on supervised learning, in particular the classification problem. More research works are highly desired for unsupervised learning and reinforcement learning.
The testing task mostly centres around image classification. There are also opportunities in many other areas such as speech recognition, natural language processing.
Testing More ML Categories and Tasks. We observed pronounced imbalance regarding the testing on different machine learning categories and tasks, as demonstrated by Table IV. There are both challenges and research opportunities for testing unsupervised learning and reinforcement learning systems.
Transfer learning is gaining popularity recently, which focuses on storing knowledge gained while solving one problem and applying it to a different but related problem . Therefore, transfer learning testing is also important.
Testing More Properties. From Figure 10, most work test robustness and correctness, while only less than 3% papers study efficiency, overfitting detection, or interpretability. We did not find any papers that systematically test data privacy violations.
In particular, for testing property interpretability, the existing approaches still mainly rely on manual assessment, which check whether human beings could understand the logic or predictive results of an ML model. It is interesting to investigate the automatic assessment of interpretability or detection of interpretability violations.
There has been a discussion that machine learning testing and traditional software testing may have different requirements in the assurance level towards different properties . It might also be interesting to explore what properties are the most important for machine learning systems, and thus deserve more research and testing efforts.
Presenting More Testing Benchmarks A huge number of datasets have been adopted in the existing ML testing papers. Nevertheless, as Tables V to VIII show, the datasets are usually those adopted in building machine learning systems. As far as we know, there are very few benchmarks like CleverHans151515https://github.com/tensorflow/cleverhans that are specially designed for the ML testing research (i.e., adversarial example construction) purpose.
We hope that in the future, more benchmarks that are specially designed for ML testing could be presented For example, a repository of machine learning programs with real bugs could present a good benchmark for bug-fixing techniques, like Defects4J161616https://github.com/rjust/defects4j in traditional software testing.
Testing More Types of Testing Activities. From the introduction in Section 5, as far we know, the requirement analysis of ML systems is still absent in ML testing. Demonstrated by Finkelstein et al. [194, 268], a good requirement analysis may tackle many non-functional properties such as fairness.
The existing work is mostly about off-line testing. Online-testing deserves many research efforts as well. Nevertheless, researchers from academia may have limitations in assessing online models or data, it might be necessary to cooperate with industry to conduct online testing.
According to the work of Amershi et al. , data testing is especially important and certainly deserves more research efforts on it. Additionally, there are also many opportunities for regression testing, bug report analysis, and bug triage in ML testing.
Due to the black-box nature of machine learning algorithms, ML testing results are often much more difficult, sometimes even impossible, for developers to understand than in traditional software testing. Visualisation of testing results might be particularly helpful in ML testing to help developers understand the bugs and help with the bug localisation and repair.
Mutating Investigation in Machine Learning System. There have been some studies discussing mutating machine learning code [122, 223], but no work has explored how to better design mutation operators for machine learning code so that the mutants could better simulate real-world machine learning bugs, which we believe could be another research opportunity.
We provided a comprehensive overview and analysis of research work on ML testing. The survey presented the definitions and current research status of different ML testing properties, testing components, and testing workflow. It also summarised the datasets used for experiments and the available open-source testing tools/frameworks, and analysed the research trend, directions, opportunities, and challenges in ML testing. We hope this survey could help software engineering and machine learning researchers get familiar with the literature research status of ML testing quickly and thoroughly, and orientate more researchers to contribute to the pressing problems of ML testing.
Before submitting, we sent the paper to those whom we cited, to check our comments for accuracy and omission. This also provided one final stage in the systematic trawling of the literature for relevant work. Many thanks to those members of the community who kindly provided comments and feedback on an earlier draft of this paper.
Proceedings of the IEEE International Conference on Computer Vision, pages 2722–2730, 2015.
Unit-Testing Statistical Software.http://blog.mpacula.com/2011/02/17/unit-testing-statistical-software/, 2011.
Data poisoning attacks against autoregressive models.In AAAI, pages 1452–1458, 2016.
Unsupervised image-to-image translation networks.In Advances in Neural Information Processing Systems, pages 700–708, 2017.
Proceedings of the 9th annual conference on Genetic and evolutionary computation, pages 1098–1105. ACM, 2007.
Validation of machine learning classifiers using metamorphic testing and feature selection techniques.In International Workshop on Multi-disciplinary Trends in Artificial Intelligence, pages 77–91. Springer, 2017.