DeepCover: Uncover the truth behind AI
Deep neural networks (DNNs) have a wide range of applications, and software employing them must be thoroughly tested, especially in safety critical domains. However, traditional software testing methodology, including test coverage criteria and test case generation algorithms, cannot be applied directly to DNNs. This paper bridges this gap. First, inspired by the traditional MC/DC coverage criterion, we propose a set of four test criteria that are tailored to the distinct features of DNNs. Our novel criteria are incomparable and complement each other. Second, for each criterion, we give an algorithm for generating test cases based on linear programming (LP). The algorithms produce a new test case (i.e., an input to the DNN) by perturbing a given one. They encode the test requirement and a fragment of the DNN by fixing the activation pattern obtained from the given input example, and then minimize the difference between the new and the current inputs. Finally, we validate our method on a set of networks trained on the MNIST dataset. The utility of our method is shown experimentally with four objectives: (1) bug finding; (2) DNN safety statistics; (3) testing efficiency and (4) DNN internal structure analysis.READ FULL TEXT VIEW PDF
DeepCover: Uncover the truth behind AI
Artificial intelligence, and specifically algorithms that implement deep neural networks (DNNs), can deliver human-level results in some specialist tasks such as full-information two-player games , image classification 
and some areas of natural language processing. There is now a prospect of a wide-scale deployment of DNNs, including in safety-critical applications such as self-driving cars. This naturally raises the question how software implementing this technology would be tested, validated and ultimately certified to meet the requirements of the relevant safety standards.
The software industry relies on testing as primary means to provide stakeholders with information about the quality of the software product or service under test . Research in software engineering has resulted in a broad range of approaches to test software ([5, 6, 7] give comprehensive reviews). In white-box testing, the structure of a program is exploited to (perhaps automatically) generate test cases according to test requirements. Code coverage criteria (or metrics) have been designed to guide the generation of test cases, and evaluate the completeness of a test suite. E.g., a test suite with 100% statement coverage exercises all instructions at least once. While it is arguable whether this ensures correct functionality, high coverage is able to increase users’ confidence (or trust) in the program . Structural coverage metrics are used as a means of assessment in a number of high-tier safety standards.
Artificial intelligence systems are typically implemented in software. This includes AI that uses DNNs, either as a monolithic end-to-end system  or as a system component. However, (white-box) testing for traditional software cannot be directly applied to DNNs, because the software that implements DNNs does not have suitable structure. In particular, DNNs do not have traditional flow of control and thus it is not obvious how to define criteria such as branch coverage for them.
In this paper, we bridge this gap by proposing a novel (white-box) testing methodology for DNNs, including both test coverage criteria and test case generation algorithms. Technically, DNNs contain not only an architecture, which bear some similarity with traditional software programs, but also a large set of parameters, which are calculated by the training procedure. Any approach to testing DNNs needs to consider the distinct features of DNNs, such as the syntactic connections between neurons in adjacent layers (neurons in a given layer interact with each other and then pass information to higher layers), the ReLU activation functions, and the semantic relationship between layers (e.g., neurons in deeper layers represent more complex features[9, 10]).
The contributions of this paper are three-fold. First, we propose four test criteria, inspired by the MC/DC test criteria  from traditional software testing, that fit the distinct features of DNNs mentioned above. MC/DC was developed by NASA and has been widely adopted. It is used in avionics software development guidance to ensure adequate testing of applications with the highest criticality. There exist two coverage criteria for DNNs: neuron coverage  and safety coverage 
, both of which have been proposed recently. Our experiments show that neuron coverage is too coarse: 100% coverage can be achieved by a simple test suite comprised of few input vectors from the training dataset. On the other hand, safety coverage is black-box, too fine, and it is computationally too expensive to compute a test suite in reasonable time. Moreover, our four proposed criteria are incomparable with each other, and complement each other in guiding the generation of test cases.
Second, we develop an automatic test case generation algorithm for each of our criteria. The algorithms produce a new test case (i.e., an input vector) by perturbing a given one using linear programming (LP). They encode the test requirement and a fragment of the DNN by fixing the activation pattern obtained from the given input vector, and then optimize over an objective that is to minimize the difference between the new and the current input vector. LP can be solved efficiently in practice, and thus, our test case generation algorithms can generate a test suite with low computational cost.
Finally, we implement our testing approaches in a software tool named DeepCover111Available from github link https://github.com/theyoucheng/deepcover, and validate it by conducting experiments on a set of DNNs obtained by training on the MNIST dataset. We observe that (1) the generated test suites are effective in detecting safety bugs (i.e., adversarial examples) of the DNNs; (2) our approach can provide metrics, including coverage, adversarial ratio and adversarial quality, to quantify the safety/robustness of a DNN and are shown to be efficient; (3) the testing is efficient enough and (4) internal behaviors of DNNs have been also investigated through the experiments. Overall, the method is able to handle networks of non-trivial size (10,000 hidden neurons). This compares very favourably with current approaches based on SMT, MILP and SAT, which can only handle networks with a few hundred hidden neurons.
A (feedforward and deep) neural network, or DNN, is a tuple , where each of its elements is defined as follows.
is a set of layers such that is the input layer, is the output layer, and layers other than input and output layers are called hidden layers.
Each layer consists of nodes, which are also called neurons.
The -th neuron of layer is denoted by .
is a set of connections between layers such that, except for the input and output layers, each layer has an incoming connection and an outgoing connection.
is a set of activation functions , one for each non-input layer.
We use to denote the value of .
Except for inputs, every node is connected to nodes in the preceding layer by pre-defined weights such that for all and with and
where is the pre-trained weight for the connection between (i.e., the -th node of layer ) and (i.e., the -th node of layer ) and the the so-called bias for node . We note that this definition can express both fully-connected functions and convolutional functions.
Values of neurons in hidden layers need to pass through a Rectified Linear Unit (ReLU), where the final activation value of each neuron of hidden layers is defined as
ReLU is by far the most popular and effective activation function for neural networks.
Finally, for any input, the neural network assigns a label, that is, the index of the node of output layer with the largest value:
Let be the set of labels.
Figure 1 is a simple neural network with four layers. Its input space is where the set of real numbers.
Owing to the use of the ReLU as in (2), the behavior of a neural network is highly non-linear. Given a neuron , its ReLU is said to be activated iff its value is strictly positive; otherwise, when , the ReLU is deactivated.
Given one particular input , we say that the neural work is instantiated and we use to denote this instance of the network.
Given a network instance , the activation value of each node of the network and the final classification label are fixed and they are denoted as and respectively.
When the input of a neural network is given, the activation or deactivation of each ReLU operator in the network is also determined. We write for the value before applying the ReLU and for the value after applying the ReLU. Moreover, we write
When there is no ReLU operator for a neuron, simply returns the sign (positive or negative) of its activation value.
Let be a network whose architecture is given in Figure 1. Assume that the weights for the first three layers are as follows
and that all biases are 0. When given an input , we get , since , and , since .
Let be a set of neural networks, the set of requirements, and the set of test suites.
A test adequacy criterion, or a test coverage metric, is a function .
Intuitively, quantifies the degree of adequacy to which a network is tested by a test suite with respect to a requirement . Usually, the greater the number , the more adequate the testing. Throughout this paper, we may use criterion and metric interchangeably. In this section, we will develop a set of test requirements and their corresponding criteria . The generation of test suite from the requirement will be discussed in the next section.
Our new criteria for DNNs are inspired by established practices in software testing, in particular MC/DC test coverage, but are designed for the specific features of neural networks. Modified Condition/Decision Coverage (MC/DC)  is a method of ensuring adequate testing for safety-critical software. At its core is the idea that if a choice can be made, all the possible factors (conditions) that contribute to that choice (decision) must be tested. For traditional software, both conditions and the decision are usually Boolean variables or Boolean expressions. For example, the decision
contains the three conditions , and . The following six test cases provide 100% MC/DC coverage:
=true, =true, =true
=false, =false, =false
=false, =false, =true
=false, =true, =true
=false, =true, =false
=true, =false, =true
The first two test cases already satisfy both the condition coverage (i.e., all possibilities of the conditions are exploited) and the decision coverage (i.e., all possibilities of the decision are exploited). The last four cases are needed because for MC/DC each condition should evaluate to true and false at least once and should also affect the decision outcome.
Our instantiation of the concepts “decision” and “condition” for DNNs is inspired by the similarity between Equation (1) and Equation (5) and the distinct features of DNNs. For example, it has been claimed that neurons in a deeper layer represent more complex features [9, 10]. Therefore, the information represented by a neuron in the next layer can be seen as a summary (implemented by the layer function, the weights, and the bias) of the information in the current layer. The core idea of our criteria is to ensure that not only the presence of a feature needs to be tested but also the effects of less complex features on a more complex feature must be tested. More specifically, we consider every neuron for and a decision and say that its conditions are those neurons in the layer , i.e., .
A neuron pair are two neurons in adjacent layers and such that , , and . Given a network , we write for the set of its neuron pairs.
Our new criteria are obtained by giving different ways of instantiating the changes of the conditions and the decision. Unlike Boolean variables or expressions, where it is trivial to define change, i.e., true false or false true, for neurons, there are many different ways of defining that a decision is affected by the changes of the conditions. Before giving definitions for “affected” in Section 3.3, we clarify when a neuron “changes”.
First, the change observed on a neuron activation value can be either a sign change or a value change.
Given a neuron and two test cases and , we say that the sign change of is exploited by and , denoted as , if . We write when the condition is not satisfied.
Given a neuron and two test cases and , we say that the value change of is exploited with respect to a value function by and , denoted as , if =true and . Moreover, we write when the condition is not satisfied.
The function represents the change between two values and . It can be instantiated as, e.g., (absolute change), or (relative change), for some real number .
Second, for the set of neurons in a layer , we can quantify the degree of their changes in terms of distance.
Given the set of neurons in layer and two test cases and , we say that the distance change of is exploited with respect to a distance function by and , denoted as , if
= true, and
for all , .
We write when any of the conditions is not satisfied.
The distance function can be instantiated as, e.g., norm-based distances for a real number and the distance measure , or structural similarity distances, such as SSIM . The distance measure could be (Manhattan distance), (Euclidean distance), (Chebyshev distance), etc.
In this section, we present four covering methods for a given neuron pair and two test cases.
A neuron pair is SS-covered by two test cases , denoted as , if the following conditions are satisfied by the network instances and :
for all ;
The SS Cover is designed to provide evidence that the change of a condition neuron ’s activation sign independently affects the sign of the decision neuron in the next layer. Intuitively, the first condition says that the sign change of neuron is exploited in and . The second says that, except for , at the -th layer, the signs of all neurons do not change in and , and the third says that the sign change of neuron is exploited in and .
|input222The inputs in this table are simple numbers for the purpose of exposition; the test generation algorithms that we introduce later can find much better test pairs for the required sign or value changes.|
The SS Cover is very close to MC/DC: instead of observing the change of a Boolean variable (i.e., true false or false true), we observe a sign change of a neuron. However, the behavior of a neural network has additional complexity. A direct adoption of the MC/DC-style coverage to a neural network does not necessarily catch all the important information in the network. The following three additional coverage criteria are designed to complement SS Cover.
First, the sign of can be altered between two test cases, even when none of the neuron values in layer changes its sign.
Given a distance function , a neuron pair is DS-covered by two test cases , denoted as , if the following conditions are satisfied by the network instances and :
Intuitively, the first condition describes the distance change of neurons in layer and the second condition requests the sign change of the neuron . Note that, in , according to Definition 5, we have another condition = true related to the distance between activation vectors and . This is to ensure that the change in layer is small (and therefore and have the same label): there might exist a completely different test case for in which is satisfied, but such a pair of test cases is not meaningful for testing purpose. Note that, for two pairs and with the same second component , the coverage of any of them implies the coverage of the other one.
Until now, we have seen the sign change of a decision neuron as the equivalent of a change of a decision in MC/DC. This view may still be limited. For DNNs, a key safety problem  related to their high non-linearity is that an insignificant (or imperceptible) change to the input (e.g., an image) may lead to a significant change to the output (e.g., its label). We expect our criteria can provide guidance to the test case generation algorithms for discovering un-safe cases, by working with two adjacent layers, which are finer than the input-output relation. We notice that the label change in the output layer is the direct result of the changes to the activation values in the penultimate layer. Therefore, in addition to the sign change, the change of the value of the decision neuron is also significant.
Given a value function , a neuron pair is SV-covered by two test cases , denoted as , if the following conditions are satisfied by the network instances and :
for all ;
The first and second conditions are the same as those in Definition 6. The difference is in the third condition, in which instead of considering the sign change , we consider the value change with respect to a value function . Intuitively, the SV Cover observes significant change of a decision neuron’s value, by independently modifying one its condition neuron’s sign.
Similarly, we have the following definition by replacing the sign change of the decision in Definition 7 with the value change.
Given a distance function and a value function , a neuron pair is DV-covered by two test cases , denoted as , if the following conditions are satisfied by the network instances and :
Similar to DS cover, the DV cover of a pair implies the DV cover of any other pair for and . Intuitively, a DV cover targets the scenario that there is no sign change for a neuron pair, but the decision neuron’s value is changed significantly.
(Continuation of Example 2) For any , neurons pairs are DV-covered by the inputs and , subject to the functions and , since , , and .
Given a network and a covering method , a test requirement is to find a test suite such that it covers all neuron pairs in with respect to , i.e.,
Due to the one-to-one correspondence between and , we may write as .
Intuitively, a test requirement asks that all neuron pairs are covered by at least two test cases in with respect to the covering method . When the requirement cannot be satisfied completely by a given test suite , we might want to compute the degree to which this requirement is satisfied.
Given a network and a covering method , the test criterion for a test suite is as follows:
Due to the one-to-one correspondence between and , we may write as .
Intuitively, it computes the percentage of the neuron pairs that are covered by test cases in with respect to the covering method .
|sign change||distance change|
|(Def. 3)||(Def. 5)|
|sign change (Def. 3)|
|value change (Def. 4)|
Finally, instantiating with covering methods in , we obtain four test criteria , , , and . Table 2 summaries the criteria with respect to their key ingredients, i.e., changes to the conditions and the decision.
We introduce for each test requirement for an automatic test case generation algorithm to compute . As we will explain later in Section 5.2, the generation of test suites for the criteria is non-trivial: a random testing approach will not achieve a test suite with high adequacy. In this paper, we consider approaches based on constraint solving. The algorithms are based on Linear Programming (LP). We remark that, as usual in conventional software testing, our test requirements/criteria are broader than any particular test case generation algorithm, and there may exist alternative algorithms for a given requirement/criterion.
Given a network and an input , the activation pattern of is a function , mapping from the set of hidden neurons to the signs . We may write as if is clear from the context.
For an activation pattern , we use to denote the sign of the -th neuron at layer .
The function represented by a DNN is highly non-linear and cannot be encoded with linear programming (LP) in general. Therefore, other constraint solvers such as SMT [17, 18, 19], MILP [20, 21, 19, 22], SAT  have been considered. However, it is computationally hard, if not impossible, for such direct encodings to handle large networks, because their underlying constraint solving problems are intractable (at least NP-hard). In this paper, for the efficient generation of a test case , we consider (1) an LP-based approach by fixing the activation pattern according to a given input , and (2) encoding a prefix of the network, instead of the entire network, with respect to a given neuron pair.
In the following, we explain the LP encoding of a DNN instance (Section 4.1), introduce a few operations on the given activation pattern (Section 4.2), discuss a safety requirement (Section 4.3), and then present the algorithms based on them (Section 4.4).
The variables used in the LP model are distinguished in bold. All variables are real-valued. Given an input , the input variable , whose value is to be synthesized with LP, is required to have the identical activation pattern as , i.e., .
We use and to denote the valuations of a neuron before and after the application of ReLU, respectively. Then, we have the following set of constraints
Moreover, the activation valuation of each neuron is decided by the activation values of those neurons in the prior layer. Therefore, we add the following set of constraints
Please note that the resulting LP model represents a symbolic set of inputs that have the identical activation pattern as . Further, we can specify some optimization objective , and call an LP solver to find the optimal x (if one exists).
We define a few operations on the set of constraints. First,
Note that, for each neuron , in Expression 8, we have either or , but not both. We write to denote the expression that is taken, and write to denote the other expression that is not taken.
Let be the same set of constraints as , except that the constraint is substituted by .
Assume that the value function and the distance function are linearly encodable. Without loss of generality, we still use and to denote their linear encodings, respectively.
Let and .
For example, when requiring a neuron to be activated with values bounded by , we can use the following operation .
Furthermore, we can generalize the above notation and define for being a set of operations. It is obvious that the ordering of the operations does not matter if there does not exist a neuron whose value is affected by more than one operations in .
In this section, we discuss a safety requirement that is independent of the test criteria. This is to check automatically whether a given test case is a bug. The neural network represents a function , which approximates , a function that models the human perception capability in labeling input examples. Therefore, a straightforward safety requirement is to require that for all test cases , we have . However, a direct use of such requirement is not practical because the number of inputs in can be too large, if not infinite, to be labeled by human.
Given a finite set of correctly labeled inputs, the safety requirement is to ensure that for all inputs that are close to an input , we have .
Ideally, the closeness between two inputs and is to be measured with respect to the human perception capability. In practice, this has been approximated by various approaches, including norm-based distance measures. In this paper, the closeness of two inputs and is concretised as the norm , i.e., for some bound . Specifically, we define the predicate
We remark that our algorithms can work with any definition of closeness as long as it can be encoded as a set of linear constraints. The test generation algorithms in this paper enable the computation of and such that can be upper bounded by a small number. A pair of inputs that satisfy the closeness definition are called adversarial examples if only one of them is correctly labeled by the DNN. In our experiments, to exhibit the behavior of adversarial examples, we instantiate with a large enough number, and study the percentage of adversarial examples with respect to the number (as illustrated in Figure 2 for one of the criteria).
Our testing criteria defined in Section 3 feature that the (sign or value) change of every decision neuron must be supported by (sign or distance) changes of its condition neurons. Given a neuron pair and a testing criterion, we are going to find two activation patterns, and the two patterns together shall exhibit the changes required by the corresponding testing criterion. The inputs matching these patterns will be added to the final test suite.
At first, we define a routine to call an LP solver, which takes as input a set of constraints and an optimization objective , and returns a valuation for input variable x if one exists. For instance, a solver call can be written as follows.
If returned successfully, satisfies the linear constraints of , with respect to the optimization objective .
The test suite generation framework is given as in Algorithm 1: for every of , it calls the method to find an input pair that satisfies the test requirement, according to the covering method . We remind that is the set of neuron pairs of a DNN (see Definition 2).
To get an input pair of tests for each corresponding neuron pair, we assume the existence of a of correctly labeled inputs. An input in serves for the reference activation pattern, by modifying which we obtain another activation pattern such that the two together shall be able to support testing conditions specified by the covering method for a neuron pair and .
Algorithm 2 tries to find a pair of inputs that satisfy requirements of SS Cover for a neuron pair, which specify that the sign change of must independently change also the sign of . In particular, the LP call in Algorithm 2 comes with the constraints that a new activation pattern must share the partial encoding of ’s activation pattern until layer (by Definition 13), but with signs of neuron and ’s activation being negated (by Definition 14). As a matter of fact, this encoding is even stronger than the requirement in SS Cover, which does not specify constraints for neurons in prior to layer . The encoding in the is a compromise for efficiency, as it is computationally un-affordable to consider combinations of even a subset of neuron activations. Consequently, the LP call may fail because the proposed activation pattern is infeasible, and this explains why we need to prepare a for potentially multiple LP calls.
The functions that find input pairs subject to the other test requirements largely follow the same structure as the SS Cover case, with the addition of distance change or value change constraints. Note that when implementing test generation for and , there is no need to go through every neuron pair, as these metrics do not require the individual change of a particular condition neuron as long as the overall behavior of all condition neurons of the decision neuron fall into certain distance change (and they do not exhibit sign change).
In our implementation, every time the is called we randomly select 40 correctly labeled inputs to construct the . The objective is used in all LP calls, to find good adversarial examples with respect to the safety requirement. Moreover, we use
with for and for . We admit that such choices are experimental. Meanwhile, for generality and to speed up the experiments, we leave the distance function unspecified. Providing a specific may require more LP calls to find a (because is an additional constraint) but the resulting can be better with respect to .
For the two criteria and , we need to call the function times. We note that . Therefore, when the size of the network is large, the generation of a test suite may take a long time. To work around this, we may consider an alternative definition of by letting only when the weight is one of the largest among