Log In Sign Up

Concolic Testing for Deep Neural Networks

by   Youcheng Sun, et al.

Concolic testing alternates between CONCrete program execution and symbOLIC analysis to explore the execution paths of a software program and to increase code coverage. In this paper, we develop the first concolic testing approach for Deep Neural Networks (DNNs). More specifically, we utilise quantified linear arithmetic over rationals to express test requirements that have been studied in the literature, and then develop a coherent method to perform concolic testing with the aim of better coverage. Our experimental results show the effectiveness of the concolic testing approach in both achieving high coverage and finding adversarial examples.


Badger: Complexity Analysis with Fuzzing and Symbolic Execution

Hybrid testing approaches that involve fuzz testing and symbolic executi...

NEUROSPF: A tool for the Symbolic Analysis of Neural Networks

This paper presents NEUROSPF, a tool for the symbolic analysis of neural...

Symbolic Execution for Deep Neural Networks

Deep Neural Networks (DNN) are increasingly used in a variety of applica...

FP4: Line-rate Greybox Fuzz Testing for P4 Switches

Compared to fixed-function switches, the flexibility of programmable swi...

TensorFuzz: Debugging Neural Networks with Coverage-Guided Fuzzing

Machine learning models are notoriously difficult to interpret and debug...

Testing Neural Program Analyzers

Deep neural networks have been increasingly used in software engineering...

DIAR: Removing Uninteresting Bytes from Seeds in Software Fuzzing

Software fuzzing mutates bytes in the test seeds to explore different be...

Code Repositories


Concolic Testing for Deep Neural Networks

view repo

1 Introduction

Deep neural networks (DNNs) have been instrumental in solving a range of hard problems in AI, e.g., the ancient game of Go, image classification, and natural language processing. As a result, many potential applications are envisaged. However, major concerns have been raised about the suitability of this technique for safety- and security-critical systems, where faulty behaviour carries the risk of endangering human lives or financial damage. To address these concerns, a (safety or security) critical system comprising DNN-based components needs to be validated thoroughly.

The software industry relies on testing as a primary means to provide stakeholders with information about the quality of the software product or service under test [1]. So far, there have been only few attempts to test DNNs systematically [2, 3, 4, 5, 6]. These are either based on concrete execution, e.g., Monte Carlo tree search [2] or gradient-based search [3, 4, 6], or symbolic execution in combination with solvers for linear arithmetic [5]

. Together with these test-input generation algorithms, several test coverage criteria have been presented, including neuron coverage 

[3], a criterion that is inspired by MC/DC [5], and criteria to capture particular neuron activation values to identify corner cases [6]. None of these approaches implement concolic testing [7, 8], which combines concrete execution and symbolic analysis to explore the execution paths of a program that are hard to cover by techniques such as random testing.

We hypothesise that concolic testing is particularly well-suited for DNNs. The input space of a DNN is usually high dimensional, which makes random testing difficult. For instance, a DNN for image classification takes tens of thousands of pixels as input. Moreover, owing to the widespread use of the ReLU activation function for hidden neurons, the number of “execution paths” in a DNN is simply too large to be completely covered by symbolic execution. Concolic testing can mitigate this complexity by directing the symbolic analysis to particular execution paths, through concretely evaluating given properties of the DNN.

In this paper, we present the first concolic testing method for DNNs. The method is parameterised using a set of coverage requirements, which we express using Quantified Linear Arithmetic over Rationals (QLAR). For a given set of coverage requirements, we incrementally generate a set of test inputs to improve coverage by alternating between concrete execution and symbolic analysis. Given an unsatisfied test requirement , we identify a test input within our current test suite such that is close to satisfying according to an evaluation based on concrete execution. After that, symbolic analysis is applied to obtain a new test input  that satisfies . The test input is then added to the test suite. This process is iterated until we reach a satisfactory level of coverage.

Finally, the generated test suite is passed to a robustness oracle, which determines whether the test suite includes adversarial examples [9]

, i.e., pairs of test cases that disagree on their classification labels when close to each other with respect to a given distance metric. The lack of robustness has been viewed as a major weakness of DNNs, and the discovery of adversarial examples and the robustness problem are studied actively in several domains, including machine learning, automated verification, cyber security, and software testing.

Overall, the main contributions of this paper are threefold:

  1. We develop the first concolic testing method for DNNs.

  2. We evaluate the method with a broad range of test coverage requirements, including Lipschitz continuity [10, 11, 2, 12, 13] and several structural coverage metrics [3, 5, 6]. We show experimentally that our new algorithm supports this broad range of properties in a coherent way.

  3. We implement the concolic testing method in the software tool DeepConcolic111 Experimental results show that DeepConcolic achieves high coverage and that it is able to discover a significant number of adversarial examples.

2 Related Work

We briefly review existing efforts for assessing the robustness of DNNs and the state of the art in concolic testing.

2.1 Robustness of DNNs

Current work on the robustness of DNNs can be categorised as offensive or defensive. Offensive approaches focus on heuristic search algorithms (mainly guided by the forward gradient or cost gradient of the DNN) to find adversarial examples that are as close as possible to a correctly classified input. On the other hand, the goal of defensive work is to increase the robustness of DNNs. There is an arms race between offensive and defensive techniques.

In this paper we focus on defensive methods. A promising approach is automated verification, which aims to provide robustness guarantees for DNNs. The main relevant techniques include a layer-by-layer exhaustive search [14], methods that use constraint solvers [15], global optimisation approaches [13] and abstract interpretation [16, 17] to over-approximate a DNN’s behavior. Exhaustive search suffers from the state-space explosion problem, which can be alleviated by Monte Carlo tree search [2]. Constraint-based approaches are limited to small DNNs with hundreds of neurons. Global optimisation improves over constraint-based approaches through its ability to work with large DNNs, but its capacity is sensitive to the number of input dimensions that need to be perturbed. The results of over-approximating analyses can be pessimistic because of false alarms.

The application of traditional testing techniques to DNNs is difficult, and work that attempts to do so is more recent, e.g., [2, 3, 4, 5, 6]. Methods inspired by software testing methodologies typically employ coverage criteria to guide the generation of test cases; the resulting test suite is then searched for adversarial examples by querying an oracle. The coverage criteria considered include neuron coverage [3], which resembles traditional statement coverage. A set of criteria inspired by MD/DC coverage [18] is used in [5]; Ma et al. [6] present criteria that are designed to capture particular values of neuron activations. Tian et al. [4] study the utility of neuron coverage for detecting adversarial examples in DNNs for the Udacity-Didi Self-Driving Car Challenge.

We now discuss algorithms for test input generation. Wicker et al. [2] aim to cover the input space by exhaustive mutation testing that has theoretical guarantees, while in [3, 4, 6] gradient-based search algorithms are applied to solve optimisation problems, and Sun et al. [5]

apply linear programming. None of these consider concolic testing and a general means for modeling test coverage requirements as we do in this paper.

2.2 Concolic Testing

By concretely executing the program with particular inputs, which includes random testing, a large number of inputs can be tested at low cost. However, without guidance, the generated test cases may be restricted to a subset of the execution paths of the program and the probability of exploring execution paths that contain bugs can be extremely low. In symbolic execution 

[19, 20, 21], an execution path is encoded symbolically. Modern constraint solvers can determine feasibility of the encoding effectively, although performance still degrades as the size of the symbolic representation increases. Concolic testing [7, 8] is an effective approach to automated test input generation. It is a hybrid software testing technique that alternates between concrete execution, i.e., testing on particular inputs, and symbolic execution, a classical technique that treats program variables as symbolic values [22].

Concolic testing has been applied routinely in software testing, and a wide range of tools is available, e.g., [7, 8, 23]. It starts by executing the program with a concrete input. At the end of the concrete run, another execution path must be selected heuristically. This new execution path is then encoded symbolically and the resulting formula is solved by a constraint solver, to yield a new concrete input. The concrete execution and the symbolic analysis alternate until a desired level of structural coverage is reached.

The key factor that affects the performance of concolic testing is the heuristics used to select the next execution path. While there are simple approaches such as random search and depth-first search, more carefully designed heuristics can achieve better coverage [23, 24]. Automated generation of search heuristics for concolic testing is an active area of research [25, 26].

DeepConcolic DeepXplore [3] DeepTest [4] DeepCover [5] DeepGauge [6]
Coverage criteria NC, SSC, NBC etc. NC NC MC/DC NBC etc.
Test generation concolic dual-optimisation greedy search symbolic execution gradient descent methods
DNN inputs single multiple single single single
Image inputs single/multiple multiple multiple multiple multiple
Distance metric and -norm -norm Jaccard distance -norm -norm
Table 1: Comparison with different coverage-driven DNN testing methods

2.3 Comparison with Related Work

We briefly summarise the similarities and differences between our concolic testing method, named DeepConcolic, and other existing coverage-driven DNN testing methods: DeepXplore [3], DeepTest [4], DeepCover [5], and DeepGauge [6]. The details are presented in Table 1, where NC, SSC, and NBC are short for Neuron Coverage, SS Coverage, and Neuron Boundary Coverage, respectively. In addition to the concolic nature of DeepConcolic, we observe the following differences.

  • DeepConcolic is generic, and is able to take coverage requirements as input; the other methods are ad hoc, and are tailored to specific requirements.

  • DeepXplore requires a set of DNNs to explore multiple gradient directions. The other methods, including DeepConcolic, need a single DNN only.

  • In contrast to the other methods, DeepConcolic can achieve good coverage by starting from a single input; the other methods need a non-trivial set of inputs.

  • Until now, there is no conclusion on the best distance metric. DeepConcolic can be parameterized with a desired norm distance metric .

Moreover, DeepConcolic features a clean separation between the generation of test inputs and the test oracle. This is a good fit for traditional test case generation. The other methods use the oracle as part of their objectives to guide the generation of test inputs.

3 Deep Neural Networks

A (feedforward and deep) neural network, or DNN, is a tuple such that is a set of layers, is a set of connections between layers, and is a set of activation functions. Each layer consists of neurons, and the -th neuron of layer is denoted by . We use to denote the value of . Values of neurons in hidden layers (with

) need to pass through a Rectified Linear Unit (ReLU) 

[27]. For convenience, we explicitly denote the activation value before the ReLU as such that


ReLU is the most popular activation function for neural networks.

Except for inputs, every neuron is connected to neurons in the preceding layer by pre-defined weights such that ,


where is the pre-trained weight for the connection between (i.e., the -th neuron of layer ) and (i.e., the -th neuron of layer ), and is the bias.

Finally, for any input, the neural network assigns a label, that is, the index of the neuron of the output layer that has the largest value, i.e., .

Due to the existence of ReLU, the neural network is a highly non-linear function. In this paper, we use variable to range over all possible inputs in the input domain and use to denote concrete inputs. Given a particular input , we say that the DNN is instantiated and we use to denote this instance of the network.

  • Given a network instance , the activation values of each neuron of the network before and after ReLU are denoted as and , respectively, and the final classification label is . We write and for

    to denote the vectors of activations for neurons in layer


  • When the input is given, the activation or deactivation of each ReLU operator in the DNN is determined.

We remark that, while for simplicity the definition focuses on DNNs with fully connected and convolutional layers, as shown in the experiments (Section 10) our method also applies to other popular layers, e.g., maxpooling, used in state-of-the-art DNNs.

4 Test Coverage for DNNs

4.1 Activation Patterns

A software program has a set of concrete execution paths. Similarly, a DNN has a set of linear behaviours called activation patterns [5].

Definition 1 (Activation Pattern)

Given a network and an input , the activation pattern of is a function that maps the set of hidden neurons to . We write for if is clear from the context. For an activation pattern , we use to denote whether the ReLU operator of the neuron is activated or not. Formally,


Intuitively, if the ReLU of the neuron is activated, and otherwise.

Given a DNN instance , each ReLU operator’s behaviour (i.e., each ) is fixed and this results in the particular activation pattern , which can be encoded by using a Linear Programming (LP) model [5].

Computing a test suite that covers all activation patterns of a DNN is intractable owing to the large number of neurons in pratically-relevant DNNs. Therefore, we identify a subset of the activation patterns according to certain coverage criteria, and then generate test inputs that cover these activation patterns.

4.2 Formalizing Test Coverage Criteria

We use a specific fragment of Quantified Linear Arithmetic over Rationals (QLAR) to express the coverage requirements on the test suite for a given DNN. This enables us to give a single test input generation algorithm (Section 8) for a variety of coverage criteria. We denote the set of formulas in our fragment by DR.

Definition 2

Given a network , we write for a set of variables that range over the all inputs of the network. We define to be a set of variables that range over the rationals. We fix the following syntax for DR formulas:


where , , , , , and . We call a coverage requirement, a Boolean formula, and an arithmetic formula. We call the logic DR if the negation operator is not allowed. We use to denote a set of coverage requirement formulas.

The formula expresses that there exists an input such that is true, while expresses that is true for all inputs . The formulas and have similar meaning, except that they quantify over two inputs and . The Boolean expression is true if the number of true Boolean expressions in the set is in relation with . The other operators in Boolean and arithmetic formulas have their standard meaning.

Although does not include variables to specify an activation pattern , we may write


to require that and have, respectively, the same and different activation behaviours on neuron . These conditions can be expressed in the syntax above using the expressions in Equation (3). Moreover, some norm-based distances between two inputs can be expressed using our syntax. For example, we can use the set of constraints


to express , i.e., we can constrain the Chebyshev distance between two inputs and , where is the -th dimension of the input vector .


We define the satisfiability of a coverage requirement  by a test suite .

Definition 3

Given a set of test inputs and a coverage requirement , the satisfiability relation is defined as follows.

  • if there exists some test such that , where denotes the expression in which the occurrences of are replaced by .

  • if there exist two tests such that

The cases for formulas are similar. For the evaluation of Boolean expression over an input , we have

  • if

  • if and

  • if not

  • if

For the evaluation of arithmetic expression over an input ,

  • and derive their values from the activation patters of the DNN for test , and and have the standard meaning where is a coefficient,

  • , , and have the standard semantics.

Note that is finite. It is trivial to extend the definition of the satisfaction relation to an infinite subspace of inputs.


Given a network , a DR requirement formula , and a test suite , checking can be done in time that is polynomial in the size of . Determining whether there exists a test suite with is NP-complete.

4.3 Test Coverage Metrics

Now we can define test coverage criteria by providing a set of requirements on the test suite. The coverage metric is defined in the standard way as the percentage of the test requirements that are satisfied by the test cases in the test suite .

Definition 4 (Coverage Metric)

Given a network , a set of test coverage requirements expressed as DR formulas, and a test suite , the test coverage metric is as follows:


The coverage is used as a proxy metric for the confidence in the safety of the DNN under test.

5 Specific Coverage Requirements

In this section, we give DR formulas for several important coverage criteria for DNNs, including Lipschitz continuity [10, 11, 2, 12, 13] and test coverage criteria from the literature [3, 5, 6]. The criteria we consider have syntactical similarity with structural test coverage criteria in conventional software testing. Lipschitz continuity is semantic, specific to DNNs, and has been shown to be closely related to the theoretical understanding of convolutional DNNs [10] and the robustness of both DNNs [2, 13] and Generative Adversarial Networks [11]. These criteria have been studied in the literature using a variety of formalisms and approaches.

Each test coverage criterion gives rise to a set of test coverage requirements. In the following, we discuss the three coverage criteria from [3, 5, 6], respectively. We use to denote the distance between two inputs and with respect to a given distance metric . The metric can be, e.g., a norm-based metric such as the -norm (the Hamming distance), the -norm (the Euclidean distance), or the -norm (the Chebyshev distance), or a structural similarity distance, such as SSIM [28]. In the following, we fix a distance metric and simply write . Section 10 elaborates on the particular metrics we use for our experiments.

We may consider requirements for a set of input subspaces. Given a real number , we can generate a finite set of subspaces of such that for all inputs , if , then there exists a subspace such that . The subspaces can be overlapping. Usually, every subspace can be represented with a box constraint, e.g., , and therefore can be expressed with a Boolean expression as follows.


5.1 Lipschitz Continuity

In [9, 13], Lipschitz continuity has been shown to hold for a large class of DNNs, including DNNs for image classification.

Definition 5 (Lipschitz Continuity)

A network is said to be Lipschitz continuous if there exists a real constant such that, for all :


Recall that denotes the vector of activation values of the neurons in the input layer. The value is called the Lipschitz constant, and the smallest such is called the best Lipschitz constant, denoted as .

Since the computation of is an NP-hard problem and a smaller can significantly improve the performance of verification algorithms [2, 29, 13], it is interesting to determine whether a given is a Lipschitz constant, either for the entire input space or for some subspace. Testing for Lipschitz continuity can be guided using the following requirements.

Definition 6 (Lipschitz Coverage)

Given a real and an integer , the set of requirements for Lipschitz coverage is


where the are given input subspaces.

Intuitively, for each , this requirement expresses the existence of two inputs and that refute that is a Lipschitz constant for . It is typically impossible to obtain full Lipschitz coverage, because there may exist inconsistent . Thus, the goal for a test case generation algorithm is to produce a test suite  that satisfies the criterion as much as possible.

5.2 Neuron Coverage

Neuron Coverage (NC) [3] is an adaptation of statement coverage in conventional software testing to DNNs. It is defined as follows.

Definition 7

Neuron coverage for a DNN requires a test suite such that, for any hidden neuron , there exists test case such that .

This is formalised with the following requirements , each of which expresses that there is a test with an input  that activates the neuron , i.e., .

Definition 8 (NC Requirements)

The set of coverage requirements for Neuron Coverage is


5.3 Modified Condition/Decision (MC/DC) Coverage

In [5], a family of four test criteria is proposed, inspired by MC/DC coverage in conventional software testing. We will restrict the discussion here to Sign-Sign Coverage (SSC). According to [5], each neuron can be seen as a decision where the neurons in the previous layer (i.e., the -th layer) are conditions that define its activation value, as in Equation (2). Adapting MC/DC to DNNs, we must show that all condition neurons can determine the outcome of the decision neuron independently. In the case of SSC coverage we say that the value of a decision or condition neuron changes if the sign of its activation function changes. Consequently, the requirements for SSC coverage are defined as follows.

Definition 9 (SSC Requirements)

For SCC coverage, we first define a requirement for a pair of neurons :


and we get


That is, for each pair of neurons in two adjacent layers and , we need two inputs and such that the sign change of independently affects the sign change of . Other neurons at layer are required to maintain their signs between and to ensure that the change is independent. The idea of SS Coverage (and all other criteria in [5]) is to ensure that not only the existence of a feature needs to be tested but also the effects of less complex features on a more complex feature must be tested.

5.4 Neuron Boundary Coverage

Neuron Boundary Coverage (NBC) [6] aims at covering neuron activation values that exceed a given bound. It can be formulated as follows.

Definition 10 (NBC Requirements)

Given two sets of bounds and , the requirements are


where and are the upper and lower bounds on the activation value of a neuron .

6 Overview of our Approach

This section gives an overview of our method for generating a test suite for a given DNN. Our method alternates between concrete evaluation of the activation patterns of the DNN and symbolic generation of new inputs. The pseudocode for our method is given as Algorithm 1. It is visualised in Figure 1.

: seed input

: coverage requirements,
: a heuristic

concrete execution

new input


adversarial examples

Algorithm 1




Figure 1: Overview of our concolic testing method

Algorithm 1 takes as inputs a DNN , an input for the DNN, a heuristic , and a set of coverage requirements, and produces a test suite as output. The test suite initially only contains the given test input . The algorithm removes a requirement from once it is satisfied by , i.e., .



1: and
3:while  do
4:   for each  do
5:     if  then         
6:   while true do
9:     if   then
11:       break
12:     else if cost exceeded then
14:       break         
Algorithm 1 Concolic Testing for DNNs

The function (Line 7), whose details are given in Section 7, looks for a pair 222For some requirements, we might return two inputs and . Here, for simplicity, we describe the case for a single input. The generalisation to two inputs is straightforward. of input and requirement that, according to our concrete evaluation, is the most promising candidate for a new test case that satisfies the requirement . The heuristic is a transformation function that maps a formula with operator to an optimisation problem. This step relies on concrete execution.

After obtaining , (Line 8), whose details are in Section 8, is applied to obtain a new concrete input . Then a function (Line 9), whose details are given in Section 9, is applied to check whether the new input is valid or not. If so, the test is added to the test suite. Otherwise, ranking and symbolic input generation are repeated until a given computational cost is exceeded, after which test generation for the requirement is deemed to have failed. This is recorded in the set .

The algorithm terminates when either all test requirements have been satisfied, i.e., , or no further requirement in can be satisfied, i.e., . It then returns the current test suite .

Finally, as illustrated in Figure 1, the test suite generated by Algorithm 1, is passed to an oracle in order to evaluate the robustness of the DNN. The details of the oracle are in Section 9.

7 Ranking Coverage Requirements

This section presents our approach for Line 7 of Algorithm 1. Given a set of requirements that have not yet been satisfied, a heuristic , and the current set of test inputs, the goal is to select a concrete input together with a requirement , both of which will be used later in a symbolic approach to compute the next concrete input (to be given in Section 8). The selection of and is done by means of a series of concrete executions.

The general idea is as follows. For all requirements , we transform into by utilising operators for that will be evaluated by concretely executing tests in . As may contain more than one requirement, we return the pair such that


Note that, when evaluating formulas (e.g., ), if an input is returned, we may need the value () as well. We use to denote such a value for the returned input and the requirement formula .

The formula is an optimisation objective together with a set of constraints. We will give several examples later in Section 7.1. In the following, we extend the semantics in Definition 3 to work with formulas with operators for , including and . Intuitively, (, resp.) determines the input among those satisfying the Boolean formula that maximises (minimises) the value of the arithmetic formula . Formally,

  • the evaluation of on returns an input such that, and for all such that we have .

  • the evaluation of on returns two inputs such that, and for all such that we have .

The cases for formulas are similar to those for , by replacing with . Similarly to Definition 3, the semantics is for a set of test cases and we can adapt it to a continuous input subspace .

7.1 Heuristics

We present the heuristics we use the coverage requirements discussed in Section 5. We remark that, since is a heuristic, there exist alternatives. The following definitions work well in our experiments.

7.1.1 Lipschitz Continuity

When a Lipschitz requirement as in Equation (10) is not satisfied by , we transform it into as follows:


I.e., the aim is to find the best and in to make as large as possible. As described, we also need to compute .

7.1.2 Neuron Cover

When a requirement as in Equation (11) is not satisfied by , we transform it into the following requirement :


We obtain the input that has the maximal value for .

The coefficient is a per-layer constant. It motivated by the following observation. With the propagation of signals in the DNN, activation values at each layer can be of different magnitudes. For example, if the minimum activation value of neurons at layer and are and , respectively, then even when a neuron , we may still regard as being closer to be activated than is. Consequently, we define a layer factor

for each layer that normalises the average activation valuations of neurons at different layers into the same magnitude level. It is estimated by sampling a sufficiently large input dataset.

7.1.3 SS Coverage

In SS Coverage, given a decision neuron , the concrete evaluation aims to select one of its condition neurons at layer such that the test input that is generated negates the signs of and while the remainder of ’s condition neurons preserve their respective signs. This is achieved by the following :


Intuitively, given the decision neuron , Equation (18) selects the condition that is closest to the change of activation sign (i.e., yields the smallest ).

7.1.4 Neuron Boundary Coverage

We transform the requirement  in Equation (19) into the following when it is not satisfied by ; it selects the neuron that is closest to either the higher or lower boundary.


8 Symbolic Generation of New Concrete Inputs

This section presents our approach for Line 8 of Algorithm 1. That is, given a concrete input and a requirement , we need to find the next concrete input by symbolic analysis. This new will be added into the test suite (Line 10 of Algorithm 1). The symbolic analysis techniques to be considered include the linear programming in [5], global optimisation for the norm in [30], and a new optimisation algorithm that will be introduced below. We regard optimisation algorithms as symbolic analysis methods because, similarly to constraint solving methods, they work with a set of test cases in a single run.

To simplify the presentation, the following description may, for each algorithm, focus on some specific coverage requirements, but we remark that all algorithms can work with all the requirements given in Section 5.

8.1 Symbolic Analysis using Linear Programming

As explained in Section 4, given an input , the DNN instance maps to an activation pattern that can be modeled using Linear Programming (LP). In particular, the following linear constraints [5] yield a set of inputs that exhibit the same ReLU behaviour as :



Continuous variables in the LP model are emphasized in bold.

  • The activation value of each neuron is encoded by the linear constraint in (20), which is a symbolic version of Equation (2) that calculates a neuron’s activation value.

  • Given a particular input , the activation pattern (Definition 1) is known: is either true or false, which indicates whether the ReLU is activated or not for the neuron . Following (3) and the definition of ReLU in (1), for every neuron , the linear constraints in (21) encode ReLU activation (when ) or deactivation (when ).

The linear model (denoted as ) given by (20) and (21) represents an input set that results in the same activation pattern as encoded. Consequently, the symbolic analysis for finding a new input from a pair of input and requirement is equivalent to finding a new activation pattern. Note that, to make sure that the obtained test case is meaningful, an objective is added to the LP model that minimizes the distance between and . Thus, the use of LP requires that the distance metric is linear. For instance, this applies to the -norm in (6), but not to the -norm.

8.1.1 Neuron Coverage

The symbolic analysis of neuron coverage takes the input test case and requirement on the activation of neuron , and returns a new test such that the test requirement is satisfied by the network instance . We have the activation pattern of the given , and can build up a new activation pattern such that


This activation pattern specifies the following conditions.

  • ’s activation sign is negated: this encodes the goal to activate .

  • In the new activation pattern , the neurons before layer preserve their activation signs as in . Though there may exist multiple activation patterns that make activated, for the use of LP modeling one particular combination of activation signs must be pre-determined.

  • Other neurons are irrelevant, as the sign of is only affected by the activation values of those neurons in previous layers.

Finally, the new activation pattern defined in (22) is encoded by the LP model using (20) and (21), and if there exists a feasible solution, then the new test input , which satisfies the requirement , can be extracted from that solution.

8.1.2 SS Coverage

To satisfy an SS Coverage requirement , we need to find a new test case such that, with respect to the input , the activation signs of and are negated, while other signs of other neurons at layer are equal to those for input .

To achieve this, the following activation pattern is constructed.

8.1.3 Neuron Boundary Coverage

In case of the neuron boundary coverage, the symbolic analysis aims to find an input such that the activation value of neuron exceeds either its higher bound or its lower bound .

To achieve this, while preserving the DNN activation pattern , we add one of the following constraints to the LP program.

  • If : ;

  • otherwise: .

8.2 Symbolic Analysis using Global Optimisation

The symbolic analysis for finding a new input can also be implemented by solving the global optimisation problem in [30]. That is, by specifying the test requirement as an optimisation objective, we apply global optimisation to compute a test case that satisfies the test coverage requirement.

  • For Neuron Coverage, the objective is to find a such that the specified neuron has true.

  • In case of SS Coverage, given the neuron pair and the original input , the optimisation objective becomes

  • Regarding the Neuron Boundary Coverage, depending on whether the higher bound or lower bound for the activation of is considered, the objective of finding a new input is either or .

Readers are referred to [30] for the details of the algorithm.

8.3 Lipschitz Test Case Generation

Given a coverage requirement as in Equation (10) for a subspace , we let be the representative point of the subspace to which and belong. The optimisation problem is to generate two inputs and such that


where and denote norm metrics such as the -norm, -norm or -norm, and is the radius of a norm ball (for the and -norm) or the size of a hypercube (for the -norm) centered on . The constant is a hyper-parameter of the algorithm.

The above problem can be efficiently solved by a novel alternating compass search scheme. Specifically, we alternate between solving the following two optimisation problems through relaxation [31], i.e., maximizing the lower bound of the original Lipschitz constant instead of directly maximizing the Lipschitz constant itself. To do so, we reformulate the original non-linear proportional optimisation as a linear problem when both norm metrics and are the -norm.

8.3.1 Stage One

We solve


The objective above enables the algorithm to search for an optimal in the space of a norm ball or hypercube centered on with radius , maximising the norm distance of and . The constraint implies that . Thus, a smaller yields a larger Lipschitz constant, considering that , i.e., is the lower bound of . Therefore, the search for a trace that minimises increases the Lipschitz constant.

To solve the problem above we use the compass search method [32], which is efficient, derivative-free, and guaranteed to provide first-order global convergence. Because we aim to find an input pair that refutes the given Lipschitz constant instead of finding the largest possible Lipschitz constant, along each iteration, when we get , we check whether . If it holds, we find an input pair and that satisfies the test requirement; otherwise, we continue the compass search until convergence or a satisfiable input pair is generated. If Equation (24) is convergent and we can find an optimal as

but we still cannot find a satisfiable input pair, we perform the Stage Two optimisation.

8.3.2 Stage Two

We solve


Similarly, we use derivative-free compass search to solve the above problem and check whether holds at each iterative optimisation trace . If it holds, we return the image pair and that satisfies the test requirement; otherwise, we continue the optimisation until convergence or a satisfiable input pair is generated. If Equation (25) is convergent at , and we still cannot find such a input pair, we modify the objective function again by letting in Equation (25) and continue the search and satisfiability checking procedure.

8.3.3 Stage Three

If the function fails to make progress in Stage Two, we treat the whole search procedure as convergent and have failed to find an input pair that can refute the given Lipschitz constant . In this case, we return the best input pair we found so far, i.e., and , and the largest Lipschitz constant observed. Note that the returned constant is smaller than .

In summary, the proposed method is an alternating optimisation scheme based on compass search. Basically, we start from the given to search for an image in a norm ball or hypercube, where the optimisation trajectory on the norm ball space is denoted as ) such that (this step is symbolic execution); if we cannot find it, we modify the optimisation objective function by replacing with (the best concrete input found in this optimisation run) to initiate another optimisation trajectory on the space, i.e., . This process is repeated until we have gradually covered the entire space of the norm ball.

9 Test Oracle

We provide details about the validity checking performed for the generated test inputs (Line 9 of Algorithm 1) and how the test suite is finally used to quantify the safety of the DNN.

Definition 11 (Valid test input)

We are given a set of inputs for which we assume to have a correct classification (e.g., the training dataset). Given a real number , a test input is said to be valid if


Intuitively, a test case is valid if it is close to some of the inputs for which we have a classification. Given a test input , we write for the input that has the smallest distance to among all inputs in .

To quantify the quality of the DNN using a test suite , we use the following robustness criterion.

Definition 12 (Robustness Oracle)

Given a set of classified inputs, a test case passes the robustness oracle if


Whenever we identify a test input that fails to pass this oracle, then it serves as evidence that the DNN lacks robustness.

10 Experimental Results

We have implemented the concolic testing approach presented in this paper in a tool we have named DeepConcolic333The implementation and all data in this section are available online at We compare it with other tools for testing DNNs. The experiments are run on a machine with 24 core Intel(R) Xeon(R) CPU E5-2620 v3 and 2.4 GHz and 125 GB memory. We use a timeout of 12 h. All coverage results are averaged over 10 runs or more.

10.1 Comparison with DeepXplore

We now compare DeepConcolic and DeepXplore [3] on DNNs obtained from the MNIST and CIFAR-10 datasets. We remark that DeepXplore has been applied to further datasets.

For each tool, we start neuron cover testing from a randomly sampled image input. Note that, since DeepXplore requires more than one DNN, we designate our trained DNN as the target model and utilise the other two default models provided by DeepXplore. Table 2 gives the neuron coverage obtained by the two tools. We observe that DeepConcolic yields much higher neuron coverage than DeepXplore in any of its three modes of operation (‘light’, ‘occlusion’, and ‘blackout’). On the other hand, DeepXplore is much faster and terminates in seconds.

DeepConcolic DeepXplore
-norm -norm light occlusion blackout
MNIST 97.60% 95.91% 80.77% 82.68% 81.61%
CIFAR-10 84.98% 98.63% 77.56% 81.48% 83.25%
Table 2: Neuron coverage of DeepConcolic and DeepXplore
Figure 2: Adversarial images, with