Deep neural networks have achieved impressive experimental results in image classification, matching the cognitive ability of humans  in complex tasks with thousands of classes. Many applications are envisaged, including their use as perception modules and end-to-end controllers for self-driving cars . Let, where is a (finite) set of class labels, models the human perception capability, then a neural network classifier is a function which approximates from training examples . For example, a perception module of a self-driving car may input an image from a camera and must correctly classify the type of object in its view, irrespective of aspects such as the angle of its vision and image imperfections. Therefore, though they clearly include imperfections, all four pairs of images in Figure 1 should arguably be classified as automobiles, since they appear so to a human eye.
Classifiers employed in vision tasks are typically multi-layer networks, which propagate the input image through a series of linear and non-linear operators. They are high-dimensional, often with millions of dimensions, non-linear and potentially discontinuous: even a small network, such as that trained to classify hand-written images of digits 0-9, has over 60,000 real-valued parameters and 21,632 neurons (dimensions) in its first layer. At the same time, the networks are trained on a finite data set and expected to generalise to previously unseen images. To increase the probability of correctly classifying such an image, regularisation techniques such as dropout are typically used, which improves the smoothness of the classifiers, in the sense that images that are close (withindistance) to a training point are assigned the same class label.
Unfortunately, it has been observed in [13, 36] that deep neural networks, including highly trained and smooth networks optimised for vision tasks, are unstable with respect to so called adversarial perturbations. Such adversarial perturbations are (minimal) changes to the input image, often imperceptible to the human eye, that cause the network to misclassify the image. Examples include not only artificially generated random perturbations, but also (more worryingly) modifications of camera images  that correspond to resizing, cropping or change in lighting conditions. They can be devised without access to the training set  and are transferable , in the sense that an example misclassified by one network is also misclassified by a network with a different architecture, even if it is trained on different data. Figure 1 gives adversarial perturbations of automobile images that are misclassified as a bird, frog, airplane or horse by a highly trained state-of-the-art network. This obviously raises potential safety concerns for applications such as autonomous driving and calls for automated verification techniques that can verify the correctness of their decisions.
, in view of their potential to cause harm in safety-critical situations such as autonomous driving. Typically, decision making in such systems is either solely based on machine learning, through end-to-end controllers, or involves some combination of logic-based reasoning and machine learning components, where an image classifier produces a classification, say speed limit or a stop sign, that serves as input to a controller. A recent trend towards “explainable AI” has led to approaches that learn not only how to assign the classification labels, but also additional explanations of the model, which can take the form of a justification explanation (why this decision has been reached, for example identifying the features that supported the decision)[17, 31]. In all these cases, the safety of a decision can be reduced to ensuring the correct behaviour of a machine learning component. However, safety assurance and verification methodologies for machine learning are little studied.
The main difficulty with image classification tasks, which play a critical role in perception modules of autonomous driving controllers, is that they do not have a formal specification in the usual sense: ideally, the performance of a classifier should match the perception ability and class labels assigned by a human. Traditionally, the correctness of a neural network classifier is expressed in terms of risk , defined as the probability of misclassification of a given image, weighted with respect to the input distribution of images. Similar (statistical) robustness properties of deep neural network classifiers, which compute the average minimum distance to a misclassification and are independent of the data point, have been studied and can be estimated using tools such as DeepFool  and cleverhans . However, we are interested in the safety of an individual decision, and to this end focus on the key property of the classifier being invariant to perturbations at a given point. This notion is also known as pointwise robustness [18, 12] or local adversarial robustness .
Contributions. In this paper we propose a general framework for automated verification of safety of classification decisions made by feed-forward deep neural networks. Although we work concretely with image classifiers, the techniques can be generalised to other settings. For a given image (a point in a vector space), we assume that there is a (possibly infinite) region around that point that incontrovertibly supports the decision, in the sense that all points in this region must have the same class. This region is specified by the user and can be given as a small diameter, or the set of all points whose salient features are of the same type. We next assume that there is a family of operations , which we call manipulations, that specify modifications to the image under which the classification decision should remain invariant in the region . Such manipulations can represent, for example, camera imprecisions, change of camera angle, or replacement of a feature. We define a network decision to be safe for input and region with respect to the set of manipulations if applying the manipulations on will not result in a class change for . We employ discretisation to enable a finite exhaustive search of the high-dimensional region for adversarial misclassifications. The discretisation approach is justified in the case of image classifiers since they are typically represented as vectors of discrete pixels (vectors of 8 bit RGB colours). To achieve scalability, we propagate the analysis layer by layer, mapping the region and manipulations to the deeper layers. We show that this propagation is sound, and is complete under the additional assumption of minimality of manipulations, which holds in discretised settings. In contrast to existing approaches [36, 28], our framework can guarantee that a misclassification is found if it exists. Since we reduce verification to a search for adversarial examples, we can achieve safety verification (if no misclassifications are found for all layers) or falsification (in which case the adversarial examples can be used to fine-tune the network or shown to a human tester).
We implement the techniques using Z3  in a tool called DLV (Deep Learning Verification)  and evaluate them on state-of-the-art networks, including regularised and deep learning networks. This includes image classification networks trained for classifying hand-written images of digits 0-9 (MNIST), 10 classes of small colour images (CIFAR10), 43 classes of the German Traffic Sign Recognition Benchmark (GTSRB) 
and 1000 classes of colour images used for the well-known imageNet large-scale visual recognition challenge (ILSVRC). We also perform a comparison of the DLV falsification functionality on the MNIST dataset against the methods of  and , focusing on the search strategies and statistical robustness estimation. The perturbed images in Figure 1 are found automatically using our tool for the network trained on the CIFAR10 dataset.
This invited paper is an extended and improved version of , where an extended version including appendices can also be found.
2 Background on Neural Networks
We consider feed-forward multi-layer neural networks 
, henceforth abbreviated as neural networks. Perceptrons (neurons) in a neural network are arranged in disjoint layers, with each perceptron in one layer connected to the next layer, but no connection between perceptrons in the same layer. Each layerof a network is associated with an -dimensional vector space , in which each dimension corresponds to a perceptron. We write for the set of perceptrons in layer and is the number of perceptrons (dimensions) in layer .
Formally, a (feed-forward and deep) neural network is a tuple , where is a set of layers such that layer is the input layer and is the output layer, is a set of sequential connections between layers such that, except for the input and output layers, each layer has an incoming connection and an outgoing connection, and is a set of activation functions , one for each non-input layer. Layers other than input and output layers are called the hidden layers.
The network is fed an input (point in ) through its input layer, which is then propagated through the layers by successive application of the activation functions. An activation for point in layer is the value of the corresponding function, denoted , where . For perceptron we write for the value of its activation on input . For every activation and layer , we define to be the set of activations in layer whose corresponding activation in layer is . The classification decision is made based on the activations in the output layer by, e.g., assigning to the class . For simplicity, we use to denote the class assigned to input , and thus expresses that two inputs and have the same class.
The neural network classifier represents a function which approximates , a function that models the human perception capability in labelling images with labels from , from training examples
. Image classification networks, for example convolutional networks, may contain many layers, which can be non-linear, and work in high dimensions, which for the image classification problems can be of the order of millions. Digital images are represented as 3D tensors of pixels (width, height and depth, the latter to represent colour), where each pixel is a discrete value in the range 0..255. The training process determines real values for weights used as filters that are convolved with the activation functions. Since it is difficult to approximatewith few samples in the sparsely populated high-dimensional space, to increase the probability of classifying correctly a previously unseen image, various regularisation techniques such as dropout are employed. They improve the smoothness of the classifier, in the sense that points that are -close to a training point (potentially infinitely many of them) classify the same.
In this paper, we work with the code of the network and its trained weights.
3 Safety Analysis of Classification Decisions
In this section we define our notion of safety of classification decisions for a neural network, based on the concept of a manipulation of an image, essentially perturbations that a human observer would classify the same as the original image. Safety is defined for an individual classification decision and is parameterised by the class of manipulations and a neighbouring region around a given image. To ensure finiteness of the search of the region for adversarial misclassifications, we introduce so called “ladders”, nondeterministically branching and iterated application of successive manipulations, and state the conditions under which the search is exhaustive.
Safety and Robustness
Our method assumes the existence of a (possibly infinite) region around a data point (image) such that all points in the region are indistinguishable by a human, and therefore have the same true class. This region is understood as supporting the classification decision and can usually be inferred from the type of the classification problem. For simplicity, we identify such a region via its diameter with respect to some user-specified norm, which intuitively measures the closeness to the point . As defined in , a network approximating human capability is said to be not robust at if there exists a point in the region of the input layer such that . The point , at a minimal distance from , is known as an adversarial example. Our definition of safety for a classification decision (abbreviated safety at a point) follows he same intuition, except that we work layer by layer, and therefore will identify such a region , a subspace of , at each layer , for , and successively refine the regions through the deeper layers. We justify this choice based on the observation [11, 23, 24] that deep neural networks are thought to compute progressively more powerful invariants as the depth increases. In other words, they gradually transform images into a representation in which the classes are separable by a linear classifier.
For each activation of point in layer , the region contains activations that the human observer believes to be so close to that they should be classified the same as .
Intuitively, safety for network at a point means that the classification decision is robust at against perturbations within the region . Note that, while the perturbation is applied in layer , the classification decision is based on the activation in the output layer .
[General Safety] Let be a region in layer of a neural network such that . We say that is safe for input and region , written as , if for all activations in we have .
A key concept of our framework is the notion of a manipulation, an operator that intuitively models image perturbations, for example bad angles, scratches or weather conditions, the idea being that the classification decisions in a region of images close to it should be invariant under such manipulations. The choice of the type of manipulation is dependent on the application and user-defined, reflecting knowledge of the classification problem to model perturbations that should or should not be allowed. Judicious choice of families of such manipulations and appropriate distance metrics is particularly important. For simplicity, we work with operators over the activations in the vector space of layer , and consider the Euclidean () and Manhattan () norms to measure the distance between an image and its perturbation through , but the techniques generalise to other norms discussed in [18, 19, 12]. More specifically, applying a manipulation to an activation will result in another activation such that the values of some or all dimensions are changed. We therefore represent a manipulation as a hyper-rectangle, defined for two activations and of layer by The main challenge for verification is the fact that the region contains potentially an uncountable number of activations. Our approach relies on discretisation in order to enable a finite exploration of the region to discover and/or rule out adversarial perturbations.
For an activation and a set of manipulations, we denote by the polyhedron which includes all hyper-rectangles that result from applying some manipulation in on , i.e., . Let be the set of all possible manipulations for layer . To ensure region coverage, we define valid manipulation as follows.
Given an activation , a set of manipulations is valid if is an interior point of , i.e., is in and does not belong to the boundary of .
Figure 2 presents an example of valid manipulations in two-dimensional space: each arrow represents a manipulation, each dashed box represents a (hyper-)rectangle of the corresponding manipulation, and activation is an interior point of the space from the dashed boxes.
Since we work with discretised spaces, which is a reasonable assumption for images, we introduce the notion of a minimal manipulation. If applying a minimal manipulation, it suffices to check for misclassification just at the end points, that is, and . This allows an exhaustive, albeit impractical, exploration of the region in unit steps.
A manipulation is finer than , written as , if any activation in the hyper-rectangle of the former is also in the hyper-rectangle of the latter. It is implied in this definition that is an activation in the hyper-rectangle of . Moreover, we write for , representing the corresponding activation in layer after applying manipulation on the activation , where .
A manipulation on an activation is minimal if there does not exist manipulations and and an activation such that , , , and and .
Intuitively, a minimal manipulation does not have a finer manipulation that results in a different classification. However, it is possible to have different classifications before and after applying the minimal manipulation, i.e., it is possible that . It is not hard to see that the minimality of a manipulation implies that the class change in its associated hyper-rectangle can be detected by checking the class of the end points and .
Recall that we apply manipulations in layer , but check the classification decisions in the output layer. To ensure finite, exhaustive coverage of the region, we introduce a continuity assumption on the mapping from space to the output space , adapted from the concept of bounded variation . Given an activation with its associated region , we define a “ladder” on to be a set of activations containing and finitely many, possibly zero, activations from . The activations in a ladder can be arranged into an increasing order such that every activation appears once and has a successor such that for some manipulation . For the greatest element , its successor should be outside the region , i.e., . Given a ladder , we write for its -th activation, for the prefix of up to the -th activation, and for the greatest element of . Figure 3 gives a diagrammatic explanation on the ladders.
Let be the set of ladders in . Then the total variation of the region on the neural network with respect to is
where is given by if and 1 otherwise. We say that the region is a bounded variation if , and are particularly interested in the case when , which is called a 0-variation.
The set is complete if, for any ladder of activations, any element for , and any manipulation , there exists a ladder such that and . Intuitively, a complete ladder is a complete tree, on which each node represents an activation and each branch of a node corresponds to a valid manipulation. From the root , every path of the tree leading to a leaf is a ladder. Moreover, the set is covering if the polyhedra of all activations in it cover the region , i.e.,
Based on the above, we have the following definition of safety with respect to a set of manipulations. Intuitively, we iteratively and nondeterministically apply manipulations to explore the region , and safety means that no class change is observed by successive application of such manipulations.
[Safety wrt Manipulations] Given a neural network , an input and a set of manipulations, we say that is safe for input with respect to the region and manipulations , written as , if the region is a 0-variation for the set of its ladders, which is complete and covering.
Given a neural network , an input , and a region , we have that implies for any set of manipulations .
In the opposite direction, we require the minimality assumption on manipulations.
Given a neural network , an input , a region and a set of manipulations, we have that implies if the manipulations in are minimal.
Theorem 3.2 means that, under the minimality assumption over the manipulations, an exhaustive search through the complete and covering ladder tree from can find adversarial examples, if any, and enable us to conclude that the network is safe at a given point if none are found. Though computing minimal manipulations is not practical, in discrete spaces by iterating over increasingly refined manipulations we are able to rule out the existence of adversarial examples in the region. This contrasts with partial exploration according to, e.g., [25, 12]; for comparison see Section 7.
4 The Verification Framework
In this section we propose a novel framework for automated verification of safety of classification decisions, which is based on search for an adversarial misclassification within a given region. The key distinctive distinctive features of our framework compared to existing work are: a guarantee that a misclassification is found if it exists; the propagation of the analysis layer by layer; and working with hidden layers, in addition to input and output layers. Since we reduce verification to a search for adversarial examples, we can achieve safety verification (if no misclassifications are found for all layers) or falsification (in which case the adversarial examples can be used to fine-tune the network or shown to a human tester).
4.1 Layer-by-Layer Analysis
We first consider how to propagate the analysis layer by layer, which will involve refining manipulations through the hidden layers. To facilitate such analysis, in addition to the activation function we also require a mapping in the opposite direction, to represent how a manipulated activation of layer affects the activations of layer . We can simply take as the inverse function of . In order to propagate safety of regions ) at a point into deeper layers, we assume the existence of functions that map activations to regions, and impose the following restrictions on the functions and , shown diagrammatically in Figure 4.
The functions and mapping activations to regions are such that
, for ,
, for , and
for all .
Intuitively, the first two conditions state that each function assigns a region around the activation , and the last condition that mapping the region from layer to via should cover the region . The aim is to compute functions based on and the neural network.
The size and complexity of a deep neural network generally means that determining whether a given set of manipulations is minimal is intractable. To partially counter this, we define a refinement relation between safety wrt manipulations for consecutive layers in the sense that is a refinement of if all manipulations in are refined by a sequence of manipulations from the set . Therefore, although we cannot theoretically confirm the minimality of , they are refined layer by layer and, in discrete settings, this process can be bounded from below by the unit step. Moreover, we can work gradually from a specific layer inwards until an adversarial example is found, finishing processing when reaching the output layer.
The refinement framework is given in Figure 5.
The arrows represent the implication relations between the safety notions and are labelled with conditions if needed. The goal of the refinements is to find a chain of implications to justify . The fact that implies is due to the constraints in Definition 6 when . The fact that implies follows from Theorem 3.1. The implication from to under the condition that is minimal is due to Theorem 3.2.
We now define the notion of refinability of manipulations between layers. Intuitively, a manipulation in layer is refinable in layer if there exists a sequence of manipulations in layer that implements the manipulation in layer .
A manipulation is refinable in layer if there exist activations and valid manipulations such that , , and for . Given a neural network and an input , the manipulations are a refinement by layer of and if, for all , all its valid manipulations are refinable in layer .
We have the following theorem stating that the refinement of safety notions is implied by the “refinement by layer” relation.
Assume a neural network and an input . For all layers , if manipulations are refinement by layer of and , then we have that implies .
We note that any adversarial example of safety wrt manipulations is also an adversarial example for general safety . However, an adversarial example for at layer needs to be checked to see if it is an adversarial example of , i.e. for the input layer. Recall that is not necessarily unique. This is equivalent to checking the emptiness of . If we start the analysis with a hidden layer and there is no specification for , we can instead consider checking the emptiness of .
4.2 The Verification Method
We summarise the theory developed thus far as a search-based recursive verification procedure given below. The method is parameterised by the region around a given point and a family of manipulations . The manipulations are specified by the user for the classification problem at hand, or alternatively can be selected automatically, as described in Section 4.4. The vector norm to identify the region can also be specified by the user and can vary by layer. The method can start in any layer, with analysis propagated into deeper layers, and terminates when a misclassification is found. If an adversarial example is found by manipulating a hidden layer, it can be mapped back to the input layer, see Section 4.5.
Given a neural network and an input , recursively perform the following steps, starting from some layer . Let be the current layer under consideration.
determine a region such that if then and satisfy Definition 6;
determine a manipulation set such that if then is a refinement by layer of and according to Definition 7;
verify whether ,
report that is safe at with respect to and , and
continue to layer ;
if , then report an adversarial example.
We implement Algorithm 1 by utilising satisfiability modulo theory (SMT) solvers. The SMT problem is a decision problem for logical formulas with respect to combinations of background theories expressed in classical first-order logic with equality. For checking refinement by layer, we use the theory of linear real arithmetic with existential and universal quantifiers, and for verification within a layer (0-variation) we use the same theory but without universal quantification. The details of the encoding and the approach taken to compute the regions and manipulations are included in Section 4.4
. To enable practical verification of deep neural networks, we employ a number of heuristics described in the remainder of this section.
4.3 Feature Decomposition and Discovery
While Theorem 3.1 and 3.2 provide a finite way to verify safety of neural network classification decisions, the high-dimensionality of the region can make any computational approach impractical. We therefore use the concept of a feature to partition the region into a set of features, and exploit their independence and low-dimensionality. This allows us to work with state-of-the-art networks that have hundreds, and even thousands, of dimensions.
Intuitively, a feature defines for each point in the high-dimensional space the most explicit salient feature it has, e.g., the red-coloured frame of a street sign in Figure 10. Formally, for each layer , a feature function assigns a small region for each activation in the space , where is the set of subspaces of . The region may have lower dimension than that of . It has been argued, in e.g.  for natural images, that natural data, for example natural images and sound, forms a high-dimensional manifold, which embeds tangled manifolds to represent their features. Feature manifolds usually have lower dimension than the data manifold, and a classification algorithm is to separate a set of tangled manifolds. By assuming that the appearance of features is independent, we can manipulate them one by one regardless of the manipulation order, and thus reduce the problem of size into a set of smaller problems of size .
The analysis of activations in hidden layers, as performed by our method, provides an opportunity to discover the features automatically. Moreover, defining the feature on each activation as a single region corresponding to a specific feature is without loss of generality: although an activation may include multiple features, the independence relation between features suggests the existence of a total relation between these features. The function essentially defines for each activation one particular feature, subject to certain criteria such as explicit knowledge, but features can also be explored in parallel.
Every feature is identified by a pre-specified number of dimensions. Let be the set of dimensions selected according to some heuristic. Then we have that
Moreover, we need a set of features to partition the region as follows.
A set of regions is a partition of , written as , if for and .
Given such a partition , we define a function by
which contains one point for each feature. Then, we reduce the checking of 0-variation of a region to the following problems:
checking whether the points in have the same class as , and
checking the 0-variation of all features in .
In the above procedure, the checking of points in can be conducted either by following a pre-specified sequential order (single-path search) or by exhaustively searching all possible orders (multi-path search). In Section 5 we demonstrate that single-path search according to the prominence of features can enable us to find adversarial examples, while multi-path search may find other examples whose distance to the original input image is smaller.
4.4 Selection of Regions and Manipulations
The procedure summarised in Algorithm 1 is typically invoked for a given image in the input layer, but, providing insight about hidden layers is available, it can start from any layer in the network. The selection of regions can be automated, as described below.
For the first layer to be considered, i.e., , the region is defined by first selecting the subset of dimensions from whose activation values are furthest away from the average activation value of the layer111We also considered other approaches, including computing derivatives up to several layers, but for the experiments we conduct they are less effective.. Intuitively, the knowledge represented by these activations is more explicit than the knowledge represented by the other dimensions, and manipulations over more explicit knowledge are more likely to result in a class change. Let be the average activation value of layer . We let be the first dimensions with the greatest values among all dimensions, and then define
i.e., a -polytope containing the activation , where represents a small span and represents the number of such spans. Let be a set of variables.
Let be a function mapping from to such that , and be the set of such functions. Let a manipulation be
for activation . That is, each manipulation changes a subset of the dimensions by the span , according to the directions given in . The set is defined by collecting the set of all such manipulations. Based on this, we can define a set of ladders, which is complete and covering.
4.4.1 Determining the region according to
Given and the functions and , we can automatically determine a region satisfying Definition 6 using the following approach. According to the function , the activation value of perceptron is computed from activation values of a subset of perceptrons in . We let be such a set of perceptrons. The selection of dimensions in depends on and , by requiring that, for every , there is at least one dimension such that . We let
Therefore, the restriction of Definition 6 can be expressed with the following formula:
We omit the details of rewriting and into Boolean expressions, which follow from standard techniques. Note that this expression includes variables in and . The variables in are fixed for a given . Because such a region always exists, a simple iterative procedure can be invoked to gradually increase the size of the region represented with variables in to eventually satisfy the expression.
4.4.2 Determining the manipulation set according to , , and
The values of the variables obtained from the satisfiability of Eqn (7) yield a definition of manipulations using Eqn (5). However, the obtained values for span variables do not necessarily satisfy the “refinement by layer” relation as defined in Definition 7. Therefore, we need to adapt the values for the variables while, at the same time, retaining the region . To do so, we could rewrite the constraint in Definition 7 into a formula, which can then be solved by an SMT solver. But, in practice, we notice that such precise computations easily lead to overly small spans , which in turn result in an unacceptable amount of computation needed to verify the relation .
To reduce computational cost, we work with a weaker “refinable in layer ” notion, parameterised with respect to precision . Given two activations and , we use to represent their distance.
A manipulation is refinable in layer with precision if there exists a sequence of activations and valid manipulations such that , , , and for . Given a neural network and an input , the manipulations are a refinement by layer of with precision if, for all , all its legal manipulations are refinable in layer with precision .
Comparing with Definition 7, the above definition replaces with and . Intuitively, instead of requiring a manipulation to reach the activation precisely, this definition allows for each to be within the hyper-rectangle . To find suitable values for according to the approximate “refinement-by-layer” relation, we use a variable to represent the maximal number of manipulations of layer used to express a manipulation in layer . The value of (and variables and in ) are automatically adapted to ensure the satisfiability of the following formula, which expresses the constraints of Definition 9:
It is noted that and for are employed when expressing . The manipulation is obtained from by considering the corresponding relation between dimensions in and .
4.5 Mapping Back to Input Layer
When manipulating the hidden layers, we may need to map back an activation in layer to the input layer to obtain an input image that resulted in misclassification, which involves computation of