Log In Sign Up

Geometric Path Enumeration for Equivalence Verification of Neural Networks

by   Samuel Teuber, et al.

As neural networks (NNs) are increasingly introduced into safety-critical domains, there is a growing need to formally verify NNs before deployment. In this work we focus on the formal verification problem of NN equivalence which aims to prove that two NNs (e.g. an original and a compressed version) show equivalent behavior. Two approaches have been proposed for this problem: Mixed integer linear programming and interval propagation. While the first approach lacks scalability, the latter is only suitable for structurally similar NNs with small weight changes. The contribution of our paper has four parts. First, we show a theoretical result by proving that the epsilon-equivalence problem is coNP-complete. Secondly, we extend Tran et al.'s single NN geometric path enumeration algorithm to a setting with multiple NNs. In a third step, we implement the extended algorithm for equivalence verification and evaluate optimizations necessary for its practical use. Finally, we perform a comparative evaluation showing use-cases where our approach outperforms the previous state of the art, both, for equivalence verification as well as for counter-example finding.


page 1

page 2

page 3

page 4


NeuroDiff: Scalable Differential Verification of Neural Networks using Fine-Grained Approximation

As neural networks make their way into safety-critical systems, where mi...

Piecewise Linear Neural Network verification: A comparative study

The success of Deep Learning and its potential use in many important saf...

Output Range Analysis for Deep Neural Networks

Deep neural networks (NN) are extensively used for machine learning task...

Branch and Bound for Piecewise Linear Neural Network Verification

The success of Deep Learning and its potential use in many safety-critic...

Formal Verification of Flow Equivalence in Desynchronized Designs

Seminal work by Cortadella, Kondratyev, Lavagno, and Sotiriou includes a...

Regression verification of unbalanced recursive functions with multiple calls (long version)

Given two programs p_1 and p_2, typically two versions of the same progr...

D-VAL: An automatic functional equivalence validation tool for planning domain models

In this paper, we introduce an approach to validate the functional equiv...

I Introduction

With the success of deep neural networks (NNs) in recent years, there has been an increasing trend to introduce machine learning based approaches into many domains – including safety-critical areas, such as airborne collision avoidance 

[8] and autonomous cars [3]. Therefore, there is a growing interest in verification of NNs. This spurred research on new verification methods [9, 15, 17, 16, 10, 14, 12]

. The literature can broadly be classified into adversarial robustness verification (e.g. the work by Singh et al. 

[15]), functional property verification (e.g. the work by Katz et al. [9] on the verification of ACAS Xu) and equivalence verification (e.g. Kleine Büning et al. [10] and Paulsen et al. [14]). The main application of equivalence verification is in the space of NN compression. As NNs grow ever larger and computing becomes ever more ubiquitous, resource restrictions require to compress large NNs into smaller models. Cheng et al. [4] give an extensive survey of such compression techniques. Furthermore, equivalence verification can be deployed to examine the influence of certain NN-based pre-processing steps (cf. [12]) or in cases where performing multiple verification tasks on a large NNs would be too expensive (cf. [10]). In what follows, we refer to the current method of Paulsen et al. [13] by ReluDiff and that of Kleine Büning et al. [10] by MilpEquiv.

One approach which has previously been shown to yield good results for adversarial and functional verification on NNs is Geometric Path Enumeration (GPE) [16, 1]. However, this algorithm was initially devised as an approach operating on a single NN. In this work we extend GPE to a setting with multiple NNs and implement its extension for the problem of equivalence verification. We explore which (sometimes previously used) optimizations yield good results when applied to the equivalence problem. While our work in this paper is specific to the problem of equivalence verification, the extended GPE algorithm can also be used for other verification tasks involving multiple NNs.

I-a Contribution

In this work, we focus on the problem of equivalence verification for (potentially) structurally differing NNs. Our contributions are as follows:

  • We prove that the -equivalence problem for NNs is coNP-complete.

  • We extend the GPE algorithm (Tran et al. [16]) to a setting with multiple NNs and apply it to the equivalence verification problem.

  • We evaluate several optimizations for this setting which increase efficiency on practical problems.

  • We perform a comparative evaluation of our algorithm (on ACAS Xu and modified MNIST benchmarks) and show that it outperforms

    MilpEquiv in four of our five -equivalence benchmarks and in counterexample finding.

I-B Overview

The structure of our paper is as follows: In Section II, we present related work in the field of NN equivalence and introduce the basic notions of GPE relevant to our paper. Before explaining the idea behind our extension of GPE to multiple NNs in Section IV, we show that the -equivalence problem is coNP-complete in Section III. Starting from a naive algorithm, we then evaluate optimizations to enable efficient equivalence verification using GPE in Section V

. We further explore these optimizations, in particular the question of good refinement heuristics, in

Section VI. Finally, in Section VII, we evaluate our algorithm and show advantages and disadvantages to the current state of the art represented by MilpEquiv [10].

Ii Preliminaries & Related Work

Ii-a Feed-Forward NNs

NNs consist of interconnected units called neurons. A neuron

computes a non-linear function of its input values according to where

is called the activation function and

are the weights. is commonly referred to as bias of neuron . In this paper, we focus on the rectified linear unit activation function, , which is one of the most commonly used activation functions in modern NNs [6]. Outputs of neurons are connected as input to other neurons, resulting in a directed graph. In this paper, we focus on feed-forward NNs, where the underlying graph is acyclic. Neurons are organized in layers, where neurons in layer take inputs only from the directly preceding layer . The first layer—called input layer—is just a place holder for the inputs to be fed into the NN, the subsequent layers are called hidden layers, while the last layer—the output layer—holds the function value computed by the NN. We refer to the input space dimension as and to the output dimension as .

Ii-B Verifiying NN Equivalence

There has recently been a line of work which proposes various compression techniques for NNs (for a full review see Cheng et al. [4]). While such techniques have been shown to be useful in practice, most lack a formal proof of correctness and only rely on empirical evidence. The usage of such techniques thus raises the question of how to prove that two NNs (reference) and (test) and their corresponding mathematical functions , are equivalent, i.e. that they produce the same results.

Kleine Büning et al. [10] examined three possible definitions of equivalence on a given subset of inputs , two of which we review here:

Definition 1 (-Equivalence [10] / Differential Equivalence [13]).

Two NNs and are -equivalent with respect to a norm , if for all

Definition 2 (Top-1-Equivalence [10]).

Two NNs and are top-1-equivalent, if where and for all

While Definition 1 is particularly suitable for regression tasks as it can show a very strong form of equivalence, the latter definition is more relaxed and especially useful for classification tasks.

Paulsen et al. [14] and Kleine Büning et al. [10] proposed two fundamentally different approaches to NN equivalence verification. While Paulsen et al. [14] proposed a technique called ReluDiff/NeuroDiff which uses (symbolic) interval propagation on NNs with similar weight configurations (e.g. produced through float truncation) to prove -equivalence, the MilpEquiv technique [10] is based on mixed integer linear programming (MILP) and encodes (potentially) structurally different NNs together with the desired equivalence property into an optimization problem. ReluDiff/NeuroDiff was shown to be efficient for cases where a NN’s weights had been truncated. However, this approach does not work at all for structurally differing NNs and suffers in performance when weight differences are larger. For such NNs MilpEquiv was shown to work well for a number of small instances. (e.g. two 64 pixel input MNIST NNs with a total of 84 nodes). In particular, MilpEquiv is able to provide a maximal radius around a datapoint for which the property still holds.

Ii-C Geometric Path Enumeration

GPE is a methodology originally proposed by Tran et al. [16] for verifying safety properties in NNs. Given a NN , a set of input instances and an unsafe output specification defined as a set of linear constraints, safety verification is concerned with the question whether there exist any instances such that .

Instead of pushing single data points through the NN and checking whether they satisfy the required safety property, GPE feeds an entire set into the NN and then evaluates whether any parts of the output sets lie inside . The sets are represented through generalized star sets or zonotopes which we define below:

Definition 3 (Generalized Star Set [16]).

A generalized star set is a tuple where is the center, is the generator matrix, and is a set defined through a conjunction of linear constraints (i.e. a polytope). The set represented by is then defined as:

Alternatively, it is possible to use zonotopes which further restrict the type of predicate allowed:

Definition 4 (Zonotopes [1]).

A zonotope is a generalized star set with the further restriction that may only be defined through interval constraints (i.e. only enforces a lower and upper bound for each dimension).

We refer to zonotopes and generalized star sets as set data structures. Initially, the GPE algorithm converts the provided input space into either of these data structures and then propagates the sets through the NN. The transformation deployed through the dense layer of a NN can be exactly represented by zonotopes and generalized star sets through application of the weight matrices to and .

nodes require a different type of transformation since they are only piece-wise linear functions: To this end, the data structures can either be split by introduction of an additional hyperplane to the linear constraint predicate (

exact GPE) or the function can be over-approximated [2, 15] (approximate GPE). Note that optimization for generalized star sets is much more expensive than optimization for zonotopes, which can be computed with a closed form solution.

Iii NN Equivalence and NP completeness

Katz et al. [9] have previously shown that the satisfiability problem for linear input and output constraints of a single NN with nodes is NP-complete. We refer to this decision problem as Net-Verify. In this section, we show that the -equivalence problem for NNs is coNP-complete. Since disproving -equivalence is NP-complete, the task of proving -equivalence is coNP-complete.

Theorem 1 (-Net-Equiv is NP-complete).

Let be two arbitrary ReLU NNs and let be some common input space of the two NNs. Determining whether is NP-complete for any p-norm .

The full proof can be found in Section -A. In essence, the proof consists of a reduction from Net-Verify to -Net-Equiv. In order to reduce a Net-Verify instance consisting of a NN and a linear constraint specification , we encode it as follows: The first NN only consists of . The second NN consists of and a suitable encoding of the linear constraints . We then show that we only can disprove -equivalence iff satisfies the given specification .

Iv Extending GPE to multiple NNs

The most trivial approach to extend GPE to multiple NNs would be to stitch

multiple NNs into a single composite NN and then execute regular GPE on this composite NN. However, the composite NN’s weight matrices would be considerably larger which would increase the computational load. Furthermore, NNs with a different number of layers would have to be padded for this approach. This would nullify any performance gains which could otherwise be achieved through the reduced NN size.

Instead, we propose to propagate star sets through both NNs sequentially. By carefully selecting the constraint sets of the propagated sets, we can ensure that there remains a point-wise correspondence between the output data structures of GPE for the two (or more) NNs considered. To make our approach clear, we introduce transfer functions as a way of reasoning about exact propagation of set data structures.

Definition 5 (Transfer Function).

Let be a NN. A transfer function is a function which, given an input data structure , produces a set of output data structures s.t.

and that the union of all within equals .

Using these transfer functions, we show that there is a correspondence between the output sets of two NNs in GPE:

Theorem 2 (NN Output Correspondence).

Let be two NNs with their corresponding transfer functions , and let be some input data structure. For any and :

An over-approximation would produce additional, spurious points in the output of and may therefore produce spurious output tuples. In this case the right side of Theorem 2 becomes a superset. This in turn gives rise to the modified GPE algorithm outlined in Algorithm 1. We begin by feeding our input data structure into the first NN. The propagation step function for the data structures (step) is the same as in the single NN GPE algorithm in Section II-C. For every output star set , we restrict the input data structure according to the predicate of the output of the first NN, i.e. . Then we feed this data structure into the second NN to obtain . In the end, we can compare the two output tuples and constrained by the predicate .

Note that both considered output sets are therefore constrained by (not ). This is the essential insight, that allows our approach to produce point-wise correspondences between the outputs of the two NNs.

0:  Input , NNs
0:  Verification result (equiv or nonequiv)
   {Working list of set data structures}
  while .empty() do
      {Propagate by one neuron}
     if  finished network  then
         {Store output from for comparison}
     end if
     if  finished network  then
         {Output of }
         {Output of }
        if  then
           return  not equivalent
        end if
     end if
  end while
  return  equivalent
Algorithm 1 High-level path enumeration algorithm for equivalence checking. indicates the step uses LP solving.

Iv-a Equivalence on Set Data Structures

For our equivalence verification approach it is necessary to define an equivalence check is_equiv which verifies whether two set data structures satisfy -equivalence or top-1 equivalence. First, we present how -equivalence with Chebyshev norm can be proven for zonotopes. Afterwards, we show how Star Sets can be used to prove -equivalence and top-1 equivalence.


In order to prove -equivalence with the Chebyshev norm, we need to bound the maximum deviation between the two NN outputs by . That is, given the two output zonotopes and we want to find the maximal deviation:

As can be seen by the reformulation above, we can find the maximal deviation over the output by solving optimization problems for each dimension of the differential zonotope

Recalling that zonotopes can be optimized with a closed form solution, this enables a quick check for the adherence of the desired -equivalence property. However, since zonotopes only approximate the output set, one may need to fall back to the use of Star Sets if equivalence cannot be established using zonotopes. In this case, we can reuse the same formula from above to obtain a differential star set which is then optimized using LP solving.

Top-1 Equivalence

For top-1 equivalence there are two possible approaches which both rely on propapagated star sets. We can reuse the MilpEquiv encoding and employ a MILP solver. Alternatively, we can use a simplex (LP) solver. In the latter case we split up the output star set :

For every output dimension we generate a polytope . Additional constraints ensure that output is the maximum among the outputs of in . Note that the union of to covers all of . We then examine the outputs of for every . Since is always the maximum of for this part of the output space, we want to ensure that is also always the maximum of . Therefore, we compute the maximal difference between output dimension and the other dimensions in . If all of these differences are below , we can guarantee top-1 equivalence. This procedure produces star sets and optimization operations in total.

Iv-B Challenges and Limitations of the approach

While the techniques outlined above permit a straightforward extension of GPE to multiple NNs, and thus allow achieving equivalence verification, the approach comes with a number of pitfalls which should be avoided. The most obvious is probably the possibility of exponential growth in the number of star sets. As previously noted, the exact GPE approach based on star sets splits the star sets on ReLU nodes. Tran et al. 

[16] rightly note that the observed growth usually drastically falls behind the worst case, however the increase in ReLU nodes through the processing of two NNs at once certainly leads to an increase in necessary splits. This is particularly the case for ReLU nodes which cut off very similar hyperplanes (such as the two ReLU nodes in a NN at the same position with truncated weights). This can not only double the work, but it may also lead to precision problems with LP solvers which tend to show problematic behavior when encountering a problem which has a very small feasible set111In one case the solver would return drastically differing maximum values for the same optimization problem depending on the previous requests or would suddenly deem the problem infeasible.. To avoid such numerical problems we thus use 64-bit floats by default and always ensure that feasibility is checked at least once by an exact (i.e. rational) LP solver before a branch is declared infeasible. While this can mitigate most numerical problems, the approach is weaker than ReluDiff/NeuroDiff for the specific use case of weight truncation for structurally similar NNs (e.g. truncation from 32-bit to 16-bit floats) – this is best left to the approach presented by Paulsen et al. [14]. Although these initial improvements help in making GPE for equivalence possible, this approach is not yet scalable. Hence, we devote the next section to various optimizations.

V Optimizing GPE for two NNs

The approach presented above is not yet scalable. In particular, we identify two bottlenecks: The number of splits and the time taken for LP optimization. Therefore, we consider a number of optimizations, some of which have previously been used by Bak [2].

V-a Zonotope Propagation

As an initial optimization we reused the zonotope propagation technique presented by Bak et al. [1], which reduces the number of LP optimizations necessary through a zonotope based bounds computation. We refer to this first version of the algorithm as NNEquiv-E (for exact). As can be seen in Figure 3 later on, this approach produces a total runtime of 54,390s on our 9 benchmark instances.

V-B Zonotope Over-Approximation

To further optimize the algorithm we can either reduce the time spent per zonotope or we can try to reduce the number of zonotopes which have to be considered. In order to achieve the second objective, we can over-approximate certain splits through a methodology first presented by Singh et al. [15] and later reused by Bak [2]: The idea is to introduce an additional dimension to the zonotope and use it to over-approximate the node by a parallelogram. Over-approximation errors accumulate across layers (Bak [2] refer to this as error snowball). To make the parallelogram as tight as possible and minimize the over-approximation error, we use the bounds computed through LP solving (instead of the looser zonotope bounds) if there were any exact splits beforehand. In an abstraction-refinement setting, we would start by propagating over-approximating zonotopes through both NNs and then check, whether the equivalence property can be established. If the property does not hold, we refine one over-approximated node by splitting the zonotope and propagating the split zonotopes further through the NNs.

In Figure 1 we compare the share of nodes per layer whose bounds contain the value for an exact approach in comparison to the propagation of an over-approximating zonotope. Any such node can be considered a split candidate which could be used for refinement. Each refinement can then help in reducing the over-approximation error and in establishing the desired property. As can clearly be seen in the plot, the over-approximation approach produces a lot more split candidates than the exact approach. Not all of the splits candidates encountered for the over-approximation would actually have to be refined in the worst case. This is, because many of the split candidates are only artifacts of previous over-approximations. We refer to these split candidates as ghost splits. These ghost splits cannot be easily distinguished from actual, necessary splits. The only guaranteed non-ghost split, is the first split candidate encountered, while all later split candidates might be artifacts of over-approximation.

Thus, the simplest refinement strategy would be to refine only this node. We refer to this strategy as NNEquiv-F (First), and it reduces the runtime on our benchmark set to 2,489s (c.f. Figure 3). However, this approach still leaves room for improvement as we explain in Section VI.

Fig. 1: Comparison of exact star set propagation (Exact) and propagation of an over-approximating zonotope (Overapprox) for -equivalence over ACAS_1_1: Net1 corresponds to and Net2 corresponds to (Overapprox Net2 is hidden behind Overapprox Net1)

V-C LP approximation

With the introduction of over-approximation we encounter an additional problem: Splitting hyperplanes are no longer dependent on the input variables only, but also depend on the dimensions introduced through the over-approximation. This raises the question how to handle the additional dimension in the propagated star set: Since equally increasing the dimensionality of the LP problem leads to increased solver runtimes, we instead opted to over-approximate the LP problem. Classically, for an dimensional zonotope with initial input dimensionality we observe a hyperplane cut of the following form:


We can now over-approximate this inequality by computing through zonotope optimization and constraining the LP problem with the following inequality:


Since any solution for Equation 1 implies a solution for Equation 2 the second inequality is an over-approximation and can be used to reduce the number of dimensions the LP solver has to handle despite the over-approximation of the zonotope.Note that we need to take this over-approximation into account for minimization/maximization tasks. Since the LP solver only optimizes the first dimensions, we need to add the optimization result of the over-approximating zonotope for the remaining dimensions. We refer to this version as NNEquiv-A (for approximate LP). Figure 3 shows that this approach reduces the runtime to 1,631s.

Vi The Branch Tree Exploration Problem

Given the introduced over-approximations over splits, it becomes necessary to define a strategy that decides which over-approximations are refined if it turns out that the property cannot be established with the current over-approximation. The problem of refinement heuristics has previously been studied for single NNs by Bak [2] who experimentally showed that a classic refinement loop approach which over-approximates everything and step by step refines over-approximations starting at the beginning of the NN (i.e. NNEquiv-F/A) sometimes performs worse than exact analysis. While we were able to reproduce this problem for some benchmark instances, we observed an improvement for others. It seems like a good approach to begin propagation with an exact strategy which splits on every encountered neuron, which, however, eventually transitions into over-approximation.

We proceed with a formal analysis on different strategies and their (dis)advantages. For this we consider binary trees that are implicitly explored by a GPE algorithm: For given NNs and input space , the implicit tree explored by GPE consists of vertices where are the inner nodes of the tree representing splits and are the leafs of the tree representing the output set data structures. The execution of an exact GPE algorithm implicitly produces a set of paths of the form that are (for now) explored sequentially. We denote this set of paths as . For the exact case, the number of explored paths is fixed to the number of leafs. Since GPE produces a partitioning of the input space , we can associate a part of the input space to every leaf and to every inner node. For its execution GPE needs to descend into each leaf and execute a check function on each leaf to prove equivalence. descend refers to the operations necessary to process a star set up to its next split. check refers to the operations necessary to prove equivalence on an output star set. Since the descend function is executed once for each of the edges of the tree and the check function is executed once for each of the leaves, the execution time of NNEquiv-E is bounded by .

Omitting the option of reordering the inner nodes and thus producing a smaller tree, we must either reduce and or to reduce solving times. In many cases, the considered property cannot only be proven on part of the input space associated to the leaf, but there also exists some inner node with an associated part of the input space which is already sufficiently partitioned to show using over-approximation. For a given equivalence property we can define a function which returns the number of necessary steps in the given path for the property to be verifiable on the input space part associated to element of path . The exploration of the tree induced by the set of paths

would then be sufficient to prove equivalence ( denotes the prefix path of from step to step ).

Fig. 2: ACAS_1_1-retrain: Descend depth at point of successful equivalence proof for (Running percentile window width: 479)

NNEquiv-F/A manages to obtain this minimal number of paths – however at the cost of a much higher time spent on each path. In particular, check is not only executed for each leaf, but also for each inner node. Ignoring the over-approximation costs, this produces the following lower bound for the cost of NNEquiv-F/A:

Even when assuming the omitted over-approximation steps to be free, NNEquiv-F/A becomes less effective than NNEquiv-E if asymptotically , i.e. if the reduction of paths in is insignificant in comparison to the check time for the additional paths. While there are cases where NNEquiv-F/A is effective, this is not guaranteed to be the case – especially for larger NNs with higher values for (which increases ) and expensive check functions.

However, this formal framework allows us to define the (virtual) optimal run which takes the minimal amount of work for a given tree: An algorithm which has an oracle for and always over-approximates at the right node. This approach has a runtime of .

Since and the omitted over-approximation time tends to be smaller than the descend time, this approach can provide the optimum achievable through heuristics for . In fact, we simulated such virtual runs using a pre-computed oracle by computing using NNEquiv-A and descending only the minimum necessary number of steps for each path. In our evaluation we refer to this approach as NNEquiv-O. As expected, NNEquiv-O produced the best results of all variants considered in our work running only 635s on our benchmark set. This is not a practical algorithm, but provides a lower bound for the time taken using heuristics.

It is thus important to find a good heuristic which estimates

. These heuristics are much more difficult to analyze theoretically because they are particularly dependent on the distribution of the encountered paths. Therefore, we only explore two heuristics experimentally which show that heuristics have a significant impact on the runtime.

Figure 2 plots the depth at which GPE was successful in proving equivalence for a path in an ACAS Xu NN (i.e. the values of ). Besides the data in grey, we plotted a number of running percentiles over the depth values.

A strategy which we have found to be inefficient is the use of a running maximum over the number of refinements needed by previous paths. This strategy is referred to as NNEquiv-M (for maximum) and drastically increases runtime to 19,191s, presumably by over-estimating the number of refinements, thus increasing the number of paths considered.

Since Figure 2 suggest that there are phases in which the NN needs deeper or less deep refinement depths, we considered a heuristic which predicts a refinement depth equal to the depth of the previous path minus 1. This accounts for the possible phases of the depth and also ensures that the algorithm is optimistic in the sense that it always tries to reduce the number of refinement steps.This can then reduce the number of considered paths. We refer to this heuristic as NNEquiv-L which reduces runtime on the benchmark set by another 5% to 1,553s. While the methodology of over-approximation using Zonotopes is the same for NNEquiv-A and NNEquiv-O/L/M the approaches differ in the strategy deciding where the over-approximation is refined.

Vii Experimental Evaluation

To evaluate our approach, we implemented the GPE based equivalence verification technique using parts of a pre-existing (single NN) GPE implementation by Bak et al. [1] in Python. We will refer to our implementation as NNEquiv222Our implementation is available on GitHub: Our evaluation aims at answering the following questions:

  • Do the proposed optimizations make the algorithm more efficient?

  • How does NNEquiv compare to previous work such as MilpEquiv [10] and ReluDiff [13]?333Unfortunately, there is no artifact for NeuroDiff [14] which we could have evaluated.

  • How does the tightness of the -equivalence constraint influence solving behavior?

Fig. 3: Time in seconds taken for our equivalent benchmarks per version. Note that the upper half of the y-axis has a logarithmic scale for improved visibility of the results.

Vii-a Experimental Setup

The benchmark landscape for the task of equivalence verification is still very limited. Paulsen et al. [13] proposed a number of benchmark NNs consisting of pairs of NNs differing only in the bit-width of the weights (32 bit vs. 16 bit). As discussed before, we see this as a restricted use case and are more interested in generic NNs with varying structures and weights. This is why we omit a comparison on these NNs where the approach by Paulsen et al. [14] is clearly faster and more precise. Structurally differing NNs have been previously proposed by Kleine Büning et al. [10] who examined 3 NNs of differing layer depths for digit classification on an MNIST [11] data set with reduced resolution [5] (8x8 pixels).

In order to evaluate and compare the approaches we thus proceeded as follows: First, we decided to look at two types of NNs: Image classification on the 8x8 pixel MNIST data set and NNs used in control systems in the context of an Airborne Collision Avoidance System (ACAS Xu [8]). Then, based on the original ACAS Xu NNs 1_1 and 1_2, we contructed a total of 4 mirror NNs through retraining (ACAS_1_1-retrain, ACAS_1_2-retrain) and student teacher training [7] for smaller NNs (ACAS_1_1-student, ACAS_1_2-student). In additon to the smallest and largest MNIST 8x8 NNs considered in previous work (MNIST_small-top, MNIST_medium-top), we constructed two larger MNIST models using student teacher training (MNIST_large-top, MNIST_larger-top). Moreover, we constructed a second version of MNIST_large-top for -equivalence verification (MNIST_large-epsilon). All NNs were trained using variants of student teacher training and were trained in such a way that they were likely to be top-1 or -equivalent in some parts of the input space. More details on the properties of the 9 considered benchmark NNs are available online.444An overview table of all benchmarks is available at The input space considered for verification is a sensitive choice, as it can have significant and varying impact on the performance of different verification techniques. For the case of GPE, the algorithm’s performance tends to degrade with increasing input space size due to the growth in necessary splits. Therefore, for each individual benchmark, we decided to look at an input size which was hard to handle for NNEquiv-E. This has two reasons. First, it allows us to evaluate the impact of the optimizations presented above in their ability to decrease runtimes. Secondly, it permits to compare the performance of NNEquiv to the performance of MilpEquiv on instances which are difficult for our approach. The entire experimental setup can be found online.555On GitHub:

We used a machine with 4 AMD EPYC 7281 16-Core processors (i.e. 64 cores in total) and a total of 64GBs of RAM. All experiments were run with a single thread, a memory limit of 512MB666The memory limit was irrelevant in practice, as no experiment hit this limit., and a timeout of 3 hours. The experiments were run in parallel, up to 24 processes at once. All times given in the subsequent sections are the median of 3 runs.

Vii-B Comparison of NNEquiv versions

Figure 3 shows that the proposed optimizations help in reducing the runtime of the algorithm (note that the upper half of the y-axis has a logarithmic scale for improved visibility of the results). On the one hand, we can observe, that heuristics for can, in principle, improve and worsen the result of the approach (as seen with NNEquiv-L and NNEquiv-M). On the other hand, we see that there is still significant room for improvement through the development of better refinement heuristics – this optimization would be supplementary to further optimizations which could be developed.

Vii-C Comparison to previous work

Benchmark Property NNEquiv-L MilpEquiv
ACAS_1_1-retrain 167.45 TO
ACAS_1_1-student 84.85 TO
ACAS_1_2-retrain 326.59 TO
ACAS_1_2-student 109.46 320.07
MNIST_large-epsilon 35.90 19.97
MNIST_small-top top-1 14.39 3.51
MNIST_medium-top top-1 94.51 3.85
MNIST_large-top top-1 13.02 25.85
MNIST_larger-top top-1 706.56 386.04
TABLE I: Runtime comparison (in seconds) for NNEquiv-L and MilpEquiv

The comparison to MilpEquiv is shown in Table I. NNEquiv outperforms MilpEquiv on the ACAS instances, where MilpEquiv even runs into a timeout for three of the four verification tasks. In particular, this seems to be the case for larger NNs with low-dimensional inputs. The superior performance of MilpEquiv for the case of MNIST_large-epsilon seems to be caused by the LP solver in NNEquiv which is a magnitude slower for solving optimizations tasks for MNIST in comparison to ACAS Xu. As this cannot be explained by the number of constraints, we suspect this is a problem related to the larger input dimensionality for the MNIST case (64 inputs in comparison to 5 inputs for ACAS Xu). The ACAS-retrain NNs have the same structure as the original ACAS NN, allowing us to compare NNEquiv to ReluDiff. While ReluDiff was able to quickly verify equivalence for the truncated NNs, where the mean absolute weight difference was , it was significantly slower than our approach on the retrain instances, with a mean weight difference of , for and even timed out for smaller values of . This suggests that the applicability of ReluDiff is not only restricted to structurally similar NNs, but that its performance also heavily depends on small weight differences. Regarding question (E2) we note that our approach is applicable to a broader class of NNs than ReluDiff and solved instances where both other approaches timed out.

Vii-D Influence of -equivalence tightness

Fig. 4: Solving time in seconds for -equivalence with varying : Solving times increase with tighter properties, however NNEquiv-L outperforms MilpEquiv. ReluDiff is outperformed by NNEquiv-L for retrained NNs.

Concerning question (E3) we evaluated the performance of the approaches as we vary the tightness of -equivalence for . Note that we did not prove equivalence for for ACAS_1_2-student as we found this NN not to be 0.005-equivalent to the original NN. Intuitively, a proof for a tighter bound will require more work as the approach either needs to refine more over-approximations (in the case of NNEquiv-L) or do further branch-and-bound operations (in the case of MILPEquiv). We can observe this behavior in Figure 4 which plots the runtime of MILPEquiv and NNEquiv-L as we tighten the bound. Taking into account the log scales on both axes, we can observe that NNEquiv-L is at least one magnitude faster in proving equivalence for ACAS Xu NNs for . In particular, MILPEquiv produces time-outs for 3 of the 4 considered NNs once . We therefore suspect that our approach is better at handling very tight constraints in large NNs with low dimensional input. This could potentially be due to the fact that GPE can use additional NN information (layer structure etc.) for its refinement decisions which is not readily available in the branch-and-bound algorithm in the backend of MILPEquiv. For comparison, we plotted the performance of ReluDiff on NNs for truncated weights and retrained NNs: As can be seen in Figure 4 the approach by Paulsen et al. [13] behaves similarly with respect to tightness. Additionally, we see that the approach is less efficient for retrained NNs where the equivalence for cannot be established.

Vii-E Finding Counterexamples

Our technique can also be used to find counterexamples, showing that two NNs are not equivalent at a certain point. This information can be interesting to further train NNs after a failed equivalence proof. To this end, we compared the capabilities of NNEquiv-L in counterexample finding with the capabilities of MilpEquiv. To account for possibly easy instances, we looked at a large number of non-equivalent input spaces for each of our benchmark NNs which we know to be equivalent on other parts of the input space. To generate counterexamples, we randomly sampled input points and selected the points with differing NN outputs. Using this technique, we produced 100 distinct non-equivalent input points for each of our 9 benchmarks. These points were used as center for -balls which represented our input spaces.

We then evaluated NNEquiv-L and MilpEquiv on the same input space radii as in Section VII-B with the objective of finding counterexamples. Since counterexample extraction is much faster than equivalence proofs, we set a timeout of 2 minutes. We found that NNEquiv-L was significantly faster and extracted a counterexample for 890 of the 900 considered benchmarks, while even a version without expensive bounds computation777While this optimization step improves performance for equivalence proofs, it may degrade performance for counterexample finding. executed in the initialization (named MilpEquiv-B) was only able to find 315 counterexamples. Also, the time per counterexample was significantly lower for NNEquiv-L, making this approach an interesting technique for retraining NNs via counterexamples. Looking at the behavior of MilpEquiv’s solver backend, it seems that the reason for our superior performance lies in the time needed by MilpEquiv to find an initial feasible solution. MilpEquiv first needs to resolve the integer based node encodings, which are automatically resolved by NNEquiv-L through the propagation of sets. NNEquiv-L has the potential to extract polytopes of non-equivalent input space subsets which could allow for even more efficient sampling.

NNEquiv-L MilpEquiv MilpEquiv-B
#Solved 890 305 315
Time (incl.TO) 3,597s 75,989s 72,515s
Time/Solved (excl.TO) 2.69s 14.91s 7.21s
TABLE II: Comparison of counterexample finding capabilities of NNEquiv-L, MilpEquiv and MilpEquiv-B (no node bounds)

Viii Conclusion and Future Work

We proposed an approach extending Geometric Path Enumeration [16] to multiple NNs. Employing this method, we presented an equivalence verification algorithm which was optimized by four techniques: Zonotope propagation, zonotope over-approximation, LP approximation, and refinement heuristics. Our evaluation shows that the optimizations increase the approach’s efficiency and that it can verify equivalence of NNs which were not verifiable by MilpEquiv [10] and ReluDiff [13]. Our approach significantly outperforms the state of the art in counterexample finding by solving 890 instances in comparison to 315 instances solved by MilpEquiv. In addition, we proved the coNP-completeness of the -equivalence problem, and presented a formal way of reasoning about refinement heuristics in the context of GPE.

In terms of efficiency, one could further explore possible refinement heuristics and consider parallelized (possibly GPU based) implementations. Moreover, while GPE can increase the confidence in NNs, the role of numerical stability for the verification approach has to be further investigated. Furthermore, an integration of MILP constraints into GPE propagation could be explored resulting in an algorithm inbetween NNEquiv and MilpEquiv. Additionally, we see a need for a larger body of equivalence benchmarks which allows the conclusive evaluation of equivalence verification algorithms.


  • [1] S. Bak, H. Tran, K. Hobbs, and T. T. Johnson (2020) Improved Geometric Path Enumeration for Verifying ReLU Neural Networks. In International Conference on Computer Aided Verification, pp. 66–96. Cited by: §I, §V-A, §VII, Definition 4.
  • [2] S. Bak (2021) Nnenum: verification of relu neural networks with optimized abstraction refinement. In NASA Formal Methods - 13th International Symposium, NFM 2021, Proceedings, A. Dutle, M. M. Moscato, L. Titolo, C. A. Muñoz, and I. Perez (Eds.), Vol. 12673, pp. 19–36. External Links: Document Cited by: §II-C, §V-B, §V, §VI.
  • [3] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al. (2016) End to end learning for self-driving cars. arXiv preprint arXiv:1604.07316. Cited by: §I.
  • [4] Y. Cheng, D. Wang, P. Zhou, and T. Zhang (2020) A Survey of Model Compression and Acceleration for Deep Neural Networks. Cited by: §I, §II-B.
  • [5] D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §VII-A.
  • [6] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT Press. Note: Cited by: §II-A.
  • [7] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §VII-A.
  • [8] K. D. Julian, J. Lopez, J. S. Brush, M. P. Owen, and M. J. Kochenderfer (2016) Policy compression for aircraft collision avoidance systems. In 2016 IEEE/AIAA 35th Digital Avionics Systems Conference (DASC), pp. 1–10. Cited by: §I, §VII-A.
  • [9] G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer (2017) Reluplex: An efficient SMT solver for verifying deep neural networks. In International Conference on Computer Aided Verification, pp. 97–117. Cited by: §I, §III.
  • [10] M. Kleine Büning, P. Kern, and C. Sinz (2020) Verifying Equivalence Properties of Neural Networks with ReLU Activation Functions. In Principles and Practice of Constraint Programming - 26th International Conference, CP 2020, Louvain-la-Neuve, Belgium, Proceedings, Cited by: §I-B, §I, §II-B, §II-B, item (E2), §VII-A, §VIII, Definition 1, Definition 2.
  • [11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §VII-A.
  • [12] N. Narodytska, S. P. Kasiviswanathan, L. Ryzhyk, M. Sagiv, and T. Walsh (2018)

    Verifying Properties of Binarized Deep Neural Networks

    In AAAI, Cited by: §I.
  • [13] B. Paulsen, J. Wang, and C. Wang (2020) ReluDiff: differential verification of deep neural networks. In ICSE ’20: 42nd International Conference on Software Engineering, Seoul, South Korea, G. Rothermel and D. Bae (Eds.), pp. 714–726. External Links: Document Cited by: §I, item (E2), §VII-A, §VII-D, §VIII, Definition 1.
  • [14] B. Paulsen, J. Wang, J. Wang, and C. Wang (2020) NEURODIFF: scalable differential verification of neural networks using fine-grained approximation. In 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020, Melbourne, Australia, pp. 784–796. External Links: Document Cited by: §I, §II-B, §IV-B, §VII-A, footnote 3.
  • [15] G. Singh, T. Gehr, M. Mirman, M. Püschel, and M. Vechev (2018) Fast and effective robustness certification. Advances in Neural Information Processing Systems 2018-Decem (Nips), pp. 10802–10813. External Links: ISSN 10495258 Cited by: §I, §II-C, §V-B.
  • [16] H. Tran, D. M. Lopez, P. Musau, X. Yang, L. V. Nguyen, W. Xiang, and T. T. Johnson (2019) Star-based reachability analysis of deep neural networks. In International Symposium on Formal Methods, pp. 670–686. Cited by: item (C2), §I, §I, §II-C, §IV-B, §VIII, Definition 3.
  • [17] S. Wang, K. Pei, J. Whitehouse, J. Yang, and S. Jana (2018) Efficient formal safety analysis of neural networks. In Advances in Neural Information Processing Systems, pp. 6367–6377. Cited by: §I.