I Introduction
With the success of deep neural networks (NNs) in recent years, there has been an increasing trend to introduce machine learning based approaches into many domains – including safetycritical areas, such as airborne collision avoidance
[8] and autonomous cars [3]. Therefore, there is a growing interest in verification of NNs. This spurred research on new verification methods [9, 15, 17, 16, 10, 14, 12]. The literature can broadly be classified into adversarial robustness verification (e.g. the work by Singh et al.
[15]), functional property verification (e.g. the work by Katz et al. [9] on the verification of ACAS Xu) and equivalence verification (e.g. Kleine Büning et al. [10] and Paulsen et al. [14]). The main application of equivalence verification is in the space of NN compression. As NNs grow ever larger and computing becomes ever more ubiquitous, resource restrictions require to compress large NNs into smaller models. Cheng et al. [4] give an extensive survey of such compression techniques. Furthermore, equivalence verification can be deployed to examine the influence of certain NNbased preprocessing steps (cf. [12]) or in cases where performing multiple verification tasks on a large NNs would be too expensive (cf. [10]). In what follows, we refer to the current method of Paulsen et al. [13] by ReluDiff and that of Kleine Büning et al. [10] by MilpEquiv.One approach which has previously been shown to yield good results for adversarial and functional verification on NNs is Geometric Path Enumeration (GPE) [16, 1]. However, this algorithm was initially devised as an approach operating on a single NN. In this work we extend GPE to a setting with multiple NNs and implement its extension for the problem of equivalence verification. We explore which (sometimes previously used) optimizations yield good results when applied to the equivalence problem. While our work in this paper is specific to the problem of equivalence verification, the extended GPE algorithm can also be used for other verification tasks involving multiple NNs.
Ia Contribution
In this work, we focus on the problem of equivalence verification for (potentially) structurally differing NNs. Our contributions are as follows:

We prove that the equivalence problem for NNs is coNPcomplete.

We extend the GPE algorithm (Tran et al. [16]) to a setting with multiple NNs and apply it to the equivalence verification problem.

We evaluate several optimizations for this setting which increase efficiency on practical problems.

We perform a comparative evaluation of our algorithm (on ACAS Xu and modified MNIST benchmarks) and show that it outperforms
MilpEquiv in four of our five equivalence benchmarks and in counterexample finding.
IB Overview
The structure of our paper is as follows: In Section II, we present related work in the field of NN equivalence and introduce the basic notions of GPE relevant to our paper. Before explaining the idea behind our extension of GPE to multiple NNs in Section IV, we show that the equivalence problem is coNPcomplete in Section III. Starting from a naive algorithm, we then evaluate optimizations to enable efficient equivalence verification using GPE in Section V
. We further explore these optimizations, in particular the question of good refinement heuristics, in
Section VI. Finally, in Section VII, we evaluate our algorithm and show advantages and disadvantages to the current state of the art represented by MilpEquiv [10].Ii Preliminaries & Related Work
Iia FeedForward NNs
NNs consist of interconnected units called neurons. A neuron
computes a nonlinear function of its input values according to whereis called the activation function and
are the weights. is commonly referred to as bias of neuron . In this paper, we focus on the rectified linear unit activation function, , which is one of the most commonly used activation functions in modern NNs [6]. Outputs of neurons are connected as input to other neurons, resulting in a directed graph. In this paper, we focus on feedforward NNs, where the underlying graph is acyclic. Neurons are organized in layers, where neurons in layer take inputs only from the directly preceding layer . The first layer—called input layer—is just a place holder for the inputs to be fed into the NN, the subsequent layers are called hidden layers, while the last layer—the output layer—holds the function value computed by the NN. We refer to the input space dimension as and to the output dimension as .IiB Verifiying NN Equivalence
There has recently been a line of work which proposes various compression techniques for NNs (for a full review see Cheng et al. [4]). While such techniques have been shown to be useful in practice, most lack a formal proof of correctness and only rely on empirical evidence. The usage of such techniques thus raises the question of how to prove that two NNs (reference) and (test) and their corresponding mathematical functions , are equivalent, i.e. that they produce the same results.
Kleine Büning et al. [10] examined three possible definitions of equivalence on a given subset of inputs , two of which we review here:
Definition 1 (Equivalence [10] / Differential Equivalence [13]).
Two NNs and are equivalent with respect to a norm , if for all
Definition 2 (Top1Equivalence [10]).
Two NNs and are top1equivalent, if where and for all
While Definition 1 is particularly suitable for regression tasks as it can show a very strong form of equivalence, the latter definition is more relaxed and especially useful for classification tasks.
Paulsen et al. [14] and Kleine Büning et al. [10] proposed two fundamentally different approaches to NN equivalence verification. While Paulsen et al. [14] proposed a technique called ReluDiff/NeuroDiff which uses (symbolic) interval propagation on NNs with similar weight configurations (e.g. produced through float truncation) to prove equivalence, the MilpEquiv technique [10] is based on mixed integer linear programming (MILP) and encodes (potentially) structurally different NNs together with the desired equivalence property into an optimization problem. ReluDiff/NeuroDiff was shown to be efficient for cases where a NN’s weights had been truncated. However, this approach does not work at all for structurally differing NNs and suffers in performance when weight differences are larger. For such NNs MilpEquiv was shown to work well for a number of small instances. (e.g. two 64 pixel input MNIST NNs with a total of 84 nodes). In particular, MilpEquiv is able to provide a maximal radius around a datapoint for which the property still holds.
IiC Geometric Path Enumeration
GPE is a methodology originally proposed by Tran et al. [16] for verifying safety properties in NNs. Given a NN , a set of input instances and an unsafe output specification defined as a set of linear constraints, safety verification is concerned with the question whether there exist any instances such that .
Instead of pushing single data points through the NN and checking whether they satisfy the required safety property, GPE feeds an entire set into the NN and then evaluates whether any parts of the output sets lie inside . The sets are represented through generalized star sets or zonotopes which we define below:
Definition 3 (Generalized Star Set [16]).
A generalized star set is a tuple where is the center, is the generator matrix, and is a set defined through a conjunction of linear constraints (i.e. a polytope). The set represented by is then defined as:
Alternatively, it is possible to use zonotopes which further restrict the type of predicate allowed:
Definition 4 (Zonotopes [1]).
A zonotope is a generalized star set with the further restriction that may only be defined through interval constraints (i.e. only enforces a lower and upper bound for each dimension).
We refer to zonotopes and generalized star sets as set data structures. Initially, the GPE algorithm converts the provided input space into either of these data structures and then propagates the sets through the NN. The transformation deployed through the dense layer of a NN can be exactly represented by zonotopes and generalized star sets through application of the weight matrices to and .
nodes require a different type of transformation since they are only piecewise linear functions: To this end, the data structures can either be split by introduction of an additional hyperplane to the linear constraint predicate (
exact GPE) or the function can be overapproximated [2, 15] (approximate GPE). Note that optimization for generalized star sets is much more expensive than optimization for zonotopes, which can be computed with a closed form solution.Iii NN Equivalence and NP completeness
Katz et al. [9] have previously shown that the satisfiability problem for linear input and output constraints of a single NN with nodes is NPcomplete. We refer to this decision problem as NetVerify. In this section, we show that the equivalence problem for NNs is coNPcomplete. Since disproving equivalence is NPcomplete, the task of proving equivalence is coNPcomplete.
Theorem 1 (NetEquiv is NPcomplete).
Let be two arbitrary ReLU NNs and let be some common input space of the two NNs. Determining whether is NPcomplete for any pnorm .
The full proof can be found in Section A. In essence, the proof consists of a reduction from NetVerify to NetEquiv. In order to reduce a NetVerify instance consisting of a NN and a linear constraint specification , we encode it as follows: The first NN only consists of . The second NN consists of and a suitable encoding of the linear constraints . We then show that we only can disprove equivalence iff satisfies the given specification .
Iv Extending GPE to multiple NNs
The most trivial approach to extend GPE to multiple NNs would be to stitch
multiple NNs into a single composite NN and then execute regular GPE on this composite NN. However, the composite NN’s weight matrices would be considerably larger which would increase the computational load. Furthermore, NNs with a different number of layers would have to be padded for this approach. This would nullify any performance gains which could otherwise be achieved through the reduced NN size.
Instead, we propose to propagate star sets through both NNs sequentially. By carefully selecting the constraint sets of the propagated sets, we can ensure that there remains a pointwise correspondence between the output data structures of GPE for the two (or more) NNs considered. To make our approach clear, we introduce transfer functions as a way of reasoning about exact propagation of set data structures.
Definition 5 (Transfer Function).
Let be a NN. A transfer function is a function which, given an input data structure , produces a set of output data structures s.t.
and that the union of all within equals .
Using these transfer functions, we show that there is a correspondence between the output sets of two NNs in GPE:
Theorem 2 (NN Output Correspondence).
Let be two NNs with their corresponding transfer functions , and let be some input data structure. For any and :
An overapproximation would produce additional, spurious points in the output of and may therefore produce spurious output tuples. In this case the right side of Theorem 2 becomes a superset. This in turn gives rise to the modified GPE algorithm outlined in Algorithm 1. We begin by feeding our input data structure into the first NN. The propagation step function for the data structures (step) is the same as in the single NN GPE algorithm in Section IIC. For every output star set , we restrict the input data structure according to the predicate of the output of the first NN, i.e. . Then we feed this data structure into the second NN to obtain . In the end, we can compare the two output tuples and constrained by the predicate .
Note that both considered output sets are therefore constrained by (not ). This is the essential insight, that allows our approach to produce pointwise correspondences between the outputs of the two NNs.
Iva Equivalence on Set Data Structures
For our equivalence verification approach it is necessary to define an equivalence check is_equiv which verifies whether two set data structures satisfy equivalence or top1 equivalence. First, we present how equivalence with Chebyshev norm can be proven for zonotopes. Afterwards, we show how Star Sets can be used to prove equivalence and top1 equivalence.
Equivalence
In order to prove equivalence with the Chebyshev norm, we need to bound the maximum deviation between the two NN outputs by . That is, given the two output zonotopes and we want to find the maximal deviation:
As can be seen by the reformulation above, we can find the maximal deviation over the output by solving optimization problems for each dimension of the differential zonotope
Recalling that zonotopes can be optimized with a closed form solution, this enables a quick check for the adherence of the desired equivalence property. However, since zonotopes only approximate the output set, one may need to fall back to the use of Star Sets if equivalence cannot be established using zonotopes. In this case, we can reuse the same formula from above to obtain a differential star set which is then optimized using LP solving.
Top1 Equivalence
For top1 equivalence there are two possible approaches which both rely on propapagated star sets. We can reuse the MilpEquiv encoding and employ a MILP solver. Alternatively, we can use a simplex (LP) solver. In the latter case we split up the output star set :
For every output dimension we generate a polytope . Additional constraints ensure that output is the maximum among the outputs of in . Note that the union of to covers all of . We then examine the outputs of for every . Since is always the maximum of for this part of the output space, we want to ensure that is also always the maximum of . Therefore, we compute the maximal difference between output dimension and the other dimensions in . If all of these differences are below , we can guarantee top1 equivalence. This procedure produces star sets and optimization operations in total.
IvB Challenges and Limitations of the approach
While the techniques outlined above permit a straightforward extension of GPE to multiple NNs, and thus allow achieving equivalence verification, the approach comes with a number of pitfalls which should be avoided. The most obvious is probably the possibility of exponential growth in the number of star sets. As previously noted, the exact GPE approach based on star sets splits the star sets on ReLU nodes. Tran et al.
[16] rightly note that the observed growth usually drastically falls behind the worst case, however the increase in ReLU nodes through the processing of two NNs at once certainly leads to an increase in necessary splits. This is particularly the case for ReLU nodes which cut off very similar hyperplanes (such as the two ReLU nodes in a NN at the same position with truncated weights). This can not only double the work, but it may also lead to precision problems with LP solvers which tend to show problematic behavior when encountering a problem which has a very small feasible set^{1}^{1}1In one case the solver would return drastically differing maximum values for the same optimization problem depending on the previous requests or would suddenly deem the problem infeasible.. To avoid such numerical problems we thus use 64bit floats by default and always ensure that feasibility is checked at least once by an exact (i.e. rational) LP solver before a branch is declared infeasible. While this can mitigate most numerical problems, the approach is weaker than ReluDiff/NeuroDiff for the specific use case of weight truncation for structurally similar NNs (e.g. truncation from 32bit to 16bit floats) – this is best left to the approach presented by Paulsen et al. [14]. Although these initial improvements help in making GPE for equivalence possible, this approach is not yet scalable. Hence, we devote the next section to various optimizations.V Optimizing GPE for two NNs
The approach presented above is not yet scalable. In particular, we identify two bottlenecks: The number of splits and the time taken for LP optimization. Therefore, we consider a number of optimizations, some of which have previously been used by Bak [2].
Va Zonotope Propagation
As an initial optimization we reused the zonotope propagation technique presented by Bak et al. [1], which reduces the number of LP optimizations necessary through a zonotope based bounds computation. We refer to this first version of the algorithm as NNEquivE (for exact). As can be seen in Figure 3 later on, this approach produces a total runtime of 54,390s on our 9 benchmark instances.
VB Zonotope OverApproximation
To further optimize the algorithm we can either reduce the time spent per zonotope or we can try to reduce the number of zonotopes which have to be considered. In order to achieve the second objective, we can overapproximate certain splits through a methodology first presented by Singh et al. [15] and later reused by Bak [2]: The idea is to introduce an additional dimension to the zonotope and use it to overapproximate the node by a parallelogram. Overapproximation errors accumulate across layers (Bak [2] refer to this as error snowball). To make the parallelogram as tight as possible and minimize the overapproximation error, we use the bounds computed through LP solving (instead of the looser zonotope bounds) if there were any exact splits beforehand. In an abstractionrefinement setting, we would start by propagating overapproximating zonotopes through both NNs and then check, whether the equivalence property can be established. If the property does not hold, we refine one overapproximated node by splitting the zonotope and propagating the split zonotopes further through the NNs.
In Figure 1 we compare the share of nodes per layer whose bounds contain the value for an exact approach in comparison to the propagation of an overapproximating zonotope. Any such node can be considered a split candidate which could be used for refinement. Each refinement can then help in reducing the overapproximation error and in establishing the desired property. As can clearly be seen in the plot, the overapproximation approach produces a lot more split candidates than the exact approach. Not all of the splits candidates encountered for the overapproximation would actually have to be refined in the worst case. This is, because many of the split candidates are only artifacts of previous overapproximations. We refer to these split candidates as ghost splits. These ghost splits cannot be easily distinguished from actual, necessary splits. The only guaranteed nonghost split, is the first split candidate encountered, while all later split candidates might be artifacts of overapproximation.
Thus, the simplest refinement strategy would be to refine only this node. We refer to this strategy as NNEquivF (First), and it reduces the runtime on our benchmark set to 2,489s (c.f. Figure 3). However, this approach still leaves room for improvement as we explain in Section VI.
VC LP approximation
With the introduction of overapproximation we encounter an additional problem: Splitting hyperplanes are no longer dependent on the input variables only, but also depend on the dimensions introduced through the overapproximation. This raises the question how to handle the additional dimension in the propagated star set: Since equally increasing the dimensionality of the LP problem leads to increased solver runtimes, we instead opted to overapproximate the LP problem. Classically, for an dimensional zonotope with initial input dimensionality we observe a hyperplane cut of the following form:
(1) 
We can now overapproximate this inequality by computing through zonotope optimization and constraining the LP problem with the following inequality:
(2) 
Since any solution for Equation 1 implies a solution for Equation 2 the second inequality is an overapproximation and can be used to reduce the number of dimensions the LP solver has to handle despite the overapproximation of the zonotope.Note that we need to take this overapproximation into account for minimization/maximization tasks. Since the LP solver only optimizes the first dimensions, we need to add the optimization result of the overapproximating zonotope for the remaining dimensions. We refer to this version as NNEquivA (for approximate LP). Figure 3 shows that this approach reduces the runtime to 1,631s.
Vi The Branch Tree Exploration Problem
Given the introduced overapproximations over splits, it becomes necessary to define a strategy that decides which overapproximations are refined if it turns out that the property cannot be established with the current overapproximation. The problem of refinement heuristics has previously been studied for single NNs by Bak [2] who experimentally showed that a classic refinement loop approach which overapproximates everything and step by step refines overapproximations starting at the beginning of the NN (i.e. NNEquivF/A) sometimes performs worse than exact analysis. While we were able to reproduce this problem for some benchmark instances, we observed an improvement for others. It seems like a good approach to begin propagation with an exact strategy which splits on every encountered neuron, which, however, eventually transitions into overapproximation.
We proceed with a formal analysis on different strategies and their (dis)advantages. For this we consider binary trees that are implicitly explored by a GPE algorithm: For given NNs and input space , the implicit tree explored by GPE consists of vertices where are the inner nodes of the tree representing splits and are the leafs of the tree representing the output set data structures. The execution of an exact GPE algorithm implicitly produces a set of paths of the form that are (for now) explored sequentially. We denote this set of paths as . For the exact case, the number of explored paths is fixed to the number of leafs. Since GPE produces a partitioning of the input space , we can associate a part of the input space to every leaf and to every inner node. For its execution GPE needs to descend into each leaf and execute a check function on each leaf to prove equivalence. descend refers to the operations necessary to process a star set up to its next split. check refers to the operations necessary to prove equivalence on an output star set. Since the descend function is executed once for each of the edges of the tree and the check function is executed once for each of the leaves, the execution time of NNEquivE is bounded by .
Omitting the option of reordering the inner nodes and thus producing a smaller tree, we must either reduce and or to reduce solving times. In many cases, the considered property cannot only be proven on part of the input space associated to the leaf, but there also exists some inner node with an associated part of the input space which is already sufficiently partitioned to show using overapproximation. For a given equivalence property we can define a function which returns the number of necessary steps in the given path for the property to be verifiable on the input space part associated to element of path . The exploration of the tree induced by the set of paths
would then be sufficient to prove equivalence ( denotes the prefix path of from step to step ).
NNEquivF/A manages to obtain this minimal number of paths – however at the cost of a much higher time spent on each path. In particular, check is not only executed for each leaf, but also for each inner node. Ignoring the overapproximation costs, this produces the following lower bound for the cost of NNEquivF/A:
Even when assuming the omitted overapproximation steps to be free, NNEquivF/A becomes less effective than NNEquivE if asymptotically , i.e. if the reduction of paths in is insignificant in comparison to the check time for the additional paths. While there are cases where NNEquivF/A is effective, this is not guaranteed to be the case – especially for larger NNs with higher values for (which increases ) and expensive check functions.
However, this formal framework allows us to define the (virtual) optimal run which takes the minimal amount of work for a given tree: An algorithm which has an oracle for and always overapproximates at the right node. This approach has a runtime of .
Since and the omitted overapproximation time tends to be smaller than the descend time, this approach can provide the optimum achievable through heuristics for . In fact, we simulated such virtual runs using a precomputed oracle by computing using NNEquivA and descending only the minimum necessary number of steps for each path. In our evaluation we refer to this approach as NNEquivO. As expected, NNEquivO produced the best results of all variants considered in our work running only 635s on our benchmark set. This is not a practical algorithm, but provides a lower bound for the time taken using heuristics.
It is thus important to find a good heuristic which estimates
. These heuristics are much more difficult to analyze theoretically because they are particularly dependent on the distribution of the encountered paths. Therefore, we only explore two heuristics experimentally which show that heuristics have a significant impact on the runtime.Figure 2 plots the depth at which GPE was successful in proving equivalence for a path in an ACAS Xu NN (i.e. the values of ). Besides the data in grey, we plotted a number of running percentiles over the depth values.
A strategy which we have found to be inefficient is the use of a running maximum over the number of refinements needed by previous paths. This strategy is referred to as NNEquivM (for maximum) and drastically increases runtime to 19,191s, presumably by overestimating the number of refinements, thus increasing the number of paths considered.
Since Figure 2 suggest that there are phases in which the NN needs deeper or less deep refinement depths, we considered a heuristic which predicts a refinement depth equal to the depth of the previous path minus 1. This accounts for the possible phases of the depth and also ensures that the algorithm is optimistic in the sense that it always tries to reduce the number of refinement steps.This can then reduce the number of considered paths. We refer to this heuristic as NNEquivL which reduces runtime on the benchmark set by another 5% to 1,553s. While the methodology of overapproximation using Zonotopes is the same for NNEquivA and NNEquivO/L/M the approaches differ in the strategy deciding where the overapproximation is refined.
Vii Experimental Evaluation
To evaluate our approach, we implemented the GPE based equivalence verification technique using parts of a preexisting (single NN) GPE implementation by Bak et al. [1] in Python. We will refer to our implementation as NNEquiv^{2}^{2}2Our implementation is available on GitHub: https://github.com/samysweb/nnequiv. Our evaluation aims at answering the following questions:

Do the proposed optimizations make the algorithm more efficient?

How does the tightness of the equivalence constraint influence solving behavior?
Viia Experimental Setup
The benchmark landscape for the task of equivalence verification is still very limited. Paulsen et al. [13] proposed a number of benchmark NNs consisting of pairs of NNs differing only in the bitwidth of the weights (32 bit vs. 16 bit). As discussed before, we see this as a restricted use case and are more interested in generic NNs with varying structures and weights. This is why we omit a comparison on these NNs where the approach by Paulsen et al. [14] is clearly faster and more precise. Structurally differing NNs have been previously proposed by Kleine Büning et al. [10] who examined 3 NNs of differing layer depths for digit classification on an MNIST [11] data set with reduced resolution [5] (8x8 pixels).
In order to evaluate and compare the approaches we thus proceeded as follows: First, we decided to look at two types of NNs: Image classification on the 8x8 pixel MNIST data set and NNs used in control systems in the context of an Airborne Collision Avoidance System (ACAS Xu [8]). Then, based on the original ACAS Xu NNs 1_1 and 1_2, we contructed a total of 4 mirror NNs through retraining (ACAS_1_1retrain, ACAS_1_2retrain) and student teacher training [7] for smaller NNs (ACAS_1_1student, ACAS_1_2student). In additon to the smallest and largest MNIST 8x8 NNs considered in previous work (MNIST_smalltop, MNIST_mediumtop), we constructed two larger MNIST models using student teacher training (MNIST_largetop, MNIST_largertop). Moreover, we constructed a second version of MNIST_largetop for equivalence verification (MNIST_largeepsilon). All NNs were trained using variants of student teacher training and were trained in such a way that they were likely to be top1 or equivalent in some parts of the input space. More details on the properties of the 9 considered benchmark NNs are available online.^{4}^{4}4An overview table of all benchmarks is available at https://github.com/samysweb/nnequivexperiments/blob/main/benchmarks.md The input space considered for verification is a sensitive choice, as it can have significant and varying impact on the performance of different verification techniques. For the case of GPE, the algorithm’s performance tends to degrade with increasing input space size due to the growth in necessary splits. Therefore, for each individual benchmark, we decided to look at an input size which was hard to handle for NNEquivE. This has two reasons. First, it allows us to evaluate the impact of the optimizations presented above in their ability to decrease runtimes. Secondly, it permits to compare the performance of NNEquiv to the performance of MilpEquiv on instances which are difficult for our approach. The entire experimental setup can be found online.^{5}^{5}5On GitHub: https://github.com/samysweb/nnequivexperiments
We used a machine with 4 AMD EPYC 7281 16Core processors (i.e. 64 cores in total) and a total of 64GBs of RAM. All experiments were run with a single thread, a memory limit of 512MB^{6}^{6}6The memory limit was irrelevant in practice, as no experiment hit this limit., and a timeout of 3 hours. The experiments were run in parallel, up to 24 processes at once. All times given in the subsequent sections are the median of 3 runs.
ViiB Comparison of NNEquiv versions
Figure 3 shows that the proposed optimizations help in reducing the runtime of the algorithm (note that the upper half of the yaxis has a logarithmic scale for improved visibility of the results). On the one hand, we can observe, that heuristics for can, in principle, improve and worsen the result of the approach (as seen with NNEquivL and NNEquivM). On the other hand, we see that there is still significant room for improvement through the development of better refinement heuristics – this optimization would be supplementary to further optimizations which could be developed.
ViiC Comparison to previous work
Benchmark  Property  NNEquivL  MilpEquiv 
ACAS_1_1retrain  167.45  TO  
ACAS_1_1student  84.85  TO  
ACAS_1_2retrain  326.59  TO  
ACAS_1_2student  109.46  320.07  
MNIST_largeepsilon  35.90  19.97  
MNIST_smalltop  top1  14.39  3.51 
MNIST_mediumtop  top1  94.51  3.85 
MNIST_largetop  top1  13.02  25.85 
MNIST_largertop  top1  706.56  386.04 
The comparison to MilpEquiv is shown in Table I. NNEquiv outperforms MilpEquiv on the ACAS instances, where MilpEquiv even runs into a timeout for three of the four verification tasks. In particular, this seems to be the case for larger NNs with lowdimensional inputs. The superior performance of MilpEquiv for the case of MNIST_largeepsilon seems to be caused by the LP solver in NNEquiv which is a magnitude slower for solving optimizations tasks for MNIST in comparison to ACAS Xu. As this cannot be explained by the number of constraints, we suspect this is a problem related to the larger input dimensionality for the MNIST case (64 inputs in comparison to 5 inputs for ACAS Xu). The ACASretrain NNs have the same structure as the original ACAS NN, allowing us to compare NNEquiv to ReluDiff. While ReluDiff was able to quickly verify equivalence for the truncated NNs, where the mean absolute weight difference was , it was significantly slower than our approach on the retrain instances, with a mean weight difference of , for and even timed out for smaller values of . This suggests that the applicability of ReluDiff is not only restricted to structurally similar NNs, but that its performance also heavily depends on small weight differences. Regarding question (E2) we note that our approach is applicable to a broader class of NNs than ReluDiff and solved instances where both other approaches timed out.
ViiD Influence of equivalence tightness
Concerning question (E3) we evaluated the performance of the approaches as we vary the tightness of equivalence for . Note that we did not prove equivalence for for ACAS_1_2student as we found this NN not to be 0.005equivalent to the original NN. Intuitively, a proof for a tighter bound will require more work as the approach either needs to refine more overapproximations (in the case of NNEquivL) or do further branchandbound operations (in the case of MILPEquiv). We can observe this behavior in Figure 4 which plots the runtime of MILPEquiv and NNEquivL as we tighten the bound. Taking into account the log scales on both axes, we can observe that NNEquivL is at least one magnitude faster in proving equivalence for ACAS Xu NNs for . In particular, MILPEquiv produces timeouts for 3 of the 4 considered NNs once . We therefore suspect that our approach is better at handling very tight constraints in large NNs with low dimensional input. This could potentially be due to the fact that GPE can use additional NN information (layer structure etc.) for its refinement decisions which is not readily available in the branchandbound algorithm in the backend of MILPEquiv. For comparison, we plotted the performance of ReluDiff on NNs for truncated weights and retrained NNs: As can be seen in Figure 4 the approach by Paulsen et al. [13] behaves similarly with respect to tightness. Additionally, we see that the approach is less efficient for retrained NNs where the equivalence for cannot be established.
ViiE Finding Counterexamples
Our technique can also be used to find counterexamples, showing that two NNs are not equivalent at a certain point. This information can be interesting to further train NNs after a failed equivalence proof. To this end, we compared the capabilities of NNEquivL in counterexample finding with the capabilities of MilpEquiv. To account for possibly easy instances, we looked at a large number of nonequivalent input spaces for each of our benchmark NNs which we know to be equivalent on other parts of the input space. To generate counterexamples, we randomly sampled input points and selected the points with differing NN outputs. Using this technique, we produced 100 distinct nonequivalent input points for each of our 9 benchmarks. These points were used as center for balls which represented our input spaces.
We then evaluated NNEquivL and MilpEquiv on the same input space radii as in Section VIIB with the objective of finding counterexamples. Since counterexample extraction is much faster than equivalence proofs, we set a timeout of 2 minutes. We found that NNEquivL was significantly faster and extracted a counterexample for 890 of the 900 considered benchmarks, while even a version without expensive bounds computation^{7}^{7}7While this optimization step improves performance for equivalence proofs, it may degrade performance for counterexample finding. executed in the initialization (named MilpEquivB) was only able to find 315 counterexamples. Also, the time per counterexample was significantly lower for NNEquivL, making this approach an interesting technique for retraining NNs via counterexamples. Looking at the behavior of MilpEquiv’s solver backend, it seems that the reason for our superior performance lies in the time needed by MilpEquiv to find an initial feasible solution. MilpEquiv first needs to resolve the integer based node encodings, which are automatically resolved by NNEquivL through the propagation of sets. NNEquivL has the potential to extract polytopes of nonequivalent input space subsets which could allow for even more efficient sampling.
NNEquivL  MilpEquiv  MilpEquivB  
#Solved  890  305  315 
Time (incl.TO)  3,597s  75,989s  72,515s 
Time/Solved (excl.TO)  2.69s  14.91s  7.21s 
Viii Conclusion and Future Work
We proposed an approach extending Geometric Path Enumeration [16] to multiple NNs. Employing this method, we presented an equivalence verification algorithm which was optimized by four techniques: Zonotope propagation, zonotope overapproximation, LP approximation, and refinement heuristics. Our evaluation shows that the optimizations increase the approach’s efficiency and that it can verify equivalence of NNs which were not verifiable by MilpEquiv [10] and ReluDiff [13]. Our approach significantly outperforms the state of the art in counterexample finding by solving 890 instances in comparison to 315 instances solved by MilpEquiv. In addition, we proved the coNPcompleteness of the equivalence problem, and presented a formal way of reasoning about refinement heuristics in the context of GPE.
In terms of efficiency, one could further explore possible refinement heuristics and consider parallelized (possibly GPU based) implementations. Moreover, while GPE can increase the confidence in NNs, the role of numerical stability for the verification approach has to be further investigated. Furthermore, an integration of MILP constraints into GPE propagation could be explored resulting in an algorithm inbetween NNEquiv and MilpEquiv. Additionally, we see a need for a larger body of equivalence benchmarks which allows the conclusive evaluation of equivalence verification algorithms.
References
 [1] (2020) Improved Geometric Path Enumeration for Verifying ReLU Neural Networks. In International Conference on Computer Aided Verification, pp. 66–96. Cited by: §I, §VA, §VII, Definition 4.
 [2] (2021) Nnenum: verification of relu neural networks with optimized abstraction refinement. In NASA Formal Methods  13th International Symposium, NFM 2021, Proceedings, A. Dutle, M. M. Moscato, L. Titolo, C. A. Muñoz, and I. Perez (Eds.), Vol. 12673, pp. 19–36. External Links: Document Cited by: §IIC, §VB, §V, §VI.
 [3] (2016) End to end learning for selfdriving cars. arXiv preprint arXiv:1604.07316. Cited by: §I.
 [4] (2020) A Survey of Model Compression and Acceleration for Deep Neural Networks. Cited by: §I, §IIB.
 [5] (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §VIIA.
 [6] (2016) Deep learning. MIT Press. Note: http://www.deeplearningbook.org Cited by: §IIA.
 [7] (2015) Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §VIIA.
 [8] (2016) Policy compression for aircraft collision avoidance systems. In 2016 IEEE/AIAA 35th Digital Avionics Systems Conference (DASC), pp. 1–10. Cited by: §I, §VIIA.
 [9] (2017) Reluplex: An efficient SMT solver for verifying deep neural networks. In International Conference on Computer Aided Verification, pp. 97–117. Cited by: §I, §III.
 [10] (2020) Verifying Equivalence Properties of Neural Networks with ReLU Activation Functions. In Principles and Practice of Constraint Programming  26th International Conference, CP 2020, LouvainlaNeuve, Belgium, Proceedings, Cited by: §IB, §I, §IIB, §IIB, item (E2), §VIIA, §VIII, Definition 1, Definition 2.
 [11] (1998) Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §VIIA.

[12]
(2018)
Verifying Properties of Binarized Deep Neural Networks
. In AAAI, Cited by: §I.  [13] (2020) ReluDiff: differential verification of deep neural networks. In ICSE ’20: 42nd International Conference on Software Engineering, Seoul, South Korea, G. Rothermel and D. Bae (Eds.), pp. 714–726. External Links: Document Cited by: §I, item (E2), §VIIA, §VIID, §VIII, Definition 1.
 [14] (2020) NEURODIFF: scalable differential verification of neural networks using finegrained approximation. In 35th IEEE/ACM International Conference on Automated Software Engineering, ASE 2020, Melbourne, Australia, pp. 784–796. External Links: Document Cited by: §I, §IIB, §IVB, §VIIA, footnote 3.
 [15] (2018) Fast and effective robustness certification. Advances in Neural Information Processing Systems 2018Decem (Nips), pp. 10802–10813. External Links: ISSN 10495258 Cited by: §I, §IIC, §VB.
 [16] (2019) Starbased reachability analysis of deep neural networks. In International Symposium on Formal Methods, pp. 670–686. Cited by: item (C2), §I, §I, §IIC, §IVB, §VIII, Definition 3.
 [17] (2018) Efficient formal safety analysis of neural networks. In Advances in Neural Information Processing Systems, pp. 6367–6377. Cited by: §I.