 # Flexible graph matching and graph edit distance using answer set programming

The graph isomorphism, subgraph isomorphism, and graph edit distance problems are combinatorial problems with many applications. Heuristic exact and approximate algorithms for each of these problems have been developed for different kinds of graphs: directed, undirected, labeled, etc. However, additional work is often needed to adapt such algorithms to different classes of graphs, for example to accommodate both labels and property annotations on nodes and edges. In this paper, we propose an approach based on answer set programming. We show how each of these problems can be defined for a general class of property graphs with directed edges, and labels and key-value properties annotating both nodes and edges. We evaluate this approach on a variety of synthetic and realistic graphs, demonstrating that it is feasible as a rapid prototyping approach.

Comments

There are no comments yet.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Graphs are a pervasive and widely applicable data structure in computer science. To name just a few examples, graphs can represent symbolic knowledge structures extracted from Wikipedia , provenance records describing how a computer system executed to produce a result , or chemical structures in a scientific knowledge base . In many settings, it is of interest to solve graph matching problems, for example to determine when two graphs have the same structure, or when one graph appears in another, or to measure how similar two graphs are.

Given two graphs, possibly with labels or other data associated with nodes and edges, the graph isomorphism problem (GI) asks whether the two graphs have the same structure, that is, whether there is an invertible mapping from one graph to another that preserves and reflects edges and any other constraints. The subgraph isomorphism problem (SUB) asks whether one graph is isomorphic to a subgraph of another. Finally, the graph edit distance problem (GED) asks whether one graph can be transformed into another via a sequence of edit steps, such as insertion, deletion, or updates to nodes or edges.

These are well-studied problems. Each is in the class NP, with SUB and GED being NP-complete , while the lower bound of the complexity of GI is an open problem . Approximate and exact algorithms for graph edit distance, based on heuristics or on reduction to other NP-complete problems, have been proposed [10, 16, 20, 8]. Moreover, for special cases such as database querying, there are algorithms for subgraph isomorphism that can provide good performance in practice when matching small query subgraphs against graph databases .

However, there are circumstances in which none of the available techniques is directly suitable. For example, many of the algorithms considered so far assume graphs of a specific form, for example with unordered edges, or unlabeled nodes and edges. In contrast, many typical applications use graphs with complex structure, such as property graphs: directed multigraphs in which nodes and edges can both be labeled and annotated with sets of key-value pairs (properties). Adapting an existing algorithm to deal with each new kind of graph is nontrivial. Furthermore, some applications involve searching for isomorphisms, subgraph isomorphisms, or edit scripts subject to additional constraints [21, 7].

In this paper we advocate the use of answer set programming

(ASP) to specify and solve these problems. Property graphs can be represented uniformly as sets of logic programming facts, and each of the graph matching problems we have mentioned can be specified using ASP in a uniform way. Concretely, we employ the Clingo ASP solver, but our approach relies only on standard ASP features.

For each of the problems we consider, it is clear in principle that it should be possible to encode using ASP, because ASP subsumes the NP-complete SAT problem. Our contribution is to show how to encode each of these problems directly in a way that produces immediately useful results, rather than via encoding as SAT or other problems and decoding the results. For GI and SUB, the encoding is rather direct and the ASP specifications can easily be read as declarative specifications of the respective problems; however, the standard formulation of the graph edit distance problem is not as easy to translate to a logic program because it involves searching for an edit script whose maximum length depends on the input. Instead, we consider an indirect (but still natural) approach which searches for a partial matching between the two graphs that minimizes the edit distance, and derives an edit script (if needed) from this matching. The proof of correctness of this encoding is our main technical contribution.

We provide experimental evidence of the practicality of our declarative approach, drawing on experience with a nontrivial application: generalizing and comparing provenance graphs . In this previous work, we needed to solve two problems: (1) given two graphs with the same structure but possibly different property values (e.g. timestamps), identify the general structure common to all of the graphs, and (2) given a background graph and a slightly larger foreground graph, match the background graph to the foreground graph and “subtract” it, leaving the unmatched part. We showed in   that our ASP approach to approximate graph isomorphism and subgraph isomorphism can solve these problems fast enough that they were not the bottleneck in the overall system. In this paper, we conduct further experimental evaluation of our approach to graph isomorphism, subgraph isomorphism, and graph edit distance on synthetic graphs and real graphs used in a recent Graph Edit Distance Contest (GEDC)  and our recent work .

## 2 Background

### Property graphs

We consider (directed) multigraphs where and are disjoint sets of node identifiers and edge identifiers, respectively, are functions identifying the source and target of each edge, and is a function assigning each vertex and edge a label from some set . Note that multigraphs can have multiple edges with the same source and target. Familiar definitions of ordinary directed or undirected graphs can be recovered by imposing further constraints, if desired.

A property graph is a directed multigraph extended with an additional partial function where is a set of keys and is a set of data values. For the purposes of this paper we assume that all identifiers, labels, keys and values are represented as Prolog atoms.

We consider a partial function with range to be a total function with range where is a special token not appearing in . We consider to be partially ordered by the least partial order satisfying for all .

### Isomorphisms

A homomorphism from property graph to is a function mapping to and to , such that:

• for all , and

• for all , and

• for all , and

(Essentially, is a pair of functions , but we abuse notation slightly here by writing for both.) As usual, an isomorphism is an invertible homomorphism whose inverse is also a homomorphism, and and are isomorphic () if an isomorphism between them exists. Note that the labels of nodes and edges must match exactly, that is, we regard labels as integral to nodes and edges, while properties must match only if defined in .

### Subgraph isomorphism

A subgraph of is a property graph satisfying:

• and

• and for all

• when

• when

In other words, the vertex and edge sets of are subsets of those of that still form a meaningful graph, the labels are the same as in , and the properties defined in are the same as in (but some properties in may be omitted).

We say that is subgraph isomorphic to () if there is a subgraph of to which is isomorphic. Equivalently, holds if there is a injective homomorphism . If such a homomorphism exists, then it maps to an isomorphic subgraph of , whereas if then the isomorphism between and extends to an injective homomorphism from to .

### Graph edit distance

We consider edit operations:

• insertion of a node (), edge ), or property )

• deletion of a node (), edge ), or property )

• in-place update () of a property value on a given node or edge with a given key to value

The meanings of each of these operations are defined in table 1, where we write for the graph before the edit and for the updated graph. Each row of the table describes how each part of is defined in terms of . In addition, the edit operations have the following preconditions: Before an insertion, the inserted node, edge, or property must not already exist; before a deletion, a deleted node must not be a source or target of an edge, and a node/edge must not have any properties; before an update, the updated property must already exist on the affected node or edge. If these preconditions are not satisfied, the edit operation is not allowed on .

We write for the result of acting on . More generally, if is a list of operations then we write for the result of applying the operations to . Given graphs we define the graph edit distance between and as , that is, the shortest length of an edit script modifying to .

Computing the graph edit distance between two graphs (even without labels or properties) is an NP-complete problem. Moreover, we consider a particular setting where the edit operations all have equal cost, but in general different weights can be assigned to different edit operations. We can consider a slight generalization as follows: Given a weighting function mapping edit operations to positive rational numbers, the weighted graph edit distance between and is . The unweighted graph edit distance is a special case so this problem is also NP-complete.

### Answer set programming

We assume familiarity with general logic programming concepts (e.g. familiarity with Prolog or Datalog). To help make the paper accessible to readers not already familiar with answer set programming, we illustrate some programming techniques that differ from standard logic programming via a short example: coloring the nodes of an undirected graph with the minimum number of colors. Graph 3-coloring is a standard example of ASP, but we will adopt a slightly nonstandard approach to illustrate some key techniques we will rely on later. We will use the concrete syntax of the Clingo ASP solver, which is part of the Potassco framework [13, 12]. Examples given here and elsewhere in the paper can be run verbatim using the Clingo interactive online demo.

Figure 1 shows an example graph where edge relationships correspond to land borders between some countries. The edges are defined using an association list notation; for example e(be,(lu;nl)) abbreviates two edges e(be,lu) and e(be,nl). Listing 1 defines graph 3-coloring declaratively. The first line states that the edge relation is symmetric and the second defines the node relation to consist of all sources (and by symmetry targets) of edges. Line 3 defines a relation color/1 to hold for values 1,2,3. Lines 4–5 define when a graph is 3-colorable, by defining when a relation c/2 is a valid 3-coloring. Line 4 says that c/2 represents a (total) function from nodes to colors, i.e. for every node there is exactly one associated color. Line 5 says that for each edge, the associated colors of the source and target must be different. Here, we are using the not operator solely to illustrate its use, but we could have done without it, writing C = D instead.

Listing 1 is a complete program that can be used with Figure 1 to determine that the example graph is not 3-colorable. What if we want to find the least such that a graph is -colorable? We cannot leave the number of colors undefined, since ASP requires a finite search space, but we could manually change the ‘3’ on line 5 to various values of , starting with the maximum and decreasing until the minimum possible is found.

Instead, using minimization constraints, we can modify the 3-coloring program above to instead compute a minimal -coloring (that is, find a coloring minimizing the number of colors) purely declaratively by adding the clauses shown in Listing 2. Line 1 defines the set of colors simply to be the set of node identifiers (plus the three colors we already had, but this is harmless). Line 2 associates a cost of 1 with each used color. Finally, line 3 imposes a minimization constraint:to minimize the sum of the costs of the colors. Thus, using a single Clingo specification we can automatically find the minimum number of colors needed for this (or any) undirected graph. The 4-coloring shown in Figure 1 was found this way.

## 3 Specifying graph matching and edit distance

In this section we give ASP specifications defining each problem. We first consider how to represent graphs as flat collections of facts, suitable for use in a logic programming setting. We choose one among several reasonable representations: given and given three predicate names we define the following relations:

 RelG(n,e,p) = {n(v,lab(v))∣v∈V} ∪ {e(e,src(e),tgt(e),lab(e))∣e∈E} ∪ {p(x,k,d)∣x∈V∪E,prop(x,k)=d≠⊥}

Clearly, we can recover the original graph from this representation.

In the following problem specifications, we always consider two graphs, say and , and to avoid confusion between them we use two sets of relation names to encode them, thus represents two graphs. We also assume without loss of generality that the sets of vertex and edge identifiers of the two graphs are all disjoint, i.e. , to avoid any possibility of confusion among them.

We now show how to specify homomorphisms and isomorphisms among graphs. The Clingo code in Listing 3 defines when a graph homomorphism exists from to . We refer to this program extended with suitable representations of and as . The binary relation , representing the homomorphism, is specified using two constraints. The first says that maps nodes of to nodes of with the same label, while the second additionally specifies that maps edges of to those of preserving source, target, and label. Notice in particular that the cardinality constraint ensures that represents a total function with range , so in any model satisfying the first clause, every node in is matched to one in , which means that the body of the second clause is satisfiable for each edge. The third clause simply constrains so that any properties of nodes or edges in must be present on the matching node or edge in .

Next to define when is a graph isomorphism, we add the symmetric clauses shown in Listing 4. We write for the combination of Listings 3 and 4. Since the two listings together imply that represents a homomorphism in the forward direction and simultaneously represents a homomorphism from to in the backward direction, these four clauses suffice to specify that is an isomorphism.

To specify subgraph isomorphism, we simply require that is an injective homomorphism from to , as shown in Listing 5. We refer to the specification in Listing 5 as . The two additional constraints specify that the inverse of is a partial homomorphism. This is equivalent to being an injective homomorphism.

Finally we consider the specification of the graph edit distance problem. On the surface, this seems challenging, since the graph edit distance is defined as the length of a minimal edit script mapping one graph to another, and there are infinitely many possible edit scripts. However, there is clearly always an upper bound on the edit distance: consider an edit script that just deletes and inserts , and take to be the length of this script. So, given two graphs and this upper bound we could proceed by specifying a search space over edit scripts of bounded length, defining the meaning of each edit operator, and seeking to minimize the number of steps necessary to get from to . However, this encoding seems rather heavyweight, and requires preprocessing to determine .

Instead, we follow a different strategy, analogous to the approach adopted for graph coloring earlier. The strategy is based on the observation that the graph edit distance is closely related to the maximum subgraph problem , that is, given two graphs , , find the largest graph that is subgraph isomorphic to both. If we identify such a graph then (as we shall show) we can read off an edit script that maps to , which first deletes unmatched structure from , then updates properties in-place, and finally inserts new structure needed in . Furthermore, to identify the maximum common subgraph, we do not need to construct a new graph separate from and ; instead, we can think of the maximum common subgraph as an isomorphic pair of subgraphs of and . So in other words, we will search for a partial isomorphism between and , use it as a basis for extracting an edit script, and minimize its cost.

Listing 6 accomplishes this. The first four lines specify that must be a partial isomorphism, by dropping the requirement that must match all nodes/edges on one side with those of another, and dropping the hard constraint that properties must match. Lines 6–7 define when a node must be deleted or inserted. Nodes that are in and not matched in must be deleted, and conversely those that are in and not matched in must be inserted. Lines 9–10 similarly specify when edges must be inserted or deleted. Lines 12–18 define when a property is updated in-place, deleted, or inserted. If a property key is present on an object in and on the matching object in but with a different value, then the key’s value needs to be updated. If it is present in but not present on the matching object in then it is deleted. Likewise, if it is present in but the associated object is deleted then the property also must be deleted. Dually, properties are inserted if they are present in but not in , either because the matching object does not have that property or because there is no matching object because the property is on an inserted object. Lines 20–28 specify the costs associated with each of the edit operations. We assign each operation a cost of 1. It would also be possible to assign different (integer) costs to different kinds of updates, or even to specify different costs depending on labels, keys, or values.

## 4 Correctness

We first state the intended correctness properties for the homomorphism, isomorphism, and subgraph isomorphism problems:

###### Theorem 4.1
1. There exists a homomorphism if and only if is satisfiable.

2. There exists an isomorphism if and only if is satisfiable.

3. witnesses a subgraph isomorphism if and only if is satisfiable.

###### Proof

See Appendix 0.A. ∎

Next we turn to graph edit distance. To assist with the reasoning, we define the following canonical form:

###### Definition 1 (Edit script canonical form)

An edit script is in canonical form if it is of the form , where:

• , and are sequences of property deletions, edge deletions, and node deletions respectively;

• is a sequence of property updates;

• , , and are sequences of node insertions, edge insertions, and property insertions, respectively.

Edit scripts obtained from are in this form. Moreover, any valid edit script can be converted to a canonical one by applying a set of rewrite rules, as shown in Figure 2. We first consider marked versions of each edit operation, for example writing for the marked version of . A marked operation has the same effect as when applied to a graphs; the mark is only to indicate which operation is actively being rewritten. The idea here is that if we have a canonical edit script and wish to add a new edit operation, we use the rewrite rules to canonicalize . The rules are applied in order and at each step, the first matching rule is applied. Note that there is a catch-all rule , which only applies if none of the other rules do. Essentially, the rewrite rules consider all of the possible pairs of adjacent operations that can appear in a non-canonical form, with the first element marked. In each case, they show how to simplify the edit script by either moving the marked operation closer to the end, or removing the mark. Removal can happen as a result of either cancellation of the marked operation by another operation (e.g. a delete undoing an insert), or by removing the mark once it has reached an appropriate place for it in the canonical form.

###### Lemma 1

If is an edit script mapping to , then there is a canonical edit script mapping to such that .

###### Proof

See Appendix 0.A. ∎

###### Theorem 4.2

The specification always has a solution, and the edit script described by the insertion, deletion and update predicates is a valid, canonical script mapping to . Moreover, the cost of the optimal solution to equals .

###### Proof

For the first part, we observe that the empty relation always solves if we ignore the minimization constraint. Therefore, the cost of this solution is an upper bound. Moreover, if we apply the edit operations described by the insert, delete and update relations in the order required by the canonical form, then each edit operation is valid, all structure present in and not is removed, all properties whose values differ in and are updated, and all structure present in and not is inserted. Therefore, the corresponding edit script maps to .

To show that the minimum cost obtained from solving the specification coincides with , one direction is easy: for any (including the one corresponding to a minimum cost solution) the collection of edit operations resulting from is a valid edit script so its length must be greater than or equal to the minimum over all valid scripts. To show the reverse direction, we use Lemma 1. Given a minimum-length edit script that is not in canonical form, we can rewrite it to one that is canonical, with equal cost (since the original script was already minimum-length). ∎

## 5 Discussion

We have argued that using ASP offers considerable flexibility. To illustrate this claim, we consider three modifications to our approach.

### Weighted graph edit distance

If the operations have different (integer) weights, implemented using a suitable modification to the cost predicates in some specification , then the same argument as above suffices to show that a minimum-weight canonical script always exists to be found by the ASP specification. The key point is that weights are defined on individual edit operations, and the rewrite rules only permute or delete operations, so preserve or decrease weight.

### Relabeling

We have treated labels as hard constraints: it is not possible to change the label of a node in to a different label in , short of deleting the node and inserting a new one with a different label. On the other hand, properties are soft constraints in the sense that we may delete or update a property value without also being obliged to delete and re-create the underlying node or edge structure. It is natural to consider an in-place relabeling operation as well. Such behavior can be encoded on top of the already-developed framework by using a single “blank” label for nodes and edges and introducing an unused property key called “label” instead; now this can be updated in-place like other properties. Alternatively, we can accommodate this behavior more directly as shown in Listing 7.

The first four lines relax the constraint that node and edge labels have to be preserved by . The next two lines define the relabel_node and relabel_edge predicates to detect when two matched nodes or edges have different labels. Finally, the node_cost and edge_cost predicates are extended to charge a cost of 1 per relabeling.

### Ad hoc constraints

The use of ASP opens up many other possibilities for controlling or constraining the various isomorphism or edit distance problems. One example which we found useful in previous work  was to modify the definitions of isomorphism or subgraph isomorphism to treat properties as soft constraints and minimize the number of mismatched properties.

Another potentially interesting class of constraints is to allow “access control” constraints on the possible edit scripts, for example specifying that certain nodes or edges in one graph cannot not be modified and so must be matched with equivalent constructs in the other graph. This is similar to the approximate constrained subgraph matching problem .

## 6 Evaluation

Graph matching and edit distance are widely studied problems and a thorough comparison of our approach with state-of-the-art algorithms is beyond the scope of this paper. However, we do not claim that our approach is faster, only that it is easy to implement and modify, rendering it suitable for rapid prototyping situations. Nevertheless, in this section we summarize a preliminary evaluation that supports a claim that our approach is fast enough to be useful for rapid prototyping. Our experiments were run on an 2.6 GHz Intel Core i7 MacBook Pro machine with 8 GB RAM and using Clingo v5.2.0.

First, we consider the various problems on synthetic graphs, such as -cycles and -chains (linear sequences of edges), with only one possible node and edge label and no properties. These problems are not representative of typical real problems, but illustrate some general trends. We considered each of the problems: (HOM), (ISO) , (SUB) , and (GED) . We first considered comparisons where and are -cycles or -chains, for . We found the running times for each of these problems to be relatively stable independent of whether the comparison was between two -chains, a -chain with a -cycle, or two -cycles, so we have averaged across all four scenarios. We also considered randomly generated graphs with

nodes and each edge generated with probability 0.1, with

. We attempted each problem with a running time limit of 30 seconds; the results are shown in Figure 3 results. Unsurprisingly, the HOM instances are solved fastest, and GED slowest. Figure 3: Synthetic results: (a) chains and cycles (b) randomly generated graphs

Second, we consider some real graphs from the Mutagenesis dataset (MUTA), a standard dataset used for evaluating graph edit distance algorithms , for example in a recent graph edit distance competition (GEDC) . In the contest, eight algorithms were run on different problems for up to 30 seconds, and compared in terms of time, accuracy (for approximate algorithms), and success rate (for exact algorithms). We modified the GED specification to allow node and edge relabeling and use the same weight function as in the second (and more challenging) configuration used in the contest, for which even the best algorithm (called F2) was not able to deal with graphs of size larger than 30. We consider three datasets MUTA-10, MUTA-20 and MUTA-30 each consisting of ten chemical structure graphs of size 10, 20 or 30 respectively. We also consider a dataset MUTA-MIXED which consists of ten graphs of varying sizes. We considered all unordered pairs of the graphs in each subset and attempted to find the GED with a timeout of 30s. Table 2 shows the results compared with the four exact algorithms reported in 

. The first two algorithms, F2 and F24threads, are implementations of a binary linear programming encoding of graph edit distance

, the first being the plain single-threaded algorithm, the second a variant using a linear programming solver, and the second running with four threads. The other two, DF and PDFS, are sequential and parallel implementations of a depth-first, branch-and-bound algorithm [3, 2].

Table 2 illustrates that our approach is competitive with DF and slightly worse than PDFS, but does not match the performance of the two F2 algorithms. These results should be taken with a grain of salt, since we have not replicated the GEDC results on our (slightly faster) hardware. Memory did not appear to be a bottleneck for our approach.

We have implemented and used variations of the isomorphism and subgraph isomorphism specifications for property graphs in a provenance graph analysis system called ProvMark . In this earlier work, we found that for graphs of up to around 100 nodes and edges, and a few hundred properties, these problems are usually solvable within a few seconds. However, these problems may not be representative of other scenarios.

The specifications we used to define approximate subgraph isomorphism problems in ProvMark are similar to those presented here, but we subsequently experimented with several different approaches with different performance. Here, we compare the performance of ProvMark on subgraph isomorphism problems over two representative example graphs considered in our previous experiments: the graph generalization and comparison problems resulting from benchmarking the creat and execve system calls using the CamFlow provenance recording system . See  for further details and the Clingo code of the previous approaches.

Table 3 shows the running time of the old version and new version of approximate subgraph isomorphism. The code for both specifications is in Appendix 0.C. The problem sizes (that is, the number of nodes, edges, and properties of the two graphs) is shown under “Size”. The “Old Time” column corresponds to the time obtained using the old approach and “New Time” shows the time obtained using the code in Listing 5 modified to allow approximate property matching. The “Speedup” column shows the ratio between the old and new time. In most cases, the speedup is around a factor of two. As future work, we plan to use graph edit distance with the results of the ProvMark system, for example for clustering or regression testing across runs.

## 7 Related Work

The lower bound of the complexity of graph isomorphism is a well-known open problem , but subgraph isomorphism and graph edit distance are NP-complete . A number of practical algorithms for graph isomorphism have been studied, however, including NAUTY , which has also been integrated with Prolog . However, most such algorithms consider graphs with vertex labels but not edge labels or properties, so are not directly applicable to property graph isomorphism. Subgraph isomorphism has been studied extensively over the past years, one survey  summarizes the state-of-art algorithms for solving partial or simplified version of the problem. Subgraph isomorphism is also studied for graph databases, where the query subgraph is usually small but the other graph may be very large. Lee et al.  evaluated five such algorithms on query graphs of up to 24 edges and database of up to tens of thousands of nodes and edges. Approximate subgraph matching with constraints has also been studied, particularly in biomedical settings , and it would be interesting to investigate whether our approach is competitive with their CSP-based algorithm. Graph edit distance has also been studied extensively , with much attention on approximate algorithms that can provide results quickly .

While several approaches to graph matching and edit distance have been based on expressing these problems as constraint satisfaction problems, satisfiability, or linear programming problems, to the best of our knowledge there is no previous work based on answer set programming. Moreover, our approach easily accommodates richer graph structure such as hard or soft label constraints, properties, and multiple edges between pairs of nodes, whereas the algorithms we have seen generally consider ordinary graphs (without properties and with at most one edge between two nodes).

## 8 Conclusions

The graph edit distance problem is a widely studied problem that has many applications. Exact solutions to it, and to related problems such as graph isomorphism and subgraph isomorphism, are challenging to compute efficiently due to their NP-completeness or unresolved complexity (in the case of graph isomorphism). There are a number of proposed algorithms in the literature, with one of the most effective based on a reduction to binary linear programming . In this paper, we investigated an alternative approach using answer set programming (ASP), specifically the Clingo solver. This approach may not be competitive with the best known techniques in terms of performance, but has the potential advantage that it is straightforward to modify the problem specification to accommodate different kinds of graphs, cost metrics or other variations, or to accommodate ad hoc constraints that can also be expressed using ASP. Our approach has already proved useful for a real application , and our experimental evaluation suggests that it is also competitive with two out of four exact algorithms from a graph edit distance competition.

Our work may be valuable to others interested in rapid prototyping of graph matching or edit distance problems using declarative programming. Additional work could be done to facilitate this, for example using Clingo’s Python wrapper library. Graph matching and edit distance problems may also be an interesting class of challenge problems for developers of ASP solvers.

## Acknowledgments

Effort sponsored by the Air Force Office of Scientific Research, Air Force Material Command, USAF, under grant number FA8655-13-1-3006. The U.S. Government and University of Edinburgh are authorised to reproduce and distribute reprints for their purposes notwithstanding any copyright notation thereon. Cheney was also supported by ERC Consolidator Grant Skye (grant number 682315). This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under contract FA8650-15-C-7557.

## References

•  Zeina Abu-Aisheh, Benoit Gaüzère, Sébastien Bougleux, Jean-Yves Ramel, Luc Brun, Romain Raveaux, Pierre Héroux, and Sébastien Adam. Graph edit distance contest: Results and future challenges. Pattern Recognition Letters, 100:96–103, 2017.
•  Zeina Abu-Aisheh, Romain Raveaux, Jean-Yves Ramel, and Patrick Martineau. An exact graph edit distance algorithm for solving pattern recognition problems. In Proceedings of the International Conference on Pattern Recognition Applications and Methods (ICPRAM 2015), pages 271–278, 2015.
•  Zeina Abu-Aisheh, Romain Raveaux, Jean-Yves Ramel, and Patrick Martineau. A parallel graph edit distance algorithm. Expert Syst. Appl., 94:41–57, 2018.
•  Vikraman Arvind and Jacobo Torán. Isomorphism testing: Perspectives and open problems. Bulletin of the EATCS, 86:66–84, 2005.
•  Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary G. Ives. DBpedia: A nucleus for a web of open data. In Proceedings of the 6th International Semantic Web Conference on The Semantic Web and 2nd Asian Semantic Web Conference (ISWC 2007 + ASWC 2007), pages 722–735, 2007.
•  Horst Bunke. On a relation between graph edit distance and maximum common subgraph. Pattern Recognition Letters, 18(8):689–694, 1997.
•  Sheung Chi Chan, James Cheney, Pramod Bhatotia, Thomas Pasquier, Ashish Gehani, Hassaan Irshad, Lucian Carata, and Margo Seltzer. ProvMark: A provenance expressiveness benchmarking system. In Proceedings of the 20th International Middleware Conference (Middlware ’19), pages 268–279. ACM, 2019.
•  Xiaoyang Chen, Hongwei Huo, Jun Huan, and Jeffrey Scott Vitter. An efficient algorithm for graph edit distance computation. Knowl.-Based Syst., 163:762–775, 2019.
•  Michael Frank and Michael Codish. Logic programming with graph automorphism: Integrating nauty with prolog (tool description). TPLP, 16(5-6):688–702, 2016.
•  Xinbo Gao, Bing Xiao, Dacheng Tao, and Xuelong Li. A survey of graph edit distance. Pattern Analysis and applications, 13(1):113–129, 2010.
•  Michael R. Garey and David S. Johnson. Computers and intractability: a guide to the theory of NP-completeness. W. H. Freeman, 1979.
•  Martin Gebser, Roland Kaminski, Benjamin Kaufmann, Patrick Lühne, Philipp Obermeier, Max Ostrowski, Javier Romero, Torsten Schaub, Sebastian Schellhorn, and Philipp Wanko. The potsdam answer set solving collection 5.0. KI-Künstliche Intelligenz, 32(2-3):181–182, 2018.
•  Martin Gebser, Benjamin Kaufmann, Roland Kaminski, Max Ostrowski, Torsten Schaub, and Marius Thomas Schneider. Potassco: The Potsdam answer set solving collection. AI Commun., 24(2):107–124, 2011.
•  Jeron Kazius, Ross McGuire, and Roberta Bursi. Derivation and validation of toxicophores for mutagenicity prediction. J. Med. Chem., 48(1):312–320, 2005.
•  Jinsoo Lee, Wook-Shin Han, Romans Kasperovics, and Jeong-Hoon Lee. An in-depth comparison of subgraph isomorphism algorithms in graph databases. PVLDB, 6(2):133–144, 2012.
•  Julien Lerouge, Zeina Abu-Aisheh, Romain Raveaux, Pierre Héroux, and Sébastien Adam. New binary linear programming formulation to compute the graph edit distance. Pattern Recognition, 72:254–265, 2017.
•  B. D. McKay. Practical graph isomorphism. Congressus Numerantium, 30:45–87, 1981.
•  Brendan D. McKay and Adolfo Piperno. Practical graph isomorphism, II. Journal of Symbolic Computation, 60:94–112, 2014.
•  Thomas Pasquier, Xueyuan Han, Mark Goldstein, Thomas Moyer, David M. Eyers, Margo Seltzer, and Jean Bacon. Practical whole-system provenance capture. In Proceedings of the 2017 Symposium on Cloud Computing (SoCC 2017), pages 405–418, 2017.
•  Kaspar Riesen. Structural Pattern Recognition with Graph Edit Distance - Approximation Algorithms and Applications. Springer, 2015.
•  Stéphane Zampelli, Yves Deville, and Pierre Dupont. Approximate constrained subgraph matching. In Proceedings of the 11th International Conference on Principles and Practice of Constraint Programming (CP 2005), pages 832–836, 2005.

## Appendix 0.A Proofs

###### Proof (of Theorem 4.1)

We outline the proof of part 1. We slightly abuse notation by considering both as a function and a binary relation (i.e. the graph of the function). If is a homomorphism, then for any node , holds for exactly one whose label must match that of , so line 1 of the homomorphism specification holds. Likewise, for any edge where and and , then where and and . Therefore there is exactly one such that holds and . Finally, by similar reasoning, because preserves property values, line 3 holds. Conversely, if the specification holds of , then line 1 implies that is a total, label-preserving function on node identifiers, and line 2 implies that is a total, source, target, and label-preserving function on edge identifiers. Finally, again line 3 implies that preserves property keys and values.

The proofs of part 2 and 3 are straightforward: for part 2 it suffices to observe that is a homomorphism from to (read in the forward direction) and from to (in the backward direction) if and only if it is an isomorphism. For part 3, by part 1 the solution must be a graph homomorphism and the additional constraint ensures that it is injective, witnessing subgraph isomorphism.

We prove Lemma 1 as the third part of the following lemma:

###### Lemma 2
1. If maps to and then maps to and .

2. If is a canonical edit script mapping to and is an edit operation mapping to then rewrites to a canonical edit script mapping to with .

3. If is an edit script mapping to , then there is a canonical edit script mapping to such that .

###### Proof
1. The proof is straightforward for each rule; in most cases, the two operations commute. The interesting cases are:

• : If the updated property is immediately deleted, it has the same effect as just deleting.

• : If the inserted property is immediately updated, it has the same effect as just inserting the updated value.

• , , : If a node, edge, or property is inserted and immediately deleted, the two operations cancel out.

2. Let be a canonical edit script mapping to , and a marked edit operation mapping to . Given an edit script with at most one marked operation, define the *-length of an edit script to be 0 if it contains no marked operation and if it is of the form . That is, the *-length is the length of the marked suffix of the edit script, or 0 if there is no mark. All of the rules in Fig. 2 decrease the *-length of the edit script, as well as preserving or decreasing the length, so we can rewrite to a normal form. Moreover, clearly the rewrite rules preserve the order of the operations aside from the marked one, and in the cases where the mark is removed, the edit operation is in a position that is allowed in a canonical edit script (because all of the cases where a marked operation violates canonical form are covered by other rules). Thus, after rewriting to a normal form, is a canonical edit script.

3. We proceed by induction on the length of . If it is empty, there is nothing to prove. Otherwise, suppose it is of the form . By induction, there must exist equivalent to with . Consider the marked edit script . By part 1, this normalizes to an unmarked edit script that is equivalent to with .

## Appendix 0.B Graph edit distance contest problem

Listing 8 shows the specification of the weighted edit distance problem corresponding to the second set of parameter settings in the Graph Edit Distance Contest . The first six lines define the costs as constants. Next the usual specification that is a partial isomorphism (ignoring labels) is given. Finally we specify the costs associated with node and edge relabeling, insertion, and deletion. (It turns out to improve performance slightly to omit the intermediate insertion, deletion and relabeling predicates shown in Listings 6 and 7.)

## Appendix 0.C Approximate subgraph isomorphism

Two specifications defining approximate subgraph isomorphism are shown in Listings 9 and 10. The code in Listing 9 is the same as that given by Chan et al. . The code in Listing 10 is a variant of Listing 5 that removes the constraint that properties match exactly and instead associates a cost to each property of not matched in , which must be minimized. This is the same approach as followed in Listing 9; the differences are in how the label preservation and edge preservation constraints are defined. Chan et al.  encoded these constraints using several (possibly redundant) clauses, whereas in our version, the functionality, label preservation, and edge preservation are captured by just four constraints.