 # Computing Optimal Assignments in Linear Time for Graph Matching

Finding an optimal assignment between two sets of objects is a fundamental problem arising in many applications, including the matching of `bag-of-words' representations in natural language processing and computer vision. Solving the assignment problem typically requires cubic time and its pairwise computation is expensive on large datasets. In this paper, we develop an algorithm which can find an optimal assignment in linear time when the cost function between objects is represented by a tree distance. We employ the method to approximate the edit distance between two graphs by matching their vertices in linear time. To this end, we propose two tree distances, the first of which reflects discrete and structural differences between vertices, and the second of which can be used to compare continuous labels. We verify the effectiveness and efficiency of our methods using synthetic and real-world datasets.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Vast amounts of data are now available for machine learning, including text documents, images, graphs and many more. Learning from such data typically involves computing a similarity or distance function between the data objects. Since many of these datasets are large, the efficiency of the comparison methods is critical. This is particularly challenging since taking the structure adequately into account often is a hard problem. For example, no polynomial-time algorithms are known even for the basic task to decide whether two graphs have the same structure

(Johnson, 2005). Therefore, the pragmatic approach of describing such data with a ‘bag-of-words’ or ‘bag-of-features’ is commonly used. In this representation, a series of objects are identified in the data and each object is described by a label or feature. The labels are placed in a bag where the order in which they appear does not matter.

In the most basic form, such bags can be represented by histograms or feature vectors, two of which are compared by counting the number of co-occurrences of a label in the bags. This is a common approach not only for images and text, but also for graph comparison, where a number of graph kernels have been proposed which use different substructures as elements of the bag

(Vishwanathan et al., 2010). In the more general case, two bags of features are compared by summing over all pairs of features weighted by a similarity function between the features. However, in both cases, the method is not ideal, since each feature corresponds to a specific element in the data object, and so can correspond to no more than one element in a second data object. The co-occurrence counting method allows each feature to match to multiple features in the other dataset.

Another approach is to explicitly compute the best assignment between the features of two bags. The pyramid match kernel was proposed to approximate correspondences between bags of features in by employing a space-partitioning tree structure and counting how often points fall into the same bin (Grauman & Darrell, 2007a, b). For graphs, the optimal assignment kernel was proposed, which establishes a correspondence between the vertices of two graphs (Fröhlich et al., 2005). However, solving the assignment problem in general has running time where is the size of the bags, and can be slow on large datasets. Moreover, it was soon realised that this approach does not lead to valid kernels in the general case (Vert, 2008). Therefore, Johansson & Dubhashi (2015) derived kernels from optimal assignments by first sampling a fixed set of so-called landmarks and representing graphs by their optimal assignment similarities to landmarks. Kriege et al. (2016) demonstrated that a specific choice of a weight function (derived from a hierarchy) does in fact generate a valid kernel for the optimal assignment method and allows computation in linear time. The approach is not designed to actually construct the assignment.

In this paper, we show how an assignment of minimum cost can be computed in linear time when the costs of assigning individual features is determined by a tree distance. Our method has applications beyond computing the optimal assignment between fixed sets and can be used for graph matching. The graph edit distance is one of the most widely accepted approaches to graph matching with applications ranging from cheminformatics to computer vision (Stauffer et al., 2017)

. It is defined as the minimum cost of edit operations required to transform one graph into another graph. The concept has been proposed for pattern recognition tasks more than 30 years ago

(Sanfeliu & Fu, 1983). However, its computation is -hard, since it generalises the classical -complete maximum common subgraph problem (Bunke, 1997; Garey & Johnson, 1979)

. Recent binary linear programming formulations in combination with highly-optimised general purpose solvers are the most efficient exact approaches, but are still limited to small graphs

(Lerouge et al., 2017). Even a restricted special case of the graph edit distance is -hard, i.e., there is a constant , such that no polynomial-time algorithm can approximate it within the factor , unless = (Lin, 1994)

. However, heuristics based on bipartite matching or the assignment problem turned out to be effective tools

(Riesen & Bunke, 2009) and are widely used in practice (Stauffer et al., 2017). The approach requires cubic running time, which is still not feasible for large graphs. Therefore, it has been proposed to use non-exact algorithms for solving the assignment problem. Simple greedy algorithms reduce the running time to  (Riesen et al., 2015a, b). For large graphs the quadratic running time is still problematic in practice. Moreover, many applications require a large number of distance computaions, e.g., for solving classification tasks or performing similarity search in graph databases (Zeng et al., 2009).

#### Our contribution.

We consider the optimal assignment problem under a cost function that is a tree distance. Using general costs, the input to the assignment problem is encoded by a matrix of quadratic size. The tree distance, however, can be represented compactly by a weighted tree of linear size. For this case, we propose an exact linear time algorithm for constructing an optimal assignment from the tree. We show how to embed the optimal assignment costs in an space without distortion. On this basis, we develop an algorithm for approximating the graph edit distance in linear time. To this end, we propose techniques for generating trees representing cost metrics for approximating the graph edit distance: (i) based on Weisfeiler-Lehman refinement to quantify discrete and structural differences, (ii)

based on hierarchical clustering for continuous vertex attributes.

We show experimentally that our linear time algorithm scales to large graphs and datasets. Our approach outperforms both exact and approximate methods for computing the graph edit distance in terms of running time and provides state-of-the-art classification accuracy. For some datasets with discrete labels our method even beats these approaches in terms of accuracy.

## 2 Fundamentals

We summarise basic concepts and results on tree distances, the assignment problem and the graph edit distance.

### 2.1 Tree metrics and ultrametrics

A dissimilarity function is a metric on , if it is (i) non-negative, (ii) symmetric, (iii) zero iff two objects are equal and (iv) satisfies the triangle inequality. A metric on is an ultrametric if for all , and is a tree metric if for all .

The relation between restricted classes of distances and path lengths in trees has been investigated in detail, e.g., in phylogentics (Semple & Steel, 2003). A weighted tree with positive real-valued edge weights represents the distance defined as , where denotes the unique path from to , for all . For every ultrametric on there is a rooted tree with leaves and positive real-valued edge weights, such that (i) is the path length between leaves in , (ii) all paths from any leaf to the root have equal length. For every tree metric on there is a tree with and positive real-valued edge weights, such that corresponds to the path lengths in . Note that an ultrametric always is a tree metric. For the clarity of notation we distinguish between the elements of and the nodes of a tree by introducing a bijective map . We will refer to both a label and the associated node by the same letter, with the meaning clear from the context. We consider the distance defined as .

### 2.2 The assignment problem

The assignment problem is a well-studied classical combinatorial problem (Burkard et al., 2012). Given a triple , where and are sets of distinct objects with and a cost function, the problem asks for a one-to-one correspondence between and with minimum costs. The cost of is . Assuming an arbitrary, but fixed ordering of the elements of and , an assignment instance can also be given by a cost matrix , where and is the th element of and is the th element of . Note that in this case the input is of size , where otherwise the input size depends on the representation of the cost function . The assignment problem is equivalent to finding a minimum weight perfect matching in a complete bipartite graph on the two sets and with edge weights according to . Unless is sparse or contains only integral values from a bounded interval, the best known algorithms require cubic time, which is achieved by the well-known Hungarian method.

Here we consider the assignment problem for two sets of objects where the objects are labelled by elements of . There exists a labelling associating each object with a label . Furthermore, we may associate objects with tree nodes using the map We then have . We denote by the cost of an optimal assignment between and , and the assignment problem as the quadruple

### 2.3 The graph edit distance

The graph edit distance measures the minimum cost required to transform a graph into another graph by adding, deleting and substituting vertices and edges. Each edit operation is assigned a cost , which may depend on the attributes associated with the affected vertices and edges. A sequence of edit operations that transforms a graph into another graph is called an edit path from to . We denote the set of all possible edit paths from to by . Let and be attributed graphs, the graph edit distance from to is defined by

 d(G,H)=min{k∑i=1c(oi)∣∣ ∣∣(o1,…,ok)∈Υ(G,H)}.

In order to obtain a meaningful measure of dissimilarity for graphs, a cost function must be tailored to the particular attributes that are present in the considered graphs.

#### Approximating the graph edit distance by assignments.

Computing the graph edit distance is an -hard problem and solving practical instances by exact approaches if often not feasible. Therefore, Riesen & Bunke (2009) proposed to derive a suboptimal edit path between graphs from an optimal assignment of their vertices, where the assignment costs also encode the local edge structure. For two graphs and with vertices and , an assignment cost matrix is created according to

 C=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣[cccc|cccc]c1,1c1,2⋯c1,mc1,ϵ∞⋯∞c2,1c2,2⋯c2,m∞c2,ϵ⋱⋮⋮⋮⋱⋮⋮⋱⋱∞cn,1cn,2⋯cn,m∞⋯∞cn,ϵ\par\cline1−8\parcϵ,1∞⋯∞00⋯0∞cϵ,2⋱⋮00⋱⋮⋮⋱⋱∞⋮⋱⋱0∞⋯∞cϵ,m0⋯00⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦,

where the entries are estimations of the cost for substituting, deleting and inserting vertices in

. In more detail, the entry is the cost for deleting increased by the costs for deleting the edges incident to . The entry is the cost for inserting and all edges incident to . Finally is the the cost made up of the cost for substituting the vertex by and the cost of an optimal assignment between the incident edges w.r.t. the edge substitution, deletion and insertion costs. An optimal assignment for allows to derive an edit path between and . Its cost is not necessarily minimum possible, but Riesen & Bunke (2009) show experimentally that this procedure leads to a sufficiently good approximation of the graph edit distance for many real-world problems.

The costs derived by Riesen & Bunke (2009) are directly related to edit costs of various operations on the graph, but unfortunately these costs are not suitable for our optimal assignment strategy, which must utilise a tree metric. For this reason, we use a different set of costs, described in Section 4. The optimal assignment is recovered using the method described in the next section. This assignment then induces an edit path which is used to compute a good approximation to the edit distance.

## 3 Optimal assignments under a tree metric

We consider the assignment problem under the assumption that the costs are derived from a tree metric and propose an efficient algorithm for constructing a solution. For a dataset of sets of objects we obtain a distortion-free embedding of the pairwise optimal assignment costs into an space.

### 3.1 Structural results

Let be an assignment instance as described in Section 2.2. We associate the objects of and with the nodes of the tree according to the map , cf. Figure 1. An assignment between the objects of and is associated with a collection of paths in , such that there is a bijection between pairs and paths . In particular, the cost of the assignment equals the sum of weighted path lengths, i.e.,

 c(M)=∑P∈P∑e∈Pw(e). (1)

We do not construct the set explicitly, but use this notion to develop efficient methods and prove their correctness. Using Eq. (1), we can attribute the total costs of an optimal assignment to the individual edges by counting how often they occur on shortest paths. Deleting an edge yields two connected components, one containing the node and the other containing . Let and denote the number of objects in associated by with nodes in the connected component containing and , respectively, cf. Figure 1.

###### Lemma 3.1.

Let be the collection of shortest paths associated with an optimal assignment between and under a cost function represented by the weighted tree . Each edge in appears times in a path in .

###### Proof.

Splitting at an edge defines a bipartiton of and . Since holds, it follows that for every edge . When appears in assignment paths, the maximum number of assignments are made within each partition, with all of the smaller of and assigned within the partition. We may assign at most objects within the connected component containing , and at least the remaining objects must be assigned to objects in the connected component containing . Therefore, appears at least this number of times in shortest paths in .

It remains to be shown that the assignment cannot be optimal when the edge is contained in more paths. Assume corresponds to an optimal solution and contains the edge more than times. Then, there exist at least one element of and assigned across the partitions with assignment paths and in that are not in . Note that , are in the connected component containing and , in the component containing . Consider the paths and , which both do not contain . The collection of paths also defines an assignment, where the edges are contained in the same number of paths with exception of the edge , which appears in two paths less. Since , the associated solution has cost and hence cannot correspond to an optimal solution, contradicting the assumption. ∎

This result allows us to compute the optimal assignment cost as a weighted sum over the edges in the tree representing the cost metric.

###### Theorem 3.2.

Let be an assignment instance with tree edge weights . The cost of an optimal assignment is

 DcOA(A,B)=∑uv∈E(T)|A←−uv−B←−uv|⋅w(uv).
###### Proof.

Directly follows from Eq. (1) and Lemma 3.1. ∎

### 3.2 Constructing an optimal assignment

In order to compute an optimal assignment, and not just its cost, we again associate the objects of and with the nodes of the tree . Then we pick an arbitrary leaf and assign the maximum possible number of elements between the subsets of and associated with . The remaining objects are passed to its neighbour and the considered leaf is deleted. Iterating the approach until the tree is eventually empty, yields an assignment between all objects of and . Algorithm 1 implements this approach.

###### Theorem 3.3.

Algorithm 1 computes an optimal assignment in time , where is the input size and the size of the tree .

###### Proof.

Since and every object of is associated with exactly one object of , the algorithm constructs an assignment. The cost of the assignment corresponds to the number of objects that are passed to neighbours along the weighted edges in lines 1 and 1. Whenever a node is processed in the while-loop, it has exactly one remaining neighbour . Since is deleted after the end of the iteration, objects are passed along the edge only in this iteration. After calling the procedure PairElements in line 1 either or or both are empty. Since all objects in the connected component of that contains must have been passed to in previous iterations, exactly objects are passed to . This is the number of occurrences of in every optimal solution according to Lemma 3.1. Therefore, is an optimal assignment.

The total running time over all iterations for the procedure PairElements is , the size of the assignment. All the other individual operations within the while-loop can be implemented in constant time when using linked lists to store and pass the objects. Therefore the while-loop and the entire algorithm run in total time. ∎

Every optimal assignment can be obtained by Algorithm 1 depending on the order in which the objects are retrieved by GetAndRemoveElement in line 1 and 1.

### 3.3 Improving the running time

We consider the setting, where the map and the weighted tree encoding the cost metric is fixed and the distance should be computed for a large number of pairs. The individual assignment instances possibly only populate a small fraction of the nodes of and only a small subtree may be relevant for Algorithm 1. We show that this subtree can be identified efficiently.

Given a tree and a set , let denote the minimal subtree of with .

###### Lemma 3.4.

Given a tree and a set , the subtree can be computed in time after a preprocessing step of time .

###### Proof.

In the preprocessing step we pick an arbitrary node of as root and compute the depth of every vertex w.r.t. the root using breadth-first search. Let . For every node , we (i) add to the result set , and (ii) if the parent of is not in and , set to and continue with step (i). Let . If , add the parents of all to , decrease by one. Repeat this step until . Eventually, we have and . Every node in is processed only once and the running time is . ∎

Assuming that the tree and the depth of all nodes are given, the result directly improves the running time of Algorithm 1 to .

### 3.4 Embedding optimal assignment costs

We show how sets can be embedded in a vector space such that the Manhattan distance between these vectors equals the optimal assignment costs between sets w.r.t. a given cost function. Let be a dataset with for all . Let the cost function between the objects be determined by a tree distance represented by the weighted tree , and the map from the objects to the nodes of the tree. We consider the following map from to points in with and components indexed by the edges of :

 ϕc(A)=[A←−uv⋅w(uv)]uv∈E(T).

The Manhattan distance between these vectors is equal to the optimal assignment costs between the sets.

###### Theorem 3.5.

Let be a cost function and defined as above, then .

###### Proof.

We calculate

 ∥ϕc(A)−ϕc(B)∥1 =∑uv∈E(T)|A←−uv⋅w(uv)−B←−uv⋅w(uv)| =∑uv∈E(T)|A←−uv−B←−uv|⋅w(uv) =DcOA(A,B),

where the last equality follows from the Theorem 3.2. ∎

This makes the optimal assignment costs available to fast indexing methods and nearest-neighbour search algorithms, e.g., following the locality sensitive hashing paradigm.

## 4 Approximating the graph edit distance in linear time

We combine our assignment algorithm with the idea of Riesen & Bunke (2009) to approximate the graph edit distance detailed in Section 2.3. To this end, we propose two methods for constructing a tree distance, such that the optimal assignment between the vertices of two graphs w.r.t. to these distances is suitable for approximating the graph edit distance. In order to quantify discrete and structural differences, we propose to use the Weisfeiler-Lehman method and, for graphs with continuous labels, hierarchical clustering. Note that both approaches can be combined to form a tree taking both, discrete and continuous labels, into account. The tree distances we consider are in fact ultrametrics. We discuss the general limitations of ultrametrics in representing distances as used for the cost matrix of Eq. (2.3) in the appendix, Section B. In order to compare graphs of different size we introduce artificial vertices that represent vertex insertion and deletion operations.

### 4.1 Weisfeiler-Lehman trees for discrete and structural differences

Weisfeiler-Lehman refinement (WL), also known as colour refinement or naïve vertex classification, is a classical heuristic for graph isomorphism testing. It iteratively refines partitions of the vertices of a graph, where the vertices in the same cell are said to have the same colour. In each iteration two vertices with the same colour obtain different new colours if their neighbourhood differs w.r.t. the current colouring. More formally, given a parameter and a graph with initial colours , a sequence of refined colours is computed, where is obtained from by the following procedure. For every vertex , sort the multiset of colours to obtain a unique sequence of colours and add as first element. Assign a new colour to every vertex by employing an injective mapping from colour sequences to new colours. It was observed by Kriege et al. (2016) that colour refinement applied to a set of graphs under the same injective mapping yields a hierarchy of partitions of the vertices, which forms a tree. Let associate the vertices with the node representing their final colour after iterations. Then the path length between two nodes in the tree represents the number of refinement steps in which the associated vertices have different colours. Assuming that is fixed, we obtain a linear running time for approximating the graph edit distance with Theorem 3.3 and Lemma 3.4.

The WL method is used successfully to derive efficient and expressive graph kernels (Shervashidze et al., 2011; Kriege et al., 2016). The prediction accuracy obtained with such kernels crucially depends on the number refinement iterations , which is typically determined by expensive grid search, e.g., from . Note that our approach to approximate the graph edit distance is not sensitive to a particular choice of , but can be expected to always benefit from more iterations. If is chosen such that , we can in fact show that a slight modification of our algorithm gurantees that the computed edit distance is zero if and only if and are isomorphic, assuming that is amenable to the WL procedure. This means that WL succeeds in distinguishing from any non-isomorphic graph  (Arvind et al., 2015). Then, there is a choice for the module PairElements such that Algorithm 1 is guaranteed to construct an isomorphism between two copies of , i.e., the algorithm output is optimal. These choices for PairElements are explicitly constructed in the appendix, Section A.

### 4.2 Hierarchical clustering for continuous labels

We apply the bisecting -means algorithm (Steinbach et al., 2000a, b) to obtain a hierarchical clustering of the continuous vertex labels of all graphs. This is then used as the tree defining the assignment costs. We use Lloyd’s algorithm (Lloyd, 1982) for each -means problem and perform bisection steps until leaves are created. Assuming that the number of reassignments in -means is constant as well as the dimension of the node labels, we obtain a linear running time for creating the tree. With Theorem 3.3 and Lemma 3.4 this again yields a linear time method for approximating the graph edit distance.

## 5 Experimental Evaluation

Our goal in this section is to answer the following questions experimentally.

1. How does our approach scale w.r.t. the graph and dataset size compared to other methods?

2. How accurately does it approximate the graph edit distance for common datasets?

3. How does it perform regarding runtime and accuracy in classification tasks?

4. How does our method compare to other approaches for graph classification?

### 5.1 Method

We have implemented the following methods for computing or approximating the graph edit distance in Java using the same code base where possible.

[topsep=0pt,parsep=1ex,itemsep=-1ex]

Exact

Binary linear programming approach to compute the graph edit distance exactly (Lerouge et al., 2017). We implemented the most efficient formulation (F2) and solved all instance using Gurobi 7.5.2.

BP

Approximate graph edit distance using the assignment problem as proposed by Riesen & Bunke (2009).

Greedy

The greedy graph edit distance proposed by Riesen et al. (2015a) solving the assignment problem by a row-wise greedy algorithm.

Linear

Our approach based on assignments under a tree distance. For graphs with discrete labels we used Weisfeiler-Lehman trees with , and bisecting -means clustering otherwise.

The experiments were conducted using Java v1.8.0 on an Intel Core i7-3770 CPU at 3.4GHz (Turbo Boost disabled) with 16GB of RAM. The methods BP, Greedy and Linear use a single processor only, the Gurobi solver for the Exact method was allowed to use all four cores with additional Hyper-Threading.

We used the graph classification benchmark sets contained in the IAM Graph Database (Riesen & Bunke, 2008)111Please note that the statistics of the datasets may differ from the datasets used in (Riesen & Bunke, 2009). and the repository of benchmark datasets for graph kernels (Kersting et al., 2016)

. The datasets AIDS, Mutagenicity and NCI1 represent small molecules and have discrete labels only. The Letter datasets have continuous vertex labels representing 2D coordinates and differ w.r.t. the level of distortion, (L)–low, (M)–medium and (H)–high. In order to systematically investigate the runtime we generated random graphs according to the Erdős–Rényi model with edge probability

. We used the predefined train, test and validation sets when available or generated them randomly using of the objects for each set, balanced by class label. We performed -nearest neighbours classification based on the graph edit distance. The costs for vertex insertion and deletion were both set to and the costs for insertion and deletion of edges were set to . The costs for substituting vertices or edges are determined by the Euclidean distance in case of continuous labels. In case of discrete labels we assume cost 0 for equal labels and 1 otherwise. We use the validation set to select the parameters , and by grid search. The approach resembles the experimental settings used by Riesen & Bunke (2009). The reported runtimes were either obtained using the selected parameters or . For the linear method, the runtimes include the time for constructing the tree.

For comparison with other approaches to graph classification, we used two graph kernels as a baseline. The GraphHopper kernel (GH) (Feragen et al., 2013) supports graphs with discrete and continuous labels by applying a Dirac or Gaussian kernel. The Weisfeiler-Lehman optimal assignment kernel (WLOA) (Kriege et al., 2016) supports only graphs with discrete labels. We used the -SVM implementation LIBSVM (Chang & Lin, 2011), selecting and using the validation set.

### 5.2 Results

We report on our experimental results and answer our research questions.

#### Q1

Figure 2 shows the growth of the runtime with increasing graph and dataset size. Our method is the only one of those studied that scales to large graphs. The number of distance computations and thus the runtime of all methods grows quadratically with the dataset size. Even for the small random graphs on 15 vertices we generated, our method is more than one order of magnitude faster than other approximate methods for datasets of moderate size.

#### Q2

To compare how accurately the graph edit distance is computed, we have selected 500 pairs of graphs at random from each IAM dataset and computed their graph edit distance by all four methods. Figure 3 shows how the distance computed by the Linear method compares to the distances obtained by the other three methods. Points below the diagonal line represent pairs of graphs, for which the edit distance computed by Linear is actually smaller than the one computed by the competing approach. Compared to the Greedy approach the Linear method appears to give slightly better results on an average. For the datasets Mutagenicity, Letter (L), (M) and (H) there are more points below the diagonal than above the diagonal. When comparing to BP, this is still the case for the Mutagenicity dataset, but not for the Letter datasets. This can be explained by the fact that continuous distances for several points cannot be represented by a tree metric without distortion. In order to compare with the exact method on Mutagenicity and AIDS, we introduced a timeout of 100 seconds for each distance computation. This was necessary since hard instances may require more than several hours. In case of a timeout the best solution found so far is used, which is not guaranteed to be optimal. The Linear method shows a clear divergence from the optimal solutions, in particular for pairs of graphs with a high (exact) edit distance. However, it is likely that non-optimal solutions in this case do not harm a nearest neighbours classification.

#### Q3

Table 1 summarises the results of the classification experiments. The Linear approach provides a high classification accuracy comparable to BP and Greedy. For the dataset Mutagenicity and NCI1 it even performs better than the other approaches. This can be explained by the ability of the Weisfeiler-Lehman tree to exploit more graph structure than BP. For the Letter datasets, the Linear method is on a par with the other methods for the version with low distortion, but performs slightly worse when the distortion increases. This observation is in accordance with the approximation quality achieved for the datasets, cf. Figure 3. The Linear method clearly outperforms all other approaches in terms of runtime. This becomes in particular clear for the dataset Mutagenicity, which contains large graphs with 30.32 vertices on an average.

#### Q4

The GraphHopper kernel performs worse than our Linear approach w.r.t. running time and classification accuracy. WLOA can only be applied to the molecular datasets with discrete labels. For these it performs exceptionally well regarding both accuracy and runtime. The result suggests that the notion of similarity provided by the graph edit distance is less suitable for this classification task.

## 6 Conclusion

We have shown that optimal assignments can be computed efficiently for tree metric costs. Although this is a severe restriction, we designed such costs functions suitable for the challenging problem of graph matching. Our approach allows to embed the optimal assignment costs in an space. It remains future work to exploit this property, e.g., for efficient nearest neighbour search in graph databases.

## Acknowledgements

This work was supported by the German Science Foundation (DFG) within the Collaborative Research Center SFB 876 “Providing Information by Resource-Constrained Data Analysis”, project A6 “Resource-efficient Graph Mining”.

## References

• Arvind et al. (2015) Arvind, V., Köbler, J., Rattan, G., and Verbitsky, O. On the power of color refinement. In Kosowski, A. and Walukiewicz, I. (eds.), Fundamentals of Computation Theory, pp. 339–350. Springer International Publishing, 2015. ISBN 978-3-319-22177-9.
• Bunke (1997) Bunke, H. On a relation between graph edit distance and maximum common subgraph. Pattern Recognition Letters, 18(8):689–694, 1997. ISSN 0167-8655.
• Burkard et al. (2012) Burkard, R. E., Dell’Amico, M., and Martello, S. Assignment Problems. SIAM, 2012.
• Chang & Lin (2011) Chang, C.-C. and Lin, C.-J.

LIBSVM: A library for support vector machines.

ACM Transactions on Intelligent Systems and Technology, 2:27:1–27:27, May 2011. ISSN 2157-6904. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.
• Feragen et al. (2013) Feragen, A., Kasenburg, N., Petersen, J., Bruijne, M. D., and Borgwardt, K. Scalable kernels for graphs with continuous attributes. In Burges, C., Bottou, L., Welling, M., Ghahramani, Z., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems 26, pp. 216–224, 2013. Erratum available at http://image.diku.dk/aasa/papers/graphkernels_nips_erratum.pdf.
• Fröhlich et al. (2005) Fröhlich, H., Wegner, J. K., Sieker, F., and Zell, A. Optimal assignment kernels for attributed molecular graphs. In Proceedings of the 22nd international conference on Machine learning, ICML ’05, pp. 225–232, New York, NY, USA, 2005. ACM. ISBN 1-59593-180-5.
• Garey & Johnson (1979) Garey, M. R. and Johnson, D. S. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman, 1979. ISBN 0-7167-1044-7.
• Grauman & Darrell (2007a) Grauman, K. and Darrell, T. The pyramid match kernel: Efficient learning with sets of features. J. Mach. Learn. Res., 8:725–760, May 2007a. ISSN 1532-4435.
• Grauman & Darrell (2007b) Grauman, K. and Darrell, T. Approximate correspondences in high dimensions. In Schölkopf, B., Platt, J. C., and Hoffman, T. (eds.), Advances in Neural Information Processing Systems 19, pp. 505–512. MIT Press, 2007b.
• Johansson & Dubhashi (2015) Johansson, F. D. and Dubhashi, D. Learning with similarity functions on graphs using matchings of geometric embeddings. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, pp. 467–476, New York, NY, USA, 2015. ACM. ISBN 978-1-4503-3664-2.
• Johnson (2005) Johnson, D. S. The NP-completeness column. ACM Trans. Algorithms, 1(1):160–176, July 2005. ISSN 1549-6325.
• Kersting et al. (2016) Kersting, K., Kriege, N. M., Morris, C., Mutzel, P., and Neumann, M. Benchmark data sets for graph kernels, 2016.
• Kriege et al. (2016) Kriege, N. M., Giscard, P.-L., and Wilson, R. On valid optimal assignment kernels and applications to graph classification. In Advances in Neural Information Processing Systems 29 (NIPS), pp. 1623–1631. Curran Associates, Inc., 2016.
• Lerouge et al. (2017) Lerouge, J., Abu-Aisheh, Z., Raveaux, R., Héroux, P., and Adam, S. New binary linear programming formulation to compute the graph edit distance. Pattern Recognition, 72(Supplement C):254 – 265, 2017. ISSN 0031-3203.
• Lin (1994) Lin, C.-L. Hardness of approximating graph transformation problem. In Du, D.-Z. and Zhang, X.-S. (eds.), Algorithms and Computation, pp. 74–82, Berlin, Heidelberg, 1994. Springer Berlin Heidelberg. ISBN 978-3-540-48653-4.
• Lloyd (1982) Lloyd, S. Least squares quantization in pcm. IEEE Transactions on Information Theory, 28(2):129–137, March 1982. ISSN 0018-9448.
• Riesen & Bunke (2008) Riesen, K. and Bunke, H. Iam graph database repository for graph based pattern recognition and machine learning. In da Vitoria Lobo, N., Kasparis, T., Roli, F., Kwok, J. T., Georgiopoulos, M., Anagnostopoulos, G. C., and Loog, M. (eds.), Structural, Syntactic, and Statistical Pattern Recognition: Joint IAPR International Workshop, SSPR & SPR 2008, Orlando, USA, December 4-6, 2008. Proceedings, pp. 287–297, Berlin, Heidelberg, 2008. Springer Berlin Heidelberg. ISBN 978-3-540-89689-0.
• Riesen & Bunke (2009) Riesen, K. and Bunke, H. Approximate graph edit distance computation by means of bipartite graph matching. Image and Vision Computing, 27(7):950 – 959, 2009. ISSN 0262-8856. 7th IAPR-TC15 Workshop on Graph-based Representations (GbR 2007).
• Riesen et al. (2015a) Riesen, K., Ferrer, M., Dornberger, R., and Bunke, H. Greedy graph edit distance. In Perner, P. (ed.), Machine Learning and Data Mining in Pattern Recognition, pp. 3–16, Cham, 2015a. Springer International Publishing. ISBN 978-3-319-21024-7.
• Riesen et al. (2015b) Riesen, K., Ferrer, M., Fischer, A., and Bunke, H. Approximation of graph edit distance in quadratic time. In Liu, C.-L., Luo, B., Kropatsch, W. G., and Cheng, J. (eds.), Graph-Based Representations in Pattern Recognition, pp. 3–12, Cham, 2015b. Springer International Publishing. ISBN 978-3-319-18224-7.
• Sanfeliu & Fu (1983) Sanfeliu, A. and Fu, K.-S. A distance measure between attributed relational graphs for pattern recognition. IEEE Transactions on Systems, Man, and Cybernetics, 13(3):353–362, 1983.
• Semple & Steel (2003) Semple, C. and Steel, M. Phylogenetics. Oxford lecture series in mathematics and its applications. Oxford University Press, 2003.
• Shervashidze et al. (2011) Shervashidze, N., Schweitzer, P., van Leeuwen, E. J., Mehlhorn, K., and Borgwardt, K. M. Weisfeiler-lehman graph kernels. Journal of Machine Learning Research, 12:2539–2561, 2011.
• Stauffer et al. (2017) Stauffer, M., Tschachtli, T., Fischer, A., and Riesen, K. A survey on applications of bipartite graph edit distance. In Foggia, P., Liu, C.-L., and Vento, M. (eds.), Graph-Based Representations in Pattern Recognition, pp. 242–252, Cham, 2017. Springer International Publishing. ISBN 978-3-319-58961-9.
• Steinbach et al. (2000a) Steinbach, M., Karypis, G., and Kumar, V. A comparison of document clustering techniques. In KDD Workshop on Text Mining, 2000a.
• Steinbach et al. (2000b) Steinbach, M., Karypis, G., and Kumar, V. A comparison of document clustering techniques. Technical report, Department of Computer Science and Egineering, University of Minnesota, 2000b.
• Vert (2008) Vert, J.-P. The optimal assignment kernel is not positive definite. CoRR, abs/0801.4061, 2008.
• Vishwanathan et al. (2010) Vishwanathan, S. V. N., Schraudolph, N. N., Kondor, R. I., and Borgwardt, K. M. Graph kernels. Journal of Machine Learning Research, 11:1201–1242, 2010.
• Zeng et al. (2009) Zeng, Z., Tung, A. K. H., Wang, J., Feng, J., and Zhou, L. Comparing stars: On approximating graph edit distance. Proc. VLDB Endow., 2(1):25–36, August 2009. ISSN 2150-8097.

## Appendix A Modification of PairElements for WL-amenable graphs

These results are due to two properties of WL-amenable graphs. To introduce these, we will need to consider the elements of the stable partition of the vertex set of at the end of the WL-procedure, called cells. For any cell , the graph denotes the induced subgraph of with vertex set .

First, the stable partition of an amenable graph coincides with the orbit partition of the automorphism group of this graph. In other terms, if there exists a graph automorphism between two vertices an on , then and have the same colour in the stable partition (Arvind et al., 2015). This guarantees that all isomorphisms between copies of , i.e. automorphisms of , relate vertices belonging to the same cell. Second, is of only 5 possible types (see Lemma 3 of (Arvind et al., 2015)) : i) empty, ii) complete; iii) a 5-cycle; iv) a matching graph (multiple copies of ) or v) the complement of a matching graph . Note that for two isomorphic graphs PairElements is called exactly for the matching cells of the stable partitions and no elements are passed to neighbours. We modify the function to first determine the nature of . If it is empty or complete, any vertex matching is an automorphism of . If is a five cycle, the algorithm picks two vertices at random on the two 5-cycles to be matched, match these two vertices, then select and match their right (or left) neighbours and proceed thus until vertex exhaustion. If is a matching graph, picks two vertices at random on the two to be matched, match these two vertices and then match their unique neighbours. Proceed thus until vertex exhaustion. If is none of the above, it is the complement of a matching graph . Then the algorithm picks two vertices at random on the two to be matched, match them, then match the neighbours of these two vertices with any other vertices than with one another. Proceed until vertex exhaustion.

In the end the algorithm has constructed a valid automorphism of , or equivalently an isomorphism between two copies of . All automorphisms can be constructed this way by modifying the module PairElements such that all possible matching are realised rather a random possible one by adding a loop over all possible choices in cases i)–v).

The procedure PairAdj shown in Algorithm 2 replaces PairElements in Algorithm 1 and implement this idea.

## Appendix B Limitations of ultrametrics

The cost matrix of Eq. (2.3) is easily seen not to be in accordance with the strong triangle inequality. Consider the cost matrix obtained for a graph with vertices compared to itself. Let and and consider . We have and and not specified by . However, and, thus, a contradiction to the strong triangle inequality, unless . Therefore, we have to modify the definition of the cost matrix. The entries in the upper right and lower left corner have been introduced with the argument, that every node can be inserted and deleted at most once (Riesen & Bunke, 2009). This, however, is already guaranteed, since the assignment is a bijection. We simplify the cost matrix as follows

 Cultra=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣[cccc|cccc]c1,1c1,2⋯c1,mττ⋯τc2,1c2,2⋯c2,mττ⋱⋮⋮⋮⋱⋮⋮⋱⋱τcn,1cn,2⋯cn,mτ⋯ττ\par\cline1−8\parττ⋯τ00⋯0ττ⋱⋮00⋱⋮⋮⋱⋱τ⋮⋱⋱0τ⋯ττ0⋯00⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦, (2)

where is the cost for vertex deletion and insertion. Moreover, we assume that the vertex substitution costs for , and can be represented by an ultrametric tree . We can extend this tree for vertex insertion and deletion as defined by by adding a node to the root, where the edge to the parent has weight . Just like the matrix (2) contains additional rows and columns for vertex insertion and deletion, we associate artificial vertices with the node via the map . Note that these can be matched at zero cost at , which represents the bottom right submatrix of (2) filled with .

## Appendix C Dataset statistics

The statistics of the graph datasets used are summaized in Table 2. All data sets are publicly available for download (Riesen & Bunke, 2008; Kersting et al., 2016).