 # Wasserstein-based Graph Alignment

We propose a novel method for comparing non-aligned graphs of different sizes, based on the Wasserstein distance between graph signal distributions induced by the respective graph Laplacian matrices. Specifically, we cast a new formulation for the one-to-many graph alignment problem, which aims at matching a node in the smaller graph with one or more nodes in the larger graph. By integrating optimal transport in our graph comparison framework, we generate both a structurally-meaningful graph distance, and a signal transportation plan that models the structure of graph data. The resulting alignment problem is solved with stochastic gradient descent, where we use a novel Dykstra operator to ensure that the solution is a one-to-many (soft) assignment matrix. We demonstrate the performance of our novel framework on graph alignment and graph classification, and we show that our method leads to significant improvements with respect to the state-of-the-art algorithms for each of these tasks.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The importance of graphs has recently increased in various tasks in different application domains, such as molecules modeling, brain connectivity analysis, or social network inference. Even if this development is partially fostered by powerful mathematical tools to model structural data, important questions are still largely open. In particular, it remains challenging to align, classify, predict or cluster graphs, since the notion of similarity between graphs is not straightforward. In many cases (e.g., dynamically changing graphs, multilayer graphs, etc…), even a consistent enumeration of the vertices cannot be trivially chosen for all graphs under study.

When two graphs are not aligned a priori

, graph matching must be performed prior to any comparison, leading to the challenging problem of estimating an unknown assignment between their vertices. Since this problem is NP-hard, there exist several relaxations that can be solved by minimizing a suitable distance between graphs under the quadratic assignment model, such as the

-norm between the graph adjacency matrices (Yu et al., 2018), or the Gromow-Wasserstein distance (Xu et al., 2019). However, these approaches may yield solutions that are unable to capture the importance of edges with respect to the overall structure of the graph. An alternative that seems more appropriate for graph comparison is based on the Wasserstein distance between the graph signal distributions (Petric Maretic et al., 2019), but it is currently limited to graphs of the same size.

In this paper, we consider the challenging alignment problem for graphs of different sizes. In particular, we build on (Petric Maretic et al., 2019) and formulate graph matching as a one-to-many soft-assignment problem, where we consider the Wasserstein distance to measure the goodness of graph alignment in a structurally meaningful way. To accommodate for the nonconvexity of the problem, we propose a stochastic formulation based on a novel Dykstra operator to implicitly ensure that the solution is a one-to-many soft-assignment matrix. This allows us to devise an efficient algorithm based on stochastic gradient descent, which naturally integrates Bayesian exploration in the optimization process, so as to help finding better local minima. We illustrate the benefits of our new graph comparison framework in representative tasks such as graph alignment and graph classification on synthetic and real datasets. Our results show that the Wasserstein distance combined with the one-to-many graph assignment permits to outperform both Gromov-Wasserstein and Euclidean distance in these tasks, suggesting that our approach outputs a structurally meaningful distance to efficiently align and compare graphs. These are important elements in graph analysis, comparison, or graph signal prediction tasks.

The paper is structured as follows. Section 3 presents the graph alignment problem with optimal transport, as well as the formulation of the one-to-many assignment problem. Section 4 introduces our new Dykstra operator and proposes an algorithm for solving the resulting optimization problem via stochastic gradient descent. In Section 5, the performance of the proposed approach is assessed on synthetic and real data, and compared to different state-of-the-art methods. Finally, Section 6 concludes the paper.

## 2 Related work

Numerous methods have been developed for graph alignment, whose goal is to match the vertices of two graphs such that the similarity of the resulting aligned graphs is maximized. This problem is typically formulated under the quadratic assignment model (Yan et al., 2016; Jiang et al., 2017), which is generally thought to be essential for obtaining a good matching, despite being NP-hard. The main body of research in graph matching is thus focused on devising more accurate and/or faster algorithms to solve this problem approximately (Neuhaus et al., 2006).

In order to deal with the NP-hardness of graph alignment, spectral clustering based approaches

(Caelli and Kosinov, 2004; Srinivasan et al., 2007) relax permutation matrices into semi-orthogonal ones, at the price of a suboptimal matching accuracy. Alternatively, semi-definite programming can be used to relax the permutation matrices into semi-definite ones (Schellewald and Schnörr, 2005). Spectral properties have also been used to inspect graphs and define different classes of graphs for which convex relaxations are tight (Aflalo et al., 2015; Fiori and Sapiro, 2015; Dym et al., 2017). Based on the assumption that the space of doubly-stochastic matrices is a convex hull of the set of permutation matrices, the graph matching problem was relaxed into a nonconvex quadratic problem (Cho et al., 2010; Zhou and Torre, 2016). A related approach was recently proposed to approximate discrete graph matching in the continuous domain by using nonseparable functions (Yu et al., 2018). Along similar lines, a Gumbel-sinkhorn network was proposed to infer permutations from data (Mena et al., 2018; Emami and Ranka, 2018) and align graphs with the Sinkhorn operator (Sinkhorn, 1964) to predict a soft permutation matrix.

Closer to our framework, some recent works studied the graph alignment problem from an optimal transport perspective. Flamary et al. (Flamary et al., 2014) proposed a method to compute an optimal transportation plan by controlling the displacement of vertex pairs. Gu et al. (Gu et al., 2015)

defined a spectral distance by assigning a probability measure to the nodes via the spectrum representation of each graph, and by using Wasserstein distances between probability measures. This approach however does not take into account the full graph structure in the alignment problem. Later, Nikolentzos

et al. (Nikolentzos et al., 2017)

proposed instead to use the Wasserstein distance for matching the graph embeddings represented as bags of vectors.

Another line of works looked at more specific graphs. Memoli (Mémoli, 2011) investigated the Gromov-Wasserstein distance for object matching, Peyré et al. (Peyré et al., 2016) proposed an efficient algorithm to compute the Gromov-Wasserstein distance and the barycenter of pairwise dissimilarity matrices, and (Xu et al., 2019) devised a scalable version of Gromov-Wasserstein distance for graph matching and classification. More recently, Vayer et al. (Vayer et al., 2018) built on this work to propose a distance for graphs and signals living on them, which is a combination between the Gromov-Wasserstein of graph distance matrices, and the Wasserstein distance of graph signals. However, while the above methods solve the alignment problem using optimal transport, the simple distances between aligned graphs do not take into account its global structure and the methods do not consider the transportation of signals between graphs.

## 3 Problem Formulation

Despite recent advances in the analysis of graph data, it stays challenging to define a meaningful distance between graphs. Even more, a major difficulty with graph representations is the lack of node alignment, which is necessary for direct quantitative comparisons between graphs. We propose to use the Wasserstein distance to compare graphs (Petric Maretic et al., 2019), since it has been shown to take into account global structural differences between graphs. Then, we formulate graph alignment as the problem of finding the assignment matrix that minimizes the distance between graphs of different sizes.

### 3.1 Preliminaries

##### Optimal transport

Let be the set of two arbitrary probability measures on two spaces . The Wasserstein distance111Wasserstein distance is also referred to as Kantorovich-Monge-Rubinstein distance. , arising from the Monge and Kantorovich optimal transport problem, can be defined as finding a map that minimizes

 W2(ν,μ)=infT#ν=μ∫X∥x−T(x)∥2dν(x), (1)

where means that pushes forward the mass from to . Intuitively, can be seen as a function that preserves positivity and total mass, i.e., moving an entire probability mass on to an entire probability mass on . Equation (1) can be seen as the minimal cost needed to transport one probability measure to another with respect to a quadratic cost .

The Wasserstein distance between Gaussian distributions has an explicit expression in terms of their mean vectors and covariance matrices

and , respectively. With and , the above distance can be written as (Takatsu and others, 2011)

 W22(ν,μ)=Tr(Σ1+Σ2)−2Tr(√Σ121Σ2Σ121), (2)

and the optimal map that takes to is

 T(x)=Σ121(Σ121Σ2Σ121)12Σ121x. (3)
##### Smooth graph signals

Let be a graph defined on a set of vertices, with (non-negative) similarity edge weights. We denote by the weighted adjacency matrix of , and the diagonal matrix of vertex degree for all . The Laplacian matrix of is thus defined as .

We further assume that each vertex of the graph is associated with a scalar feature, forming a graph signal. We denote this graph signal as a vector . Following (Rue and Held, 2005)

, we interpret graphs as key elements that drive the probability distributions of signals, and thus we consider that a graph signal follows a normal distribution with zero mean and covariance matrix

 x∼νG=N(0,L†), (4)

where denotes a pseudoinverse operator. The above formulation means that the graph signal varies slowly between strongly connected nodes (Dong et al., 2016). This assumption is verified for most common graph and network datasets. It is further used in many graph inference algorithms that implicitly represent a graph through its smooth signals (Dempster, 1972; Friedman et al., 2008; Dong et al., 2018)

. Furthermore, the smoothness assumption is used as regularization in many graph applications, such as robust principal component analysis

(Shahid et al., 2015) and label propagation (Zhu et al., 2003).

### 3.2 One-to-many assignment problem

Assume that we are given two graphs and with the same number of nodes, and that we have knowledge of the one-to-one mapping between their vertices.

Following (Petric Maretic et al., 2019), instead of comparing graphs directly, we look at their signal distributions, which are governed by the graphs. Specifically, we measure the dissimilarity between two aligned graphs and through the Wasserstein distance of the respective distributions and , which can be calculated explicitly as

 W22(νG1,μG2)=Tr(L†1+L†2)−2Tr⎛⎝√L†21L†2L†21⎞⎠. (5)

The advantage of this distance over more traditional graph distances (eg. , graph edit distance…) is that it takes into account the importance of an edge to the graph structure. This allows to better capture topological features in the distance metric. Another advantage is that the Wasserstein distance comes with a transport map that allows to transfer signals from one graph to the other. Hence, the mapping of signals over graphs yields

 T(x)=L†21(L†21L†2L†21)†2L†21x, (6)

which represents the signal , originally living on graph , adapted to the structure of graph .

The above Wasserstein distance requires the two graphs to be of the same size. However, we want to compare graphs of different sizes as well, which represents a common setting in practice. Throughout the rest of this work, we will consider two graphs and , and we arbitrarily pick as the graph with the smaller number of nodes.

We now compare graphs of different sizes by looking for the one-to-many assignment between their vertices, similarly to (Zaslavskiy et al., 2010). This is illustrated in the toy example of Figure 1, where every vertex of the smaller graph is assigned to one or more vertices in the larger graph , and every vertex of is assigned to exactly one vertex in . Let be the maximum number of nodes in matched to a single node in . Such a one-to-many assignment can be described by a matrix satisfying the constraints

 (7)

In words, the matrix only takes values zero or one, which corresponds to a hard assignment. Moreover, the sum of each matrix row has to be between and , ensuring that every vertex of is matched to at least one and at most vertices of . Finally, the sum of each matrix column has to be exactly one, so that every vertex of is matched to exactly one vertex of . To ensure that is a nonempty constraint set, we require that

 1≤kmax≤1+|V2|−|V1|. (8)

Given the true assignment matrix , the larger graph can be aligned to the smaller graph by transforming its Laplacian matrix as (Zaslavskiy et al., 2010), yielding an associated distribution of signals:

 μG2P∗=N(0,(P∗L2P⊤∗)†). (9)

The graph alignment with the one-to-many assignment solution thus naturally leads to the use of of Equation (5) for evaluating the distance222It is not a distance in the theoretical sense. For brevity, we will use the term “distance” with an abuse of terminology. between graphs that originally have different sizes.

Of course, the true assignment matrix is often unknown beforehand. We are thus interested in estimating the best alignment, or equivalently in finding the assignment matrix that minimizes the distance between two graphs and , leading to the optimization problem

 minimizeP∈ChardW22(νG1,μG2P). (10)

The main difficulty in solving Problem (10) arises from the constraint defined in (7), since it leads to a discrete optimization problem with a factorial number of feasible solutions. To circumvent this issue, we propose a relaxation of the one-to-many assignment problem in the next section.

## 4 Optimization algorithm

To deal with the nonconvexity of the alignment problem in Equation (10), we rely on two main ideas. Firstly, we relax the binary constraint into the unitary interval, so that becomes a soft-assignment matrix belonging to the set

 (11)

Secondly, we enforce the relaxed constraints implicitly using the Dykstra operator

 Aτ:R|V1|×|V2|→Csoft, (12)

which transforms a rectangular matrix into a soft-assignment matrix, as explained in Section 4.1. This operator can be injected into the cost function to remove all the constraints, thus yielding the new unconstrained optimization problem

 (13)

Problem (13) is highly nonconvex, which may cause gradient descent to converge towards a local minimum. As we will see in Section 4.2, using the Dykstra operator will allow us to devise a stochastic formulation that can be efficiently solved with a variant of gradient descent integrating Bayesian exploration in the optimization process, possibly helping the algorithm to find better local minima.

### 4.1 Dykstra operator

Given a rectangular matrix and a small constant , the Dykstra operator normalizes the rows and columns of to obtain a one-to-many assignment matrix, where a node in the smaller graph is matched to one or more (but at most ) nodes in the larger graph. It is defined as

 Aτ(˜P)=argmaxP∈Csoft[⟨P,˜P⟩−τ∑ijPijlog(Pij)]. (14)

This operator can be efficiently computed by the Dykstra algorithm (Dykstra, 1983) with Bregman projections (Bauschke and Lewis, 2000). Indeed, Problem (14) can be written as a Kullback-Leibler (KL) projection (Benamou et al., 2015)

 Aτ(˜P)=argminP∈C(0)∩C(1)KL(P|exp(˜P/τ)), (15)

with

 C(0) ={Ξ∈R|V1|×|V2|+|Ξ1|V2|∈[1,kmax]|V1|}, (16) C(1) ={Ξ∈R|V1|×|V2|+|Ξ⊤1|V1|=1|V2|}.

The Dykstra algorithm starts by initializing

 P=exp(˜P/τ)andQ=Q[−1]=1|V1|×|V2|, (17)

and then iterates for every

 P[t+1] =PKLC(tmod2)(P[t]⊙Q[t−1]), (18) Q[t+1] =Q[t−1]⊙P[t]P[t+1], (19)

where all operations are meant entry-wise.333 denotes the entry-wise (Hadamard) product of matrices. The KL projections are defined, for every , as follows

 PKLC(0)(Ξ) =diag([max{1,min{∑jΞij,kmax}}∑jΞij]i)Ξ (20) PKLC(1)(Ξ) =Ξdiag⎛⎝[1∑iΞij]j⎞⎠. (21)

In the limit , the operator yields a one-to-many assignment matrix. It is also differentiable (Luise et al., 2018), and can be thus used in a cost function optimized by gradient descent, as we will see in Section 4.2.

#### 4.1.1 Connection to Sinkhorn

In the special case where the two graphs have the same size , the condition in (8) leads to , and thus reduces to the space of doubly-stochastic matrices. The Dykstra operator then reverts to a Sinkhorn operator (Sinkhorn, 1964; Cuturi, 2013; Genevay et al., 2018; Mena et al., 2018; Petric Maretic et al., 2019). Given a square matrix and a small constant , the Sinkhorn operator normalizes the rows and columns of

so as to obtain a doubly stochastic matrix. Formally, it is defined as

 Sτ(˜P)=argmaxP∈Cdoubly[⟨P,˜P⟩−τ∑ijPijlog(Pij)], (22)

where is the set of doubly stochastic matrices

 (23)

It is well known that the above operator can be computed with the following iterations

 P =exp(˜P/τ) (24) L[t] =diag(P[t]1|V|)−1 R[t] =diag(1⊤|V|L[t]P[t])−1 P[t+1] =L[t]P[t]R[t].

In the limit , the operator yields a permutation matrix (Mena et al., 2018). It is also differentiable (Luise et al., 2018), and can be thus used in a cost function optimized by gradient descent, as we will see in Section 4.2. Figure 2: Alignment and detection of communities in structured graphs, showing the recovery of local changes, as well as the global graph structure. The experiment is performed on distorted stochastic block model graphs as a function of the percentage of fused nodes. The graph G2 is a four stochastic block model with 24 nodes. The graph G1 is a random distorted version of G2, where the edges are collapsed until the target percentage of nodes is fused. We compare three different distances: Gromov-Wasserstein (GW), the ℓ2 defined as ∥L1−PL2P⊤∥2 and solved using our stochastic algorithm and the proposed distance W22(νG1,μG2P). The first plot shows the ℓ2 distance between aligned graphs (closer to 0 the better), while the second one shows the community detection performance using spectral clustering technique in terms of Normalized Mutual Information (NMI closer to 1 the better).

### 4.2 Stochastic formulation

With help of the Dykstra operator, the cost function in Problem (13) becomes differentiable, and can be thus optimized by gradient descent. However, the nonconvex nature of the problem may cause gradient descent to converge towards a local minimum. Instead of directly solving Problem (13), we propose to optimize the expectation w.r.t. the parameters of some distribution , yielding

 minimizeθE˜P∼qθ{W22(νG1,μG2Aτ(˜P))}. (25)

The optimization of the expectation w.r.t. the parameters aims at shaping the distribution so as to put all its mass on a minimizer of the original cost function, thus integrating the use of Bayesian exploration in the optimization process, possibly helping the algorithm to find better local minima.

A standard choice for in continuous optimization is the multivariate normal distribution, leading to with and being matrices. By leveraging the reparameterization trick (Kingma and Welling, 2014; Figurnov et al., 2018), which boils down to setting

 ˜Pij=ηij+σijϵijwithϵij∼N(0,1). (26)

The problem of Equation (25) can thus be reformulated as

 (27)

where

denotes the multivariate normal distribution with zero mean and unitary variance. The advantage of this reformulation is that the gradient of the above stochastic function can be approximated by sampling from the parameterless distribution

, yielding

 ∇J(η,σ)≈∑ϵ∼qunit∇W22(νG1,μG2Aτ(η+σ⊙ϵ)). (28)

The problem can be thus solved by stochastic gradient descent (Khan et al., 2017). Our approach is summarized in Algorithm 1.

Under mild assumptions, the algorithm converges almost surely to a critical point, which is not guaranteed to be the global minimum, as the problem is nonconvex. The computational complexity of a naive implementation is per iteration, due to the matrix square-root operation, but faster options exist to approximate this operation (Lin and Maji, 2017). Moreover, the computation of pseudo-inverses can be avoided by adding a small diagonal shift to the Laplacian matrices and directly computing the inverse matrices, which is orders of magnitude faster.

## 5 Experiments

We now analyse the performance of our new algorithm in two parts. Firstly, we assess the performance achieved by our approach for graph alignment and community detection in structured graphs, testing the preservation of both local and global graph properties. We investigate the influence of distance on alignment recovery and compare to methods using different definitions of graph distance for graph alignment. Secondly, we extend our analysis to graph classification, where we compare our approach with several state-of-the-art methods.

Prior to running experiments, we determined the algorithmic parameters (in the Dykstra operator) and (step size in SGD) with grid search, while (sampling size) was fixed empirically. In all experiments, we set , and

. We set the maximal number of Dykstra iterations to 20, and we run stochastic gradient descent for 1000 iterations. As our algorithm seems robust to different initialisations, we used random initialization in all our experiments. The algorithm was implemented in PyTorch with AMSGrad method

(Reddi et al., 2018). Figure 3: Alignment and detection of communities in structured graphs, showing the recovery of local changes, as well as the global graph structure. The experiment is performed on random instances of stochastic block model graphs as a function of the graph size. The graph G2 is a four stochastic block model with 24 nodes. The graph G1 is a random graph with four stochastic block model with different number of nodes. We compare three different methods: Gromov-Wasserstein (GW), the ℓ2 defined as ∥L1−PL2P⊤∥2 and solved using our stochastic algorithm and the proposed distance W22(νG1,μG2P). The first plot shows the ℓ2 distance between aligned graphs (closer to 0 the better), while the second one shows the community detection performance using spectral clustering technique in terms of Normalized Mutual Information (NMI closer to 1 the better).

### 5.1 Graph alignment and community detection

In this section, we test our proposed approach for graph alignment and recovery of communities in structured graphs. Namely, apart from the direct comparison of two graphs matrices, we evaluate the preservation of global properties by comparing the clustering of nodes into communities. We consider two experimental settings. In the first one (Figure 2), we generate a stochastic block model graph with 24 nodes and 4 communities. The graph is a noisy version of constructed by randomly collapsing edges, merging two connected nodes into one, until a target percentage of nodes is merged. We then generate a random permutation to change the order of the nodes in graph .

In the second experimental setting (Figure 3), the graph is again generated as a stochastic block model with four communities. For each , six graphs are created as random instances of stochastic block model graphs with the same number of communities, but with a different number of vertices and edges. Apart from the number of communities, there is no direct connection between and .

We investigate the influence of a distance metric on alignment recovery. We compare three different methods for graph alignment, namely the proposed method based on the Wasserstein distance between graphs, the proposed stochastic algorithm with the Euclidean distance (), and the state-of-the-art Gromov-Wasserstein distance (Peyré et al., 2016) for graphs (GW), using the Euclidean distance between shortest path matrices, as proposed in (Vayer et al., 2018). We repeat each experiment 50 times, after adjusting parameters for all compared methods, and show the results in Figures 2 and 3.

We now evaluate the structure recovery of the community-based models through spectral clustering. Namely, after alignment estimation, we cluster the nodes in both graphs. A good alignment should detect and preserve communities, keeping the nodes in the same clusters, close to their original neighbours, even when the exact neighbours are not recovered. We evaluate the quality of community recovery with normalized mutual information (NMI) between the clusters in the original graph and the recovered clusters. We further evaluate the alignment quality by checking the difference between the two graphs in terms of the norm. While it is not the best possible distance measure for graphs, it is used here as a complementary measure to the NMI, not taking any special structural information into account. It can also be seen as an unbiased metric to compare the two methods performing the best in terms of NMI.

As shown in Figure 2, the proposed approach manages to capture the structural information and outperform methods based on different distance metrics, especially under large perturbations. In Figure 3, we observe an increase in performance in terms of NMI for both and . The emergence of this phenomenon despite the growing size difference between compared graphs suggests our assignment matrix has the ability to fuse nodes into meaningful groups, forming well defined clusters.

### 5.2 Graph classification

We now tackle the task of graph classification on two different datasets: PTC (Kriege et al., 2016) and IMDB-B (Yanardag and Vishwanathan, 2015). We randomnly sample 100 graphs from each dataset. The graphs have a different number of nodes and edges. We use to align graphs and compute graph distances, and eventually use a simple non-parametric 1-NN classification algorithm to classify graphs. We compare the classification performance with methods where the same 1-NN classifier is used with different state-of-the-art methods for graph alignment: GW (Peyré et al., 2016; Vayer et al., 2018), GA (Gold and Rangarajan, 1996), IPFP (Leordeanu et al., 2009), RRWM (Cho et al., 2010), NetLSD (Tsitsulin et al., 2018), and the proposed stochastic algorithm with the Euclidean distance () instead of the Wasserstein distance in Eq. (25) . We present the accuracy scores in Table 1, where the classification with the proposed clearly outperforms the other methods in terms of general accuracy. Furthermore, we analyse the performance of , GW and on several examples from the two datasets. Figure 4: PTC dataset with two classes. Each row presents a set of graph examples, from the left to the right: G1, G2, G3 and G4. G1 and G2 belong to class 0. G3 and G4 belong to class 1. Each table provides two kind of distances: an intra (D(G1,G2) and D(G3,G4)) and inter (D(G1,G3)) classes. We evaluate three different methods in terms of distances in order to classify the graphs (e.g. D(G1,G2)≤D(G1,G3) or D(G3,G4)≤D(G1,G3)). Figure 5: IMDB-B dataset with two classes. Each row presents a set of graph examples, from the left to the right: G1, G2, G3 and G4. G1 and G2 belong to class 0. G3 and G4 belong to class 1. Each table provides two kind of distances: an intra (D(G1,G2) and D(G3,G4)) and inter (D(G1,G3)) classes. We evaluate three different methods in terms of distances in order to classify the graphs (e.g. D(G1,G2)≤D(G1,G3) or D(G3,G4)≤D(G1,G3)).

### PTC dataset

PTC dataset contains the molecular structure of the NTP dataset. Figure 4 presents a set of graph examples from two different classes (0 and 1). In the first example (first row), outperforms both and in separating the two classes. The distinguishing feature between and is the number of nodes that forms the ring, which has been captured by , thanks to the soft permutation applied to the larger graph ().

The second example shows in a very intuitive way how and GW are able to capture structural similarities in graphs, even when those largely vary in size. This is especially clear when comparing the almost two times larger , and , , with structurally very similar and , and an easy-to-imagine assignment of one node in the graph to several nodes in the graph . However, it is not always as simple to understand the similarities. The third row shows an example in which all the three methods fail to find structural similarities with graphs in the same class.

### IMDB-B dataset

IMDB-B dataset contains two classes: Comedy and science-fiction movies, with several examples shown in Figure 5. The striking difference between example 2 and 3 shows that, while taking into account the global graph structure can be crucial in distinguishing some samples (second row), it remains a challenging dataset with very similar graphs often belonging to different clusters (third row). This possibly explains the low accuracy across all examined methods. However, example 1 shows the high flexibility of the assignment matrix proposed in our algorithm, where the one-to-many assignment is able to detect that graph is very close to a graph with 2 communities, even if it technically has 3. This combination of putting emphasis on structural information, and allowing for flexibility might be the reason why still manages to outperform the other investigated methods.

## 6 Conclusion

In this paper, we have proposed a new method to align graphs of different sizes. Equipped with an optimal transport based approach to compute the distance between two smooth graph distributions associated to each graph, we have formulated a new one-to-many alignment problem to find a soft assignment matrix that minimizes the “mass” transportation from a fixed distribution to a permuted and partially merged distribution. The resulting nonconvex optimization problem is solved efficiently with a novel stochastic gradient descent algorithm. It allows us to align and compare graphs, and it outputs a structurally meaningful distance. We have shown the performance of the proposed method in the context of graph alignment and graph classification. Our results show that the proposed algorithm outperforms state-of-the-art alignment methods for structured graphs.

## References

• Y. Aflalo, A. Bronstein, and R. Kimmel (2015) On convex relaxation of graph isomorphism. Proceedings of the National Academy of Sciences 112 (10), pp. 2942–2947. Cited by: §2.
• H. H. Bauschke and A. S. Lewis (2000) Dykstras algorithm with bregman projections: a convergence proof. Optimization 48 (4), pp. 409–427. Cited by: §4.1.
• J.-D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyré (2015) Iterative bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing 37 (2), pp. A1111–A1138. Cited by: §4.1.
• T. Caelli and S. Kosinov (2004)

An eigenspace projection clustering method for inexact graph matching

.
IEEE transactions on Pattern Analysis and Machine Intelligence 26 (4), pp. 515–519. Cited by: §2.
• M. Cho, J. Lee, and K. M. Lee (2010) Reweighted random walks for graph matching. In

European conference on Computer vision

,
pp. 492–505. Cited by: §2, §5.2.
• M. Cuturi (2013) Sinkhorn distances: lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 2292–2300. Cited by: §4.1.1.
• A. P. Dempster (1972) Covariance selection. Biometrics, pp. 157–175. Cited by: §3.1.
• X. Dong, D. Thanou, P. Frossard, and P. Vandergheynst (2016) Learning laplacian matrix in smooth graph signal representations. IEEE Transactions on Signal Processing 64 (23), pp. 6160–6173. Cited by: §3.1.
• X. Dong, D. Thanou, M. Rabbat, and P. Frossard (2018) Learning graphs from data: a signal representation perspective. Preprint arXiv:1806.00848. Cited by: §3.1.
• R. L. Dykstra (1983) An algorithm for restricted least squares regression. Journal of the American Statistical Association 78 (384), pp. 837–842. Cited by: §4.1.
• N. Dym, H. Maron, and Y. Lipman (2017) DS++: a flexible, scalable and provably tight relaxation for matching problems. arXiv preprint arXiv:1705.06148. Cited by: §2.
• P. Emami and S. Ranka (2018) Learning permutations with sinkhorn policy gradient. Preprint arXiv:1805.07010. Cited by: §2.
• M. Figurnov, S. Mohamed, and A. Mnih (2018) Implicit reparameterization gradients. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 441–452. Cited by: §4.2.
• M. Fiori and G. Sapiro (2015) On spectral properties for graph matching and graph isomorphism problems. Information and Inference: A Journal of the IMA 4 (1), pp. 63–76. Cited by: §2.
• R. Flamary, N. Courty, A. Rakotomamonjy, and D. Tuia (2014) Optimal transport with Laplacian regularization. In NIPS 2014, Workshop on Optimal Transport and Machine Learning, Montréal, Canada. Cited by: §2.
• J. Friedman, T. Hastie, and R. Tibshirani (2008) Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 (3), pp. 432–441. Cited by: §3.1.
• A. Genevay, G. Peyré, and M. Cuturi (2018) Learning generative models with sinkhorn divergences. In

Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics

, A. Storkey and F. Perez-Cruz (Eds.),
Proceedings of Machine Learning Research, Vol. 84, Playa Blanca, Lanzarote, Canary Islands, pp. 1608–1617. Cited by: §4.1.1.
• S. Gold and A. Rangarajan (1996) A graduated assignment algorithm for graph matching. IEEE Transactions on pattern analysis and machine intelligence 18 (4), pp. 377–388. Cited by: §5.2.
• J. Gu, B. Hua, and S. Liu (2015) Spectral distances on graphs. Discrete Applied Mathematics 190-191, pp. 56 – 74. Cited by: §2.
• B. Jiang, J. Tang, C. Ding, Y. Gong, and B. Luo (2017) Graph matching via multiplicative update algorithm. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 3187–3195. Cited by: §2.
• M. E. Khan, W. Lin, V. Tangkaratt, Z. Liu, and D. Nielsen (2017) Variational adaptive-newton method for explorative learning. Preprint arXiv:1711.05560. Cited by: §4.2.
• D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. preprint arXiv:1312.6114. Cited by: §4.2.
• N. M. Kriege, P.-L. Giscard, and R. Wilson (2016) On valid optimal assignment kernels and applications to graph classification. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 1623–1631. Cited by: §5.2.
• M. Leordeanu, M. Hebert, and R. Sukthankar (2009) An integer projected fixed point method for graph matching and map inference. In Advances in Neural Information Processing Systems 22, Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta (Eds.), pp. 1114–1122. Cited by: §5.2.
• T.-Y. Lin and S. Maji (2017) Improved bilinear pooling with CNNs.. In British Machine Vision Conference, London, UK. Cited by: §4.2.
• G. Luise, A. Rudi, M. Pontil, and C. Ciliberto (2018) Differential properties of sinkhorn approximation for learning with wasserstein distance. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 5859–5870. Cited by: §4.1.1, §4.1.
• F. Mémoli (2011) Gromov–wasserstein distances and the metric approach to object matching. Foundations of computational mathematics 11 (4), pp. 417–487. Cited by: §2.
• G. Mena, D. Belanger, S. Linderman, and J. Snoek (2018) Learning latent permutations with gumbel-sinkhorn networks. In International Conference on Learning Representations, Cited by: §2, §4.1.1.
• M. Neuhaus, K. Riesen, and H. Bunke (2006) Fast suboptimal algorithms for the computation of graph edit distance.. In SSPR/SPR, D. Yeung, J. T. Kwok, A. L. N. Fred, F. Roli, and D. de Ridder (Eds.), Lecture Notes in Computer Science, Vol. 4109, pp. 163–172. Cited by: §2.
• G. Nikolentzos, P. Meladianos, and M. Vazirgiannis (2017) Matching node embeddings for graph similarity. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §2.
• H. Petric Maretic, M. El Gheche, G. Chierchia, and P. Frossard (2019) GOT: an optimal transport framework for graph comparison. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 13876–13887. Cited by: §1, §1, §3.2, §3, §4.1.1.
• G. Peyré, M. Cuturi, and S. J. (2016) Gromov-wasserstein averaging of kernel and distance matrices. In International Conference on Machine Learning, M. F. Balcan and K. Q. Weinberger (Eds.), Proceedings of Machine Learning Research, Vol. 48, New York, New York, USA, pp. 2664–2672. Cited by: §2, §5.1, §5.2.
• S. J. Reddi, S. Kale, and S. Kumar (2018) On the convergence of adam and beyond. In International Conference on Learning Representations, External Links: Link Cited by: §5.
• H. Rue and L. Held (2005) Gaussian markov random fields: theory and applications. Chapman and Hall/CRC. Cited by: §3.1.
• C. Schellewald and C. Schnörr (2005) Probabilistic subgraph matching based on convex relaxation. In

Energy Minimization Methods in Computer Vision and Pattern Recognition

, A. Rangarajan, B. Vemuri, and A. L. Yuille (Eds.),
Berlin, Heidelberg, pp. 171–186. Cited by: §2.
• N. Shahid, V. Kalofolias, X. Bresson, M. Bronstein, and P. Vandergheynst (2015) Robust principal component analysis on graphs. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2812–2820. Cited by: §3.1.
• R. Sinkhorn (1964) A relationship between arbitrary positive matrices and doubly stochastic matrices. The Annals of Mathematical Statistics 35 (2), pp. 876–879. Cited by: §2, §4.1.1.
• P. Srinivasan, T. Cour, and J. Shi (2007) Balanced graph matching. In Advances in Neural Information Processing Systems, B. Schölkopf, J. C. Platt, and T. Hoffman (Eds.), pp. 313–320. Cited by: §2.
• A. Takatsu et al. (2011) Wasserstein geometry of gaussian measures. Osaka Journal of Mathematics 48 (4), pp. 1005–1026. Cited by: §3.1.
• A. Tsitsulin, D. Mottin, P. Karras, A. Bronstein, and E. Müller (2018) Netlsd: hearing the shape of a graph. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2347–2356. Cited by: §5.2.
• T. Vayer, L. Chapel, R. Flamary, R. Tavenard, and N. Courty (2018) Optimal transport for structured data. Preprint arXiv:1805.09114. Cited by: §2, §5.1, §5.2.
• H. Xu, D. Luo, and L. Carin (2019) Scalable gromov-wasserstein learning for graph partitioning and matching. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 3046–3056. Cited by: §1, §2.
• J. Yan, X. Yin, W. Lin, C. Deng, H. Zha, and X. Yang (2016) A short survey of recent advances in graph matching. In International Conference on Multimedia Retrieval, New York, NY, USA, pp. 167–174. Cited by: §2.
• P. Yanardag and S.V.N. Vishwanathan (2015) Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’15, New York, NY, USA, pp. 1365–1374. Cited by: §5.2.
• T. Yu, J. Yan, Y. Wang, W. Liu, and B. Li (2018) Generalizing graph matching beyond quadratic assignment model. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 853–863. Cited by: §1, §2.
• M. Zaslavskiy, F. Bach, and J.-P. Vert (2010) Many-to-many graph matching: a continuous relaxation approach. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 515–530. Cited by: §3.2, §3.2.
• F. Zhou and F. D. Torre (2016) Factorized graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (9), pp. 1774–1789. Cited by: §2.
• X. Zhu, Z. Ghahramani, and J. D. Lafferty (2003) Semi-supervised learning using gaussian fields and harmonic functions. In International conference on Machine learning, pp. 912–919. Cited by: §3.1.