Scalable Gromov-Wasserstein Learning for Graph Partitioning and Matching

05/18/2019 ∙ by Hongteng Xu, et al. ∙ 0

We propose a scalable Gromov-Wasserstein learning (S-GWL) method and establish a novel and theoretically-supported paradigm for large-scale graph analysis. The proposed method is based on the fact that Gromov-Wasserstein discrepancy is a pseudometric on graphs. Given two graphs, the optimal transport associated with their Gromov-Wasserstein discrepancy provides the correspondence between their nodes and achieves graph matching. When one of the graphs has isolated but self-connected nodes (i.e., a disconnected graph), the optimal transport indicates the clustering structure of the other graph and achieves graph partitioning. Using this concept, we extend our method to multi-graph partitioning and matching by learning a Gromov-Wasserstein barycenter graph for multiple observed graphs; the barycenter graph plays the role of the disconnected graph, and since it is learned, so is the clustering. Our method combines a recursive K-partition mechanism with a regularized proximal gradient algorithm, whose time complexity is O(K(E+V)_K V) for graphs with V nodes and E edges. To our knowledge, our method is the first attempt to make Gromov-Wasserstein discrepancy applicable to large-scale graph analysis and unify graph partitioning and matching into the same framework. It outperforms state-of-the-art graph partitioning and matching methods, achieving a trade-off between accuracy and efficiency.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gromov-Wasserstein distance sturm2006geometry ; memoli2011gromov was originally designed for metric-measure spaces, which can measure distances between distributions in a relational way, deriving an optimal transport between the samples in distinct spaces. Recently, the work in chowdhury2018gromov proved that this distance can be extended to Gromov-Wasserstein discrepancy (GW discrepancy) peyre2016gromov , which defines a pesudometric for graphs. Accordingly, the optimal transport between two graphs indicates the correspondence between their nodes. This work theoretically supports the applications of GW discrepancy to structural data analysis, , 2D/3D object matching memoli2004comparing ; memoli2009spectral ; bronstein2010gromov , molecule analysis vayer2018fused ; vayer2018optimal , network alignment xu2019gromov , etc.

Unfortunately, although GW discrepancy-based methods are attractive theoretically, they are often inapplicable to large-scale graphs, because of high computational complexity. Additionally, these methods are designed for two-graph matching, ignoring the potential of GW discrepancy to other tasks, like graph partitioning and multi-graph matching. As a result, the partitioning and the matching of large-scale graphs still typically rely on heuristic methods 

girvan2002community ; clauset2004finding ; vijayan2015magna++ ; malod2015graal , whose performance is often sub-optimal, especially in noisy cases.

(a) Graph matching (b) Graph partitioning (c) Multi-graph matching (d) Multi-graph partitioning
(e) Comparisons on accuracy and efficiency
Figure 1: (a)-(d) Illustrations of graph partitioning and matching in the GWL framework. (e) We test various graph matching methods in 10 trials on an Intel i7 CPU. In each trial, the source graph has 2,000 nodes and the target graph has 100 more noisy nodes and corresponding edges. The graphs yield either Gaussian partition model brandes2003experiments or Barabási-Albert model barabasi2016network . The GW discrepancy-based methods (‘’) obtains higher node correctness than other baselines (‘’), and our S-GWL (big ‘’) achieves a trade-off on accuracy and efficiency.

Focusing on the issues above, we design a scalable Gromov-Wasserstein learning (S-GWL) method and establish a new and unified paradigm for large-scale graph partitioning and matching. As illustrated in Figure 1(a), given two graphs, the optimal transport associated with their Gromov-Wasserstein discrepancy provides the correspondence between their nodes. Similarly, graph partitioning corresponds to calculating the Gromov-Wasserstein discrepancy between an observed graph and a disconnected graph, as shown in Figure 1(b). The optimal transport connects each node of the observed graph with an isolated node of the disconnected graph, yielding a partitioning. In Figures 1(c) and 1(d), taking advantage of the Gromov-Wasserstein barycenter in peyre2016gromov , we achieve multi-graph matching and partitioning by learning a “barycenter graph”. For arbitrary two or more graphs, the correspondence (or the clustering structure) among their nodes can be established indirectly through their optimal transports to the barycenter graph.

The four tasks in Figures 1(a)-1(d) are explicitly unified in our Gromov-Wasserstein learning (GWL) framework, which corresponds to the same GW discrepancy-based optimization problem. To improve its scalability, we introduce a recursive mechanism to the GWL framework, which recursively applies -way partitioning to decompose large graphs into a set of aligned sub-graph pairs, and then matches each pair of sub-graphs. When calculating GW discrepancy, we design a regularized proximal gradient method, that considers the prior information of nodes and performs updates by solving a series of convex sub-problems. The sparsity of edges further helps us reduce computations. These acceleration strategies yield our S-GWL method: for graphs with nodes and edges, its time complexity is and memory complexity is . To our knowledge, our S-GWL is the first to make GW discrepancy applicable to large-scale graph analysis. Figure 1(e) illustrates the effectiveness of S-GWL on graph matching, with more results presented in Section 5.

2 Graph Analysis Based on Gromov-Wasserstein Learning

Denote a measure graph as , where is the set of nodes, is the adjacency matrix, and

is a Borel probability measure defined on

. In practice,

is an empirical distribution of nodes, which can be estimated by normalized node degree. A

-way graph partitioning aims to decompose a graph into sub-graphs by clustering its nodes, , , where and for . Given two graphs and , graph matching aims to find a correspondence between their nodes, , . Many real-world networks are modeled using graph theory, and graph partitioning and matching are important for community detection karypis1998fast ; girvan2002community and network alignment sharan2006modeling ; singh2008global ; zhang2015multiple , respectively. In this section, we propose a Gromov-Wasserstein learning framework to unify these two problems.

2.1 Gromov-Wasserstein discrepancy between graphs

Our GWL framework is based on Gromov-Wasserstein distance:

Definition 2.1.

Let and be two metric measure spaces, where is a compact metric space and is a Borel probability measure on (with defined in the same way). For , the -th order Gromov-Wasserstein distance is defined as , where

is the loss function and

is the set of all probability measures on with and as marginals.

When the “” in the definition is not a strict metric, but an arbitrary distance measurement, as common in practice, we extend the Gromov-Wasserstein distance to GW discrepancy. GW discrepancy calculates distances between pairs of samples within each domain and measures how these distances compare to those in the other domain. Such a property reveals the potential of GW discrepancy to measure the discrepancy between graphs (, networks xu2019gromov , molecules vayer2018fused and 3D meshes bronstein2010gromov ), in which the topology of each graph is given while the correspondence between their nodes is unknown. Specifically, the work in chowdhury2018gromov proves that GW discrepancy is a pseudometric on graphs.

Theorem 2.2 (chowdhury2018gromov ).

Denote the collection of measure graphs as . For each and each , the Gromov-Wasserstein discrepancy between and is

(1)

which is a pseudometric on .

According to the definition, GW discrepancy measures the discrepancy between two graphs, and the optimal transport

is a joint distribution of the graphs’ nodes:

indicates the probability that the node corresponds to the node . As shown in Figure 1(a), the optimal transport achieves a probabilistic assignment of the source nodes to the target ones.

Besides graph matching, this paradigm is also suitable for graph partitioning. Recall that most existing graph partitioning methods obey the modularity maximization principle girvan2002community ; clauset2004finding : for each partitioned sub-graph, its internal edges should be dense, while its external edges connecting with other sub-graphs should be sparse. This principle implies that if we treat each sub-graph as a “super node” karypis1998fast ; wang2011detecting ; paresfluid , an ideal partitioning should correspond to a disconnected graph with isolated, but self-connected super nodes. Therefore, we achieve -way partitioning by calculating the GW discrepancy between the observed graph and a disconnected graph, , , where . . is a node distribution, whose derivation is in Appendix A.1. is the adjacency matrix of . As shown in Figure 1(b), the optimal transport is a matrix. The maximum in each row of the matrix indicates the cluster of a node.

2.2 Gromov-Wasserstein barycenter graph for analysis of multiple graphs

Distinct from most graph matching methods gold1996graduated ; cordella2004sub ; sharan2006modeling ; cour2007balanced , which mainly focus on two-graph matching, our GWL framework can be readily extended to multi-graph cases, by introducing the Gromov-Wasserstein barycenter (GWB) proposed in peyre2016gromov . Given a set of graphs , their Gromov-Wasserstein barycenter is a barycenter graph defined as

(2)

where contains predefined weights, and is the barycenter graph with a predefined number of nodes. The barycenter graph minimizes the weighted average of its GW discrepancy to observed graphs. As shown in Figure 1(c), the barycenter graph works as a “reference” connecting with the observed graphs. For each node in the barycenter graph, we can find its matched nodes in different graphs with the help of the corresponding optimal transport. These matched nodes construct a node set, and arbitrary two nodes in the set are a correspondence. The collection of all the node sets achieves multi-graph matching.

We can also use the Gromov-Wasserstein barycenter graph to achieve multi-graph partitioning, with the learned barycenter graph playing the role of the aforementioned disconnected graph. Given two or more graphs, whose nodes may have unobserved correspondences, existing partitioning methods karypis1998fast ; girvan2002community ; clauset2004finding ; blondel2008fast ; paresfluid only partition them independently because they are designed for clustering nodes in a single graph. As a result, the first cluster of a graph may correspond to the second cluster of another graph. Without the correspondence between clusters, we cannot reduce the search space in matching tasks. Although this correspondence can be estimated by matching two coarse graphs that treat the clusters as their nodes, this strategy not only introduces additional computations but also leads to more uncertainty on matching, because different graphs are partitioned independently without leveraging structural information from each other. By learning a barycenter graph for multiple graphs, we can partition them and align their clusters simultaneously. As shown in Figure 1(d), when applying -way multi-graph partitioning, we initialize a disconnected graph with isolated nodes as the barycenter graph, and then learn it by . For each node of the barycenter graph, its matched nodes in each observed graph belong to the same cluster.

3 Scalable Gromov-Wasserstein Learning

Based on Gromov-Wasserstein discrepancy and the barycenter graph, we have established a GWL framework for graph partitioning and matching. To make this framework scalable to large graphs, we propose a regularized proximal gradient method to calculate GW discrepancy and integrate multiple acceleration strategies to greatly reduce the computational complexity of GWL.

3.1 Regularized proximal gradient method

Inspired by the work in xie2018fast ; xu2019gromov , we calculate the GW discrepancy in (1) based on a proximal gradient method, which decomposes a complicated non-convex optimization problem into a series of convex sub-problems. For simplification, we set in (1). Given two graphs and , in the -th iteration, we update the current optimal transport by

(3)

Here, , derived based on peyre2016gromov , and represents the inner product of two matrices. The Kullback-Leibler (KL) divergence, , , is added as the proximal term. The third equality indicates that this can be reformulated as an optimal transport problem with an entropy regularizer benamou2015iterative ; peyre2016gromov , where . We can solve (3) via the Sinkhorn-Knopp algorithm sinkhorn1967concerning ; cuturi2013sinkhorn with linear convergence altschuler2017near . As demonstrated in xu2019gromov , the global convergence of this proximal gradient method is guaranteed, so repeating (3) leads to a stable optimal transport, denoted as

. Additionally, this method is robust to hyperparameter

, achieving better convergence and numerical stability than the entropy-based method in peyre2016gromov .

Learning the barycenter graph is also based on the proximal gradient method. Given graphs, we estimate their barycenter graph via alternating optimization. In the -th iteration, given the previous barycenter graph , we update optimal transports via solving (3). Given the updated optimal transports , we update the adjacency matrix of the barycenter graph by

(4)

The weights , the number of the nodes and the node distribution are predefined.

Different from the work in xu2019gromov ; peyre2016gromov , we use the following initialization strategies to achieve a regularized proximal gradient method and estimate optimal transports with few iterations.

Optimal transports For graph analysis, we can leverage prior knowledge to get a better regularization of optimal transport. Recall that the node distribution of a graph is estimated by the normalized node degree, which reflects the local topology of nodes, , the density of neighbors. Generally, the nodes with similar local topology should be matched with a high probability. Therefore, given two node distributions and , we construct a node-based cost matrix , whose element is , and add a regularization term to (3). As a result, in the learning phase, we replace the in (3) with .

Barycenter graphs When learning GWB, the work in peyre2016gromov fixed the node distribution to be uniform In practice, however, both the node distribution of the barycenter graph and its optimal transports to observed graphs are unknown. In such a situation, we need to first estimate the node distribution . Without loss of generality, we assume that the node distribution of the barycenter graph is sorted, , . We estimate the node distribution via the weighted average of the sorted and re-sampled node distributions of observed graphs:

(5)

where

sorts the elements of the input vector in descending order, and

samples values from the input vector via bilinear interpolation. Given the node distribution, we initialize the optimal transports via the method mentioned above.

1:  Set , .
2:  Calculate with .
3:  Initialize .
4:  While not converge
5:    .
6:    , and .
7:    , then .
8:  Output: .
Algorithm 1
1:  Set .
2:  Initialize via (5). .
3:  While not converge
4:   For
5:    .
6:   Calculate via (4).
7:   .
8:  Output: for .
Algorithm 2

Algorithms 1 and 2 show the details of our method, where “” and “” represent elementwise multiplication and division, respectively. The GWL framework for the tasks in Figures 1(a)-1(d) are implemented based on these two algorithms, with details in Appendix A.1.

3.2 A recursive -partition mechanism for large-scale graph matching

Assume that the observed graphs have comparable size, whose number of nodes and edges are denoted as and , respectively. When using the proximal gradient method directly to calculate the GW discrepancy between two graphs, the time complexity, in the worst case, is because the in (3) involves . Even if we consider the sparsity of edges and implement sparse matrix multiplications, the time complexity is still as high as .

To improve the scalability of our GWL framework, we introduce a recursive -partition mechanism, recursively decomposing observed large graphs to a set of aligned small graphs. As shown in Figure 2(a), given two graphs, we first calculate their barycenter graph (with nodes) and achieve their joint -way partitioning. For each node of the barycenter graph, the corresponding sub-graphs extracted from the observed two graphs construct an aligned sub-graph pair, shown as the dotted frames connected with grey circles in Figure 2(a). Taking these aligned sub-graph pairs as inputs, we further calculate their barycenter graphs and decompose them into more and smaller sub-graph pairs. Repeating the above step, we finally calculate the GW discrepancy between the sub-graphs in each pair, and find the correspondence between their nodes. Note that this recursive mechanism is also applicable to multi-graph matching: for multiple graphs, in the final step we calculate the GWB among the sub-graphs in each set. The details of our S-GWL method are provided in Appendix A.2.

3.3 Complexity analysis

In Table 1, we compare the time and memory complexity of our S-GWL method with other graph-matching methods. The methods based on the Hungarian algorithm kuhn1955hungarian have time complexity  gold1996graduated ; pachauri2013solving ; yan2015consistency . Denoting the largest node degree in a graph as , the time complexity of GHOST patro2012global is

. Both the Hungarian algorithm and GHOST take the graph affinity matrix as input, so their memory complexity in the worst case is

. MI-GRAAL kuchaiev2011integrative , HubAlign hashemifar2014hubalign and NETAL neyshabur2013netal are relatively efficient, with time complexity , and , respectively. CPD+Emb first learns -dimensional node embeddings grover2016node2vec , and then registers the embeddings by the CPD method myronenko2010point , whose time complexity is . The memory complexity of these four methods is . For GW discrepancy-based methods, the GWL+Emb in xu2019gromov achieves graph matching and node embedding jointly, which combines the adjacency matrix of graph with the distance matrix of node embeddings. Because the sparsity of the edges is broken, its time complexity is and memory complexity is . As mentioned above, the time complexity of GWL is . However, because the in (3) is a dense matrix, its memory complexity is still . Our S-GWL method combines the recursive mechanism with the regularized proximal gradient method and implements the in (3) by sparse matrix multiplications. Ideally, we can apply recursions. In the -th recursion we calculate barycenter graphs for sub-graph pairs. The sub-graphs in each pair have nodes. As a result, we have

Proposition 3.1.

Suppose that we have graphs, each of which has nodes and edges. With the help of the recursive -partition mechanism, the time complexity of our S-GWL method is , and its memory complexity is .

(a) Scheme of our S-GWL method
(b) Runtime
Figure 2: (a) An illustration of S-GWL. (b) Comparisons on runtime.
Hungarian GHOST MI-GRAAL HubAlign NETAL CPD+Emb GWL+Emb GWL S-GWL
Time + + +
Memory
Table 1: Comparisons for graph matching methods on time and memory complexity.

Choosing and ignoring the number of graphs, we obtain the complexity shown in Table 1. Our S-GWL has lower computational time complexity and memory requirements than many existing methods. Figure 2(b) visualizes the runtime of GWL and S-GWL on matching synthetic graphs. The S-GWL methods with different configurations (, the number of partitions and that of recursions ) are consistently faster than GWL. More detailed analysis is provided in Appendix A.3.

4 Related Work

Gromov-Wasserstein learning GW discrepancy has been applied in many matching problems, , registering 3D objects memoli2009spectral ; memoli2011gromov and matching vocabulary sets between different languages alvarez2018gromov . Focusing on graphs, a fused Gromov-Wasserstein distance is proposed in vayer2018optimal ; vayer2018fused , combining GW discrepancy with Wasserstein discrepancy villani2008optimal . The work in xu2019gromov further takes node embedding into account, learning the GW discrepancy between two graphs and their node embeddings jointly. The appropriateness of these methods is supported by chowdhury2018gromov , which proves that GW discrepancy is a pseudometric on measure graphs. Recently, an adversarial learning method based on GW discrepancy is proposed in bunne2018 , which jointly trains two generative models in incomparable spaces. The work in peyre2016gromov further proposes Gromov-Wasserstein barycenters for clustering distributions and interpolating shapes. Currently, GW discrepancy is mainly calculated based on Sinkhorn iterations sinkhorn1967concerning ; cuturi2013sinkhorn ; benamou2015iterative ; peyre2016gromov , whose applications to large-scale graphs are challenging because of its high complexity. Our S-GWL method is the first attempt to make GW discrepancy applicable to large-scale graph analysis.

Graph partitioning and graph matching Graph partitioning is important for community detection in networks. Many graph partitioning methods have been proposed, such as Metis karypis1998fast , EdgeBetweenness girvan2002community , FastGreedy clauset2004finding , Label Propagation raghavan2007near , Louvain blondel2008fast and Fluid Community paresfluid . All of these methods explore the clustering structure of nodes heuristically based on the modularity-maximization principle girvan2002community ; clauset2004finding . Graph matching is important for network alignment sharan2006modeling ; singh2008global ; zhang2015multiple and 2D/3D object registration myronenko2010point ; yan2015matrix ; jun2017sequential ; yu2018generalizing . Traditional methods formulate graph matching as a quadratic assignment problem (QAP) and solve it based on the Hungarian algorithm gold1996graduated ; pachauri2013solving ; yan2015matrix ; yan2015consistency , which are only applicable to small graphs. For large graphs like protein networks, many heuristic methods have been proposed, such as GRAAL kuchaiev2010topological , IsoRank singh2008global , PISwap chindelevitch2013optimizing , MAGNA++ vijayan2015magna++ , NETAL neyshabur2013netal , HubAlign hashemifar2014hubalign , and GHOST patro2012global , which mainly focus on two-graph matching and are sensitive to the noise in graphs. With the help of GW discrepancy, our work establishes a unified framework for graph partitioning and matching, that can be readily extended to multi-graph cases.

5 Experiments

We compare our S-GWL method with state-of-the-art methods for graph partitioning and matching. All the methods are run on an Intel i7 CPU with 4GB memory. Implementation details and a further set of experimental results are provided in Appendix B.

5.1 Graph partitioning

We first verify the performance of the GWL framework on graph partitioning, comparing it with the following four baselines: Metis karypis1998fast , FastGreedy clauset2004finding , Louvain blondel2008fast , and Fluid Community paresfluid . We consider synthetic and real-world data. Similar to yang2016comparative , we compare these methods in terms of adjusted mutual information (AMI) and runtime. Each synthetic graph is a Gaussian random partition graph with nodes and

clusters. The size of each cluster is drawn from a normal distribution

. The nodes are connected within clusters with probability and between clusters with probability . The ratio indicates the clearness of the clustering structure, and accordingly the difficulty of partitioning. We set , , and . Under each configuration , we simulate graphs. For each method, its average performance on these graphs is listed in Table 2. GWL outperforms the alternatives consistently on AMI. Additionally, as shown in Table 2, GWL has time complexity comparable to other methods, especially when the graph is sparse, , . According to the runtime in practice, GWL is faster than most baselines except Metis, likely because Metis is implemented in the C language while GWL and other methods are based on Python.

Method Metis FastGreedy Louvain Fluid GWL
Time complexity ++
AMI Time AMI Time AMI Time AMI Time AMI Time
0.413 1.744 0.247 55.435 0.747 22.889 0.776 21.580 0.812 13.033
0.009 2.340 0.064 65.441 0.574 95.114 0.577 111.043 0.590 12.740
0.002 3.592 0.002 80.322 0.005 290.846 0.005 203.225 0.012 12.901
Table 2: Comparisons for graph partitioning methods on AMI, time complexity and runtime (second).
Method Metis FastGreedy Louvain Fluid GWL
Dataset Raw Noisy Raw Noisy Raw Noisy Raw Noisy Raw Noisy
EU-Email 0.421 0.246 0.312 0.118 0.434 0.272 0.338 0.459 0.349
Indian-Village 0.834 0.513 0.882 0.275 0.880 0.633 0.401 0.857 0.664
  • “—”: Fluid is inapplicable when the networks have disconnected nodes or sub-graphs.

Table 3: Comparisons for graph partitioning methods on AMI.

Table 3 lists the performance of different methods on two real-world datasets. The first dataset is the email network from a large European research institution snapnets . The network contains 1,005 nodes and 25,571 edges. The edge in the network mean that person sent person at least one email, and each node in the network belongs to exactly one of 42 departments at the research institute. The second dataset is the interactions among 1,991 villagers in 12 Indian villages banerjee2013diffusion . Furthermore, to verify the robustness of GWL to noise, we not only consider the raw data of these two datasets but also create their noisy version by adding 10% more noisy edges between different communities (, departments and villages). Experimental results show that GWL is at least comparable to its competitors on raw data, and it is more robust to noise than other methods.

5.2 Graph matching

For two-graph matching, we compare our S-GWL method with the following baselines: PISwap chindelevitch2013optimizing , GHOST patro2012global , MI-GRAAL kuchaiev2011integrative , MAGNA++ vijayan2015magna++ , HubAlign hashemifar2014hubalign , NETAL neyshabur2013netal , CPD+Emb grover2016node2vec ; myronenko2010point , the GWL framework based on Algorithm 1, and the GWL+Emb in xu2019gromov . We test all methods on both synthetic and real-world data. For each method, given the learned correspondence set and the ground-truth correspondence set , we calculate node correctness as . The runtime of each method is recorded as well.

In the synthetic dataset, each source graph obeys a Gaussian random partition model brandes2003experiments or Barabási-Albert model barabasi2016network . For each source graph, we generate a target graph by adding noisy nodes and noisy edges to the source graph. Figure 1(e) compares our S-GWL with the baselines when and . For each method, its average node correctness and runtime on matching 10 synthetic graph pairs are plotted. Compared with existing heursitic methods, GW discrepancy-based methods (GWL+Emb, GWL and S-GWL) obtain much higher node correctness. GWL+Emb achieves the highest node correctness, with runtime comparable to many baselines. Our GWL framework does not learn node embeddings when matching graphs, so it is slightly worse than GWL+Emb on node correctness but achieves about 10 times acceleration. Our S-GWL method further accelerates GWL with the help of the recursive mechanism. It obtains high node correctness and makes its runtime comparable to the fastest methods (HubAlign and NETAL).

In addition to graph matching on synthetic data, we also consider two real-world matching tasks. The first task is matching the protein-protein interaction (PPI) network of yeast with its noisy version. The PPI network of yeast contains 1,004 proteins and their 4,920 high-confidence interactions. Its noisy version contains more low-confidence interactions, and . The dataset is available on https://www3.nd.edu/~cone/MAGNA++/. The second task is matching user accounts in different communication networks. The dataset is available on http://vacommunity.org/VAST+Challenge+2018+MC3, which records the communications among a company’s employees. Following the work in xu2019gromov , we extract employees and their call-network and email-network. For each communication network, we construct a dense version and a sparse one: the dense version keeps all the communications (edges) among the employees, while the sparse version only preserves the communications happening more than times. We test different methods on ) matching yeast’s PPI network with its , and noisy versions; and ) matching the employee call-network with their email-network in both sparse and dense cases. Table 4 shows the performance of various methods in these two tasks. Similar to the experiments on synthetic data, the GW discrepancy-based methods outperform other methods on node correctness, especially for highly-noisy graphs, and our S-GWL method achieves a good trade-off between accuracy and efficiency.

Dataset Yeast 5% noise Yeast 15% noise Yeast 25% noise MC3 sparse MC3 dense
Method NC Time NC Time NC Time NC Time NC Time
PISwap 0.10 15.80 0.10 18.31 0.00 22.09 6.32 10.27 0.00 11.81
GHOST 11.06 25.67 0.40 30.22 0.30 35.54 21.27 17.86 0.03 22.90
MI-GRAAL 18.03 189.21 6.87 202.77 5.18 240.03 35.53 72.89 0.64 197.65
MAGNA++ 48.13 603.29 25.04 630.60 13.61 624.17 7.88 425.16 0.09 447.86
HubAlign 50.00 3.27 35.16 3.50 12.85 3.89 36.21 2.11 3.86 2.29
NETAL 6.87 1.91 0.90 2.06 1.00 2.09 36.87 1.23 1.77 1.30
CPD+Emb 3.59 103.22 2.09 110.19 2.00 108.62 4.35 87.54 0.48 95.68
GWL+Emb 55.91 1243.51 44.05 1337.88 33.12 1389.72 40.45 608.76 4.23 831.80
GWL 55.68 148.29 43.33 176.66 32.97 153.45 34.21 89.43 3.96 93.94
S-GWL 53.30 15.27 41.93 15.46 31.87 15.41 36.92 8.39 4.03 9.01
Table 4: Comparisons for graph matching methods on node correctness (%) and runtime (second).
Method 3 graphs 4 graphs 5 graphs 6 graphs
NC@1 NC@all NC@1 NC@all NC@1 NC@all NC@1 NC@all
MultiAlign 62.97 45.19
GWL 63.84 46.22 68.73 39.14 71.61 31.57 76.49 28.39
S-GWL 60.06 43.33 68.53 38.45 73.21 33.27 76.99 29.68
Table 5: Comparisons for multi-graph matching methods on yeast networks.

Given the PPI network of yeast and its 5 noisy versions, we test GWL and S-GWL for multi-graph matching. We consider several existing multi-graph matching methods and find that the methods in pachauri2013solving ; yan2015matrix ; yan2015consistency are not applicable for the graphs with hundreds of nodes because ) their time complexity is at least , and ) they suffer from inadequate memory on our machine (with 4GB memory) because their memory complexity in the worst case is . The IsoRankN in liao2009isorankn can align multiple PPI networks jointly, but it needs confidence scores of protein pairs as input, which are not available for our dataset. The only applicable baseline we are aware of is the MultiAlign in zhang2015multiple . However, it can only achieve three-graph matching. Table 5 lists the performance of various methods. Given learned correspondence sets, each of which is a set of matched nodes from different graphs, NC@1 represents the percentage of the set containing at least a pair of correctly-matched nodes, and NC@all represents the percentage of the set in which arbitrary two nodes are matched correctly. Both GWL and S-GWL obtain comparable performance to MultiAlign on three-graph matching, and GWL is the best. When the number of graphs increases, NC@1 increases while NC@all decreases for all the methods, and S-GWL becomes better than GWL.

6 Conclusion and Future Work

We have developed a scalable Gromov-Wasserstein learning method, achieving large-scale graph partitioning and matching in a unified framework, with theoretical support. Experiments show that our approach outperforms state-of-the-art methods in many situations. In the future, we will further study the influence of hyperparameters on the rate of convergence and set the hyperparameters adaptively according to observed data. Additionally, our S-GWL method can decompose a large graph into many independent small graphs, so we plan to further accelerate it by parallel processing and/or distributed learning.

References

  • (1) J. Altschuler, J. Weed, and P. Rigollet. Near-linear time approximation algorithms for optimal transport via sinkhorn iteration. In Advances in Neural Information Processing Systems, pages 1964–1974, 2017.
  • (2) D. Alvarez-Melis and T. Jaakkola. Gromov-wasserstein alignment of word embedding spaces. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    , pages 1881–1890, 2018.
  • (3) A. Banerjee, A. G. Chandrasekhar, E. Duflo, and M. O. Jackson. The diffusion of microfinance. Science, 341(6144):1236498, 2013.
  • (4) A.-L. Barabási et al. Network science. Cambridge university press, 2016.
  • (5) J.-D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyré. Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2):A1111–A1138, 2015.
  • (6) V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre. Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment, 2008(10):P10008, 2008.
  • (7) U. Brandes, M. Gaertler, and D. Wagner. Experiments on graph clustering algorithms. In European Symposium on Algorithms, pages 568–579. Springer, 2003.
  • (8) A. M. Bronstein, M. M. Bronstein, R. Kimmel, M. Mahmoudi, and G. Sapiro. A Gromov-Hausdorff framework with diffusion geometry for topologically-robust non-rigid shape matching.

    International Journal of Computer Vision

    , 89(2-3):266–286, 2010.
  • (9) C. Bunne, D. Alvarez-Melis, A. Krause, and S. Jegelka. Learning generative models across incomparable spaces. NeurIPS Workshop on Relational Representation Learning, 2018.
  • (10) L. Chindelevitch, C.-Y. Ma, C.-S. Liao, and B. Berger. Optimizing a global alignment of protein interaction networks. Bioinformatics, 29(21):2765–2773, 2013.
  • (11) S. Chowdhury and F. Mémoli. The Gromov-Wasserstein distance between networks and stable network invariants. arXiv preprint arXiv:1808.04337, 2018.
  • (12) A. Clauset, M. E. Newman, and C. Moore. Finding community structure in very large networks. Physical review E, 70(6):066111, 2004.
  • (13) L. P. Cordella, P. Foggia, C. Sansone, and M. Vento. A (sub) graph isomorphism algorithm for matching large graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(10):1367–1372, 2004.
  • (14) T. Cour, P. Srinivasan, and J. Shi. Balanced graph matching. In NIPS, pages 313–320, 2007.
  • (15) M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292–2300, 2013.
  • (16) M. Girvan and M. E. Newman. Community structure in social and biological networks. Proceedings of the national academy of sciences, 99(12):7821–7826, 2002.
  • (17) S. Gold and A. Rangarajan. A graduated assignment algorithm for graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(4):377–388, 1996.
  • (18) A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In KDD, pages 855–864, 2016.
  • (19) S. Hashemifar and J. Xu. Hubalign: An accurate and efficient method for global alignment of protein–protein interaction networks. Bioinformatics, 30(17):i438–i444, 2014.
  • (20) S.-H. Jun, S. W. Wong, J. Zidek, and A. Bouchard-Côté. Sequential graph matching with sequential monte carlo. In AISTATS, pages 1075–1084, 2017.
  • (21) G. Karypis and V. Kumar. A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on scientific Computing, 20(1):359–392, 1998.
  • (22) O. Kuchaiev, T. Milenković, V. Memišević, W. Hayes, and N. Pržulj. Topological network alignment uncovers biological function and phylogeny. Journal of the Royal Society Interface, page rsif20100063, 2010.
  • (23) O. Kuchaiev and N. Pržulj. Integrative network alignment reveals large regions of global network similarity in yeast and human. Bioinformatics, 27(10):1390–1396, 2011.
  • (24) H. W. Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
  • (25) J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.
  • (26) C.-S. Liao, K. Lu, M. Baym, R. Singh, and B. Berger. Isorankn: spectral methods for global alignment of multiple protein networks. Bioinformatics, 25(12):i253–i258, 2009.
  • (27) N. Malod-Dognin and N. Pržulj. L-GRAAL: Lagrangian graphlet-based network aligner. Bioinformatics, 31(13):2182–2189, 2015.
  • (28) F. Mémoli. Spectral Gromov-Wasserstein distances for shape matching. In ICCV Workshops, pages 256–263, 2009.
  • (29) F. Mémoli. Gromov-Wasserstein distances and the metric approach to object matching. Foundations of computational mathematics, 11(4):417–487, 2011.
  • (30) F. Mémoli and G. Sapiro. Comparing point clouds. In Proceedings of the 2004 Eurographics/ACM SIGGRAPH symposium on Geometry processing, pages 32–40, 2004.
  • (31) A. Myronenko and X. Song. Point set registration: Coherent point drift. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(12):2262–2275, 2010.
  • (32) B. Neyshabur, A. Khadem, S. Hashemifar, and S. S. Arab. NETAL: A new graph-based method for global alignment of protein–protein interaction networks. Bioinformatics, 29(13):1654–1662, 2013.
  • (33) D. Pachauri, R. Kondor, and V. Singh. Solving the multi-way matching problem by permutation synchronization. In Advances in neural information processing systems, pages 1860–1868, 2013.
  • (34) F. Parés, D. Garcia-Gasulla, A. Vilalta, J. Moreno, E. Ayguadé, J. Labarta, U. Cortés, and T. Suzumura. Fluid communities: A competitive and highly scalable community detection algorithm. Complex Networks & Their Applications VI, pages 229–240, 2018.
  • (35) R. Patro and C. Kingsford. Global network alignment using multiscale spectral signatures. Bioinformatics, 28(23):3105–3114, 2012.
  • (36) G. Peyré, M. Cuturi, and J. Solomon. Gromov-wasserstein averaging of kernel and distance matrices. In

    International Conference on Machine Learning

    , pages 2664–2672, 2016.
  • (37) U. N. Raghavan, R. Albert, and S. Kumara. Near linear time algorithm to detect community structures in large-scale networks. Physical review E, 76(3):036106, 2007.
  • (38) R. Sharan and T. Ideker. Modeling cellular machinery through biological network comparison. Nature biotechnology, 24(4):427, 2006.
  • (39) R. Singh, J. Xu, and B. Berger. Global alignment of multiple protein interaction networks with application to functional orthology detection. Proceedings of the National Academy of Sciences, 2008.
  • (40) R. Sinkhorn and P. Knopp. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2):343–348, 1967.
  • (41) K.-T. Sturm et al. On the geometry of metric measure spaces. Acta mathematica, 196(1):65–131, 2006.
  • (42) T. Vayer, L. Chapel, R. Flamary, R. Tavenard, and N. Courty. Fused Gromov-Wasserstein distance for structured objects: theoretical foundations and mathematical properties. arXiv preprint arXiv:1811.02834, 2018.
  • (43) T. Vayer, L. Chapel, R. Flamary, R. Tavenard, and N. Courty. Optimal transport for structured data. arXiv preprint arXiv:1805.09114, 2018.
  • (44) V. Vijayan, V. Saraph, and T. Milenković. MAGNA++: Maximizing accuracy in global network alignment via both node and edge conservation. Bioinformatics, 31(14):2409–2411, 2015.
  • (45) C. Villani. Optimal transport: Old and new, volume 338. Springer Science & Business Media, 2008.
  • (46) L. Wang, T. Lou, J. Tang, and J. E. Hopcroft. Detecting community kernels in large social networks. In 2011 IEEE 11th International Conference on Data Mining, pages 784–793. IEEE, 2011.
  • (47) Y. Xie, X. Wang, R. Wang, and H. Zha. A fast proximal point method for Wasserstein distance. arXiv preprint arXiv:1802.04307, 2018.
  • (48) H. Xu, D. Luo, H. Zha, and L. Carin. Gromov-wasserstein learning for graph matching and node embedding. arXiv preprint arXiv:1901.06003, 2019.
  • (49) J. Yan, J. Wang, H. Zha, X. Yang, and S. Chu. Consistency-driven alternating optimization for multigraph matching: A unified approach. IEEE Transactions on Image Processing, 24(3):994–1009, 2015.
  • (50) J. Yan, H. Xu, H. Zha, X. Yang, H. Liu, and S. Chu. A matrix decomposition perspective to multiple graph matching. In ICCV, pages 199–207, 2015.
  • (51) Z. Yang, R. Algesheimer, and C. J. Tessone. A comparative analysis of community detection algorithms on artificial networks. Scientific reports, 6:30750, 2016.
  • (52) T. Yu, J. Yan, Y. Wang, W. Liu, et al. Generalizing graph matching beyond quadratic assignment model. In NIPS, pages 861–871, 2018.
  • (53) J. Zhang and S. Y. Philip. Multiple anonymized social networks alignment. In ICDM, pages 599–608, 2015.

Appendix A Details of Algorithms

a.1 The GWL framework for different tasks

Based on Algorithms 1 and 2, our GWL framework achieve the graph partitioning and matching tasks in Figures 1(a)-1(d). The schemes of GWL for these tasks are shown in Algorithms 3-6.

0:  , , hyperparameter .
1:  Initialize correspondence set .
2:  .
3:  For
4:   Find , then .
5:  return  
Algorithm 3
0:  , hyperparameter , the number of clusters .
1:  Initialize a node distribution via (5):
2:  Construct a disconnected graph , where .
3:  .
4:  Initialize for .
5:  For
6:   Find , then .
7:  For
8:   Construct a adjacency matrix by selecting rows and columns: .
9:   Construct a node distribution by selecting elements and normalizing them: .
10:  return  
Algorithm 4
0:  A graph set , hyperparameter
1:  Initialize correspondence set , , .
2:  .
3:  For
4:   
5:   For
6:    Find , then .
7:   .
8:  return  .
Algorithm 5
0:  A graph set , hyperparameter , the number of clusters .
1:  Initialize .
2:  .
3:  Initialize for and .
4:  For
5:   For
6:    Find , then .
7:   For
8:    , and .
9:  return  , where .
Algorithm 6

a.2 The scheme of S-GWL

Based on Algorithms 3, 5 and 6, we show the scheme of our S-GWL method for (multi-) graph matching in Algorithm 7.

0:  A graph set with graphs, , , , the number of partitions and that of recursions .
1:  Initialize correspondence set .
2:  Initialize the root collection of graph sets as .
3:  For Recursive -partition mechanism
4:   Initialize .
5:   For each graph set
6:    .
7:    .
8:  For each graph set
9:   If Two-graph matching
10:    , where .
11:   Else Multi-graph matching
12:    .
13:   .
14:  return  .
Algorithm 7

a.3 Detailed complexity analysis for GWL and S-GWL

Algorithms 3 and 5 Suppose that we have a source graph with nodes and edges and a target graph with nodes and edges. The most time- and memory-consuming operation in Algorithm 3 is the in (3). Because is with size and is with size , the computational time complexity of this step in the worst case is and its memory complexity is . Taking advantage of the sparsity of edge, can be implemented by sparse matrix multiplications (, save , as “csr” matrix in Python), whose computational time complexity and memory cost can be reduced to and 111The memory complexity actually should be . Based on the sparsity of edge, we ignore the edge-related terms., respectively. Assuming that these two graphs are with comparable size, we ignore the number of graphs and the subscripts and rewrite the time and memory complexity as and , as shown in the “GWL” column of Table 1.

Algorithm 5 is a natural extension of Algorithm 3 based on GWB. Suppose that we have graphs. We assume that these graphs and the target barycenter graph are with comparable size. The computational time complexity of Algorithm 5 is and its memory complexity is .

Algorithms 4 and 6 The main difference between Algorithm 4 and Algorithm 3 is that the size of target graph is much smaller than that of source graph, , and , because the target graph is disconnected, whose number of nodes indicates the number of partitions in the source graph. According to the analysis above, the time and memory complexity of Algorithm 4 is and 222Even if edges are sparse, is often comparable to . Therefore, different from the analysis for Algorithms 3 and 5, here we do not ignore .. Ignoring the subscripts, we obtain the complexity shown in Table 2.

Similarly, Algorithm 6 is an extension of Algorithm 4 for graphs, whose time and memory complexity is and , respectively.

Algorithm 7 Given graphs with comparable sizes, each of which has about nodes and edges, we can apply recursions. In the -th recursion, the in Algorithm 7) contains sub-graph sets. If we assume that each partitioning operation partition a graph into sub-graphs with comparable sizes, the -th sub-graph in each set should be with nodes and edges. For each sub-graph set, we calculate its barycenter graph by Algorithm 6, thus, its time and memory complexity is and , respectively. At the end of recursion, we obtain sub-graph sets. Each sub-graph is very small, with size . As long as is comparable to , the computations in lines 8-13 of Algorithm 7 can be ignored compared with the computations in the recursions.

In summary, we run recursions, and in the -th recursion we need to calculate barycenter graphs. The overall time complexity of S-GWL is , and its memory complexity is , respectively, as shown in Proposition 3.1. Choosing and ignoring the number of graphs, we obtain the complexity shown in Table 1.

a.4 Usefulness of node prior

With the help of the prior knowledge of node (, ), our regularized proximal gradient method can achieve a stable optimal transport with few iterations, whose rate of convergence is faster than the entropy-based method in peyre2016gromov and the vanilla proximal gradient method in xu2019gromov . Figure 3 illustrates the improvements on convergence achieved by our method. Given two synthetic graphs with 1,000 nodes, we calculate their GW discrepancy by different methods. Our method can reach lower GW discrepancy with fewer iterations, and its superiority is consistent with respect to the change of the hyperparameter .

(a)
(b)
(c)
Figure 3: Illustrations of the improvements on convergence achieved by our proximal gradient method regularized by node prior (, “prior + proximal” compared with the entropy-based method in peyre2016gromov ) and the vanilla proximal gradient method in xu2019gromov .

Appendix B More Experimental Results

b.1 Implementation details

For each baseline, we list its source and language below:

All the baselines are tested under their default settings. For our GWL framework and S-GWL method, the weight of the regularizer is fixed as 1 in our work. The remaining hyperparameters are set empirically in different experiments, which are shown in Table 6.

Experiments
Synthetic partitioning (Table 2) 1e-2
EU-Email partitioning (Table 3) 5e-7
Indian-Village partitioning (Table 3) 5e-5
Synthetic matching (Figure 4) 2e-1 2 3
Yeast graph matching (Table 4) 2e-2 2 3
MC3 network matching (Table 4) 1e-3 2 3
Yeast multi-graph matching (Table 5) 1e-1 8 1
Yeast-Human matching (Table 7) 5e-2 2 4
Table 6: The settings of hyperparameters in different experiments.

b.2 Performance on some challenging cases

Although our GWL framework and S-GWL method perform well in most of our experiments, we find some challenging cases that point out our future research direction.

(a) Gaussian Partition: Accuracy v.s. efficiency
(b) Barabási-Albert: Accuracy v.s. efficiency
(c) Gaussian Partition: Acceleration
(d) Barabási-Albert: Acceleration
Figure 4: The performance of our method on different kinds of graphs.

Matching Barabási-Albert (BA) graphs Figure 1(e) shows the averaged matching results in 10 trials. In five of these trials, we match synthetic graphs obeying to Gaussian random partition model. In the remaining five trials, we match synthetic graphs obeying to Barabási-Albert (BA) model. The overall performance shown in Figure 1(e) demonstrates the superiority of our S-GWL method. This outstanding result is mainly contributed by the experiments on Gaussian partition graphs. Specifically, when matching Gaussian partition graphs, all the GW discrepancy-based methods achieves very high node correctness, and the speed of our method is almost the same with the fastest HubAlign method, as shown in Figure 4(a). When it comes to BA graphs, Figure 4(b) indicates that although GW discrepancy-based methods still outperform many baselines, there is a gap between them and the state-of-the-art methods in the aspect of node correctness.

Additionally, the BA graphs also have a negative influence on our recursive mechanism. For Gaussian partition graphs, it is relatively easy to partition them into several sub-graphs with comparable size. In such a situation, the power of our recursive mechanism can be maximized, which helps us achieve over 100 times acceleration. However, for BA graphs, the sub-graphs we get are often with incomparable size. The largest sub-graph decides the runtime of our S-GWL method. As a result, our S-GWL method only achieves about 1020 times acceleration.

Currently, we are making efforts to improve the performance and the speed of our method on BA graphs. To solve this problem, we may need to use some node information, , introducing node embedding into our S-GWL method.

Matching incomparable graphs The second challenging case is matching incomparable graphs. This case is common in the field of bioinformatics, , matching the PPI networks from different species. When the networks are with incomparable size, the performance of GW discrepancy-based methods degrades. For example, in Table 7, we match the PPI network of yeast to that of human. This yeast network has 2,340 proteins (nodes), while the human network has 9,141 proteins. Because the ground truth correspondence between these proteins is unknown, we use edge correctness to evaluate our method. Specifically, edge correctness calculates the percentage of yeast’s edges appearing in the human network.

Experimental results show that both GWL and S-GWL outperform most of their competitors except HubAlign and NETAL. The main reason for this phenomenon, in our opinion, is because the constraint of optimal transport. The constraint implies that each node in the target graph is assigned to a source node with a probability as long as its probability in is nonzero. When the number of target nodes is much larger than that of source nodes, the real correspondence will be oversmoothed because each source node transports to too many target nodes. To overcome this issue, we need to propose a preprocess to remove potentially-useless nodes from the large graph, which is another future work for us.

Method IsoRank PISwap MI-GRAAL GHOST NETAL HubAlign GWL S-GWL
YeastHuman 2.12 2.16 13.87 17.04 28.65 21.59 19.56 18.89
Table 7: Comparisons for graph matching methods on edge correctness (%).