Gromov-Wasserstein Learning for Graph Matching and Node Embedding

01/17/2019 ∙ by Hongteng Xu, et al. ∙ 12

A novel Gromov-Wasserstein learning framework is proposed to jointly match (align) graphs and learn embedding vectors for the associated graph nodes. Using Gromov-Wasserstein discrepancy, we measure the dissimilarity between two graphs and find their correspondence, according to the learned optimal transport. The node embeddings associated with the two graphs are learned under the guidance of the optimal transport, the distance of which not only reflects the topological structure of each graph but also yields the correspondence across the graphs. These two learning steps are mutually-beneficial, and are unified here by minimizing the Gromov-Wasserstein discrepancy with structural regularizers. This framework leads to an optimization problem that is solved by a proximal point method. We apply the proposed method to matching problems in real-world networks, and demonstrate its superior performance compared to alternative approaches.



There are no comments yet.


page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Real-world entities and their interactions are often represented as graphs. Given two or more graphs created in different domains, graph matching aims to find a correspondence across different graphs. This task is important for many applications, , matching the protein networks from different species (Sharan & Ideker, 2006; Singh et al., 2008), linking accounts in different social networks (Zhang & Philip, 2015)

, and feature matching in computer vision 

(Cordella et al., 2004; Zanfir & Sminchisescu, 2018)

. However, because it is NP-hard, graph matching is challenging and often solved heuristically. Further complicating matters, the observed graphs may be noisy (

, containing unreliable edges), which leads to unsatisfying matching results using traditional methods.

Figure 1: An illustration of the proposed method.

A problem related to graph matching is the learning of node embeddings, which aims to learn a latent vector for each graph node; the collection of embeddings approximates the topology of the graph, with similar/related nodes nearby in embedding space. Learning suitable node embeddings is beneficial for graph matching, as one may seek to align two or more graphs according to the metric structure associated with their node embeddings. Although graph matching and node embedding are highly related tasks, in practice they are often treated and solved independently. Existing node embedding methods (Perozzi et al., 2014; Tang et al., 2015; Grover & Leskovec, 2016) are designed for a single graph, and applying such methods separately to multiple graphs doesn’t share information across the graphs, and, hence, is less helpful for graph matching. Most graph matching methods rely purely on topological information (, adjacency matrices of graphs) and ignore the potential functionality of node embeddings (Kuchaiev et al., 2010; Neyshabur et al., 2013; Nassar et al., 2018). Although some methods consider first deriving embeddings for each graph and then learning a transformation between the embeddings, their results are often unsatisfying because their embeddings are predefined and the transformations are limited to orthogonal projections (Grave et al., 2018) or rigid/non-rigid deformations (Myronenko & Song, 2010).

This paper considers the joint goal of graph matching and learning node embeddings, seeking to achieve improvements in both tasks. As illustrated in Fig. 1, to address this goal we propose a novel Gromov-Wasserstein learning framework. The dissimilarity between two graphs is measured by the Gromov-Wasserstein discrepancy (GW discrepancy), which compares the distance matrices of different graphs in a relational manner, and learns an optimal transport between the nodes of different graphs. The learned optimal transport indicates the correspondence between the graphs. The embeddings of the nodes from different graphs are learned jointly: the distance between the embeddings within the same graph should approach the distance matrix derived from data, and the distance between the embeddings across different graphs should reflect the correspondence indicated by the learned optimal transport. As a result, the objectives of graph matching and node embedding are unified as minimizing the Gromov-Wasserstein discrepancy (Peyré et al., 2016)

between two graphs, with structural regularizers. This framework leads to an optimization problem that is solved via an iterative process. In each iteration, the embeddings are used to estimate distance matrices when learning the optimal transport, and the learned optimal transport regularizes the learning of embeddings in the next iteration.

There are two important benefits to tackling graph matching and node embedding jointly. First, the observed graphs often contain spurious edges or miss some useful edges, leading to noisy adjacency matrices and unreliable graph matching results. Treating the distance between learned node embeddings as complementary information of observed edges, we can approximate the topology of graph more robustly, and accordingly, match noisy graphs. Second, as shown in Figure 1, our method regularizes the GW discrepancy and learns embeddings of different graphs on the same manifold,111The GW discrepancy is applicable to the embeddings on different manifolds even those with different dimensions. However, making the embeddings on the same manifold is helped by imposing the proposed regularizers, reducing the difficulty of matching. instead of learning an explicit transformation between the embeddings with predefined constraints. Therefore, the proposed method is more flexible and has lower risk of model misspecification (, imposing incorrect constraints on the transformation); the distance between the embeddings of different graphs can be calculated directly without any additional transformation. We test our method on real-world matching problems and analyze its performance, including its convergence, consistency and scalability. Experiments show that our method obtains encouraging matching results, with comparisons made to alternative approaches.

2 Gromov-Wasserstein Learning Framework

Assume we have two sets of entities (nodes), denoted as source set and target set . Without loss of generality, we assume that . For each set, we observe a set of interactions between its entities, , , where or , and counts the appearances of the interaction . Accordingly, the data of these entities can be represented as two graphs, denoted and , and we focus on the following two tasks: ) Find a correspondence between the graphs. ) Obtain node embeddings of the two graphs, , and . As discussed above, these two tasks are unified in a framework based on Gromov-Wasserstein discrepancy.

2.1 Gromov-Wasserstein discrepancy

Gromov-Wasserstein discrepancy was proposed in (Peyré et al., 2016), which is a natural extension of Gromov-Wasserstein distance (Mémoli, 2011). Specifically, the definition of Gromov-Wasserstein distance is as follows:

Definition 2.1.

Let and be two metric measure spaces, where is a compact metric space and

is a Borel probability measure on

(with defined in the same way). The Gromov-Wasserstein distance is


is the loss function and

is the set of all probability measures on with and as marginals.

This defines an optimal transport-like distance (Villani, 2008) by comparing the metric spaces directly: it calculates distances between pairs of samples within each domain and measures how these distances compare to those in the other domain. In other words, it does not require one to directly compare the samples across different spaces and the target spaces can have different dimensions. When and are replaced with dissimilarity measurements rather than strict distance metrics, and the loss function is defined more flexibly, , mean-square-error (MSE) or KL-divergence, we relax the Gromov-Wasserstein distance to the proposed Gromov-Wasserstein discrepancy

. These relaxations make the proposed Gromov-Wasserstein learning framework suitable for a wide range of machine learning tasks, including graph matching.

In graph matching, a metric-measure space corresponds to the pair of a graph , where represents a distance/dissimilarity matrix derived according to the interaction set , , each is a function of . The empirical distribution of nodes is denoted , which counts the appearance of each node in . Given two graphs and , the Gromov-Wasserstein discrepancy between and is defined as


Here, . is an element-wise loss function, with typical choices the square loss and the KL-divergence . Accordingly, and each , and represents the inner product of matrices; is the optimal transport between the nodes of two graphs, and its element represents the probability that matches . By choosing the largest for each , we find the correspondence that minimizes the GW discrepancy between the two graphs.

However, such a graph matching strategy raises several issues. First, for each graph, its observed interaction set can be noisy, which leads to an unreliable distance matrix. Minimizing the GW discrepancy based on such distance matrices has a negative influence on matching results. Second, the Gromov-Wasserstein discrepancy compares different graphs relationally based on their edges (, the distance between a pair of nodes within each graph), while most existing graph matching methods consider the information of nodes and edges jointly (Neyshabur et al., 2013; Vijayan et al., 2015; Sun et al., 2015). Therefore, to make a successful graph matching method, we further consider the learning of node embeddings and derive the proposed Gromov-Wasserstein learning framework.

2.2 Proposed model

We propose to not only learn the optimal transport indicating the correspondence between graphs but also simultaneously learn the node embeddings for each graph, which leads to a regularized Gromov-Wasserstein discrepancy. The corresponding optimization problem is


The first term in (2) corresponds to the GW discrepancy defined in (1), which measures the relational dissimilarity between the two graphs. The difference here is that the proposed distance matrices consider both the information of observed data and that of embeddings:


Here is a distance matrix, with element that is a function measuring the distance between node embeddings;

is a hyperparameter controlling the contribution of embedding-based distance to the proposed distance.

The second term in (2) represents the Wasserstein discrepancy between the nodes of the two graphs. Similar to the first term, the distance matrix is also derived based on the node embeddings, , , and its contribution is controlled by the same hyperparameter . This term measures the absolute dissimilarity between the two graphs, which connects the target optimal transport with node embeddings. By adding this term, the optimal transport minimizes both the Gromov-Wasserstein discrepancy based directly on observed data and the Wasserstein discrepancy based on the embeddings (which are indirectly also a function of the data). Furthermore, the embeddings of different graphs can be learned jointly under the guidance of the optimal transport — the distance between the embeddings of different graphs should be consistent with the relationship indicated by the optimal transport.

Because the target optimal transport is often sparse, purely considering its guidance leads to overfitting or trivial solutions when learning embeddings. To mitigate this problem, the third term in (2) represents a regularization of the embeddings, based on the prior information provided by and . In this work, we require the embedding-based distance matrices to be close to the observed ones:


where the definition of loss function is the same as that used in (1). Note that if we observe partial correspondences between different graphs, , , we can calculate a distance matrix for the nodes of different graphs, denoted as , and require the distance between the embeddings to match with , as shown in the optional term of (4). This term is available only when is given.

The proposed method unifies (optimal transport-based) graph matching and node embedding in the same framework, and makes them beneficial to each other. For the original GW discrepancy term, introducing the embedding-based distance matrices can suppress the noise in the data-driven distance matrices, improving robustness. Additionally, based on node embeddings, we can calculate the Wasserstein discrepancy between graphs, which further regularizes the target optimal transport directly. When learning node embeddings, the Wasserstein discrepancy term works as the regularizer of node embeddings — the values of the learned optimal transport indicate which pairs of nodes should be close to each other.

3 Learning Algorithm

3.1 Learning optimal transport

Although (2) is a complicated nonconvex optimization problem, we can solve it effectively by alternatively learning the optimal transport and the embeddings. In particular, the proposed method applies nested iterative optimization. In the -th outer iteration, given current embeddings and , we solve the following sub-problem:


This sub-problem is still nonconvex because of the quadratic term . We solve it iteratively with the help of a proximal point method. Inspired by the method in (Xie et al., 2018), in the -th inner iteration we update the target optimal transport via


Here, a proximal term based on Kullback-Leibler (KL) divergence, , is added as a regularizer. We use projected gradient descent to solve (6), in which both the gradient and the projection are based on the KL metric. When the learning rate is set as , the projected gradient descent is equivalent to solving the following optimal transport problem with an entropy regularizer (Benamou et al., 2015; Peyré et al., 2016):


where , and . This problem can be solved via the Sinkhorn-Knopp algorithm (Sinkhorn & Knopp, 1967; Cuturi, 2013) with linear convergence.

In summary, we decompose (5) into a series of updating steps. Each updating step (6) can be solved via projected gradient descent, which is a solution to a regularized optimal transport problem (7). Essentially, the proposed method can be viewed as a special case of successive upper-bound minimization (SUM) (Razaviyayn et al., 2013), whose global convergence is guaranteed:

Proposition 3.1.

Every limit point generated by our proximal point method, , , is a stationary point of the problem (5).

Note that besides our proximal point method, another method for solving (5) involves replacing the KL-divergence in (6) with an entropy regularizer and minimizing an entropic GW discrepancy via iterative Sinkhorn projection (Peyré et al., 2016). However, its performance (, its convergence and numerical stability) is more sensitive to the choice of the hyperparameter . The details of our proximal point method, the proof of Proposition 3.1, and its comparison with the Sinkhorn method (Peyré et al., 2016) are shown in the Supplementary Material.

Parameter controls the influence of node embeddings on the GW discrepancy and the Wasserstein discrepancy. When training the proposed model from scratch, the embeddings and are initialized randomly and thus are unreliable in the beginning. Therefore, we initialize with a small value and increase it with respect to the number of outer iterations. We apply a simple linear strategy to adjust : with the maximum number of outer iterations set as , in the -th iteration, we set .

3.2 Updating embeddings

Given the optimal transport, , we update the embeddings by solving the following optimization problem:


This problem can be solved effectively by (stochastic) gradient descent. In summary, the proposed learning algorithm is shown in Algorithm 


1:  Input: , , , , the dimension , the number of outer/inner iterations .
2:  Output: , and .
3:  Initialize , randomly, .
4:  For
5:   Set .
6:   For
7:    Update optimal transport via solving (6).
8:   Obtain , via solving (8).
9:  , and .
10:   Graph matching:
11:  Initialize correspondence set
12:  For
13:   . .
Algorithm 1 Gromov-Wasserstein Learning (GWL)

3.3 Implementation details and analysis

Distance matrix The distance matrix plays an important role in our Gromov-Wasserstein learning framework. For a graph, the data-driven distance matrix should reflect its structure. Based on the fact that the counts of interactions in many real-world graphs is characterized by Zipf’s law (Powers, 1998), we treat the counts as the weights of edges and define the element of the data-driven distance matrix as


This definition assigns a short distance to pairs of nodes with many interactions. Additionally, we hope that the embedding-based distance matrix can fit the data-driven distance matrix easily. In the following experiments, we test two kinds of embedding-based distance: 1) Cosine-based distance:

. 2) Radial basis function (RBF)-based distance:

. When applying the cosine-based distance, we choose such that the maximum approaches to . When applying the RBF-based distance, we choose . The following experiments show that these two distances work well in various matching tasks.

Complexity and Scalability When learning optimal transport, one of the most time-consuming steps is computing the loss matrix

, which involves a tensor-matrix multiplication. Fortunately, as shown in 

(Peyré et al., 2016) when the loss function can be written as for functions , which is satisfied by our MSE/KL loss, the loss matrix can be calculated as . Because tends to be sparse quickly during the learning process, the computational complexity of is , where . For -dimensional node embeddings, the complexity of the embedding-based distance matrix is . Additionally, we can apply the inexact proximal point method (Xie et al., 2018; Chen et al., 2018a), running one-step Sinkhorn-Knopp projection in each inner iteration. Therefore, the complexity of learning optimal transport is . When learning node embeddings, we can apply stochastic gradient descent to solve (8). In our experiments, we select the size of the node batch as and the objective function of (8

) converges quickly after a few epochs. Therefore, the computational complexity of the embedding-based distance sub-matrix is just

, which may be ignored compared to that of learning optimal transport. In summary, the overall complexity of our method is , and both the learning of optimal transport and that of node embeddings can be done in parallel on GPUs.

According to the above analysis, we find that the proposed method has lower complexity than many existing graph matching methods. For example, the GRAAL and its variants (Malod-Dognin & Pržulj, 2015) have complexity, which is much slower than the proposed method. Additionally, the complexity of our method is independent of the number of edges (denoted as ). Compared to other well-known alternatives, , NETAL with , our method has at least comparable complexity for dense graphs ().

4 Related Work

Gromov-Wasserstein learning Gromov-Wasserstein discrepancy extends optimal transport (Villani, 2008) to the case when the target domains are not registered well. It can also be viewed as a relaxation of Gromov-Hausdorff distance (Mémoli, 2008; Bronstein et al., 2010) when pairwise distance between entities is defined. The GW discrepancy is suitable for solving matching problems like shape and object matching (Mémoli, 2009, 2011). Besides graphics and computer vision, recently its potential for other applications has been investigated, , matching vocabulary sets between different languages (Alvarez-Melis & Jaakkola, 2018) and matching weighted directed networks (Chowdhury & Mémoli, 2018). The work in (Peyré et al., 2016) considers the Gromov-Wasserstein barycenter and proposes a fast Sinkhorn projection-based algorithm to compute GW discrepancy (Cuturi, 2013). Similar to our method, the work in (Vayer et al., 2018) proposes a fused Gromov-Wasserstein distance, combining GW discrepancy with Wasserstein discrepancy. However, it does not consider the learning of embeddings and requires the distance between the entities in different domains to be known, which is inapplicable to matching problems. In (Bunne et al., 2018), an adversarial learning method is proposed to learn a pair of generative models for incomparable spaces, which uses GW discrepancy as the objective function. This method imposes an orthogonal assumption on the transformation between the sample and its embedding; it is designed for fuzzy matching between distributions, rather than the graph matching task that requires point-to-point correspondence.

Graph matching Graph matching has been studied extensively, with a wide range of applications. Focusing on protein-protein interaction (PPI) networks, many methods have been proposed, including methods based on local neighborhood information like GRAAL (Kuchaiev et al., 2010), and its variants MI-GRAAL (Kuchaiev & Pržulj, 2011) and L-GRAAL (Malod-Dognin & Pržulj, 2015); as well as methods based on global structural information, like IsoRank (Singh et al., 2008), MAGNA++ (Vijayan et al., 2015), NETAL (Neyshabur et al., 2013), HubAlign (Hashemifar & Xu, 2014) and WAVE (Sun et al., 2015). Among these methods, MAGNA++ and WAVE consider both edge and node information. Besides bioinformatics, network alignment techniques are also applied to computer vision (Jun et al., 2017; Yu et al., 2018), document analysis (Bayati et al., 2009) and social network analysis (Zhang & Philip, 2015). For small graphs, , the graph of feature points in computer vision, graph matching is often solved as a quadratic assignment problem (Yan et al., 2015). For large graphs, , social networks and PPI networks, existing methods either depend on a heuristic searching strategy or leverage domain knowledge for specific cases. None of these methods consider graph matching and node embedding jointly from the viewpoint of Gromov-Wasserstein discrepancy.

Node embedding Node embedding techniques have been widely used to represent and analyze graph/network structures. The representative methods include LINE (Tang et al., 2015), Deepwalk (Perozzi et al., 2014), and node2vec (Grover & Leskovec, 2016). Most of these embedding methods first generate sequential observations of nodes through a random-walk procedure, and then learn the embeddings by maximizing the coherency between each observation and its context (Mikolov et al., 2013). The distance between the learned embeddings can reflect the topological structure of the graph. More recently, many new embedding methods have been proposed, , the anonymous walk embedding in (Ivanov & Burnaev, 2018) and the mixed membership word embedding (Foulds, 2018), which help to improve the representations of complicated graphs and their nodes. However, none of these methods consider jointly learning embeddings for multiple graphs.

5 Experiments

We apply the Gromov-Wasserstein learning (GWL) method to both synthetic and real-world matching tasks, and compare it with state-of-the-art methods. In our experiments, we set hyperparameters as follows: the number of outer iterations is , the number of inner iteration is , , , and is the MSE loss. When solving (8), we use Adam (Kingma & Ba, 2014) with learning rate and set the number of epochs to , and the size of batches as . The proposed method based on cosine and RBF distances are denoted GWL-C and GWL-R, respectively. Additionally, to highlight the benefit from joint graph matching and node-embedding learning, we consider a baseline that purely minimizes GW discrepancy based on data-driven distance matrices (denoted as GWD).

5.1 Synthetic data

We verify the feasibility of our GWL method by first considering a synthetic dataset. We simulate the source graph as follows: for each , we select nodes randomly from , denoted as . For each selected edge , there are interactions between these two nodes. Accordingly, is the union of all simulated . The target graph is constructed by first adding noisy nodes to the source graph, , , and then generating noisy edges between the nodes in via the simulation method mentioned above, , .

Figure 2: The performance of our method on synthetic data.

We set and . Under a certain configuration, we simulate the source graph and the target one in trials. For each trial, we apply our method (and its baseline GWD) to match the graphs and calculate node correctness as our measurement: Given the learned correspondence set and the ground truth set of correspondences , we calculate percent node correctness as . To analyze the rationality of the learned node embeddings, we construct in two ways: for each , we find its matched node via () (as shown in line 13 of Algorithm 1) or ()

. Additionally, the corresponding GW discrepancy is calculated as well. Assuming that the results in different trials are Gaussian distributed, we calculate the

confidence interval for each measurement.

Figure 2 visualizes the performance222The performance of GWL-R is almost the same with that of GWL-C, so here we just show the results of GWL-C. of our GWL-C method and its baseline GWD, which demonstrates the feasibility of our method. In particular, when the target graph is identical to the source one (, ), the proposed Gromov-Wasserstein learning framework can achieve almost node correctness, and the GW discrepancy approaches zero. With the increase of , the noise in the target graph becomes serious, and the GW discrepancy increases accordingly. It means that the GW discrepancy reflects the dissimilarity between the graphs indeed. Although the GWD is comparable to our GWL-C in the case with low noise level, it becomes much worse when . This phenomenon supports our claim that learning node embeddings can improve the robustness of graph matching. Moreover, we find that the node correctness based on the optimal transport (blue curves) and that based on the embeddings (orange curves) are almost the same. This demonstrates that the embeddings of different graphs are on the same manifold, and their distances indicate the correspondences between graphs.

In the above experiments, with synthetic data, we demonstrate the feasibility of our method and localize the advantages of jointly performing graph matching and node-embedding learning. In the below experiments, we make comparisons with many of the state-of-the-art methods on real data.

5.2 MC3: Matching communication networks

MC3 is a dataset used in the Mini-Challenge 3 of VAST Challenge 2018, which records the communication behavior among a company’s employees on different networks.333 The communications are categorized into two types: phone calls and emails between employees. According to the types of the communications, we obtain two networks, denoted as CallNet and EmailNet. Because an employee has two independent accounts in these two networks, we aim to link the accounts belonging to the same employee. We test our method on a subset of the MC3 dataset, which contains employees and their communications through phone calls and emails. In this subset, for each selected employee there is at least one employee in a network (either CallNet or EmailNet) having over times communications with him/her, which ensures that each node has at least one reliable edge. Additionally, for each network, we can control the density of its edge by thresholding the count of interactions. When we only keep the edges corresponding to the communications happening more than times, we obtain two sparse graphs: the CallNet contains edges and the EmailNet contains edges. When we keep all the communications and the corresponding edges, we obtain two dense graphs, the CallNet contains edges and the EmailNet contains edges. Generally, experience indicates that matching dense graphs is much more difficult than matching sparse ones.

We compare our methods (GWL-R and GWL-C) with well-known graph matching methods: the graduated assignment algorithm (GAA) (Gold & Rangarajan, 1996), the low-rank spectral alignment (LRSA) (Nassar et al., 2018), TAME (Mohammadi et al., 2017), GRAAL444, MI-GRAAL555, MAGNA++666, HugAlign and NETAL.777Both HugAlign and NETAL can be downloaded from the website These alternatives achieve the state-of-the-art performance on matching large-scale graphs, , protein networks. Table 1 lists the matching results obtained by the different methods.888For GWD, GWL-R and GWL-C, here we show the node correctness calculated based on the learned optimal transport. For the alternative methods, their performance on sparse and dense graphs is inconsistent. For example, GRAAL works almost as well as our GWL-R and GWL-C for sparse graphs, but its matching result becomes much worse for dense graphs. For the baseline GWD, it is inferior to most graph-matching methods on node correctness, because it purely minimizes the GW discrepancy based on the information of pairwise interactions (, edges). Additionally, GWD merely relies on data-driven distance matrices, which is sensitive to the noise in the graphs. However, when we take node embeddings (with dimension ) into account, the proposed GWL-R and GWL-C outperform GWD and other considered approaches consistently, on both sparse and dense graphs.

(a) Stability and convergence
(b) Learned embeddings
Figure 3: Visualization of typical experimental results.
Method CallEmail (Sparse) CallEmail (Dense)
NC (%) NC (%)
GAA 34.22 0.53
LRSA 38.20 2.93
TAME 37.39 2.67
GRAAL 39.67 0.48
MI-GRAAL 35.53 0.64
MAGNA++ 7.88 0.09
HugAlign 36.21 3.86
NETAL 36.87 1.77
GWD 23.160.46 1.770.22
GWL-R 39.640.57 3.800.23
GWL-C 40.450.53 4.230.27
Table 1: Communication network matching results.

To demonstrate the convergence and the stability of our method, we run GWD, GWL-R and GWL-C in trails with different initialization. For each method, its node correctness is calculated based on optimal transport and the embedding-based distance matrix. The -confidence interval of the node correctness is estimated as well, as shown in Table 1. We find that the proposed method has good stability and outperforms other methods with high confidence. Figure 3(a) visualizes the GW discrepancy and the node correctness with respect to the number of outer iterations; the -confidence intervals are shown as well. In Figure 3(a), we find that the GW discrepancy decreases and the two kinds of node correctness increase accordingly and become consistent with the increase of iterations, which means that the embeddings we learn and their distances indeed reflect the correspondence between the two graphs. Figure 3(b) visualizes the learned embeddings with the help of t-SNE (Maaten & Hinton, 2008). We find that the learned node embeddings of different graphs are on the same manifold and the overlapped embeddings indicate matched pairs.

5.3 MIMIC-III: Procedure recommendation

Besides typical graph matching, our method has potential for other applications, like recommendation systems. Such systems recommend items to users according to the distance/similarity between their embeddings. Traditional methods (Rendle et al., 2009; Chen et al., 2018b) learn the embeddings of users and items purely based on their interactions. Recent work (Monti et al., 2017; Ying et al., 2018) shows that considering the user network and/or item network is beneficial to improve recommendation results. Such a strategy is also applicable to our Gromov-Wasserstein learning framework: given the network of users, the network of items, and the observed interactions between them (, partial correspondences between the graphs), we learn the embeddings of users and items and the optimal transport between them via minimizing the GW discrepancy between the networks. Because the learned embeddings are on the same manifold, we can calculate the distance between a user and an item directly via the cosine-based distance or the RBF-based distance. Accordingly, we recommend each user with the items with shortest distances. For our method, the only difference between the recommendation task and previous graph matching task is that we observed some interactions and take the optional regularizer in (4) into account. The in (4) is calculated via (3).

We test the feasibility of our method on the MIMIC-III dataset (Johnson et al., 2016), which contains patient admissions in a hospital. Each admission is represented as a sequence of ICD (International Classification of Diseases) codes of the diseases and the procedures. The diseases (procedures) appearing in the same admission construct the interactions of the disease (procedure) graph. We aim to recommend suitable procedures for patients, according to their disease characteristics. To achieve this, we learn the embeddings of the ICD codes for the diseases and the procedures with the help of various methods, and measure the distance between the embeddings. We compare the proposed Gromov-Wasserstein learning method with the following baselines: ) treating the admission sequences as sentences and learning the embeddings of ICD codes via traditional word embedding methods like Word2Vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014); ) the distilled Wasserstein learning (DWL) method in (Xu et al., 2018), which trains the embeddings from scratch or fine-tunes Word2Vec’s embeddings based on a Wasserstein topic model; and ) the GWD method that minimizes the GW discrepancy purely based on the data-driven distance matrices, and then learns the embeddings regularized by the learned optimal transport. The GWD method is equivalent to applying our GWL method and setting the number of outer iterations . For the GWD method, we also consider the cosine- and RBF-based distances when learning embeddings (, GWD-C and GWD-C).

For fairness of comparison, we use a subset of the MIMIC-III dataset provided by (Xu et al., 2018), which contains patient admissions, corresponding to diseases and procedures. For all the methods, we use of the admissions for training, for validation, and the remaining for testing. In the testing phase, for the -th admission, , we may recommend a list of procedures with length , denoted as , based on its diseases and evaluate recommendation results based on the ground truth list of procedures, denoted as . Given , we calculate the top- precision, recall and F1-score as follows: , , . Table 2 shows the results of various methods with and . We find that our GWL method outperforms the alternatives, especially on the top-1 measurements.

We analyze the learned optimal transport between diseases and procedures from a clinical viewpoint. We highlight some matched disease-procedure pairs, , “(d41401999The descriptions of the ICD codes are listed in the Supplementary Material.) Coronary atherosclerosis of native coronary artery (p3961) Extracorporeal circulation auxiliary to open heart surgery”, “(dV053) Need for prophylactic vaccination and inoculation against viral hepatitis (p9955) Prophylactic administration of vaccine against other diseases”, and “(dV3001) Single liveborn, born in hospital, delivered by cesarean section (p640) Circumcision”, etc. We asked two clinical researchers to check the pairs corresponding to ; they confirmed that for over of the pairs, either the procedures are clearly related to the treatments of the diseases, or the procedures clearly lead to the diseases as side effects or complications (other relationships may be less clear, but are implied by the data). The learned optimal transport, and all pairs of ICD codes and their evaluation results are shown in the Supplementary Material.

Method Top-1 (%) Top-5 (%)
P R F1 P R F1
Word2Vec 39.95 13.27 18.25 28.89 46.98 32.59
GloVe 32.66 13.01 17.22 27.93 44.79 31.47
DWL (Scratch) 37.89 12.42 17.16 27.39 43.81 30.81
DWL (Finetune) 40.00 13.76 18.71 30.59 48.56 34.28
GWD-R 46.29 17.01 22.32 31.82 43.81 33.77
GWD-C 43.16 15.79 20.77 31.42 42.99 33.25
GWL-R 46.20 16.93 22.22 32.03 44.75 34.18
GWL-C 47.46 17.25 22.71 32.09 45.64 34.31
Table 2: Top- procedure recommendation results.

6 Conclusions and Future Work

We have proposed a Gromov-Wasserstein learning method to unify graph matching and the learning of node embeddings into a single framework. We show that such joint learning is beneficial to each of the objectives, obtaining superior performance in various matching tasks. In the future, we plan to extend our method to multi-graph matching tasks, which may be related to Gromov-Wasserstein barycenter (Peyré et al., 2016) and its learning method. Additionally, to improve the scalability of our method, we will explore new Gromov-Wasserstein learning algorithms.


  • Altschuler et al. (2017) Altschuler, J., Weed, J., and Rigollet, P. Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration. arXiv preprint arXiv:1705.09634, 2017.
  • Alvarez-Melis & Jaakkola (2018) Alvarez-Melis, D. and Jaakkola, T. Gromov-Wasserstein alignment of word embedding spaces. In EMNLP, pp. 1881–1890, 2018.
  • Bayati et al. (2009) Bayati, M., Gerritsen, M., Gleich, D. F., Saberi, A., and Wang, Y. Algorithms for large, sparse network alignment problems. In ICDM, pp. 705–710, 2009.
  • Benamou et al. (2015) Benamou, J.-D., Carlier, G., Cuturi, M., Nenna, L., and Peyré, G. Iterative Bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2):A1111–A1138, 2015.
  • Bronstein et al. (2010) Bronstein, A. M., Bronstein, M. M., Kimmel, R., Mahmoudi, M., and Sapiro, G. A Gromov-Hausdorff framework with diffusion geometry for topologically-robust non-rigid shape matching. International Journal of Computer Vision, 89(2-3):266–286, 2010.
  • Bunne et al. (2018) Bunne, C., Alvarez-Melis, D., Krause, A., and Jegelka, S. Learning generative models across incomparable spaces. NeurIPS Workshop on Relational Representation Learning, 2018.
  • Chen et al. (2018a) Chen, L., Dai, S., Tao, C., Zhang, H., Gan, Z., Shen, D., Zhang, Y., Wang, G., Zhang, R., and Carin, L.

    Adversarial text generation via feature-mover’s distance.

    In NIPS, pp. 4671–4682, 2018a.
  • Chen et al. (2018b) Chen, X., Xu, H., Zhang, Y., Tang, J., Cao, Y., Qin, Z., and Zha, H. Sequential recommendation with user memory networks. In WSDM, pp. 108–116, 2018b.
  • Chowdhury & Mémoli (2018) Chowdhury, S. and Mémoli, F. The Gromov-Wasserstein distance between networks and stable network invariants. arXiv preprint arXiv:1808.04337, 2018.
  • Cordella et al. (2004) Cordella, L. P., Foggia, P., Sansone, C., and Vento, M. A (sub) graph isomorphism algorithm for matching large graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(10):1367–1372, 2004.
  • Cuturi (2013) Cuturi, M. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS, pp. 2292–2300, 2013.
  • Foulds (2018) Foulds, J. Mixed membership word embeddings for computational social science. In AISTATS, pp. 86–95, 2018.
  • Gold & Rangarajan (1996) Gold, S. and Rangarajan, A. A graduated assignment algorithm for graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 18(4):377–388, 1996.
  • Grave et al. (2018) Grave, E., Joulin, A., and Berthet, Q. Unsupervised alignment of embeddings with Wasserstein Procrustes. arXiv preprint arXiv:1805.11222, 2018.
  • Grover & Leskovec (2016) Grover, A. and Leskovec, J. node2vec: Scalable feature learning for networks. In KDD, pp. 855–864, 2016.
  • Hashemifar & Xu (2014) Hashemifar, S. and Xu, J. Hubalign: An accurate and efficient method for global alignment of protein–protein interaction networks. Bioinformatics, 30(17):i438–i444, 2014.
  • Ivanov & Burnaev (2018) Ivanov, S. and Burnaev, E. Anonymous walk embeddings. In ICML, 2018.
  • Johnson et al. (2016) Johnson, A. E., Pollard, T. J., Shen, L., Li-wei, H. L., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., and Mark, R. G. MIMIC-III, a freely accessible critical care database. Scientific data, 3:160035, 2016.
  • Jun et al. (2017) Jun, S.-H., Wong, S. W., Zidek, J., and Bouchard-Côté, A. Sequential graph matching with sequential monte carlo. In AISTATS, pp. 1075–1084, 2017.
  • Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kuchaiev & Pržulj (2011) Kuchaiev, O. and Pržulj, N. Integrative network alignment reveals large regions of global network similarity in yeast and human. Bioinformatics, 27(10):1390–1396, 2011.
  • Kuchaiev et al. (2010) Kuchaiev, O., Milenković, T., Memišević, V., Hayes, W., and Pržulj, N. Topological network alignment uncovers biological function and phylogeny. Journal of the Royal Society Interface, pp. rsif20100063, 2010.
  • Maaten & Hinton (2008) Maaten, L. v. d. and Hinton, G. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.
  • Malod-Dognin & Pržulj (2015) Malod-Dognin, N. and Pržulj, N. L-GRAAL: Lagrangian graphlet-based network aligner. Bioinformatics, 31(13):2182–2189, 2015.
  • Mémoli (2008) Mémoli, F. Gromov-Hausdorff distances in Euclidean spaces. In CVPR Workshops, pp. 1–8, 2008.
  • Mémoli (2009) Mémoli, F. Spectral Gromov-Wasserstein distances for shape matching. In ICCV Workshops, pp. 256–263, 2009.
  • Mémoli (2011) Mémoli, F. Gromov-Wasserstein distances and the metric approach to object matching. Foundations of computational mathematics, 11(4):417–487, 2011.
  • Mikolov et al. (2013) Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
  • Mohammadi et al. (2017) Mohammadi, S., Gleich, D. F., Kolda, T. G., and Grama, A. Triangular alignment TAME: A tensor-based approach for higher-order network alignment. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 14(6):1446–1458, 2017.
  • Monti et al. (2017) Monti, F., Bronstein, M., and Bresson, X.

    Geometric matrix completion with recurrent multi-graph neural networks.

    In NIPS, pp. 3697–3707, 2017.
  • Myronenko & Song (2010) Myronenko, A. and Song, X. Point set registration: Coherent point drift. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(12):2262–2275, 2010.
  • Nassar et al. (2018) Nassar, H., Veldt, N., Mohammadi, S., Grama, A., and Gleich, D. F. Low rank spectral network alignment. In WWW, pp. 619–628, 2018.
  • Neyshabur et al. (2013) Neyshabur, B., Khadem, A., Hashemifar, S., and Arab, S. S. NETAL: A new graph-based method for global alignment of protein–protein interaction networks. Bioinformatics, 29(13):1654–1662, 2013.
  • Pennington et al. (2014) Pennington, J., Socher, R., and Manning, C. Glove: Global vectors for word representation. In EMNLP, pp. 1532–1543, 2014.
  • Perozzi et al. (2014) Perozzi, B., Al-Rfou, R., and Skiena, S. Deepwalk: Online learning of social representations. In KDD, pp. 701–710, 2014.
  • Peyré et al. (2016) Peyré, G., Cuturi, M., and Solomon, J. Gromov-Wasserstein averaging of kernel and distance matrices. In ICML, pp. 2664–2672, 2016.
  • Powers (1998) Powers, D. M. Applications and explanations of zipf’s law. In Proceedings of the joint conferences on new methods in language processing and computational natural language learning, pp. 151–160, 1998.
  • Razaviyayn et al. (2013) Razaviyayn, M., Hong, M., and Luo, Z.-Q. A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM Journal on Optimization, 23(2):1126–1153, 2013.
  • Rendle et al. (2009) Rendle, S., Freudenthaler, C., Gantner, Z., and Schmidt-Thieme, L. BPR: Bayesian personalized ranking from implicit feedback. In UAI, pp. 452–461, 2009.
  • Sharan & Ideker (2006) Sharan, R. and Ideker, T. Modeling cellular machinery through biological network comparison. Nature biotechnology, 24(4):427, 2006.
  • Singh et al. (2008) Singh, R., Xu, J., and Berger, B. Global alignment of multiple protein interaction networks with application to functional orthology detection. Proceedings of the National Academy of Sciences, 2008.
  • Sinkhorn & Knopp (1967) Sinkhorn, R. and Knopp, P. Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics, 21(2):343–348, 1967.
  • Sun et al. (2015) Sun, Y., Crawford, J., Tang, J., and Milenković, T. Simultaneous optimization of both node and edge conservation in network alignment via WAVE. In International Workshop on Algorithms in Bioinformatics, pp. 16–39, 2015.
  • Tang et al. (2015) Tang, J., Qu, M., Wang, M., Zhang, M., Yan, J., and Mei, Q. Line: Large-scale information network embedding. In WWW, pp. 1067–1077, 2015.
  • Vayer et al. (2018) Vayer, T., Chapel, L., Flamary, R., Tavenard, R., and Courty, N. Fused Gromov-Wasserstein distance for structured objects: theoretical foundations and mathematical properties. arXiv preprint arXiv:1811.02834, 2018.
  • Vijayan et al. (2015) Vijayan, V., Saraph, V., and Milenković, T. MAGNA++: Maximizing accuracy in global network alignment via both node and edge conservation. Bioinformatics, 31(14):2409–2411, 2015.
  • Villani (2008) Villani, C. Optimal transport: Old and new, volume 338. Springer Science & Business Media, 2008.
  • Xie et al. (2018) Xie, Y., Wang, X., Wang, R., and Zha, H. A fast proximal point method for Wasserstein distance. arXiv preprint arXiv:1802.04307, 2018.
  • Xu et al. (2018) Xu, H., Wang, W., Liu, W., and Carin, L. Distilled Wasserstein learning for word embedding and topic modeling. In NIPS, pp. 1723–1732, 2018.
  • Yan et al. (2015) Yan, J., Xu, H., Zha, H., Yang, X., Liu, H., and Chu, S. A matrix decomposition perspective to multiple graph matching. In ICCV, pp. 199–207, 2015.
  • Ying et al. (2018) Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton, W. L., and Leskovec, J. Graph convolutional neural networks for web-scale recommender systems. arXiv preprint arXiv:1806.01973, 2018.
  • Yu et al. (2018) Yu, T., Yan, J., Wang, Y., Liu, W., et al. Generalizing graph matching beyond quadratic assignment model. In NIPS, pp. 861–871, 2018.
  • Zanfir & Sminchisescu (2018) Zanfir, A. and Sminchisescu, C. Deep learning of graph matching. In CVPR, pp. 2684–2693, 2018.
  • Zhang & Philip (2015) Zhang, J. and Philip, S. Y. Multiple anonymized social networks alignment. In ICDM, pp. 599–608, 2015.

7 Supplementary Material

7.1 The scheme of proposed proximal point method

In the -th ouer iteration, we learn the optimal transport iteratively. Particularly, in the -th inner iteration, we update the target optimal transport via solving (6) based on the Sinkhorn-Knopp algorithm (Sinkhorn & Knopp, 1967; Cuturi, 2013). Algorithm 2 gives the details of our proximal point method in the -th outer iteration, where converts a vector to a diagonal matrix, and and represent element-wise multiplication and division, respectively.

7.2 The convergence of each updating step

The proposed proximal point method decomposes a nonconvex optimization problem into a series of convex updating steps. Each updating step corresponds to the solution to a regularized optimal transport problem, which is solved via Sinkhorn projections. The work in (Altschuler et al., 2017) proves that solving the regularized optimal transport based on Sinkhorn projections is with linear convergence. The work in (Xie et al., 2018) further proves that the linear convergence holds even just applying one-step Sinkhorn projection in each updating step (, ). Therefore, the updating steps of the proposed method are with linear convergence.

7.3 Global convergence: The proof of Proposition 3.1

Proposition 3.1 Every limit point generated by our proximal point method, , , is a stationary point of the problem (5).


When learning the target optimal transport, the original optimization problem (5) is with a nonconvex and differentiable objective function


and a closed convex set as the constraint of . As a special case of successive upper-bound minimization (SUM), our proximal point method solves (5) via optimizing a sequence of approximate objective functions: starting from a feasible point , the algorithm generates a sequence according to the update rule:




is an approximation of at the -th iteration, and is the point generated in the previous iteration.

Obviously, we have

  • is continuous in .

Additionally, because and the equality holds only when , we have

  • .

  • .

According to the Proposition 1 in (Razaviyayn et al., 2013), when the conditions C2 and C3 are satisfied, for the differentiable function and its global upper bound , we have

  • with , where

    is the directional derivative of along the direction , and is the directional derivative only with respect to .

According to the Theorem 1 in (Razaviyayn et al., 2013), when the approximate objective function in each iteration satisfies C1-C4, every limit point generated by the proposed method, , is a stationary point of the original problem (5). ∎

1:  Input: , , current embeddings , , the number of inner iterations .
2:  Output: .
3:  Initialize and .
4:  for  do
5:     Calculate the in (7).
6:     Set .
7:      Sinkhorn-Knopp algorithm:
8:     for  do
11:     end for
12:     .
13:  end for
14:  .
Algorithm 2 Proximal Point Method for GW Discrepancy
(a) ,
(b) ,
(c) ,
(d) ,
(e) ,
(f) ,
(g) ,
(h) ,
(i) ,
(j) ,
(k) ,
(l) ,
Figure 4: Comparisons for our method (blue curves) and that in (Peyré et al., 2016) (orange curves).

7.4 Connections and comparisons with existing method

Note that when replacing the KL-divergence in (6) with an entropy regularizer , we derive an entropic GW discrepancy, which can also be solved by the Sinkhorn-Knopp algorithm. Accordingly, the in Algorithm 2 (line 6) is replaced with . In such a situation, the proposed algorithm becomes the Sinkhorn projection method in (Peyré et al., 2016).

For both these two methods, the number of Sinkhorn iterations and the weight of (proximal or entropical) regularizer are two significant hyperparameters. Figure 4 shows the empirical convergence of these two methods using different hyperparameters with respect to the number of inner iterations ( in Algorithm 2). We can find that for the Sinkhorn method it can obtain smaller GW discrepancy than our proximal point method when the weight of regularizer is very small () and . However, in such a situation both of these two methods suffer from a high risk of numerical instability. When enlarging , the stability of our method is improved obviously, and it is still able to obtain small GW discrepancy with a good convergence rate. The Sinkhorn method, on the contrary, converges slowly when . In other words, our method is more robust to the change of , and we can choose in a wide range and achieve a trade-off between convergence and stability easily. Additionally, although the increase of helps to improve the stability of our method slightly, , suppressing the numerical fluctuations after the GW discrepancy converges, such an improvement is so obvious as the cost on the computational complexity. Therefore, in practice we set .

   Disease Procedure CR1 CR2
  1.00 d4019: Unspecified essential hypertension p9604: Insertion of endotracheal tube
  0.22 d4019: Unspecified essential hypertension p966: Enteral infusion of concentrated nutritional substances
  0.17 d4280: Congestive heart failure, unspecified p966: Enteral infusion of concentrated nutritional substances
  0.64 d4280: Congestive heart failure, unspecified p9671: Continuous invasive mechanical ventilation for less than 96 consecutive hours
  0.36 d42731: Atrial fibrillation p3961: Extracorporeal circulation auxiliary to open heart surgery
  0.18 d42731: Atrial fibrillation p8856: Coronary arteriography using two catheters
  0.16 d42731: Atrial fibrillation p8872: Diagnostic ultrasound of heart
  0.34 d41401: Coronary atherosclerosis of native coronary artery p3961: Extracorporeal circulation auxiliary to open heart surgery
  0.29 d41401: Coronary atherosclerosis of native coronary artery p8856: Coronary arteriography using two catheters
  0.42 d5849: Acute kidney failure, unspecified p9672: Continuous invasive mechanical ventilation for 96 consecutive hours or more
  0.44 d25000: Diabetes mellitus without mention of complication, type II or unspecified type, not stated as uncontrolled p3615: Single internal mammary-coronary artery bypass a
  0.20 d51881: Acute respiratory failure

p3893: Venous catheterization, not elsewhere classified

  0.22 d51881: Acute respiratory failure p9904: Transfusion of packed cells
  0.29 d5990: Urinary tract infection, site not specified p3893: Venous catheterization, not elsewhere classified
  0.22 d53081: Esophageal reflux p9390: Non-invasive mechanical ventilation
  0.23 d2720: Pure hypercholesterolemia p3891: Arterial catheterization
  0.48 dV053: Need for prophylactic vaccination and inoculation against viral hepatitis p9955: Prophylactic administration of vaccine against other diseases
  0.53 dV290: Observation for suspected infectious condition p9955: Prophylactic administration of vaccine against other diseases
  0.30 d2859: Anemia, unspecified p9915: Parenteral infusion of concentrated nutritional substances b
  0.24 d486: Pneumonia, organism unspecified p9671: Continuous invasive mechanical ventilation for less than 96 consecutive hours
  0.18 d2851: Acute posthemorrhagic anemia p9904: Transfusion of packed cells
  0.18 d2762: Acidosis p966: Enteral infusion of concentrated nutritional substances
  0.28 d496: Chronic airway obstruction, not elsewhere classified p3722: Left heart cardiac catheterization
  0.16 d99592: Severe sepsis p3893: Venous catheterization, not elsewhere classified
  0.26 d0389: Unspecified septicemia p966: Enteral infusion of concentrated nutritional substances
  0.26 d5070: Pneumonitis due to inhalation of food or vomitus p3893: Venous catheterization, not elsewhere classified
  0.33 dV3000: Single liveborn, born in hospital, delivered without mention of cesarean section p331: Incision of lung
  0.17 d5859: Chronic kidney disease, unspecified p9904: Transfusion of packed cells
  0.16 d412: Old myocardial infarction p8853: Angiocardiography of left heart structures
  0.18 d2875: Thrombocytopenia, unspecified p3893: Venous catheterization, not elsewhere classified
  0.25 d41071: Subendocardial infarction, initial episode of care p3723: Combined right and left heart cardiac catheterization
  0.21 d4240: Mitral valve disorders p9904: Transfusion of packed cells
  0.31 dV3001: Single liveborn, born in hospital, delivered by cesarean section p640: Circumcision
  0.23 d40391: Hypertensive chronic kidney disease, unspecified, with chronic kidney disease stage V or end stage renal disease p3893: Venous catheterization, not elsewhere classified
  0.17 d78552: Septic shock p9904: Transfusion of packed cells
  0.17 d9971: Cardiac complications, not elsewhere classified p8856: Coronary arteriography using two catheters
  0.27 d7742: Neonatal jaundice associated with preterm delivery p9983: Other phototherapy
  0.25 dV502: Routine or ritual circumcision p9955: Prophylactic administration of vaccine against other diseases c
  • The relationship here is that usually people with diabetes also have heart disease, and heart disease can require a coronary artery bypass

  • The relationship here is that if someone has a chronic disease they can develop anemia of chronic disease and they may also be requiring parenteral nutrition for some specific condition

  • This procedure is not inherently related to the disease, but they do appear together frequently in the same medical record because they both happen to newborn babies.

  • The procedure is related to the treatment of the disease.

  • The procedure can lead to the disease as side effect or complication.

Table 3: Matched disease–procedure pairs based on optimal transport

7.5 MIMIC III: More details of experiments

The enlarged optimal transport between diseases and procedures learned by our method is shown in Figure 5. The pairs whose values in the optimal transport matrix are larger than are listed in Table 3. Additionally, we ask two clinical researchers to evaluate these pairs — for each pair, each research independently checks whether the procedure is potentially related to the disease. The columns of “CR1” and “CR2” in Table 3 give the evaluation results. For each pair, the “” means that the procedure is potentially related to the treatments of the disease, while the “” means that the procedure can lead to the disease as side effect or complication. We can find that 1) the evaluation results from different clinical researchers are with high consistency; 2) over of the pairs are reasonable: they correspond to either “diseases and their treatments” or “procedures and their complications”. These phenomena demonstrate that the learned optimal transport is clinically-meaningful to some extent, which reflects some relationships between diseases and procedures. Table 4 lists the ICD codes of diseases and procedures and their detailed descriptions.

Figure 5: The optimal transport from diseases to procedures (enlarged version).
  ICD code Disease/Procedure
  d4019 Unspecified essential hypertension
  d41401 Coronary atherosclerosis of native coronary artery
  d4241 Aortic valve disorders
  dV4582 Percutaneous transluminal coronary angioplasty status
  d2724 Other and unspecified hyperlipidemia
  d486 Pneumonia, organism unspecified
  d99592 Severe sepsis
  d51881 Acute respiratory failure
  d5990 Urinary tract infection, site not specified
  d5849 Acute kidney failure, unspecified
  d78552 Septic shock
  d25000 Diabetes mellitus without mention of complication, type II or unspecified type
  d2449 Unspecified acquired hypothyroidism
  d41071 Subendocardial infarction, initial episode of care
  d4280 Congestive heart failure, unspecified
  d4168 Other chronic pulmonary heart diseases
  d412 Pneumococcus infection in conditions classified elsewhere and of unspecified site
  d2761 Hyposmolality and/or hyponatremia
  d2720 Pure hypercholesterolemia
  d2762 Acidosis
  d389 Unspecified septicemia
  d4589 Hypotension, unspecified
  d42731 Atrial fibrillation
  d2859 Anemia, unspecified
  d311 Cutaneous diseases due to other mycobacteria
  dV3001 Single liveborn, born in hospital, delivered by cesarean section
  dV053 Need for prophylactic vaccination and inoculation against viral hepatitis
  d4240 Mitral valve disorders
  dV3000 Single liveborn, born in hospital, delivered without mention of cesarean section
  d7742 Neonatal jaundice associated with preterm delivery
  d42789 Other specified cardiac dysrhythmias
  d5070 Pneumonitis due to inhalation of food or vomitus
  dV502 Routine or ritual circumcision
  d2760 Hyperosmolality and/or hypernatremia
  dV1582 Personal history of tobacco use
  d40390 Hypertensive chronic kidney disease, unspecified, with chronic kidney disease stage I through stage IV, or unspecified
  dV4581 Aortocoronary bypass status
  dV290 Observation for suspected infectious condition
  d5845 Acute kidney failure with lesion of tubular necrosis
  d2875 Thrombocytopenia, unspecified
  d2767 Hyperpotassemia
  d32723 Obstructive sleep apnea (adult)(pediatric)
  dV5861 Long-term (current) use of anticoagulants
  d2851 Acute posthemorrhagic anemia
  d53081 Esophageal reflux
  d496 Chronic airway obstruction, not elsewhere classified
  d40391 Hypertensive chronic kidney disease, unspecified, with chronic kidney disease stage V or end stage renal disease
  d9971 Gross hematuria
  d5119 Unspecified pleural effusion
  d2749 Gout, unspecified
  d5859 Chronic kidney disease, unspecified
  d49390 Asthma, unspecified type, unspecified
  d45829 Other iatrogenic hypotension
  d3051 Tobacco use disorder
  dV5867 Long-term (current) use of insulin
  d5180 Pulmonary collapse
  p9604 Insertion of endotracheal tube
  p9671 Continuous invasive mechanical ventilation for less than 96 consecutive hours
  p3615 Single internal mammary-coronary artery bypass
  p3961 Extracorporeal circulation auxiliary to open heart surgery
  p8872 Diagnostic ultrasound of heart
  p9904 Transfusion of packed cells
  p9907 Transfusion of other serum
  p9672 Continuous invasive mechanical ventilation for 96 consecutive hours or more
  p331 Spinal tap
  p3893 Venous catheterization, not elsewhere classified
  p966 Enteral infusion of concentrated nutritional substances
  p3995 Hemodialysis
  p9915 Parenteral infusion of concentrated nutritional substances
  p8856 Coronary arteriography using two catheters
  p9955 Prophylactic administration of vaccine against other diseases
  p3891 Arterial catheterization
  p9390 Non-invasive mechanical ventilation
  p9983 Other phototherapy
  p640 Circumcision
  p3722 Left heart cardiac catheterization
  p8853 Angiocardiography of left heart structures
  p3723 Combined right and left heart cardiac catheterization
  p5491 Percutaneous abdominal drainage
  p3324 Closed (endoscopic) biopsy of bronchus
  p4513 Other endoscopy of small intestine
Table 4: The map between ICD codes and diseases/procedures