1 Introduction
A wide variety of data are represented as relational data, such as friend links on social networks, hyperlinks between web pages, citations between scientific articles, word appearance in documents, item purchase by users, and interactions between proteins. In this paper, we consider object matching for relational data, which is the task to find correspondence between objects in different relational datasets. For example, corresponding objects are words with the same meaning in different languages haghighi2008 , the same users in different databases korula2014efficient , related entities in different ontologies doan2002learning ; aumueller2005schema , and proteins with the same function in different species kuchaiev2010topological . There have been proposed many methods for object matching when a similarity measure is defined between objects in different domains or correspondence information is given socher2010connecting ; zhou2012factorized ; terada2012 . However, similarity measures might not be defined across different modalities, and correspondence information would be unavailable due to the need to preserve privacy or its high cost.
We propose an unsupervised method to find matchings of objects in different relational datasets without similarity measures nor correspondence information. First, the proposed method estimates latent vectors for objects by making use of representation learning on graphs perozzi2014deepwalk for each relational dataset. In the latent vectors, hidden structural information about the relational dataset is encoded by modeling neighbor objects with the innerproduct of the latent vectors, where the neighbors are generated by short random walks over the relations. Then, the latent vectors are linearly projected onto a common latent space shared across all relational datasets by matching the latent vector distributions while preserving the encoded structural information. To represent the distributions effectively, we use the kernel embeddings of distributions sriperumbudur2010hilbert , that hold highorder moment information about a distribution as an element in a reproducing kernel Hilbert space (RKHS). It enables us to calculate the distance between the distributions, which is called maximum mean discrepancy (MMD) gretton2012kernel , without density estimation. The structural information is preserved by using an orthogonal projection matrix since it does not change the values of the innerproduct. We estimate a projection matrix by minimizing the MMD between the latent vector distributions of different relational datasets with an orthogonality regularizer. Objects to be matched are estimated based on the distance in the common latent space. Figure 1 shows an overview of the proposed method.
2 Related work
A number of unsupervised object matching methods have been proposed haghighi2008 ; quadrianto2010 ; djuric2012convex ; klami2012 ; klami2013bayesian ; yamada2011cross , such as kernelized sorting quadrianto2010 and Bayesian object matching klami2013bayesian . However, these methods are not for relational data. In addition, these methods do not scale to large data since they find correspondence by estimating a (probabilistic) permutation matrix with the size of the square of the number of objects. On the other hand, the proposed method scales well since it estimates a projection matrix with the size of the square of the latent space dimensionality, which is much smaller than the number of objects.
The ReMatch method is an unsupervised cluster matching method for relational data iwata2016 . The ReMatch assigns each object into a cluster that is shared across all datasets, and finds correspondence based on the cluster assignments. Therefore, multiple objects assigned into the same cluster cannot be distinguished, and there would be many ties when objects are ranked based on the estimated correspondence. In contrast, the proposed method estimates different continuous feature representations for different objects.
In natural language processing, methods for word translation without parallel data have been proposed
cao2016distribution ; zhang2017earth ; zhang2017adversarial ; lample2018word . With these methods, word embeddings obtained by word2vec mikolov2013distributed are transformed by matching the distributions with the orthogonality constraint xing2015normalized ; smith2017offline. Although the proposed method employs a similar approach with these methods, there are two clear differences. The first difference is that the proposed method is for relational data, but these methods are for natural language sentences. The second difference is techniques for matching distributions. The distributions are matched based their mean and variance in
cao2016distribution, which implicitly assumes the latent vectors follow Gaussian distributions. The proposed method uses the kernel embeddings of distributions, by which higherorder moments as well as mean and covariance are considered for matching. The latent vector distributions of relational data are generally not Gaussians as shown in Figure
3 in our experiments. The earth mover’s distance is used for the distribution distance measure in zhang2017earth , which requires solving a minimization problem. In contrast, the MMD used with the proposed method is calculated in a closed form by the weighted sum of the kernel functions without the need of optimization. Adversarial training that solves a minimax optimization problem is used for matching the distributions in zhang2017adversarial ; lample2018word . Although the adversarial training is successfully used especially for image generation goodfellow2014generative , it is well known for being unstable arjovsky2017towards, and requires training an additional neural network for a discriminator. The MMD has fewer tuning hyperparameters than the adversarial training approach, and the proposed method works stably as empirically shown in Section
5.The kernel embeddings of distributions and the MMD have been successfully used for various applications, such as a statistical independence test gretton2008kernel ; NIPS2012_4727 ; zaremba2013b , discriminative learning on probability measures muandet2012learning
, anomaly detection for group data
muandet2013one , density estimation dudik2007maximum , a threevariable interaction test sejdinovic2013kernel , a goodness of fit test chwialkowski2016kernel , supervised object matching yoshikawa2015cross , domain adaptation long2015learning , and deep generative modeling li2015generative ; li2017mmd .The proposed method is related to graph matching and network alignment methods, which find correspondence so that matched objects have the same edge relation and/or the similar attributes NIPS2006_2960 ; NIPS2009_3800 ; NIPS2009_3756 ; NIPS2013_4925 ; NIPS2017_6911 ; kuchaiev2010topological ; terada2012 ; cho2010reweighted ; gori2005exact ; sharma2010shape . The proposed method is different from them, in a way that the proposed method considers not only directly connected edge relation but also the intrinsic hidden structure of the given relational dataset without attributes by embedding objects onto a latent space. The proposed method is based on the recent advancements of graph representation learning methods, such as DeepWalk perozzi2014deepwalk , LINE tang2015line and node2vec grover2016node2vec . These methods obtain node representations that contain structural information in the graph, and have been successfully used in link prediction and node classification. Although spectral methods for graph matching have been proposed knossow2009inexact ; umeyama1988eigendecomposition , DeepWalk and node2vec achieved better performance than the spectral methods in graph representation learning perozzi2014deepwalk ; grover2016node2vec .
3 Kernel embeddings of distributions
In this section, we explain the kernel embeddings of distributions, which are employed with the proposed method. The kernel embeddings of distributions sriperumbudur2010hilbert
embed probability distribution
on space into a reproducing kernel Hilbert space (RKHS) specified by kernel . In particular, distribution is represented as an element in the RKHS as follows, . The kernel embedding preserves the properties of the distribution such as mean, covariance and higherorder moments when characteristic kernels, such as Gaussian kernels, are used sriperumbudur2010hilbert .Although distribution is unknown in practice, we can estimate the empirical kernel embedding using a set of samples drawn from the distribution. By interpreting the samples as an empirical distribution , where is the Dirac delta function, empirical kernel embedding is given by , which can be approximated with an error rate of smola2007hilbert
. Unlike kernel density estimation, the error rate of the kernel embeddings is independent of the dimensionality of the given distribution.
By using the kernel embeddings, we can measure the distance between two distributions. Given two set of vectors, and , we obtain their kernel embeddings and . Then, the distance between and is given by
(1) 
where is an innerproduct in the RKHS, and this distance is called maximum mean discrepancy (MMD) gretton2008kernel . The innerproduct is calculated by
(2) 
which is represented by an empirical expectation of the kernel. When the number of vectors is large, the computation can be expensive. In that case, we can obtain the unbiased estimate of the MMD by using Monte Carlo approximation for the expectation
NIPS2012_4727 ; long2015learning , in which vectors are randomly sampled to compute the expectation.4 Proposed method
4.1 Task
Suppose that we are given relational datasets, or networks, . Here, is the th relational dataset, is the set of objects, is the number of objects, and is the set of relations. The proposed method is applicable to arbitrary kinds of relational data, such as singletype, multitype and/or weighted relations, if all the given datasets are the same type. The task is to find correspondence between objects in different relational datasets.
4.2 Procedures
Latent vector estimation
With the proposed method, continuous feature representations for objects are obtained in each relational dataset. We obtain the object representations by using a skipgram mikolov2013efficient based approach for graphs perozzi2014deepwalk ; grover2016node2vec . This approach encodes structural information of the given graph as continuous vectors. In particular, each object is assumed to have two latent vectors and , where is the dimensionality of the latent space. The probability that object is the neighbor of object is modeled by the innerproduct of the latent vectors as follows,
(3) 
The latent vectors for all objects in the th dataset, and , are obtained by maximizing the following likelihood of neighbors,
(4) 
where is the set of neighbors of object . The summation over all the objects in the denominator of (3) is expensive to compute for large data. We approximate it with negative sampling mikolov2013distributed for efficiency.
The neighbors are generated by short random walks over relations . We conduct multiple random walks starting from every objects. A random walk chooses a next object uniform randomly from the objects that have relations with the current object until the maximum length is reached. The objects that appear in the random walk sequences within a window are considered as the neighbors. The random walks are successfully used for capturing structure in graphs andersen2006local , and robust to perturbations in the form of noisy or missing relations grover2016node2vec . The neighbors are not restricted to objects directly connected by relations, but are generated depending on the local relational structure.
Projection matrix estimation
Since we obtained object representations independently for each dataset, the obtained representations, and
, are not related across different datasets. Then, we project the representations onto a common latent space shared across all datasets by matching the distributions while preserving the encoded structural information. We assume the following linear transformation with orthogonal projection matrix
,(5) 
where and are the transformed latent vector of and
, respectively. Here, without loss of generality we can set that the transformation for the first dataset is the identity matrix,
. By assuming the orthogonality , we can preserve the encoded structural information, since the innerproduct of the transformed vectors is the same with the that of the original vectors, , and the relations are modeled with only the innerproduct in the proposed method in (3).We would like to have the transformed vectors that follows the same distribution among all datasets. We employ the kernel embeddings of distributions to represent the distribution of the transformed latent vectors of the th dataset as follows, , and . Then, the distance between the latent vector distributions of datasets and is measured by , and , which are calculated using (1) and (2).
We obtain orthogonal matrices , by which transformed vectors follow the same distribution, by minimizing the following objective function,
(6) 
where the first term handles the orthogonality, the second term handles the distribution matching, and the is the hyperparameter. The objective function is minimized by using gradient based optimization methods.
Matching
The object correspondence is calculated based on the Euclidean distance of the transformed latent vectors in the common latent space. The ranking of objects in the th dataset to be matched with object is obtained by the ascending order of the Euclidean distance .
4.3 Model
We described the procedures for matching with the proposed method in the previous subsection. In this subsection, we explain the model the proposed method assumes.
The proposed model assumes that each object in all datasets has latent vectors and in a common latent space. The distance in the common latent space indicates the correspondence, and the objects that are closely located to each other are considered to be matched. The latent vectors in all datasets are generated from common distributions, and . The proposed method does not explicitly model the common distributions. Instead, we achieve to have the common distributions by minimizing the distributions across different datasets based on MMD. By this approach, we do not need to consider the parametric form of the distributions, and do not need to estimate the distributions.
The neighbors, which are defined by the relations, are assumed to be modeled using the latent vectors with (3). The neighbor model (3) does not contain datasetspecific parameters except for the latent vectors, and the form of the model is the same across all datasets given the latent vectors. By this modeling, all we need to consider is the latent vector distributions for finding correspondence.
Although we can estimate the latent vectors in the common space and directly in a onestep approach, the proposed method employs the following twostep approach: 1) estimates individual latent vectors for each dataset, and then transforms them into the common space. The twostep approach has advantages over the onestep approach. First, the twostep approach is easy to parallelize the latent vector estimation. Second, we can estimate the latent vectors robustly by separating the structural information encoding and distribution matching. On the other hand, the onestep approach can deteriorate the encoding quality by enforcing the distribution matching. In our preliminary experiments, the twostep approach performed better than the onestep approach. This robust twostep approach is made possible by modeling the relations using only the innerproduct of the latent vectors and introducing the orthogonal regularizer for preserving the encoded information.
5 Experiments
Data
To demonstrate the effectiveness of the proposed method, we used the following two datasets: Wikipedia and Movielens.
The Wikipedia data were multilingual documentword relational datasets in English (EN), German (DE), Italian (IT) and Japanese (JP), where objects were documents and words, and a document has a relation with a word when the document contained the word. The documents were obtained from the following five categories in Wikipedia: ‘Nobel laureates in Physics’, ‘Nobel laureates in Chemistry’, ‘American basketball players’, ‘American composers’ and ‘English footballers’. For each category, we sampled 20 documents that appeared in all languages. We used 1,000 frequent words after removing stopwords for each language. There were 100 document objects and 1,000 word objects, and 9,191 relations on average for each documentword relational dataset in a language.
The Movielens data are a standard benchmark dataset for collaborative filtering herlocker1999algorithmic . The original data contained 943 users, 1,682 movies, and 100,000 ratings. First, we randomly split users into two sets. Then, two useritem relational datasets were constructed by using the first or second sets of users and all items, where objects were users and items, and a user and an item were connected when the user had rated the item. We call this data MovielensUser. There were 471 user objects and 1,682 item objects in each dataset. We also constructed MovielensItem data by randomly splitting items into two sets, where there were 943 user objects and 841 item objects in each dataset.
For the evaluation measurements, we used the top accuracy, which is the rate that the correctly corresponding object is included in the top list estimated with a method.
Comparing methods
We compared the proposed method with the following five methods: Degree, DeepWalk, CKS, Adversarial and ReMatch. The Degree method finds correspondence between objects with its degree, or the number of relations of the object, where objects with similar degree are considered as matched.
The DeepWalk method finds correspondence with the Euclidean distance between 50dimensional continuous feature vectors obtained by the DeepWalk perozzi2014deepwalk . The DeepWalk is a representation learning method for graphs and is not a method for matching, but we included it as a comparing method since the proposed method is based on the DeepWalk. The DeepWalk method estimates the feature vectors by maximizing the likelihood (4). We used typical values for the DeepWalk hyperparameters as follows: the number of walks that started from an object was 10, the length of a walk was 80, the window size of neighbors in a random walk sequence was 10, the number of negative samples was 20, the batch size was 4,096, the Adam kingma2014adam with learning rate was used for optimization.
The CKS method is the convex kernelized sorting djuric2012convex , which is an unsupervised object matching method. With the CKS, correspondences are found by stochastically permuting objects so as to maximize the dependence between two datasets, where the Hilbert Schmidt independence criterion (HSIC) is used for measuring the dependence. The ranking of objects to be matched is obtained by using the probability of a match estimated by the CKS. The input of the CKS was the 50dimensional continuous feature vectors obtained by the DeepWalk method, and the Gaussian kernels were used for calculating the HSIC. We used the code provided by the authors tuc . Note that with the CKS we found matching of documents (not for words) with the Wikipedia data, users with the MovielensUser data, and items with the MovielensItem data.
The Adversarial method orthogonally transforms the continuous feature vectors obtained by the DeepWalk method onto a common latent space by matching the distributions using an adversarial approach lample2018word ; conneau2017word . With the Adversarial method, a neural network, which is called the discriminator, is trained so as to predict the dataset identifier where each transformed latent vector comes from. The projection matrices are optimized so as to prevent the discriminator to predict the dataset identifiers. We used the code provided by the authors muse , and used the default hyperparameters. The Euclidean distance between 50dimensional transformed continuous feature vectors was used for matching.
The ReMatch method is an unsupervised cluster matching method for relational data iwata2016 . With the ReMatch, an object is assigned into a common cluster shared across all datasets based on stochastic block modeling wang1987stochastic ; kemp2006learning , where the number of clusters is automatically estimated from the given data using Dirichlet processes. We considered that objects assigned in the same cluster were matched.
The proposed method trained with the following settings. We obtained 50dimensional continuous latent vectors by the same setting with the DeepWalk method described above, and transformed them by minimizing (6). For the kernel to calculate the MMD, we used the following Gaussian kernel, . We fixed the hyperparameter in (6) for all the datasets. For optimization, we used the Adam kingma2014adam with the learning rate
. We ran the proposed method ten times with different initializations, and the selected a result by the value of the loss function (
6). The correspondence was found based on the Euclidean distance between 50dimensional transformed continuous feature vectors.Results
(a) WikipediaENDE  (b) WikipediaENIT  (c) WikipediaENJP 
(d) WikipediaDEIT  (e) WikipediaDEJP  (f) WikipediaITJP 
(g) MovielensUser  (h) MovielensItem 
accuracy, and the errorbar shows the standard error.
Figure 2 shows the top accuracy with the Wikipedia and Movielens datasets, which are averaged over ten experiments. The proposed method achieved the highest accuracy in seven of the eight datasets, and the second highest accuracy in the MovielensItem data (h). The accuracy of the DeepWalk method was low, which was almost the same with random matching. This was reasonable because the DeepWalk obtained the latent vectors for each dataset independently. The CKS, Adversarial and the proposed methods used the independent latent vectors estimated by the DeepWalk as inputs, and achieved the higher accuracy than the DeepWalk. This result indicates that these methods found the relationship between datasets in an unsupervised fashion. The performance of the CKS was worse than the proposed method. It would be because the CKS finds alignments based on kernel matrices of the latent vectors without using their characteristics. On the other hand, the proposed method exploits their characteristics, where the structural information is encoded as their innerproducts, as the orthogonality regularization. The Adversarial method gave the lower accuracy compared with the proposed method. Note that we did not tune the hyperparameters of the Adversarial method, and careful hyperparameter tuning would give better performance. However, it is well known that the adversarial training often becomes unstable arjovsky2017towards , and the hyperparameter tuning is difficult. In contrast, the proposed method does not need additional neural networks, i.e. a discriminator in the Adversarial method, and the distributions are matched stably by using the MMD. Since the ReMatch represents an object by its cluster assignment, some objects in the same cluster have the identical representation, and the accuracy of the ReMatch was low.
Figure 3 shows the twodimensional visualization of the latent vectors with stochastic neighbor embedding (SNE) maaten2008visualizing before (a,b,c) and after (d,e,f) the transformation. The SNE is a nonlinear dimensionality reduction method. With the Wikipedia data, the DeepWalk obtained similar latent vectors for documents in the same category in each language, but obtained dissimilar latent vectors for different languages (a). The proposed method successfully found the category alignment, which was shown in (d), where the documents in the same category in different languages were located close together. Similarly, the latent vectors by the DeepWalk with the Movielens data (b,c) were transformed so as to matching the distributions by the proposed method (e,f).
Latent vectors with the DeepWalk method  
(a) WikipediaENDE  (b) MovielensUser  (c) MovielensItem 
Transformed latent vectors with the proposed method  
(d) WikipediaENDE  (e) MovielensUser  (f) MovielensItem 
Table 1 shows the computational time in seconds. The computational time with the CKS method increased cubically with the number of objects, and did not scale to large data. On the other hand, the proposed, DeepWalk and Adversarial methods scale well by using stochastic gradient methods.
Proposed  DeepWalk  Adversarial  CKS  ReMatch  

Wikipedia  1,204  291  3,153  317  297 
MovielensUser  4,473  394  3,268  17,764  679 
MovielensItem  5,337  378  3,299  103,950  651 
6 Conclusion
We proposed an unsupervised object matching methods for relational data. With the proposed method, object representations that contain hidden structural information are obtained for each relational dataset, and the representations are transformed onto a common latent space shared across all datasets by matching the distributions while preserving the structural information. With the experiments, we confirmed that the proposed method can effectively and efficiently find matchings in relational data. For future work, we would like to extend the proposed method for a semisupervised setting, where a few correspondences are given.
References
 [1] https://astro.temple.edu/~tuc17157.
 [2] https://github.com/facebookresearch/MUSE.
 [3] R. Andersen, F. Chung, and K. Lang. Local graph partitioning using pagerank vectors. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science, pages 475–486. IEEE, 2006.
 [4] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862, 2017.
 [5] D. Aumueller, H.H. Do, S. Massmann, and E. Rahm. Schema and ontology matching with coma++. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 906–908. ACM, 2005.
 [6] H. Cao, T. Zhao, S. Zhang, and Y. Meng. A distributionbased model to learn bilingual word embeddings. In Proceedings of the 26th International Conference on Computational Linguistics, pages 1818–1827, 2016.

[7]
M. Cho, J. Lee, and K. M. Lee.
Reweighted random walks for graph matching.
In
European Conference on Computer Vision
, pages 492–505. Springer, 2010. 
[8]
K. Chwialkowski, H. Strathmann, and A. Gretton.
A kernel test of goodness of fit.
In
International Conference on Machine Learning
, pages 2606–2615, 2016.  [9] A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jégou. Word translation without parallel data. arXiv preprint arXiv:1710.04087, 2017.
 [10] T. Cour, P. Srinivasan, and J. Shi. Balanced graph matching. In Advances in Neural Information Processing Systems, pages 313–320. MIT Press, 2007.

[11]
N. Djuric, M. Grbovic, and S. Vucetic.
Convex kernelized sorting.
In
AAAI Conference on Artificial Intelligence
, 2012.  [12] A. Doan, J. Madhavan, P. Domingos, and A. Halevy. Learning to map between ontologies on the semantic web. In Proceedings of the 11th International Conference on World Wide Web, pages 662–673. ACM, 2002.
 [13] M. Dudík, S. J. Phillips, and R. E. Schapire. Maximum entropy density estimation with generalized regularization and an application to species distribution modeling. Journal of Machine Learning Research, 8(Jun):1217–1260, 2007.
 [14] M. Fiori, P. Sprechmann, J. Vogelstein, P. Muse, and G. Sapiro. Robust multimodal graph matching: Sparse coding meets graph matching. In Advances in Neural Information Processing Systems, pages 127–135, 2013.
 [15] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
 [16] M. Gori, M. Maggini, and L. Sarti. Exact and approximate graph matching using random walks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(7):1100–1111, 2005.
 [17] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel twosample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
 [18] A. Gretton, K. Fukumizu, C. H. Teo, L. Song, B. Schölkopf, and A. J. Smola. A kernel statistical test of independence. In Advances in Neural Information Processing Systems, pages 585–592, 2008.
 [19] A. Gretton, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu, and B. K. Sriperumbudur. Optimal kernel choice for largescale twosample tests. In Advances in Neural Information Processing Systems 25, pages 1205–1213, 2012.
 [20] A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 855–864. ACM, 2016.

[21]
A. Haghighi, P. Liang, T. BergKirkpatrick, and D. Klein.
Learning bilingual lexicons from monolingual corpora.
In Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 771–779, 2008.  [22] J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl. An algorithmic framework for performing collaborative filtering. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 230–237. ACM, 1999.
 [23] T. Iwata, J. Lloyd, and Z. Ghahramani. Unsupervised manytomany object matching for relational data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(3):607–619, 2016.
 [24] B. Jiang, J. Tang, C. Ding, Y. Gong, and B. Luo. Graph matching via multiplicative update algorithm. In Advances in Neural Information Processing Systems, pages 3187–3195, 2017.
 [25] C. Kemp, J. B. Tenenbaum, T. L. Griffiths, T. Yamada, and N. Ueda. Learning systems of concepts with an infinite relational model. In AAAI, volume 3, page 5, 2006.
 [26] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [27] A. Klami. Variational Bayesian matching. In Asian Conference on Machine Learning, pages 205–220, 2012.
 [28] A. Klami. Bayesian object matching. Machine learning, 92(23):225–250, 2013.

[29]
D. Knossow, A. Sharma, D. Mateus, and R. Horaud.
Inexact matching of large and sparse graphs using laplacian eigenvectors.
InInternational Workshop on Graphbased Representations in Pattern Recognition
, pages 144–153. Springer, 2009.  [30] N. Korula and S. Lattanzi. An efficient reconciliation algorithm for social networks. Proceedings of the VLDB Endowment, 7(5):377–388, 2014.
 [31] O. Kuchaiev, T. Milenković, V. Memišević, W. Hayes, and N. Pržulj. Topological network alignment uncovers biological function and phylogeny. Journal of the Royal Society Interface, page rsif20100063, 2010.
 [32] G. Lample, A. Conneau, L. Denoyer, H. Jégou, et al. Word translation without parallel data. In Proceedings of the 6th International Conference on Learning Representations, 2018.
 [33] M. Leordeanu, M. Hebert, and R. Sukthankar. An integer projected fixed point method for graph matching and map inference. In Advances in Neural Information Processing Systems, pages 1114–1122, 2009.
 [34] C.L. Li, W.C. Chang, Y. Cheng, Y. Yang, and B. Póczos. MMD GAN: Towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems, pages 2200–2210, 2017.
 [35] Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In International Conference on Machine Learning, pages 1718–1727, 2015.
 [36] M. Long, Y. Cao, J. Wang, and M. I. Jordan. Learning transferable features with deep adaptation networks. In Proceedings of the 32nd International Conference on Machine Learning, pages 97–105, 2015.
 [37] L. v. d. Maaten and G. Hinton. Visualizing data using tSNE. Journal of Machine Learning Research, 9:2579–2605, 2008.
 [38] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In Proceedings of the 1st International Conference on Learning Representations, 2013.
 [39] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems, pages 3111–3119, 2013.
 [40] K. Muandet, K. Fukumizu, F. Dinuzzo, and B. Schölkopf. Learning from distributions via support measure machines. In Advances in Neural Information Processing Systems, pages 10–18, 2012.
 [41] K. Muandet and B. Schölkopf. Oneclass support measure machines for group anomaly detection. In Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence, pages 449–458. AUAI Press, 2013.
 [42] B. Perozzi, R. AlRfou, and S. Skiena. DeepWalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge discovery and Data Mining, pages 701–710. ACM, 2014.
 [43] J. Petterson, J. Yu, J. J. Mcauley, and T. S. Caetano. Exponential family graph matching and ranking. In Advances in Neural Information Processing Systems, pages 1455–1463, 2009.
 [44] N. Quadrianto, A. J. Smola, L. Song, and T. Tuytelaars. Kernelized sorting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(10):1809–1821, 2010.
 [45] D. Sejdinovic, A. Gretton, and W. Bergsma. A kernel test for threevariable interactions. In Advances in Neural Information Processing Systems, pages 1124–1132, 2013.
 [46] A. Sharma and R. Horaud. Shape matching based on diffusion embedding and on mutual isometric consistency. In NORDIA 2010Workshop on Nonrigid Shape Analysis and Deformable Image Alignment, pages 29–36. IEEE, 2010.
 [47] S. L. Smith, D. H. Turban, S. Hamblin, and N. Y. Hammerla. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv preprint arXiv:1702.03859, 2017.
 [48] A. Smola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert space embedding for distributions. In Algorithmic Learning Theory, pages 13–31, 2007.
 [49] R. Socher and L. FeiFei. Connecting modalities: Semisupervised segmentation and annotation of images using unaligned text corpora. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 966–973. IEEE, 2010.
 [50] B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G. R. Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11:1517–1561, 2010.
 [51] J. Tang, M. Qu, M. Wang, M. Zhang, J. Yan, and Q. Mei. LINE: Largescale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, pages 1067–1077, 2015.
 [52] A. Terada and J. Sese. Global alignment of proteinprotein interaction networks for analyzing evolutionary changes of network frameworks. In Proceedings of the 4th International Conference on Bioinformatics and Computational Biology, pages 196–201, 2012.
 [53] S. Umeyama. An eigendecomposition approach to weighted graph matching problems. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10(5):695–703, 1988.
 [54] Y. Wang and G. Wong. Stochastic blockmodels for directed graphs. Journal of the American Statistical Association, 82(397):8–19, 1987.
 [55] C. Xing, D. Wang, C. Liu, and Y. Lin. Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1006–1011, 2015.
 [56] M. Yamada and M. Sugiyama. Crossdomain object matching with model selection. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 807–815, 2011.
 [57] Y. Yoshikawa, T. Iwata, H. Sawada, and T. Yamada. Crossdomain matching for bagofwords data via kernel embeddings of latent distributions. In Advances in Neural Information Processing Systems, pages 1405–1413, 2015.
 [58] W. Zaremba, A. Gretton, and M. Blaschko. Btest: A nonparametric, low variance kernel twosample test. In Advances in Neural Information Processing Systems, pages 755–763, 2013.
 [59] M. Zhang, Y. Liu, H. Luan, and M. Sun. Adversarial training for unsupervised bilingual lexicon induction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, volume 1, pages 1959–1970, 2017.
 [60] M. Zhang, Y. Liu, H. Luan, and M. Sun. Earth mover’s distance minimization for unsupervised bilingual lexicon induction. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1934–1945, 2017.
 [61] F. Zhou and F. De la Torre. Factorized graph matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 127–134. IEEE, 2012.