A Benchmarking Study of Embedding-based Entity Alignment for Knowledge Graphs

03/10/2020 ∙ by Zequn Sun, et al. ∙ The University of Texas at Arlington Nanjing University 0

Entity alignment seeks to find entities in different knowledge graphs (KGs) that refer to the same real-world object. Recent advancement in KG embedding impels the advent of embedding-based entity alignment, which encodes entities in a continuous embedding space and measures entity similarities based on the learned embeddings. In this paper, we conduct a comprehensive experimental study of this emerging field. This study surveys 23 recent embedding-based entity alignment approaches and categorizes them based on their techniques and characteristics. We further observe that current approaches use different datasets in evaluation, and the degree distributions of entities in these datasets are inconsistent with real KGs. Hence, we propose a new KG sampling algorithm, with which we generate a set of dedicated benchmark datasets with various heterogeneity and distributions for a realistic evaluation. This study also produces an open-source library, which includes 12 representative embedding-based entity alignment approaches. We extensively evaluate these approaches on the generated datasets, to understand their strengths and limitations. Additionally, for several directions that have not been explored in current approaches, we perform exploratory experiments and report our preliminary findings for future studies. The benchmark datasets, open-source library and experimental results are all accessible online and will be duly maintained.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


A Benchmarking Study of Embedding-based Entity Alignment for Knowledge Graphs

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Knowledge graphs (KGs) store real-world facts in a structured and machine-readable way. Each fact is organized in the form of a triple like (subject entity, relation, object entity) or (subject entity, attribute, literal value). The structured knowledge supports a variety of applications, e.g., semantic search, question answering and recommender systems [20]. To promote knowledge transfer and fusion, researchers have put persistent effort into resolving the task of entity alignment (a.k.a. entity matching or entity resolution). The goal is to identify entities from different KGs that refer to the same real-world object, e.g., Mount_Everest111http://dbpedia.org/resource/Mount_Everest in DBpedia [43] and Q513222http://www.wikidata.org/entity/Q513 in Wikidata [86]. Conventional approaches tackle this task by exploiting a wide range of discriminative features of entities, e.g., names, descriptive annotations, and relational structures [16, 32, 34, 42, 78], as well as involving human-in-the-loop [33, 105]. The major challenge lies in the symbolic and schematic heterogeneity between independently-created KGs, such as multilingualism and different schemata [18].

Recent years have witnessed the rapid development of KG embedding techniques. They embed the symbolic representations of a KG as low-dimensional vectors, such that the semantic relatedness of entities can be captured by the geometrical structures of the embedding space. KG embeddings have been successfully applied to many tasks such as link prediction 

[5] and relation extraction [91]. The merit lies in the capabilities of resolving the aforementioned heterogeneity and simplifying knowledge reasoning [36, 88]. Motivated by such success, a new research field, called embedding-based entity alignment, has emerged [10] and attracted massive attention recently [8, 9, 25, 29, 66, 80, 81, 84, 89, 92, 102].

We summarize a typical framework for embedding-based entity alignment in Figure 1. It takes as input two different KGs and collects the seed alignment between them using, for instance, the owl:sameAs links [10]. Then, the two KGs and seed alignment are fed into the embedding and alignment modules, respectively, to capture the correspondence of entity embeddings. There are two typical combination paradigms for module interaction: (i) the embedding module encodes the two KGs in two independent embedding spaces, meanwhile the alignment module uses seed alignment to learn a mapping between them [9, 10, 66, 67]; or (ii) the alignment module guides the embedding module to represent the two KGs into one unified space by forcing the aligned entities in seed alignment to hold very similar embeddings [8, 47, 80, 81, 84, 89, 102]. Finally, entity similarities are measured by the learned embeddings. We can predict the counterpart of a source entity through the nearest neighbor search among the target entity embeddings using a distance metric such as the Euclidean distance. Besides, to overcome the shortage of seed entity alignment, several approaches [9, 81, 102]

deploy semi-supervised learning to iteratively get benefit from new entity alignment during the training process.

Figure 1: Framework of embedding-based entity alignment

However, as an emerging research topic, there are still some issues with analyzing and evaluating embedding-based entity alignment. First, as far as we know, there is no prior work summarizing the status quo of this field yet. The latest development of embedding-based entity alignment, as well as its advantages and weaknesses still remain to be explored. We even do not know how the embedding-based approaches compare to conventional entity alignment approaches. Second, there are also no widely-acknowledged benchmark datasets towards a realistic evaluation of embedding-based entity alignment. Arguably, a bit more popular datasets are DBP15K (used by [8, 45, 77, 80, 81, 89, 92, 93, 94, 96, 103])) and WK3L (used by [10, 47, 66, 67]). The different datasets for evaluation make it difficult to obtain a fair and comprehensive comparison of embedding-based entity alignment approaches. Moreover, we observe that current datasets contain much more high-degree entities (i.e., entities connected with many other entities, which are relatively easy for entity alignment) than real-world KGs do. As a result, many approaches can exhibit good performance on these biased datasets. However, such evaluation may not reflect their performance in real-world entity alignment scenarios. Additionally, these datasets only focus on one aspect of heterogeneity, e.g., multilingualism, while overlook other aspects, e.g, different schemata and scales. This brings difficulties in understanding the generalization and robustness of embedding-based entity alignment. Third, we also find that only a portion of the studies in this field come with source code, which makes it difficult to conduct further research on top of these approaches. Due to these issues, there is a pressing need to conduct a comprehensive and realistic re-evaluation of embedding-based entity alignment approaches with in-depth analyses.

In this paper, we carry out a systematic experimental study of embedding-based entity alignment with an open-source library. Our main contributions are listed as follows:

  • A comprehensive survey. We survey 23 recent approaches for embedding-based entity alignment and categorize their core techniques and characteristics from different aspects. We also review the popular choices for each technical modules, providing a brief overview of this field. (Sect. 2)

  • Benchmark datasets. To make a fair and realistic comparison, we construct a set of dedicated benchmark datasets with splits of five folds by sampling real-world KGs DBpedia [43], Wikidata [86] and YAGO [70], in consideration of various aspects of heterogeneity regarding entity degrees, multilingualism, schemata and scales. Particularly, we propose an iterative degree-based sampling algorithm, which can make the degree distribution of entities in a sample approximate its source KG. (Sect. 3)

  • Open-source library. We develop an open-source library OpenEA333https://github.com/nju-websoft/OpenEA

    using Python and TensorFlow. This library integrates 12 representative embedding-based entity alignment approaches belonging to a wide range of technologies. It uses a flexible architecture to make it easy to integrate a large amount of existing KG embedding models (8 representative ones have been implemented) for entity alignment. The library will be duly updated along with the coming of new approaches, to benefit future research. (Sect. 


  • Comprehensive comparison and analysis. We provide a comprehensive comparison of 12 representative embedding-based entity alignment approaches in terms of both effectiveness and efficiency on our datasets. We train and tune each approach from scratch using our open-source library to ensure a fair evaluation. These results offer an overview of the performance of embedding-based entity alignment. To gain insights into the strengths and limitations of each approach, we conduct extensive analysis on their performance from different aspects. (Sect. 5)

  • Exploratory experiments. We carry out three experiments beyond what has been available in literature. We give the first analysis on the geometric properties of entity embeddings to understand their underlying connections with the final performance. We notice that many KG embedding models have not been exploited for entity alignment and we explore 8 popular ones among them. We also compare embedding-based approaches with several conventional approaches, to explore their complementarity. (Sect. 6)

  • Future research directions. Based on our survey and experimental findings, we provide a thorough outlook on several promising research directions to facilitate future work, including unsupervised entity alignment, long-tail entity alignment, large-scale entity alignment, entity alignment in non-Euclidean embedding space and the interpretability of embedding-based entity alignment. (Sect. 7)

To the best of our knowledge, this work is the first systematic and comprehensive experimental study on embedding-based entity alignment between KGs. Our experiments reveal the true performance as well as the advantages and shortcomings of current approaches in the realistic entity alignment scenario. The shortcomings that we find, such as the incapacity of relation-based approaches in handling long-tail entities and the poor effectiveness of attribute-based approaches in resolving the heterogeneity of attribute values, call for the reinvestigation of truly effective approaches for real-world entity alignment. We also believe that our in-depth analysis on the geometric properties of entity embeddings opens a new direction to investigate what enables the alignment-oriented embeddings and what supports the entity alignment performance behind the increasingly powerful approaches. Our benchmark datasets, library and experimental results are all publicly available through the GitHub repository3 under the GPL license, to foster reproducible research. We think that the datasets and library will become a valuable and fundamental resource to future studies. As a growing number of knowledge-driven applications build their capacities relying on KGs and benefit from KG fusion [36], this work can lead to profound impacts. We also notice some work applying embedding techniques for entity resolution in databases [15, 60]. We believe that our work can foster the collaboration and exchange of research ideas between entity alignment in KGs and entity resolution in databases.

2 Preliminaries

We consider the entity alignment task between two KGs and . Let and denote their entity sets, respectively. The goal is to find the 1-to-1 alignment of entities [42, 98], where denotes an equivalence relation. In many cases, a small subset of the alignment , called seed alignment, is known beforehand and used as training data.

2.1 Literature Review

2.1.1 Knowledge Graph Embedding

Approaches. Existing KG embedding models can be generally divided into three categories: (i) translational models, e.g., TransE [5], TransH [90], TransR [52] and TransD [35]; (ii) semantic matching models, e.g., RESCAL [55], DistMult [95], ComplEx [83], HolE [63], Analogy [53], SimplE [39], RotatE [79] and TuckER [3]; and (iii) deep models, e.g., ProjE [75], ConvE [14], ConvKB [61], R-GCN [72] and KBGAN [7]. These models have been generally used for link prediction. We refer the interested readers to recent surveys [36, 51, 88] for more details about their techniques and applications. A related research area is network embedding [26], which learns vertex representations to capture their proximity in the network. However, the edges in networks usually carry simplex semantics, such as “friendship” in a social network. This differentiates network embedding from KG embedding in both data models and representation learning techniques.

Datasets & evaluation metrics.

FB15K and WN18 are two benchmark datasets for link prediction in KGs [5]. Recent studies notice that FB15K and WN18 suffer from the test leakage problem and build two new benchmark datasets FB15K-237 [82] and WN18RR [14] correspondingly. Three metrics are widely used in evaluation: (i) proportion of correct links in the top- ranked results (called Hits@, for example, ), (ii) mean rank (MR) of correct links, and (iii) mean reciprocal rank (MRR). Three efforts in evaluating link prediction models have been reported in [1, 58, 71].

2.1.2 Conventional Entity Alignment

Approaches. Conventional approaches address entity alignment mainly from two angles. One is based on equivalence reasoning mandated by OWL semantics [23, 37]. The other is based on similarity computation, which compares symbolic features of entities [12, 42, 74, 78]

. Recent studies also use statistical machine learning

[16, 32, 34] and crowdsourcing [33, 105] to improve the accuracy. Also, in the database area, detecting duplicate entities, a.k.a. record linkage or entity resolution, has been extensively studied [17, 21]. Due to the paper scope, we do not discuss this area in detail.

Datasets & evaluation metrics. Since 2004, OAEI444http://oaei.ontologymatching.org/ (Ontology Alignment Evaluation Initiatives) has become the primary venue for work in ontology alignment. It also organizes an evaluation track for entity alignment in recent years. We have not observed any embedding-based systems participating in this track. The preferred evaluation metrics are precision, recall and F1-score.

2.1.3 Embedding-based Entity Alignment

Approaches. Many existing approaches [10, 50, 66, 67, 80, 81, 84, 102] employ the translational models (e.g., TransE [5]) to learn entity embeddings for alignment based on relation triples. Some recent approaches [8, 45, 89, 92, 94, 93, 97, 103] employ graph convolutional networks (GCNs) [41, 85]. Besides, some approaches incorporate attribute and value embeddings [9, 29, 80, 84, 92, 93, 96, 100]. We elaborate the techniques of these approaches in Sect. 2.2. Also, there are some approaches for (heterogeneous information) network alignment [30, 47, 54, 99] or cross-lingual knowledge projection [65], which may also be modified for entity alignment. Due to the space and focus issues, we do not discuss and compare with them. Besides, it is also worth noting that two studies [15, 60] design the embedding-based approaches for entity resolution in databases. They represent the attribute values of entities based on word embeddings and compare entities using embedding distances. However, they assume that all entities follow the same schema or the attribute alignment must be 1-to-1 mapping. As different KGs are often created with different schemata, it is hard to fulfill these requirements. Thus, they cannot be applied to entity alignment of KGs. Our work can fill this gap.

Figure 2: Degree distributions and average degrees of two popular datasets DBP15K [80] and WK3L [10] used in current approaches. Both are extracted from DBpedia [43], but their degree distributions are quite different from DBpedia and their average degrees are also much larger.

Datasets & evaluation metrics. To the best of our knowledge, there is no widely-acknowledged benchmark dataset for assessing embedding-based entity alignment approaches. Arguably, a bit more used datasets are DBP15K [80] and WK3L [10]. However, Figure 2 shows that their degree distributions and average degrees are significantly different from real-world KGs. This issue prevents us from a comprehensive and precise understanding of the strengths and limitations of the state-of-the-art. Similar to link prediction, Hits@, MR and MRR are mainly used as evaluation metrics, where Hits@1 should be emphasized, as it is equivalent to precision.

2.2 Categorization of Techniques

We hereby categorize 23 recent embedding-based entity alignment approaches by analyzing their differences in the embedding and alignment modules as well as the mode that they interact. Table 1 summarizes the characteristics of these approaches. For notations, we use capital calligraphic letters to denote sets and boldface letters for vectors and matrices.

2.2.1 Embedding Module

The embedding module seeks to encode a KG into a low-dimensional embedding space. Based on the types of triples they use, we classify the KG embedding models in two types, i.e.,

relation embedding and attribute embedding. The former leverages relational learning techniques on KG structures; the latter exploits attribute triples of entities. For entity alignment, the two classes support the computation of relation and attribute similarities of entities, respectively.

Relation embedding is employed by all existing approaches. Below is three representative ways to realize it:

Triple-based embedding captures the local semantics of relation triples. Many KG embedding models fall into this category, which defines an energy function to measure the plausibility of triples. For example, TransE [5] interprets a relation as the translation from its head entity embedding to its tail. The energy function of a relation triple is


where denotes the - or -norm of vectors. TransE optimizes the marginal

ranking loss to separate positive triples from negative samples by a pre-defined margin. Other choices of loss functions include the

logistic loss [63, 83] and the limit-based loss [81, 101]. To generate negative samples, the uniform sampling method replaces either the head or tail entity of a real triple with a random entity. The truncated sampling method [81] constrains the sampling scope within the -nearest neighbors of the entity to be replaced.

Path-based embedding exploits the long-term dependency of relations spanning over relation paths. A relation path is a set of nose-to-tail linked relation triples, e.g., . IPTransE [102] models relation paths by inferring the equivalence between a direct relation and a multi-hop path. Assume that there is a direct relation from to . IPTransE expects the embedding of to be similar with the path embedding, which is encoded as a combination of its constituent relation embeddings:


where is a sequence composition operation such as sum. is minimized to make them close to each other. However, IPTransE overlooks entities. Another work, RSN4EA [25]

, modifies RNNs (recurrent neural networks) to model the sequence of entities and relations together.

Neighborhood-based embedding uses the subgraph structure constituted by a large amount of relations between entities. GCNs [6, 13, 41, 72] are well suited for modeling this structure, and have been used for embedding-based entity alignment recently [8, 45, 89, 92, 93, 94, 96]. A GCN consists of multiple graph convolutional layers. Let denote the adjacency matrix of a KG and be a feature matrix where each row corresponds to an entity. The typical propagation rule from the layer to the ( layer [41] is


with and

is an identity matrix.

is the diagonal degree matrix of . is the weight matrix.

is the activation function such as tanh.

width= Embedding Alignment Interaction Relation Att. Emb. distance Combination Learning MTransE [10] Triple - Euclidean transformation Superv. IPTransE [102] Path - Euclidean Sharing Semi- JAPE [80] Triple Att. Cosine Sharing Superv. BootEA [81] Triple - Cosine Swapping Semi- KDCoE [9] Triple Literal Euclidean transformation Semi- NTAM [47] Triple - Cosine Swapping Superv. GCNAlign [89] Neighbor Att. Manhattan Calibration Superv. AttrE [84] Triple Literal Cosine Sharing Superv. IMUSE [29] Triple Literal Cosine Sharing Superv. SEA [66] Triple - Cosine transformation Superv. RSN4EA [25] Path - Cosine Sharing Superv. GMNN [94] Neighbor Literal Cosine Swapping Superv. MuGNN [8] Neighbor - Manhattan Calibration Superv. OTEA [67] Triple - Euclidean transformation Superv. NAEA [103] Neighbor - Cosine Swapping Superv. AVR-GCN [97] Neighbor - Euclidean Swapping Superv. MultiKE [100] Triple Literal Cosine Swapping Superv. RDGCN [92] Neighbor Literal Manhattan Calibration Superv. KECG [45] Neighbor - Euclidean Calibration Superv. HGCN [93] Neighbor Literal Euclidean Calibration Superv. MMEA [77] Triple - Cosine Sharing Superv. HMAN [96] Neighbor Literal Euclidean Calibration Superv. AKE [50] Triple - Euclidean transformation Superv. Approaches are ordered by their publishing dates.

Table 1: Categorization of popular embedding-based entity alignment approaches published before December 2019

Attribute embedding is used by several approaches [9, 29, 80, 84, 89, 92, 94, 96, 100] to enhance the similarity measure of entities. There are two ways for attribute embedding:

Attribute correlation embedding considers the correlations among attributes. Attributes are regarded as correlated if they are frequently used together to describe an entity. For example, longitude is highly correlated with latitude as they often form a coordinate to describe a location. JAPE [80] exploits such correlations for entity alignment, based on the assumption that similar entities should have correlated attributes. For two attributes and

, the probability that they are correlated is defined as


where attribute embeddings can be learned by maximizing the probability over all attribute pairs. Here, the attribute correlation embedding does not consider literal values.

Literal embedding introduces literal values to attribute embedding. AttrE [84] proposes a character-level encoder of literals that is capable of dealing with unseen values in training phases. Let denote a literal with characters, where () denotes the character. AttrE represents as:


With this representation, literals are treated as entities and the relation embedding models like TransE can be used to learn from attribute triples. However, the character-based literal embedding may fail in cross-lingual settings.

2.2.2 Alignment Module

The alignment module uses seed alignment as labeled training data to capture the correspondence of entity embeddings. A common process is first measuring the distances between entity embeddings and then finding the entity pairs with a short distance as the alignment. Two keys are picking a distance metric and designing an alignment inference strategy.

Distance metrics. Cosine, Euclidean and Manhattan distances are three widely-used metrics. In high-dimensional spaces, a few vectors (called hubs [69]) may repeatedly occur as the -nearest neighbors of others, the so-called hubness problem [11]. See Sect. 6.1 for more details.

Alignment inference strategies. Greedy search is used by all current approaches. Given and to be aligned and a distance metric , for each entity , it finds the aligned entity by . Differently, collective search [40, 57] aims to find a global optimal alignment that minimizes . It can be modeled as the maximum weight matching problem in a bipartite graph and solved in time using the Kuhn-Munkres algorithm (

), or reduced to linear time using the heuristic algorithm

[31]. Another strategy is the stable marriage problem [56]. The alignment between and satisfies a stable marriage if there does not exist a pair of entities that both prefer each other than their current aligned ones. Its solution takes time [19].

2.2.3 Interaction Mode

Combination modes. Four typical designs to reconcile KG embeddings for entity alignment are as follows: Embedding space transformation embeds two KGs in different embedding spaces and learns a transformation matrix between the two spaces using seed alignment, to achieve for each . Other combination modes encode two KGs into a unified embedding space. Embedding space calibration minimizes for each to calibrate the embeddings of seed alignment. As two special cases, parameter sharing directly configures and parameter swapping swaps seed entities in their triples to generate extra triples as supervision. For instance, given and a relation triple of , parameter swapping produces a new triple and feeds it in KG embedding models as a real triple. Both parameter sharing and swapping methods do not introduce new loss functions, but the latter produces more triples.

Learning strategies. Based on how to process labeled and unlabeled data, learning strategies can be divided below:

Supervised learning leverages the seed alignment as labeled training data. For embedding space transformation, seed alignment is used to learn the transformation matrix. For embedding space calibration, it is used to let aligned entities have very similar embeddings. But, the acquisition of seed alignment is costly and error-prone, especially for cross-lingual KGs. As found in [10], the inter-language links in Wikipedia only cover of entities.

Semi-supervised learning uses unlabeled data in training, e.g., self-training [81, 102] and co-training [9]. The former iteratively proposes new alignment to augment seed alignment. The latter combines two models learned from disjoint entity features and alternately enhances the alignment learning of each other. Although OTEA [67] and KECG [45] claim that they are semi-supervised approaches, their learning strategies are not to augment seed alignment. We do not treat them as standard semi-supervised learning in this paper.

Unsupervised learning needs no seed alignment. Several methods for cross-lingual word embedding study cross-lingual word alignment without parallel corpora [11, 28]

. However, we have not observed any embedding-based entity alignment approaches using unsupervised learning. Although IMUSE 

[29] claims that it is an unsupervised approach, it actually uses a preprocessing method to collect seed alignment with high string similarity on some picked attributes. Its embedding module still needs seed alignment.

3 Dataset Generation

As aforementioned, current widely-used datasets are quite different from real-world KGs. It is difficult for embedding-based approaches to run on the full data of real-world KGs because the candidate space can be very large, which would bring a significant computational burden to evaluating related approaches. Following the link prediction benchmark datasets [5, 14, 82], we also sample real-world KGs and provide two data scales (15K and 100K). In this section, we present a new dataset sampling algorithm for entity alignment, followed by the statistics of our generated datasets.

Input: , reference alignment , entity size , hyper-parameters
// only retain entities in reference alignment
1 Filter by ;
2 Get degree distributions for , resp.;
3 do // if fails, run it again
4        Initialize datasets from , resp.;
5        while  do
6               for  do
7                      Get for each degree ;
8                      Get entity deletion probability by PageRank;
9                      Delete entities w.r.t. probabilities;
10              Filter by ; update accordingly;
11       Get degree distributions for , resp.;
12while ;
13return ;
Algorithm 1 Iterative degree-based sampling (IDS)

width= 15K (V1) 15K (V2) 100K (V1) 100K (V2) Datasets KGs #Rel. #Att. #Rel tr. #Att tr. #Rel. #Att. #Rel tr. #Att tr. #Rel. #Att. #Rel tr. #Att tr. #Rel. #Att. #Rel tr. #Att tr. EN-FR EN 267 308 47,334 73,121 193 189 96,318 66,899 400 466 309,607 497,729 379 364 649,902 503,922 FR 210 404 40,864 67,167 166 221 80,112 68,779 300 519 258,285 426,672 287 468 561,391 431,379 EN-DE EN 215 286 47,676 83,755 169 171 84,867 81,988 381 451 335,359 552,750 323 326 622,588 560,247 DE 131 194 50,419 156,150 96 116 92,632 186,335 196 252 336,240 716,615 170 189 629,395 793,710 D-W DB 248 342 38,265 68,258 167 175 73,983 66,813 413 493 293,990 451,011 318 328 616,457 467,103 WD 169 649 42,746 138,246 121 457 83,365 175,686 261 874 251,708 687,860 239 760 588,203 878,219 D-Y DB 165 257 30,291 71,716 72 90 68,063 65,100 287 379 294,188 523,062 230 277 576,547 547,026 YG 28 35 26,638 132,114 21 20 60,970 131,151 32 38 400,518 749,787 31 36 865,265 855,161

Table 2: Dataset statistics

3.1 Iterative Degree-based Sampling

We consider five factors in building our datasets: source KGs, reference alignment, dataset sizes, languages and density, where the last is more challenging for building datasets. Specifically, we want to generate a certain-sized dataset from a source KG such that the difference of their entity degree distributions does not exceed an expectation. The difficulty lies in that the removal of an entity from source KGs also changes the degree of its neighboring entities.

In this paper, we propose an iterative degree-based sampling (IDS) algorithm, which simultaneously deletes entities in two source KGs with reference alignment until achieving the desired size, meanwhile retaining a similar degree distribution of the sampled dataset as the source KG. Algorithm 1 describes the sampling procedure. During iterations, the proportion of entities having degree in the current dataset, denoted by , cannot always equal the original proportion . This forces us to adjust the entity size to be deleted by , where denotes the base step size (see Line 7). Furthermore, we prefer not to delete entities having a big influence on the overall degree distribution, such as the ones of high degree. To achieve this, we leverage the PageRank value for measuring the probability of an entity to be deleted (Line 8). For those entities with the same PageRank value, we randomly pick some of them to delete. Note that sampling on large graphs has been used in plenty of areas. We refer the interested readers to [44] for more information.

We use the Jensen-Shannon (JS) divergence [49] to assess the difference of two degree distributions (Line 12). Given two degree distributions , their JS-divergence is:


where and denote the proportions of entities with degree () in , respectively, and . A small JS divergence between and reveals that they have similar degree distributions. We set our expectation to 5% in the dataset generation.

3.2 Dataset Overview

We choose three well-known KGs as our sources: DBpedia (2016-10) [43], Wikidata (20160801) [86] and YAGO 3 [70]. Also, we consider two cross-lingual versions of DBpedia: English–French and English–German. We follow the conventions in [10, 80, 81, 89, 102] to generate datasets of two sizes with 15K and 100K entities, using the IDS algorithm. Specifically, we make use of DBpedia’s inter-language links and owl:sameAs among the three KGs to retrieve reference entity alignment. To balance the efficiency and deletion safety, we set for 15K and for 100K.

The statistics of the datasets are listed in Table 2. We generate two versions of datasets for each pair of source KGs. V1 is gained by directly using the IDS algorithm. For V2, we first randomly delete entities with low degrees () in the source KG to make the average degree doubled, and then execute IDS to fit the new KG. As a result, V2 is twice denser than V1 and more similar to existing datasets [10, 80]. Figure 3 shows the degree distributions and average degrees of EN-FR-15K (V1, V2). Compared with Figure 2, we see that our datasets are much closer to the source KGs.

For each dataset, we also extract the attribute triples of entities to fulfill the input requirement of some approaches [9, 29, 80, 84, 89, 92, 94, 100]. Considering that DBpedia, Wikidata and YAGO collect data from very similar sources (mainly, Wikipedia), the aligned entities usually have identical labels. They would become “tricky” features for entity alignment and influence the evaluation of real performance. According to the suggestion in [104], we delete entity labels.

By convention, we split a dataset into training, validation and test sets. The details are given in Table 3 and Sect. 5.1.

Figure 3: Degree distributions and average degrees of EN-FR-15K (V1, V2)
#Ref. alignment #Training #Validation #Test
  15K   3,000   1,500 10,500
100K 20,000 10,000 70,000
Table 3: Dataset split for experiments

4 Open-source Library

We use Python and TensorFlow to develop an open-source library, namely OpenEA, for embedding-based entity alignment. The software architecture is illustrated in Figure 4, which follows the framework shown in Figure 1 and implements the workable techniques discussed in Sect. 2.2. The design goals and features of OpenEA include three aspects:

Loose coupling. The implementation of embedding and alignment modules is independent to each other. OpenEA provides a framework template with pre-defined input and output data structures to make these modules as an integral pipeline. Users can freely call and combine different techniques in these modules to develop new approaches.

Functionality and extensibility. OpenEA implements a set of necessary functions as its underlying components, including initialization functions, loss functions and negative sampling methods in the embedding module; combination and learning strategies in the interaction mode; as well as distance metrics and alignment inference strategies in the alignment module. On top of those, OpenEA also provides a set of flexible and high-level functions with configuration options to call these components. In this way, new functions can be easily integrated by adding new configuration options.

Figure 4: Software architecture of OpenEA

Off-the-shelf approaches. To facilitate the usage of OpenEA and support our experimental study, we try our best to integrate or rebuild 12 representative embedding-based entity alignment approaches belonging to a wide range of technologies, including MTransE [10], IPTransE [102], JAPE [80], KDCoE [9], BootEA [81], GCNAlign [89], AttrE [84], IMUSE [29], SEA [66], RSN4EA [25], MultiKE [100] and RDGCN [92]. MTransE, JAPE, KDCoE, BootEA, GCNAlign, AttrE, RSN4EA, MultiKE and RDGCN are implemented by integrating their source code with our functions, while IPTransE, IMUSE and SEA are rebuilt by ourselves. Moreover, we integrate several relation embedding models that have not been explored for entity alignment yet, including three translational models TransH [90], TransR [52] and TransD [35]; three semantic matching models HolE [63], SimplE [39] and RotatE [79]; as well as two deep models ProjE [75] and ConvE [14]. We also integrate two attribute embedding models AC2Vec [80] and Label2Vec [100], based on the pre-trained multilingual word embeddings fastText [4]. TransH, TransR, TransD and HolE are developed by referring to the open-source toolkit OpenKE [27]; the remaining is implemented based on their source code.

5 Experiments and Results

In this section, we report a comprehensive evaluation on current embedding-based entity alignment approaches using our benchmark datasets and open-source library.

5.1 Experiment Settings

Environment. We carry out the experiments on several workstations, each of which is configured with an Intel Xeon E3 3.3GHz CPU, 128GB memory, a NVIDIA GeForce GTX 1080Ti GPU and Ubuntu 16.04 OS.

Cross-validation. In addition to splitting the datasets into training, validation and test sets, we conduct the experiments with 5-fold cross-validation to ensure unbiased evaluation. Specifically, we divide the reference entity alignment into five disjoint folds, each of which accounts for 20% of the total. For each running, we pick one fold (20%) as training data and leave the remaining for validation (10%) and testing (70%). As found in [10], the inter-language links in the multilingual Wikipedia cover about 15% of entity alignment. Thus, using as training data can both satisfy the need for 5-fold cross-validation and conform to the actual situation of entity alignment in the real world.

Comparative approaches and settings. We evaluate all the embedding-based entity alignment approaches implemented in OpenEA. To make a fair comparison, we try our best efforts to unify the experiment settings. Table 4 shows the common hyper-parameters used for all the approaches. As indicated in [38], the batch size has an influence on the performance and running time. So, we use a fixed batch size for relation triples to avoid its interference. For other settings specific to each approach, we follow the reported details in literature as carefully as we can. For the key hyper-parameters and the unreported ones, we try our best to tune them. For example, we constrain the -norm of entity embeddings to 1 for many approaches, e.g., IMUSE, because we find that such normalization yields better results. For cross-lingual datasets, we use pre-trained cross-lingual word embeddings [4] to initialize literal embeddings for the approaches using attribute values. All hyper-parameter settings of each approach on our datasets are available online.

width= 15K 100K Batch size for rel. trip. 5,000 20,000 Termination condition Early stop when the Hits@1 score begins to drop on

the validation sets, checked every 10 epochs.

Max. epochs 2000

Table 4: Common hyper-parameters for all the approaches

Evaluation metrics. In our experiments, the default alignment direction is from left to right. Take D-W for example. We treat DBpedia as source and align it with the target KG Wikidata. Following the conventions, we use Hits@ (), MR and MRR as the evaluation metrics.

Availability. We release the datasets and OpenEA library online. The experimental results on five folds of each dataset using all the metrics are provided in the CSV format. All will be duly updated along with the coming of new approaches.

5.2 Main Results and Analysis

Table 5 depicts the Hits@1, Hits@5 and MRR results of the 12 implemented approaches on the generated datasets. Based on the overall performance, we observe that RDGCN, BootEA and MultiKE achieve the top-3 results. This is consistent with our intuition that more recent methods usually outperform early ones. For a comprehensive and thorough understanding, we analyze the results from four angles:

width= 15K (V1) 15K (V2) 100K (V1) 100K (V2) Hits@1 Hits@5 MRR Hits@1 Hits@5 MRR Hits@1 Hits@5 MRR Hits@1 Hits@5 MRR EN-FR MTransE IPTransE JAPE KDCoE BootEA GCNAlign AttrE IMUSE SEA RSN4EA MultiKE RDGCN EN-DE MTransE IPTransE JAPE KDCoE BootEA GCNAlign AttrE IMUSE SEA RSN4EA MultiKE RDGCN D-W MTransE IPTransE JAPE KDCoE BootEA GCNAlign AttrE IMUSE SEA RSN4EA MultiKE RDGCN D-Y MTransE IPTransE JAPE KDCoE BootEA GCNAlign AttrE IMUSE SEA RSN4EA MultiKE RDGCN are shown. Top-3 results on each dataset are marked in red, blue and cyan, respectively. The same to the following.

Table 5: Cross-validation results of current representative approaches on the 15K and 100K datasets

Sparse datasets (V1) vs. dense datasets (V2). From Table 5, we find that most relation-based approaches perform better on the dense datasets than on the sparse ones, e.g., IPTransE, BootEA, SEA and RSN4EA. This is also in accord with our intuition that entities in dense datasets are generally involved in more relation triples, which enable these approaches to capture more semantic information. For the approaches considering attribute triples, KDCoE, GCNAlign, AttrE, IMUSE and RDGCN also perform better on the dense datasets, indicating that the relation embeddings still make contributions. Differently, MultiKE relies on multiple “views” of features, which make it relatively insensitive to the relation changes. Interestingly, we also see that the performance of two relation-based approaches, MTransE and JAPE, drops on some dense datasets. We believe that this is because they are based on TransE, which has deficiency in dealing with multi-mapping relations. The complex structures make them prone to learn very similar embeddings for different entities involving the same multi-mapping relation [52, 90].

For further analysis, we divide the test alignment of each dataset into multiple groups in terms of alignment degrees. The degree of an alignment is defined as the sum of relation triples for the two involved entities. Figure 5 illustrates the recall results on EN-FR-15K (V1). Obviously, most entities have relatively few relation triples, and we call them long-tail entities. We find that all the relation-based approaches run better in aligning entities with rich relation triples while their results decline on long-tail entities. This lopsided performance confirms the results on the sparse and dense datasets from another angle. By using additional literals, the lopsided performance of KDCoE, AttrE, IMUSE, MultiKE and RDGCN alleviates. However, JAPE and GCNAlign that use attribute correlations still show the lopsided performance for entities with different degrees. The results on other datasets also agree on the above observations. Currently, we have not seen any specific solutions for long-tail entities.

Figure 5: Recall w.r.t. alignment degrees on EN-FR-15K (V1)

15K datasets vs. 100K datasets. We observe that all the approaches perform better on the 15K datasets than on the 100K datasets, except D-Y. This is because the 100K datasets have more complex KG structures and a larger candidate space, bringing negative influence to embedding-based entity alignment. However, as shown in Table 2, D-Y-15K and D-Y-100K have very similar numbers of relations in YAGO, which makes the results different from that on other datasets.

width= Datasets Fastest Median () Slowest All 15K (V1) GCNAlign: 85 MultiKE: 379 BootEA: 2,141 All 15K (V2) GCNAlign: 90 MultiKE: 622 RSN4EA: 3,903 All 100K (V1) MTransE: 647 RDGCN: 3,845 BootEA: 36,416 All 100K (V2) MTransE: 568 GCNAlign: 5,057 BootEA: 47,735

Table 6: Average running time (sec.) of current approaches

Additionally, we show a brief comparison on the average running time of five repetitions in Table 6. We find that the running time of different approaches varies greatly. BootEA and RSN4EA use much more time than others, due to the bootstrapping procedure and the path sampling, respectively. On the contrary, GCNAlign and MTransE take significantly less time, due to their lightweight implementation. Overall, MultiKE achieves a good balance between effectiveness and efficiency. Also, it is not surprising that most approaches use more time on the dense datasets than on the sparse ones, because the dense datasets contain more relation triples for training. However, the running time of MTransE and JAPE on the sparse datasets is longer. We check the logs and find that they are prone to overfitting on the dense datasets, so they are early-stopped within less training epochs.

Relations vs. attributes. For the purely relation-based approaches, there is no clear advantage of one relation embedding technique beyond another. For example, although MTransE and BootEA both use TransE, their performance is at two extremes. We believe that the negative sampling used in BootEA makes great contribution. The work in [7] also shows that negative sampling can largely affect the expressiveness of KG embeddings. Besides, the bootstrapping strategy of BootEA also contributes a lot to the performance, which will be discussed shortly. For another example, IPTransE and RSN4EA both extend triple-based embedding by linking relation triples into long relation paths, but their results are also significantly different. This is because the recurrent skipping network of RSN4EA to encode relation paths is more powerful than the shallow composition operation used in IPTransE.

Figure 6: Hits@1 results of JAPE, GCNAlign, KDCoE, AttrE, IMUSE, MultiKE, RDGCN and their degraded variants without attribute embedding

For the approaches using attributes, we also compare them to their degraded variants without attribute embedding. Due to the space limitation, Figure 6 only shows the Hits@1 results on D-W-15K (V1) and D-Y-15K (V2). Other datasets show similar results. On the D-Y dataset, we do not observe significant improvement from JAPE and GCNAlign by using attribute correlations to cluster entities. The problem here is, if not given pre-aligned attributes, this technique would fail to capture the attribute correlations across different KGs. Furthermore, even if the attribute correlations are discovered, this signal is too coarse-grained to determine whether two entities with correlated attributes are aligned (taking for example two different persons with the same set of correlated attributes like name, gender and age). Differently, literal embedding brings significant improvement to most approaches except IMUSE, which indicates that literals are a stronger signal for identifying entity alignment than attribute correlations. IMUSE has a preprocessing step using literals to find some new entity alignment to augment training data. However, the errors in new alignment also harm performance. We notice that, most approaches fail to be improved by attribute embedding on D-W. The symbolic heterogeneity of attributes in Wikidata (the local names of attributes are special IDs) notably challenges some approaches as they cannot automatically find high-quality attribute alignment for literal comparison. Overall, we think that attribute heterogeneity has a strong effect on capturing attribute correlations, and literal embeddings contribute to entity alignment.

Figure 7: Precision, recall and F1-score of augmented alignment during iterations on EN-FR-100K (V1)

Semi-supervised learning strategies. To improve performance, some approaches employ semi-supervised strategies such as self-training or co-training as the interaction mode. We further investigate the strengths and limitations of these semi-supervised learning strategies by analyzing the quality of the augmented seed alignment. Figure 7 shows the precision, recall and F1-score of IPTransE, BootEA and KDCoE during the semi-supervised training iterations on EN-FR-100K (V1), and other datasets show similar results. Surprisingly, IPTransE fails to achieve good performance, because it involves many errors as the self-training continues but does not design a mechanism to eliminate these errors. KDCoE propagates new alignment by co-training two orthogonal types of features, i.e., relation triples and textual descriptions. However, this strategy does not bring improvement, due to some entities may lack textual descriptions. For BootEA, it employs a heuristic editing method to remove wrong alignment. After undergoing a period of fluctuations, the precision stays stable while the recall continues growing during self-training, which brings a clear performance boost. Therefore, the design of semi-supervised learning strategies has a strong influence on the final performance.

6 Exploratory Experiments

6.1 Geometric Analysis

In addition to performance comparison, we hereby focus on the geometric properties of entity embeddings, to understand how these embeddings support entity alignment performance and the underlying limitations of existing approaches.

Figure 8: Visualization of the similarities between entities and their top-5 nearest cross-KG neighbors on the D-Y-15K (V1) dataset. The five rows from top to bottom correspond to the similarities from the first to the fifth nearest neighbors, respectively. Darker color indicates larger similarity.

6.1.1 Similarity Distribution

Given entity embeddings, the alignment inference algorithm identifies aligned entities by the nearest neighbor search in the embedding space. It is interesting to investigate the similarity distribution of each entity and its nearest neighbors in the cross-KG scope. Towards this end, we visualize in Figure 8

the average similarities between entities of the source KG and their top-5 nearest neighbors of the target KG embeddings of D-Y-15K (V1). To make the similarity comparable over all the approaches, we select cosine similarity as the metric. The results show two interesting findings:

First, the average similarities between source entities and their top-1 nearest neighbors (top-1 similarities) of different approaches differ widely. BootEA, KDCoE, MultiKE and RDGCN yield a very high top-1 similarity, while IPTransE and RSN4EA show the opposite. Intuitively, a high top-1 similarity indicates a better quality because it can reflect how confidently the entity embeddings capture the alignment information between two KGs. Most approaches with a high top-1 similarity, such as BootEA, MultiKE and RDGCN, also achieve good performance for entity alignment (see Table 5). For KDCoE, as shown in Figure 7, the low precision of its augmented alignment makes its top-1 entity alignment contain many errors. Hence, its performance is not as good as BootEA. But is still outperforms many other approaches.

Second, the similarity variances between the top-5 nearest neighbors also differ greatly, which can be reflected by the color gradients of the five rows from top to bottom. For example, BootEA, KDCoE, RSN4EA and RDGCN show large variances while MTransE, IPTransE and JAPE exhibit very slight variances. A small similarity variance means that the five nearest neighbors are not discriminative enough to enable the source entity to identify its counterpart correctly.

Other datasets also show similar distribution. In summary, the ideal similarity distribution for entity alignment is to have a high top-1 similarity and a large similarity variance.

6.1.2 Hubness and Isolation

Hubness is a common phenomenon in high-dimensional vector spaces [69], where some points (known as hubs

) frequently appear as the top-1 nearest neighbors of many other points in the vector space. A high level of hubness indicates a high centrality of points. Another phenomenon is that, there would exist some outliers isolated from any point clusters. The two phenomena have negative effects on the tasks relying on the nearest neighbor search, such as cross-lingual word alignment

[11, 64]. Here, we investigate whether embedding-based entity alignment also suffers from the hubness and isolation problem. To conduct a quantitative analysis, we measure the proportions of target entities that appear zero, one and more times as the nearest neighbors of source entities, respectively. Due to the space limitation, Figure 9

shows the results on D-Y-15K (V1), and the experiments on other datasets also show similar results. Surprisingly, we find that there is a large proportion of target entities that never appear as the top-1 nearest neighbors of any source entity (marked with orange bars). This means that such isolated entities may never be considered if we use the greedy strategy of choosing the top-1 nearest neighbor to form alignment. Consequently, we would miss much correct entity alignment. The entities (blue and gray bars) that appear as the nearest neighbors of more than one source entity also occupy considerable proportions. They would cause many violations against the 1-to-1 mapping constraint and globally increase the uncertainty of alignment inference. We observe that the approaches which yield fewer isolated and hub entities, such as MultiKE and RDGCN, achieve the leading performance of entity alignment, and vice versa. In summary, the ideal case is to have small proportions of isolated and hub entities. This finding also inspires us to make a rough estimation about the final entity alignment performance through the hubness and isolation analysis.

Figure 9: Proportions of target entities that appear zero, one and more times as the nearest neighbors of source entities on D-Y-15K (V1)

To resolve the hubness and isolation problem, we further explore cross-domain similarity local scaling (CSLS) [11] as the alternative metric. Taking cosine for example, we have: