Log In Sign Up

Aligning Cross-Lingual Entities with Multi-Aspect Information

by   Hsiu-Wei Yang, et al.

Multilingual knowledge graphs (KGs), such as YAGO and DBpedia, represent entities in different languages. The task of cross-lingual entity alignment is to match entities in a source language with their counterparts in target languages. In this work, we investigate embedding-based approaches to encode entities from multilingual KGs into the same vector space, where equivalent entities are close to each other. Specifically, we apply graph convolutional networks (GCNs) to combine multi-aspect information of entities, including topological connections, relations, and attributes of entities, to learn entity embeddings. To exploit the literal descriptions of entities expressed in different languages, we propose two uses of a pretrained multilingual BERT model to bridge cross-lingual gaps. We further propose two strategies to integrate GCN-based and BERT-based modules to boost performance. Extensive experiments on two benchmark datasets demonstrate that our method significantly outperforms existing systems.


Co-training Embeddings of Knowledge Graphs and Entity Descriptions for Cross-lingual Entity Alignment

Multilingual knowledge graph (KG) embeddings provide latent semantic rep...

Cross-lingual Entity Alignment via Joint Attribute-Preserving Embedding

Entity alignment is the task of finding entities in two knowledge bases ...

A Multilingual Bag-of-Entities Model for Zero-Shot Cross-Lingual Text Classification

We present a multilingual bag-of-entities model that effectively boosts ...

Cross-lingual Entity Alignment for Knowledge Graphs with Incidental Supervision from Free Text

Much research effort has been put to multilingual knowledge graph (KG) e...

RAKA:Co-training of Relationships and Attributes for Cross-lingual Knowledge Alignment

Cross-lingual knowledge alignment suffers from the attribute heterogenei...

Multilingual Knowledge Graph Embeddings for Cross-lingual Knowledge Alignment

Many recent works have demonstrated the benefits of knowledge graph embe...

EASE: Entity-Aware Contrastive Learning of Sentence Embedding

We present EASE, a novel method for learning sentence embeddings via con...

Code Repositories


Code and data for IJCAI-17 paper Multilingual Knowledge Graph Embeddings for Cross-lingual Knowledge Alignment

view repo

1 Introduction

English: University of Toronto Japanese: トロント大学
Attribute Value Attribute Value
Name University of Toronto 大学名 トロント大学
Type Public University 学校種別 州立
Found Date 1827-03-15 創立年 1827
Campus Ontario キャンパス セントジョージ(トロント)
Former Name King’s College 旧名 キングスカレッジ
The University of Toronto is a public research university in Toronto, Ontario, Canada トロント大学 は、オンタリオ州、トロントに本部を置くカナダの州立大学である
Figure 1: An example fragment of two KGs (in English and Japanese) connected by an inter-lingual link (ILL). In addition to the graph structures (top) consisting of entity nodes and typed relation edges, KGs also provide attributes and literal descriptions of entities (bottom).

A growing number of multilingual knowledge graphs (KGs) have been built, such as DBpedia Bizer et al. (2009), YAGO Suchanek et al. (2008); Rebele et al. (2016), and BabelNet Navigli and Ponzetto (2012), which typically represent real-world knowledge as separately-structured monolingual KGs. Such KGs are connected via inter-lingual links (ILLs) that align entities with their counterparts in different languages, exemplified by Figure 1 (top). Highly-integrated multilingual KGs contain useful knowledge that can benefit many knowledge-driven cross-lingual NLP tasks, such as machine translation Moussallem et al. (2018)

and cross-lingual named entity recognition

Darwish (2013). However, the coverage of ILLs among existing KGs is quite low Chen et al. (2018): for example, less than 20% of the entities in DBpedia are covered by ILLs. The goal of cross-lingual entity alignment is to discover entities from different monolingual KGs that actually refer to the same real-world entities, i.e., discovering the missing ILLs.

Traditional methods for this task apply machine translation techniques to translate entity labels Spohr et al. (2011). The quality of alignments in the cross-lingual scenario heavily depends on the quality of the adopted translation systems. In addition to entity labels, existing KGs also provide multi-aspect information of entities, including topological connections, relation types, attributes, and literal descriptions expressed in different languages Bizer et al. (2009); Xie et al. (2016), as shown in Figure 1 (bottom). The key challenge of addressing such a task thus is how to better model and use provided multi-aspect information of entities to bridge cross-lingual gaps and find more equivalent entities (i.e., ILLs).

Recently, embedding-based solutions Chen et al. (2017b); Sun et al. (2017); Zhu et al. (2017); Wang et al. (2018); Chen et al. (2018) have been proposed to unify multilingual KGs into the same low-dimensional vector space where equivalent entities are close to each other. Such methods only make use of one or two aspects of the aforementioned information. For example, Zhu et al. (2017) relied only on topological features while Sun et al. (2017) and Wang et al. (2018) exploited both topological and attribute features. Chen et al. (2018) proposed a co-training algorithm to combine topological features and literal descriptions of entities. However, combining these multi-aspect information of entities (i.e., topological connections, relations and attributes, as well as literal descriptions) remains under-explored.

In this work, we propose a novel approach to learn cross-lingual entity embeddings by using all aforementioned aspects of information in KGs. To be specific, we propose two variants of GCN-based models, namely Man and Hman, that incorporate multi-aspect features, including topological features, relation types, and attributes into cross-lingual entity embeddings. To capture semantic relatedness of literal descriptions, we fine-tune the pretrained multilingual BERT model Devlin et al. (2019) to bridge cross-lingual gaps. We design two strategies to combine GCN-based and BERT-based modules to make alignment decisions. Experiments show that our method achieves new state-of-the-art results on two benchmark datasets. Source code for our models is publicly available at

2 Problem Definition

In a multilingual knowledge graph , we use to denote the set of languages that contains and to represent the language-specific knowledge graph in language . , , , and are sets of entities, relations, attributes, values of attributes, and literal descriptions, each of which portrays one aspect of an entity. The graph consists of relation triples and attribute triples such that , , and . Each entity is accompanied by a literal description consisting of a sequence of words in language , e.g., and , .

Given two knowledge graphs and expressed in source language and target language , respectively, there exists a set of pre-aligned ILLs which can be considered training data. The task of cross-lingual entity alignment is to align entities in with their cross-lingual counterparts in , i.e., discover missing ILLs.

3 Proposed Approach

In this section, we first introduce two GCN-based models, namely Man and Hman, that learn entity embeddings from the graph structures. Second, we discuss two uses of a multilingual pretrained BERT model to learn cross-lingual embeddings of entity descriptions: PointwiseBert and PairwiseBert. Finally, we investigate two strategies to integrate the GCN-based and the BERT-based modules.

3.1 Cross-Lingual Graph Embeddings

Graph convolutional networks (GCNs) Kipf and Welling (2017) are variants of convolutional networks that have proven effective in capturing information from graph structures, such as dependency graphs Guo et al. (2019b), abstract meaning representation graphs Guo et al. (2019a), and knowledge graphs Wang et al. (2018). In practice, multi-layer GCNs are stacked to collect evidence from multi-hop neighbors. Formally, the -th GCN layer takes as input feature representations and outputs :


where is the adjacency matrix,

is the identity matrix,

is the diagonal node degree matrix of ,

is ReLU function, and

represents learnable parameters in the -th layer. is the initial input.

GCNs can iteratively update the representation of each entity node via a propagation mechanism through the graph. Inspired by previous studies Zhang et al. (2018); Wang et al. (2018), we also adopt GCNs in this work to collect evidence from multilingual KG structures and to learn cross-lingual embeddings of entities. The primary assumptions are: (1) equivalent entities tend to be neighbored by equivalent entities via the same types of relations; (2) equivalent entities tend to share similar or even the same attributes.

Multi-Aspect Entity Features. Existing KGs Bizer et al. (2009); Suchanek et al. (2008); Rebele et al. (2016) provide multi-aspect information of entities. In this section, we mainly focus on the following three aspects: topological connections, relations, and attributes. The key challenge is how to utilize the provided features to learn better embeddings of entities. We discuss how we construct raw features for the three aspects, which are then fed as inputs to our model. We use , and to denote the topological connection, relation, and attribute features, individually.

The topological features contain rich neighborhood proximity information of entities, which can be captured by multi-layer GCNs. As in Wang et al. (2018), we set the initial topological features to , i.e., an identity matrix serving as index vectors (also known as the featureless setting), so that the GCN can learn the representations of corresponding entities.

In addition, we also consider the relation and attribute features. As shown in Figure 1, the connected relations and attributes of two equivalent entities, e.g., “University of Toronto” (English) and “トロント大学” (Japanese), have a lot of overlap, which can benefit cross-lingual entity alignment. Specifically, they share the same relation types, e.g., “country” and “almaMater”, and some attributes, e.g., “foundDate” and “創立年”. To capture relation information, Schlichtkrull et al. (2018) proposed RGCN with relation-wise parameters. However, with respect to this task, existing KGs typically contain thousands of relation types but few pre-aligned ILLs. Directly adopting RGCN may introduce too many parameters for the limited training data and thus cause overfitting. Wang et al. (2018) instead simply used the unlabeled GCNs Kipf and Welling (2017) with two proposed measures (i.e., functionality and inverse functionality) to encode the information of relations into the adjacency matrix. They also considered attributes as input features in their architecture. However, this approach may lose information about relation types. Therefore, we regard relations and attributes of entities as bag-of-words features to explicitly model these two aspects. Specifically, we construct count-based N-hot vectors and for these two aspects of features, respectively, where the entry is the count of the -th relation (attribute) for the corresponding entity . Note that we only consider the top- most frequent relations and attributes to avoid data sparsity issues. Thus, for each entity, both of its relation and attribute features are -dimensional vectors.

Man. Inspired by Wang et al. (2018), we propose the Multi-Aspect Alignment Network (Man) to capture the three aspects of entity features. Specifically, three -layer GCNs take as inputs the triple-aspect features (i.e., , , and ) and produce the representations , , and according to Equation 1, respectively. Finally, the multi-aspect entity embedding is:


where denotes vector concatenation. can then feed into alignment decisions.

Such fusion through concatenation is also known as Scoring Level Fusion, which has been proven simple but effective for capturing multi-modal semantics Bruni et al. (2014); Kiela and Bottou (2014); Collell et al. (2017). It is worth noting that the main differences between Man and the work of Wang et al. (2018) are two fold: First, we use the same approach as in Kipf and Welling (2017) to construct the adjacency matrix, while Wang et al. (2018) designed a new connectivity matrix as the adjacency matrix for the GCNs. Second, Man explicitly regards the relation type features as model input, while Wang et al. (2018) incorporated such relation information into the connectivity matrix.

Figure 2: Architecture of Hman.
Figure 3: Architecture overview of PointwiseBert (left) and PairwiseBert (right).

Hman. Note that Man propagates relation and attribute information through the graph structure. However, for aligning a pair of entities, we observe that considering the relations and attributes of neighboring entities, besides their own ones, may introduce noise. Merely focusing on relation and attribute features of the current entity could be a better choice. Thus, we propose the Hybrid Multi-Aspect Alignment Network (Hman) to better model such diverse features, shown in Figure 2. Similar to Man, we still leverage the -th layer of a GCN to obtain topological embeddings

, but exploit feedforward neural networks to obtain the embeddings with respect to relations and attributes. The feedforward neural networks consist of one fully-connected (FC) layer and a highway network layer 

Srivastava et al. (2015). The reason we use highway networks is consistent with the conclusions of Mudgal et al. (2018), who conducted a design space exploration of neural models for entity matching and found that highway networks are generally better than FC layers in convergence speed and effectiveness.

Formally, these feedforward neural networks are defined as:


where and refer to one specific aspect (i.e., relation or attribute) and the corresponding raw features, respectively, and are model parameters, is ReLU function, and

is sigmoid function. Accordingly, we obtain the hybrid multi-aspect entity embedding

, to which normalization is further applied.

Model Objective. Given two knowledge graphs, and , and a set of pre-aligned entity pairs as training data, our model is trained in a supervised fashion. During the training phase, the goal is to embed cross-lingual entities into the same low-dimensional vector space where equivalent entities are close to each other. Following Wang et al. (2018)

, our margin-based ranking loss function is defined as:


where , denotes the set of negative entity alignment pairs constructed by corrupting the gold pair . Specifically, we replace or with a randomly-chosen entity in or . is the distance function, and

is the margin hyperparameter separating positive and negative pairs.

3.2 Cross-Lingual Textual Embeddings

Existing multilingual KGs Bizer et al. (2009); Navigli and Ponzetto (2012); Rebele et al. (2016) also provide literal descriptions of entities expressed in different languages and contain detailed semantic information about the entities. The key observation is that literal descriptions of equivalent entities are semantically close to each other. However, it is non-trivial to directly measure the semantic relatedness of two entities’ descriptions, since they are expressed in different languages.

Recently, Bidirectional Encoder Representations from Transformer (BERT) Devlin et al. (2019) has advanced the state-of-the-art in various NLP tasks by heavily exploiting pretraining based on language modeling. Of special interest is the multilingual variant, which was trained with Wikipedia dumps of 104 languages. The spirit of BERT in the multilingual scenario is to project words or sentences from different languages into the same semantic space. This aligns well with our objective—bridging gaps between descriptions written in different languages. Therefore, we propose two methods for applying multilingual BERT, PointwiseBert and PairwiseBert, to help make alignment decisions.

PointwiseBert. A simple choice is to follow the basic design of BERT and formulate the entity alignment task as a text matching task. For two entities and from two KGs in and , denoting source language and target language, respectively, their textual descriptions are and , consisting of word sequences in two languages. The model takes as inputs [CLS] [SEP]

[SEP], where [CLS] is the special classification token, from which the final hidden state is used as the sequence representation, and [SEP] is the special token for separating token sequences, and produces the probability of classifying the pair as equivalent entities. The probability is then used to rank all candidate entity pairs, i.e., ranking score. We denote this model as

PointwiseBert, shown in Figure 3 (left).

This approach is computationally expensive, since for each entity we need to consider all candidate entities in the target language. One solution, inspired by the work of Shi et al. (2019), is to reduce the search space for each entity with a reranking strategy (see Section 3.3).

PairwiseBert. Due to the heavy computational cost of PointwiseBert, semantic matching between all entity pairs is very expensive. Instead of producing ranking scores for description pairs, we propose PairwiseBert to encode the entity literal descriptions as cross-lingual textual embeddings, where distances between entity pairs can be directly measured using these embeddings.

The PairwiseBert model consists of two components, each of which takes as input the description of one entity (from the source or target language), as depicted in Figure 3 (right). Specifically, the input is designed as [CLS] [SEP], which is then fed into PairwiseBert for contextual encoding. We select the hidden state of [CLS] as the textual embedding of the entity description for training and inference. To bring the textual embeddings of cross-lingual entity descriptions into the same vector space, a similar ranking loss function as in Equation 4 is used.

3.3 Integration Strategy

Sections 3.1 and 3.2 introduce two modules that separately collect evidence from knowledge graph structures and the literal descriptions of entities, namely graph and textual embeddings. In this section, we investigate two strategies to integrate these two modules to further boost performance.

Reranking. As mentioned in Section 3.2, the PointwiseBert model takes as input the concatenation of two descriptions for each candidate–entity pair, where conceptually we must process every possible pair in the training set. Such a setting would be cost prohibitive computationally.

One way to reduce the cost of PointwiseBert would be to ignore candidate pairs that are unlikely to be aligned. Rao et al. (2016) showed that uncertainty-based sampling can provide extra improvements in ranking. Following this idea, the GCN-based models (i.e., Man and Hman) are used to generate a candidate pool whose size is much smaller than the entire universe of entities. Specifically, GCN-based models provide top- candidates of target entities for each source entity (where is a hyperparameter). Then, the PointwiseBert model produces a ranking score for each candidate–entity pair in the pool to further rerank the candidates. However, the weakness of such a reranking strategy is that performance is bounded by the quality of (potentially limited) candidates produced by Man or Hman.

Weighted Concatenation. With the textual embeddings learned by PairwiseBert denoted as and graph embeddings denoted as , a simple way to combine the two modules is by weighted concatenation:


where is the graph embeddings learned by either Man or Hman, and is a factor to balance the contribution of each source (where is a hyperparameter).

3.4 Entity Alignment

After we obtain the embeddings of entities, we leverage distance to measure the distance between candidate–entity pairs. A small distance reflects a high probability for an entity pair to be aligned as equivalent entities. To be specific, with respect to the reranking strategy, we select the target entities that have the smallest distances to a source entity in the vector space learned by Man or Hman as its candidates. For weighted concatenation, we employ the distance of the representations of a pair derived from the concatenated embedding, i.e., , as the ranking score.

4 Experiments

4.1 Datasets and Settings

We evaluate our methods over two benchmark datasets: DBP15K and DBP100K Sun et al. (2017). Table 1 outlines the statistics of both datasets, which contain 15,000 and 100,000 ILLs, respectively. Both are divided into three subsets: Chinese-English (ZH-EN), Japanese-English (JA-EN), and French-English (FR-EN).

Following previous work Sun et al. (2017); Wang et al. (2018), we adopt the same split settings in our experiments, where 30% of the ILLs are used as training and the remaining 70% for evaluation. Hits@k

is used as the evaluation metric

Bordes et al. (2013); Sun et al. (2017); Wang et al. (2018), which measures the proportion of correctly aligned entities ranked in the top- candidates, and results in both directions, e.g., ZH-EN and EN-ZH, are reported.

In all our experiments, we employ two-layer GCNs and the top 1000 (i.e., =1000) most frequent relation types and attributes are included to build the -hot feature vectors. For the Man model, we set the dimensionality of topological, relation, and attribute embeddings to 200, 100, and 100, respectively. When training Hman

, the hyperparameters are dependent on the dataset sizes due to GPU memory limitations. For DBP15K, we set the dimensionality of topological embeddings, relation embeddings, and attribute embeddings to 200, 100, and 100, respectively. For DBP100K, the dimensionalities are set to 100, 50, and 50, respectively. We adopt SGD to update parameters and the numbers of epochs are set to 2,000 and 50,000 for

Man and Hman, respectively. The margin in the loss function is set to 3. The balance factor is determined by grid search, which shows that the best performance lies in the range from 0.8 to 0.7. For simplicity, is set to 0.8 in all associated experiments. Multilingual BERT-base models with 768 hidden units are used in PointwiseBert and PairwiseBert. We additionally append one more FC layer to the representation of [CLS] and reduce the dimensionality to 300. Both BERT models are fine-tuned using the Adam optimizer.

Datasets DBP15K
Entities Rel. Attr. Rel.triples Attr.triples
ZH-EN Chinese 66,469 2,830 8,113 153,929 379,684
English 98,125 2,317 7,173 237,674 567,755
JA-EN Japanese 65,744 2,043 5,882 164,373 354,619
English 95,680 2,096 6,066 233,319 497,230
FR-EN French 66,858 1,379 4,547 192,191 528,665
English 105,889 2,209 6,422 278,590 576,543
Datasets DBP100K
Entities Rel. Attr. Rel.triples Attr.triples
ZH-EN Chinese 106,517 4,431 16,152 329,890 1,404,615
English 185,022 3,519 14,459 453,248 1,902,725
JA-EN Japanese 117,836 2,888 12,305 413,558 1,474,721
English 118,570 2,631 13,238 494,087 1,738,803
FR-EN French 105,724 1,775 8,029 409,399 1,361,509
English 107,231 2,504 13,170 513,382 1,957,813
Table 1: Statistics of DBP15K and DBP100K. Rel. and Attr. stand for relations and attributes, respectively.
@1 .@10 @50 @1 .@10 @50 @1 .@10 @50 @1 .@10 @50 @1 .@10 @50 @1 .@10 @50
Hao et al. (2016) 21.2 42.7 56.7 19.5 39.3 53.2 18.9 39.9 54.2 17.8 38.4 52.4 15.3 38.8 56.5 14.6 37.2 54.0
Chen et al. (2017a) 30.8 61.4 79.1 24.7 52.4 70.4 27.8 57.4 75.9 23.7 49.9 67.9 24.4 55.5 74.4 21.2 50.6 69.9
Sun et al. (2017) 41.1 74.4 88.9 40.1 71.0 86.1 36.2 68.5 85.3 38.3 67.2 82.6 32.3 66.6 83.1 32.9 65.9 82.3
Wang et al. (2018) 41.2 74.3 86.2 36.4 69.9 82.4 39.9 74.4 86.1 38.4 71.8 83.7 37.2 74.4 86.7 36.7 73.0 86.3
Man 46.0 79.4 90.0 41.5 75.6 88.3 44.6 78.8 90.0 43.0 77.1 88.7 43.1 79.7 91.7 42.1 79.1 90.9
Man w/o te 21.5 55.0 79.4 20.2 53.6 78.8 15.0 44.0 69.9 14.3 44.0 70.6 10.2 34.5 59.5 10.8 35.2 60.3
Man w/o re 45.6 79.1 89.5 41.1 75.0 87.3 44.2 78.7 89.8 43.0 76.9 88.1 42.8 79.7 91.4 42.1 78.9 90.6
Man w/o ae 43.7 77.1 87.8 39.2 72.9 85.5 43.2 77.6 88.4 41.2 74.9 86.6 42.9 79.6 91.0 41.5 78.9 90.5
Hman 56.2 85.1 93.4 53.7 83.4 92.5 56.7 86.9 94.5 56.5 86.6 94.6 54.0 87.1 95.0 54.3 86.7 95.1
Hman w/o te 13.2 16.7 38.3 13.5 17.2 38.5 15.4 22.3 45.5 15.2 22.0 45.5 12.4 13.9 35.3 12.2 13.7 35.3
Hman w/o re 50.2 78.4 86.5 49.3 78.6 87.0 52.6 81.6 89.1 52.4 81.1 89.8 52.7 84.2 91.4 52.0 83.9 91.1
Hman w/o ae 49.2 81.0 89.8 48.8 80.9 90.0 52.2 83.3 91.6 51.5 83.1 91.6 52.3 85.6 93.7 52.3 85.1 93.2
Hman w/o hw 46.8 76.1 84.1 46.0 76.2 84.6 50.5 79.5 87.5 49.9 79.1 87.5 51.9 82.7 90.9 51.6 82.5 90.6
Hao et al. (2016) 1-.1 16.9 1-.1 1-.1 16.6 1-.1 1-.1 21.1 1-.1 1-.1 20.9 1-.1 1-.1 22.9 1-.1 1-.1 22.6 1-.1
Chen et al. (2017a) 1-.1 34.3 1-.1 1-.1 29.1 1-.1 1-.1 33.9 1-.1 1-.1 27.2 1-.1 1-.1 44.8 1-.1 1-.1 39.1 1-.1
Sun et al. (2017) 20.2 41.2 58.3 19.6 39.4 56.0 19.4 42.1 60.5 19.1 39.4 55.9 26.2 54.6 70.5 25.9 51.3 66.9
Wang et al. (2018) 23.1 47.5 63.8 19.2 40.3 55.4 26.4 55.1 70.0 21.9 44.4 56.6 29.2 58.4 68.7 25.7 50.5 59.8
Man 27.2 54.2 72.8 24.7 50.2 69.0 30.0 60.4 77.3 26.6 54.4 71.2 31.6 64.0 77.3 28.8 59.3 73.4
Man w/o te 11.8 28.6 47.7 11.2 28.3 47.9 17.4 21.7 39.4 17.2 21.6 39.8 15.4 19.4 38.2 15.1 18.8 37.1
Man w/o re 26.5 53.4 72.1 23.9 49.2 67.9 29.8 60.3 77.1 26.3 53.9 70.6 31.0 63.2 76.4 28.4 58.4 72.2
Man w/o ae 25.5 51.7 70.4 22.8 47.6 66.3 29.4 59.4 76.1 25.9 52.9 69.7 30.8 62.7 75.8 28.1 57.8 71.5
Hman 29.8 54.6 69.5 28.7 53.3 69.0 34.3 63.3 76.1 33.8 63.0 76.7 37.5 67.7 77.7 37.6 68.1 78.5
Hman w/o te 16.8 20.3 39.2 17.2 21.0 39.4 13.0 11.5 27.3 13.3 11.8 28.0 10.5 13.5 11.1 10.5 13.4 11.4
Hman w/o re 28.0 50.3 62.3 28.2 50.6 62.9 30.3 54.9 64.8 30.2 55.9 66.9 32.8 60.3 69.1 33.3 60.9 69.8
Hman w/o ae 25.7 46.4 57.3 25.5 64.7 57.9 29.6 55.1 66.1 29.9 56.1 67.4 32.5 59.2 67.8 32.9 59.4 68.4
Hman w/o hw 25.2 46.0 57.9 25.2 45.9 57.9 28.6 52.6 62.2 28.5 53.0 63.0 32.8 60.9 70.0 32.9 60.2 70.3
Table 2: Results of using graph information on DBP15K and DBP100K. @1, @10 and @50 refer to Hits@1, Hits@10 and Hits@50, respectively.
@1 .@10 @50 @1 .@10 @50 @1 .@10 @50 @1 .@10 @50 @1 .@10 @50 @1 .@10 @50
Translation 55.7 67.6 74.3 40.3 54.2 62.2 74.6 84.5 89.1 61.9 72.0 77.2 1-.1 1-.1 1-.1 1-.1 1-.1 1-.1
JAPE + Translation 73.0 90.4 96.6 62.7 85.2 94.2 82.8 94.6 98.3 75.9 90.7 96.0 1-.1 1-.1 1-.1 1-.1 1-.1 1-.1
PairwiseBert 74.3 94.6 98.8 74.8 94.7 99.0 78.6 95.8 98.5 78.3 95.4 98.4 95.2 99.2 99.6 94.9 99.2 99.7
Man (Rerank) 84.2 93.6 94.8 82.1 91.8 93.1 89.4 94.0 94.8 88.2 93.3 94.0 93.1 95.2 95.4 93.1 95.3 95.4
Hman (Rerank) 86.5 95.9 96.9 85.8 94.1 95.3 89.0 96.0 97.3 89.0 96.0 97.5 95.3 97.7 97.8 95.2 97.9 98.1
Man (Weighted) 85.4 98.2 99.7 83.8 97.7 99.5 90.8 98.8 99.7 89.9 98.5 99.5 96.8 99.6 99.8 96.7 99.7 99.9
Hman (Weighted) 87.1 98.7 99.8 86.4 98.5 99.8 93.5 99.4 99.9 93.3 99.3 99.9 97.3 99.8 99.9 97.3 99.8 99.9
PairwiseBert 65.1 85.1 92.6 66.2 85.8 92.9 67.7 86.5 93.1 67.9 86.4 93.2 93.2 97.9 98.9 93.4 98.0 98.9
Man (Rerank) 59.5 62.1 62.2 55.9 58.2 58.2 65.5 68.2 68.4 59.9 62.1 62.3 69.7 70.4 70.5 65.5 66.2 66.2
Hman (Rerank) 58.9 61.2 61.3 57.9 60.2 60.3 66.9 69.4 69.6 67.0 69.6 69.8 72.1 72.9 73.0 72.7 73.5 73.5
Man (Weighted) 81.4 94.9 98.2 80.5 94.1 97.7 84.3 95.4 98.3 81.5 94.2 97.6 96.2 99.3 99.7 95.7 99.1 99.6
Hman (Weighted) 81.1 94.3 97.8 80.3 94.5 97.9 85.2 96.1 98.4 84.6 96.1 98.5 96.5 99.4 99.7 96.5 99.5 99.8
Table 3: Results of using both graph and textual information on DBP15K and DBP100K. @1, @10, and @50 refer to Hits@1, Hits@10, and Hits@50, respectively. indicates results are taken from Sun et al. (2017).

4.2 Results on Graph Embeddings

We first compare Man and Hman against previous systems Hao et al. (2016); Chen et al. (2017a); Sun et al. (2017); Wang et al. (2018). As shown in Table 2, Man and Hman consistently outperform all baselines in all scenarios, especially Hman. It is worth noting that, in this case, Man and Hman use as much information as Wang et al. (2018), while Sun et al. (2017) require extra supervised information (relations and attributes of two KGs need to be aligned in advance). The performance improvements confirm that our model can better utilize topological, relational, and attribute information of entities provided by KGs.

To explain why Hman achieves better results than Man, recall that Man collects relation and attribute information by the propagation mechanism in GCNs where such knowledge is exchanged through neighbors, while Hman uses feedforward networks to capture expressive features directly from the input feature vectors without propagation. As we discussed before, it is not always the case that neighbors of equivalent entities share similar relations or attributes. Propagating such features through linked entities in GCNs may introduce noise and thus harm performance.

Moreover, we perform ablation studies on the two proposed models to investigate the effectiveness of each component. We alternatively remove each aspect of features (i.e., topological, relation, and attribute features) and the highway layer in Hman, denoted as w/o te (re, ae, and hw). As reported in Table 2, we observe that after removing relation or attribute features, the performance of Hman and Man drops across all datasets. These figures prove that these two aspects of features are useful in making alignment decisions. On the other hand, compared to Man, Hman shows more significant performance drops, which also demonstrates that employing the feedforward networks can better categorize relation and attribute features than GCNs in this scenario. Interestingly, looking at the two variants Man w/o te and Hman w/o te, we can see the former achieves better results. Since Man propagates relation and attribute features via graph structures, it can still implicitly capture topological knowledge of entities even after we remove the topological features. However, Hman loses such structure knowledge when topological features are excluded, and thus its results are worse. From these experiments, we can conclude that the topological information is playing an indispensable role in making alignment decisions.

4.3 Results with Textual Embeddings

In this section, we discuss empirical results involving the addition of entity descriptions, shown in Table 3. Applying literal descriptions of entities to conduct cross-lingual entity alignment is relatively under-explored. The recent work of Chen et al. (2018) used entity descriptions in their model; however, we are unable to make comparisons with their work, as we do not have access to their code and data. Since we employ BERT to learn textual embeddings of descriptions, we consider systems that also use external resources, like Google Translate,111 as our baselines. We directly take results reported by Sun et al. (2017), denoted as “Translation” and “JAPETranslation”.

The PointwiseBert model is used with GCN-based models, which largely reduces the search space, as indicated by Man (Rerank) and Hman (Rerank), where the difference is that the candidate pools are given by Man and Hman, respectively. For DBP15K, we select top-200 candidate target entities as the candidate pool while for DBP100K, top-20 candidates are selected due to its larger size. The reranking method does lead to performance gains across all datasets, where the improvements are dependent on the quality of the candidate pools. Hman (Rerank) generally performs better than Man (Rerank) since Hman recommends more promising candidate pools.

The PairwiseBert model learns the textual embeddings that map cross-lingual descriptions into the same space, which can be directly used to align entities. The results are listed under PairwiseBert in Table 3. We can see that it achieves good results on its own, which also shows the efficacy of using multilingual descriptions. Moreover, such textual embeddings can be combined with graph embeddings (learned by Man or Hman) by weighted concatenation, as discussed in Section 3.3. The results are reported as Man (Weighted) and Hman (Weighted), respectively. As we can see, this simple operation leads to significant improvements and gives excellent results across all datasets. However, it is not always the case that KGs provide descriptions for every entity. For those entities whose descriptions are not available, the graph embeddings would be the only source for making alignment decisions.

English Chinese
ILL pair Casino_Royale_(2006_film) (3) 007大戰皇家賭場 (3)
Features starring, starring, distributor starring, starring, language

Daniel_Craig (1), Eva_Green (4),

Columbia_Pictures (9)

丹尼爾·克雷格 (1), 伊娃·格蓮 (4), 英語 (832)
Table 4: Case study of the noise introduced by the propagation mechanism.

4.4 Case Study

In this section, we describe a case study to understand the performance gap between Hman and Man. The example in Table 4 provides insights potentially explaining this performance gap. We argue that Man introduces unexpected noise from heterogeneous nodes during the GCN propagation process. We use the number in parentheses (*) after entity names to denote the number of relation features they have.

In this particular example, the two entities “Casino_Royale_(2006_film)” in the source language (English) and “007大戰皇家賭場” in the target language (Chinese) both have three relation features. We notice that the propagation mechanism introduces some neighbors which are unable to find cross-lingual counterparts from the other end, marked in red. Considering the entity “英語” (English), a neighbor of “007大戰皇家賭場”, no counterparts can be found in the neighbors of “Casino_Royale_(2006_film)”. We also observe that “英語” (English) is a pivot node in the Chinese KG and has 832 relations, such as “語言” (Language), “官方語言” (Official Language), and “頻道語言” (Channel Language). In this case, propagating features from neighbors can harm performance. In fact, the feature sets of the ILL pair already convey information that captures their similarity (e.g., the “starring” marked in blue are shared twice). Therefore, by directly using feedforward networks, Hman is able to effectively capture such knowledge.

5 Related Work

KG Alignment. Research on KG alignment can be categorized into two groups: monolingual and multilingual entity alignment. As for monolingual entity alignment, main approaches align two entities by computing string similarity of entity labels Scharffe et al. (2009); Volz et al. (2009); Ngomo and Auer (2011) or graph similarity Raimond et al. (2008); Pershina et al. (2015); Azmy et al. (2019). Recently, Trsedya et al. (2019) proposed an embedding-based model that incorporates attribute values to learn the entity embeddings.

To match entities in different languages, Wang et al. (2012) leveraged only language-independent information to find possible links cross multilingual Wiki knowledge graphs. Recent studies learned cross-lingual embeddings of entities based on TransE Bordes et al. (2013), which are then used to align entities across languages. Chen et al. (2018) designed a co-training algorithm to alternately learn multilingual entity and description embeddings. Wang et al. (2018) applied GCNs with the connectivity matrix defined on relations to embed entities from multilingual KGs into a unified low-dimensional space.

In this work, we also employ GCNs. However, in contrast to Wang et al. (2018), we regard relation features as input to our models. In addition, we investigate two different ways to capture relation and attribute features.

Multilingual Sentence Representations. Another line of research related to this work is aligning sentences in multiple languages. Recent works Hermann and Blunsom (2014); Conneau et al. (2018); Eriguchi et al. (2018) studied cross-lingual sentence classification via zero-shot learning. Johnson et al. (2017) proposed a sequence-to-sequence multilingual machine translation system where the encoder can be used to produce cross-lingual sentence embeddings Artetxe and Schwenk (2018). Recently, BERT Devlin et al. (2019) has advanced the state-of-the-art on multiple natural language understanding tasks. Specifically, multilingual BERT enables learning representations of sentences under multilingual settings. We adopt BERT to produce cross-lingual representations of entity literal descriptions to capture their semantic relatedness, which benefits cross-lingual entity alignment.

6 Conclusion and Future Work

In this work, we focus on the task of cross-lingual entity alignment, which aims to discover mappings of equivalent entities in multilingual knowledge graphs. We proposed two GCN-based models and two uses of multilingual BERT to investigate how to better utilize multi-aspect information of entities provided by KGs, including topological connections, relations, attributes, and entity descriptions. Empirical results demonstrate that our best model consistently achieves state-of-the-art performance across all datasets. In the future, we would like to apply our methods to other multilingual datasets such as YAGO and BabelNet. Also, since literal descriptions of entities are not always available, we will investigate alternative ways to design graph-based models that can better capture structured knowledge for this task.


We would like to thank the three anonymous reviewers for their thoughtful and constructive comments. Waterloo researchers are supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada, with additional computational resources provided by Compute Ontario and Compute Canada. Yanyan Zou and Wei Lu are supported by Singapore Ministry of Education Academic Research Fund (AcRF) Tier 2 Project MOE2017-T2-1-156, and are partially supported by SUTD project PIE-SGP-AI-2018-01.


  • Artetxe and Schwenk (2018) Mikel Artetxe and Holger Schwenk. 2018. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. arXiv preprint arXiv:1812.10464.
  • Azmy et al. (2019) Michael Azmy, Peng Shi, Jimmy Lin, and Ihab F. Ilyas. 2019. Matching entities across different knowledge graphs with graph embeddings. arXiv preprint arXiv:1903.06607.
  • Bizer et al. (2009) Christian Bizer, Jens Lehmann, Georgi Kobilarov, Sören Auer, Christian Becker, Richard Cyganiak, and Sebastian Hellmann. 2009. DBpedia: A crystallization point for the web of data. Web Semantics: science, services and agents on the world wide web, 7(3):154–165.
  • Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multi-relational data. In Proceedings of NIPS.
  • Bruni et al. (2014) Elia Bruni, Nam-Khanh Tran, and Marco Baroni. 2014. Multimodal distributional semantics.

    Journal of Artificial Intelligence Research

    , 49:1–47.
  • Chen et al. (2018) Muhao Chen, Yingtao Tian, Kai-Wei Chang, Steven Skiena, and Carlo Zaniolo. 2018. Co-training embeddings of knowledge graphs and entity descriptions for cross-lingual entity alignment. In Proceedings of IJCAI.
  • Chen et al. (2017a) Muhao Chen, Yingtao Tian, Mohan Yang, and Carlo Zaniolo. 2017a. Multilingual knowledge graph embeddings for cross-lingual knowledge alignment. In Proceedings of IJCAI.
  • Chen et al. (2017b) Muhao Chen, Tao Zhou, Pei Zhou, and Carlo Zaniolo. 2017b. Multi-graph affinity embeddings for multilingual knowledge graphs. In Proceedings of NIPS Workshop on Automated Knowledge Base Construction.
  • Collell et al. (2017) Guillem Collell, Ted Zhang, and Marie-Francine Moens. 2017. Imagined visual representations as multimodal embeddings. In Proceedings of AAAI.
  • Conneau et al. (2018) Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina Williams, Samuel Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. XNLI: Evaluating cross-lingual sentence representations. In Proceedings of EMNLP.
  • Darwish (2013) Kareem Darwish. 2013. Named entity recognition using cross-lingual resources: Arabic as an example. In Proceedings of ACL.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL.
  • Eriguchi et al. (2018) Akiko Eriguchi, Melvin Johnson, Orhan Firat, Hideto Kazawa, and Wolfgang Macherey. 2018. Zero-shot cross-lingual classification using multilingual neural machine translation. arXiv preprint arXiv:1809.04686.
  • Guo et al. (2019a) Zhijiang Guo, Yan Zhang, Zhiyang Teng, and Wei Lu. 2019a. Attention guided graph convolutional networks for relation extraction. In Proceedings of ACL.
  • Guo et al. (2019b) Zhijiang Guo, Yan Zhang, Zhiyang Teng, and Wei Lu. 2019b. Densely connected graph convolutional networks for graph-to-sequence learning. Transactions of the Association of Computational Linguistics.
  • Hao et al. (2016) Yanchao Hao, Yuanzhe Zhang, Shizhu He, Kang Liu, and Jun Zhao. 2016. A joint embedding method for entity alignment of knowledge bases. In Proceedings of China Conference on Knowledge Graph and Semantic Computing.
  • Hermann and Blunsom (2014) Karl Moritz Hermann and Phil Blunsom. 2014. Multilingual models for compositional distributed semantics. In Proceedings of ACL.
  • Johnson et al. (2017) Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda Viégas, Martin Wattenberg, Greg Corrado, et al. 2017.

    Google’s multilingual neural machine translation system: Enabling zero-shot translation.

    TACL, 5:339–351.
  • Kiela and Bottou (2014) Douwe Kiela and Léon Bottou. 2014.

    Learning image embeddings using convolutional neural networks for improved multi-modal semantics.

    In Proceedings of EMNLP.
  • Kipf and Welling (2017) Thomas N. Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In Proceedings of ICLR.
  • Moussallem et al. (2018) Diego Moussallem, Matthias Wauer, and Axel-Cyrille Ngonga Ngomo. 2018. Machine translation using semantic web technologies: A survey. Journal of Web Semantics, 51:1–19.
  • Mudgal et al. (2018) Sidharth Mudgal, Han Li, Theodoros Rekatsinas, AnHai Doan, Youngchoon Park, Ganesh Krishnan, Rohit Deep, Esteban Arcaute, and Vijay Raghavendra. 2018. Deep learning for entity matching: A design space exploration. In Proceedings of International Conference on Management of Data.
  • Navigli and Ponzetto (2012) Roberto Navigli and Simone Paolo Ponzetto. 2012. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193:217–250.
  • Ngomo and Auer (2011) Axel-Cyrille Ngonga Ngomo and Soren Auer. 2011. Limes: a time-efficient approach for large-scale link discovery on the web of data. In Proceedings of IJCAI.
  • Pershina et al. (2015) Maria Pershina, Mohamed Yakout, and Kaushik Chakrabarti. 2015. Holistic entity matching across knowledge graphs. In Proceedings of IEEE International Conference on Big Data.
  • Raimond et al. (2008) Yves Raimond, Christopher Sutton, and Mark B. Sandler. 2008. Automatic interlinking of music datasets on the semantic web. In Proceedings of WWW workshop on Linked Data on the Web.
  • Rao et al. (2016) Jinfeng Rao, Hua He, and Jimmy Lin. 2016. Noise-contrastive estimation for answer selection with deep neural networks. In Proceedings of CIKM.
  • Rebele et al. (2016) Thomas Rebele, Fabian Suchanek, Johannes Hoffart, Joanna Biega, Erdal Kuzey, and Gerhard Weikum. 2016. YAGO: A multilingual knowledge base from Wikipedia, WordNet, and GeoNames. In Proceedings of International Semantic Web Conference.
  • Scharffe et al. (2009) François Scharffe, Yanbin Liu, and Chuguang Zhou. 2009. RDF-AI: an architecture for RDF datasets matching, fusion and interlink. In Proceedings of IJCAI workshop on Identity and Reference in Web-based Knowledge Representation.
  • Schlichtkrull et al. (2018) Michael Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. 2018. Modeling relational data with graph convolutional networks. In Proceedings of European Semantic Web Conference.
  • Shi et al. (2019) Peng Shi, Jinfeng Rao, and Jimmy Lin. 2019. Simple attention-based representation learning for ranking short social media posts. In Proceedings of NAACL.
  • Spohr et al. (2011) Dennis Spohr, Laura Hollink, and Philipp Cimiano. 2011.

    A machine learning approach to multilingual and cross-lingual ontology matching.

    In Proceedings of International Semantic Web Conference.
  • Srivastava et al. (2015) Rupesh K. Srivastava, Klaus Greff, and Jürgen Schmidhuber. 2015. Training very deep networks. In Proceedings of NIPS.
  • Suchanek et al. (2008) Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2008. YAGO: A large ontology from Wikipedia and WordNet. Web Semantics: Science, Services and Agents on the World Wide Web, 6(3):203–217.
  • Sun et al. (2017) Zequn Sun, Wei Hu, and Chengkai Li. 2017. Cross-lingual entity alignment via joint attribute-preserving embedding. In Proceedings of International Semantic Web Conference.
  • Trsedya et al. (2019) Bayu Distiawan Trsedya, Jianzhong Qi, and Rui Zhang. 2019. Entity alignment between knowledge graphs using attribute embeddings. In Proceedings of AAAI.
  • Volz et al. (2009) Julius Volz, Christian Bizer, Martin Gaedke, and Georgi Kobilarov. 2009. Discovering and maintaining links on the web of data. In Proceedings of International Semantic Web Conference.
  • Wang et al. (2012) Zhichun Wang, Juanzi Li, Zhigang Wang, and Jie Tang. 2012. Cross-lingual knowledge linking across wiki knowledge bases. In Proceedings of WWW.
  • Wang et al. (2018) Zhichun Wang, Qingsong Lv, Xiaohan Lan, and Yu Zhang. 2018. Cross-lingual knowledge graph alignment via graph convolutional networks. In Proceedings of EMNLP.
  • Xie et al. (2016) Ruobing Xie, Zhiyuan Liu, Jia Jia, Huanbo Luan, and Maosong Sun. 2016. Representation learning of knowledge graphs with entity descriptions. In Proceedings of AAAI.
  • Zhang et al. (2018) Yuhao Zhang, Peng Qi, and Christopher D. Manning. 2018. Graph convolution over pruned dependency trees improves relation extraction. In Proceedings of EMNLP.
  • Zhu et al. (2017) Hao Zhu, Ruobing Xie, Zhiyuan Liu, and Maosong Sun. 2017. Iterative entity alignment via joint knowledge embeddings. In Proceedings of IJCAI.