A multilingual knowledge base (KB) such as DBpedia Lehmann et al. (2015), ConceptNet Speer et al. (2017) and Yago Mahdisoltani et al. (2015) stores multiple language-specific knowledge graphs (KGs) that express relations of many concepts and real-world entities. As each KG thereof is either extracted independently from monolingual corpora Lehmann et al. (2015); Mahdisoltani et al. (2015) or contributed by speakers of the language Speer et al. (2017); Mitchell et al. (2018), it is common for different KGs to constitute complementary knowledge Yang et al. (2019); Cao et al. (2019). Hence, aligning and synchronizing language-specific KGs support AI systems with more comprehensive commonsense reasoning Lin et al. (2019); Li et al. (2019b); Yeo et al. (2018), and benefit various knowledge-driven NLP tasks, including machine translation Moussallem et al. (2018), narrative prediction Chen et al. (2019) and dialogue agents Sun et al. (2019a).
Learning to align multilingual KGs is a non-trivial task, as KGs with distinct surface forms, heterogeneous schemata and inconsistent structures easily cause traditional symbolic methods to fall short Suchanek et al. (2011); Wijaya et al. (2013); Jiménez-Ruiz et al. (2012). Recently, much attention has been paid to methods based on multilingual KG embeddings Chen et al. (2017a, b, 2018); Sun et al. (2017, 2018, 2019b); Zhang et al. (2019)
. Those methods seek to separately encode the structure of each language-specific KG in an embedding space. Then, based on some seed entity alignment, the entity counterparts in different KGs can be easily matched via distances or transformations of embedding vectors. The principle is that entities with relevant neighborhood information can be characterized with similar embedding representations. Such representations particularly are tolerant to the aforementioned heterogeneity of surface forms and schemata in language-specific KGsChen et al. (2017a); Sun et al. (2018, 2020).
While multilingual KG embeddings provide a general and tractable way to align KGs, it still remains challenging for related methods to precisely infer the correspondence of entities. The challenge is that the seed entity alignment, which serves as the essential training data to learn the connection between language-specific KG embeddings, is often limitedly provided in KBs Chen et al. (2018); Sun et al. (2018). Hence, the lack of supervision often hinders the precision of inferred entity counterparts, and affects even more significantly when KGs scale up and become inconsistent in contents and density Pujara et al. (2017). Several methods also gain auxiliary supervision from profile information of entities, including descriptions Chen et al. (2018); Yang et al. (2019); Zhang et al. (2019) and numerical attributes Sun et al. (2017); Trsedya et al. (2019); Pei et al. (2019a). However, such profile information is not available in many KGs Speer et al. (2017); Mitchell et al. (2018); Bond and Foster (2013), therefore causing these methods to be not generally applicable to many cases.
Unlike existing models that rely on internal information of KGs, we seek to create embeddings that incorporate both KGs and freely available text corpora, and exploit incidental supervision signals Roth (2017) from text corpora to enhance the alignment learning on KGs. In this paper, we propose a novel embedding model JEANS (Joint Embedding Based Entity Alignment with INcidental Supervision). Particularly, JEANS first performs a grounding process Gupta et al. (2017); Upadhyay et al. (2018) to link entity mentions in each monolingual text corpus to the KG of the same language. Based on the KGs and grounded text in a pair of languages, JEANS conducts two learning processes, i.e. embedding learning and alignment learning. The embedding learning process distributes entities, relations and lexemes of each language in a separate embedding space, in which a KG embedding model and a language model are jointly trained. This process seeks to leverage text contexts to help capture the proximity of entities. On top of that, alignment learning captures the correspondence for entities and lexemes in a self-learning manner Artetxe et al. (2018). Starting from a small amount of seed entity alignment, this process iteratively induces a transformation between language-specific embedding spaces, and infers more alignment of entity and lexemes at each iteration to improve the learning at the next one. Moreover, we also employ the closed-form Procrustes solution Conneau et al. (2018) to strengthen the learning and inference within each iteration. Experimental results on two benchmark datasets confirm the effectiveness of JEANS in leveraging incidental supervision, leading to significant improvement to entity alignment and drastically outperforming existing methods.
2 Related Work
We discuss relevant works in four topics.
Entity alignment. Entity alignment in KBs has been a long-standing problem Shvaiko and Euzenat (2011). Aside from earlier approaches based on symbolic or schematic similarity of entities Suchanek et al. (2011); Wijaya et al. (2013); Jiménez-Ruiz et al. (2012), more recent research addresses this task with multilingual KG embeddings. A representative method of such is MTransE Chen et al. (2017a). MTransE jointly learns two model components. There are a translational embedding model Bordes et al. (2013) that distributes the facts in language-specific KGs into separate embeddings, and a transformation-based alignment model that maps between entity counterparts across embedding spaces.
Following the general principle of MTransE, later approaches are developed through the following three lines. One is to incorporate various embedding learning techniques for KGs. Besides translational techniques, some models employ alternative relation modeling techniques to encode relation facts, such as circular correlation Nickel et al. (2016), Hadamard product Hao et al. (2019) and recurrent skipping networks Guo et al. (2019). Others encode entities with neighborhood aggregation techniques, including GCN Wang et al. (2018); Yang et al. (2019); Cao et al. (2019); Xu et al. (2019); Wu et al. (2019b), RGCN Wu et al. (2019a) and GAT Zhu et al. (2019). Their benefits are mainly to produce entity representations capturing high-order proximity, so as to better suit the alignment task. A few works follow the second line to enhance the alignment learning
with semi-supervised learning techniques. Representative ones include co-trainingChen et al. (2018), optimal transport Pei et al. (2019b) and bootstrapping Sun et al. (2018); Zhu et al. (2017), which improve the preciseness of alignment captured with limited supervision. The third line of research seeks to obtain additional supervision from entity profiles, including descriptions Chen et al. (2018); Yang et al. (2019), attributes Sun et al. (2017); Trsedya et al. (2019); Pei et al. (2019a); Yang et al. (2020) and KG schemata Zhang et al. (2019). While those alternative views of entities can effectively bridge the embeddings, the limitation of such methods lies in unavailability of those views in many KGs Speer et al. (2017); Mitchell et al. (2018); Bond and Foster (2013).
Our method is mainly related to the third line of research. While instead of leveraging specific intra-KB information, our method introduces supervision signals from text contexts that are freely accessible to almost any KBs with the aid of grounding techniques. Meanwhile, our paper also follows the second line to improve alignment learning techniques, and couples two mainstream techniques for embedding learning.
Joint embeddings of entities and text. Fewer efforts have been put to jointly characterize entities and text as embeddings. Wang et al. Wang et al. (2014b) propose to connect a translational embedding of Freebase Bollacker et al. (2008) to a English word embedding based on Wikipedia anchors, therefore providing a joint embedding to enhance link prediction in the KG. Zhong et al. Zhong et al. (2015) generalize the approach in Wang et al. (2014b) with distant supervision based on entity descriptions and text corpora. Toutanova et al. Toutanova et al. (2015) extract dependency paths from sentences and jointly embed them with a KG using DistMult Yang et al. (2015) to support the relation extraction task. Several other approaches focus on jointly embedding words, entities Yamada et al. (2017); Newman-Griffis et al. (2018); Cao et al. (2017); Almasian et al. (2019) and entity types Gupta et al. (2017) appearing in the same textual contexts without considering relational structure of a KG. These approaches are employed in monolingual NLP tasks including entity linking Gupta et al. (2017); Cao et al. (2017), entity abstraction Newman-Griffis et al. (2018) and factoid QA Yamada et al. (2017). As they focus on a monolingual and supervised scenario, they are essentially different from our goal to help cross-lingual KG alignment with incidental supervision from unparalleled corpora.
Multilingual word embeddings.
Our model component of alignment induction from text is closely connected to multilingual word embeddings. Earlier approaches in this line, regardless of being supervised or weakly supervised, based on seed lexiconZou et al. (2013) or parallel corpora Gouws et al. (2015), are systematically summarized in a recent survey Ruder et al. (2017). While a number of methods in this line can be employed in our model to gain addition supervision for entity alignment, we choose to use a combination of Procrustes solution Conneau et al. (2018) with self-learning to offer precise inference of cross-lingual alignment based on limited seed alignment. Note that recent contextualized embeddings such M-BERT Pires et al. (2019) and X-ELMo Schuster et al. (2019) are not suitable for our setting, since contextualization causes ambiguity to entity representations.
Incidental supervision. Incidental supervision is a recently introduced learning strategy Roth (2017), which seeks to retrieve supervision signals from data that are not labeled for the target task. This strategy has been applied to tasks including SRL He et al. (2019), controversy prediction Rethmeier et al. (2018) and dataless classification Song and Roth (2015). To the best of our knowledge, the proposed method here is the first of its kind that incorporates incidental supervision in embedding learning or entity alignment.
We hereby begin introducing our method with the formalization of learning resources.
In a KB, denotes the set of languages, and unordered language pairs. is the language-specific KG of language . and respectively denote the corresponding vocabularies of entities and relations. denotes a triple in such that and . Boldfaced , , represent the embedding vectors of head , relation , and tail respectively. For a language pair , denotes a set of entity alignments between and , such that and for each entity pair . Following the convention of previous work Chen et al. (2018); Sun et al. (2018); Yang et al. (2019), we assume the entity pairs to have a 1-to-1 mapping and it is specified in . This assumption is congruent to the design of mainstream KBs Lehmann et al. (2015); Mahdisoltani et al. (2015) where disambiguation of entities is granted. Besides the definition of multilingual KGs, we use to denote the text corpus of language . is a set of documents , where each document is a sequence of tokens from the monolingual lexicon . Each token thereof is originally a lexeme, but may also be an entity surface form after the ground process, and we also use boldfaced to denote its vector. denotes the seed lexicon between , such that and for each lexeme pair . Note that only include the alignment between lexemes, and may optionally serve as external supervision data. To be consistent with previous problem settings of entity alignment Chen et al. (2017a); Sun et al. (2018); Yang et al. (2019), is not necessarily provided to training, but is defined to be compatible with cases where it is available.
JEANS addresses entity alignment in three consecutive processes. A grounding process first link entities of each KG to possible mentions of them in the corresponding monolingual corpus, therefore connecting entities and text tokens of the same language into a shared vocabulary. Then an embedding learning process characterizes the KG and text of each language into a separate embedding space. During this process, we couple both the translational technique Bordes et al. (2013); Chen et al. (2017a, 2018) and the neighborhood aggregation technique Wang et al. (2018); Yang et al. (2019), which are two representative techniques for characterizing a KG. Simultaneously, the monolingual text tokens are encoded with a skip-gram language model Mikolov et al. (2013). On top of the embeddings, starting from a small amount of seed entity alignment and optional seed lexicon, the alignment learning process iteratively infer more alignment both on KGs and text using self-learning and Procrustes solution Schönemann (1966). A depiction of JEANS’s learning framework is shown in Fig 1.
In the rest of this section, we introduce the technical details of each process.
3.1 (Noisy) Entity Grounding
The goal of the grounding process is to combine vocabularies of the KG and the text corpus in each language. This serves as the premise for the embedding learning process to produce a shared representation scheme for entities, relations and lexemes, therefore allowing the alignment learning process to leverage supervision signals for both entities and lexemes. It is noteworthy that, the purpose of entity grounding here is to combine the two data modalities. Hence, we only expect this process to discover enough entity contexts and offer a higher coverage on entity vocabularies, while being tolerant to possible noise in entity recognition and linking. Particularly, we consider two grounding techniques, one using a pre-trained entity discovery and linking (EDL) model, the other based on simple surface form matching (SFM).
Pre-trained EDL model. One technique is to use off-the-shelf EDL models Khashabi et al. (2018); Manning et al. (2014). A typical model of such sequentially handles the steps of NER to detect entity mentions, and link each mention to candidate entities from the KG based on symbolic and contextual similarity. Many EDL models are easily trainable on large text corpora with anchors, and offer promising performance of grounding and disambiguation on multiple languages Sil et al. (2018). In this paper, we do not go into details to the design of EDL models. Interested readers are referred to the aforementioned literature.
Surface form matching. Suppose a pre-trained EDL model is not available, then a simpler way of combining data is to match KG surface forms with text. This can be efficiently done by building a Completion Trie Hsu and Ottaviano (2013) for multi-token surface forms, and conducting a longest prefix matching Dharmapurikar et al. (2006) between surface forms and sub-sequences of text tokens. While this simple technique does not necessarily disambiguate entity mentions, experiments find it sufficient to combine the two modalities and allow supervision signals from induced lexical alignment to propagate to entities.
Once the entity vocabulary and lexicon of a language are combined, we assume that entity mentions in are properly tokenized as grounded surface forms in . Specifically, we now use to denote a token in the grounded corpus that can either be an entity or a lexeme . Given the combined learning resources for each language, we next describe the processes of embedding learning and alignment learning.
3.2 Embedding Learning
The embedding learning process is responsible for capturing the combined KG and text corpus of each language in a shared embedding space . In this process, JEANS jointly learns two model components to respectively encode units of the KG and the text, among which the overlaps use shared representations. We hereby describe these two model components in detail.
3.2.1 KG Embedding
As discussed in Section 2, previous approaches respectively leverage two forms of embedding learning techniques: (i) relation modeling Chen et al. (2017a); Sun et al. (2018) such as vector translations, circular correlation and Hadamard product seeks to capture relations as an arithmetic operation in the vector space; (ii) neighborhood aggregation Wang et al. (2018); Yang et al. (2019); Cao et al. (2019)
employs graph neural networks (GNN) to encode neighborhood contexts for better seizing the proximity of entities.
The KG embedding model proposed in this work couples both forms of techniques. This aims at seizing both relations and entity proximity, two factors that are both beneficial to produce transferable entity embeddings. To achieve this goal, the encoder first stacks layers of GCN Kipf and Welling (2016) on the KG. Formally, the -th layer representation is computed as
where is the diagonal degree matrix of the KG, is the sum of the adjacency matrix and an identity , and is a trainable weight matrix. The raw features of entities can be either entity attributes or randomly initialized. The last layer outputs are regarded as entity embedding representations, i.e. .
We use to denote the entity representations of language , then the following log-softmax loss is optimized to perform relational modeling with translation vectors in the embedding space of :
where is the plausibility measure of a triple Bordes et al. (2013), is a Bernoulli negative-sampled triple Wang et al. (2014a) created by substituting either head or tail entities or in . is a positive bias to adjust the scale of the plausibility measure. All the entity representations optimized in are from . Note that the reason for us to choose the translational technique over other relation modeling techniques is due to this technique being more robust in cases where KG structures are sparser Pujara et al. (2017).
3.2.2 Text Embedding
In addition to the KG embedding model, the text embedding model seeks to leverage the contextual information of free text to help the embedding better capture the proximity of entities This model employs the continuous skip-gram language model, which is inline with a number of word embedding methods Mikolov et al. (2013); Bojanowski et al. (2017); Conneau et al. (2018), and is realized by optimizing the following log-softmax loss:
The text context thereof is the set of tokens that surround a token in the entity-grounded corpus , denotes the distance, and denotes a randomly sampled token in .
3.2.3 Embedding Learning Objective
For each language , the goal of embedding learning is to optimize the joint loss . As mentioned, the grounded entity surface forms in use shared representations in both model components, hence are optimized with both and
. The rest lexeme, relation and entity representations are optimized alternately by either component. In both model components, the number of negative samples of triples and tokens are both adjustable hyperparameters.
It is noteworthy that, both model components may choose alternative techniques, including other KG encoders such as GAT Veličković et al. (2018), multi-channel GCN Cao et al. (2019) and gated GNN Sun et al. (2020), and text embeddings such as GloVe Pennington et al. (2014). As experimenting with different embedding techniques is not a main contribution of this work, we leave them as future work. Specifically, contextualized text representations Peters et al. (2018); Devlin et al. (2019) are not applicable, as contextualization causes ambiguity to token representations.
3.3 Alignment Learning
Once the KG and text units of each language are captured in a shared embedding, the alignment learning process therefore bridges the alignment between each pair of embeddings. This process seeks to exploit additional alignment labels from text embeddings, and use those to help the alignment of entities. Different from the majority of methods discussed in Section 2 that jointly learn embeddings and alignment, the alignment learning process in JEANS is a retrofitting process Shi et al. (2019); Faruqui et al. (2015). Hence, the embedding of each language is fixed and does not require duplicate training for different language pairs Chen et al. (2017a); Sun et al. (2017).
Given a pair of languages , the objective of alignment learning is to induce a transformation between the two embedding spaces. The following loss is minimized
in which , and the word seed lexicon is considered additional supervision data that are optionally provided. Each () denotes a fixed representation of either an entity or a lexeme of ().
Starting from a small amount of seed alignment in , JEANS conducts an iterative self-learning process to exploit more alignment labels for both entities and lexemes to improve the learning of . In each iteration, we follow Conneau et al. Conneau et al. (2018) to induce a Procrustes solution for . To propose new alignment labels, the self-learning technique in JEANS deploys a mutual nearest neighbor (NN) constraint, which requires a suggested pair of matched units to appear in the NN of each other. More specifically, define as the -NN of vector in the embedding space of , this constraint requires a proposed match to be inserted into only if is in , and mutually appears in . Besides, we also require to be of the same type, i.e. both being entities or being lexemes. Particularly, we only select entities that have not been aligned in to form the newly-proposed . This respects the 1-to-1 matching constraint of entities being defined at the beginning of this section, and effectively reduces the candidate space after each iteration of self-learning. Meanwhile, 1-to-1 matching is not required for lexemes. To mitigate hubness, we also follow Conneau et al. Conneau et al. (2018) to employ the Cross-domain Similarity Local Scaling (CSLS) measure.
After the iteration, the newly proposed alignment labels are inserted to to enhance the learning at the next iteration. The iterative self-learning is stopped once the number of proposed entity alignment in an iteration is below certain quantity (e.g. of ). With more and more matched entities and lexemes being exploited within each iteration, a better is induced, whereas the lexical alignment naturally serve as incidental supervision signals for entity alignment.
|MTransE Chen et al. (2017a)||0.224||0.556||0.335||0.308||0.614||0.364||0.279||0.575||0.349||0.140||0.203||0.177||0.034||0.101||0.072|
|GCN-Align Wang et al. (2018)||0.373||0.745||0.532||0.413||0.744||0.549||0.399||0.745||0.546||0.215||0.378||0.293||0.138||0.246||0.190|
|AlignE Sun et al. (2018)||0.481||0.824||0.599||0.472||0.792||0.581||0.448||0.789||0.563|
|GCN-JE Wu et al. (2019b)||0.483||0.778||0.459||0.729||0.466||0.746|
|RotatE Sun et al. (2019c)||0.345||0.738||0.476||0.485||0.788||0.589||0.442||0.761||0.550|
|KECG Li et al. (2019a)||0.486||0.851||0.610||0.478||0.835||0.598||0.490||0.844||0.610|
|MuGCN Cao et al. (2019)||0.495||0.870||0.621||0.494||0.844||0.611||0.501||0.857||0.621|
|RSN Guo et al. (2019)||0.516||0.768||0.605||0.508||0.745||0.591||0.507||0.737||0.590|
|GMN Xu et al. (2019)||0.596||0.876||0.679||0.433||0.681||0.479||0465||0.728||0.580|
|AliNet Sun et al. (2020)||0.552||0.852||0.657||0.539||0.826||0.628||0.549||0.831||0.645|
|JAPE Sun et al. (2017)||0.324||0.667||0.430||0.412||0.745||0.490||0.363||0.685||0.476||0.169||0.354||0.271||0.147||0.239||0.192|
|SEA Pei et al. (2019a)||0.400||0.797||0.533||0.424||0.796||0.548||0.385||0.783||0.518|
|HMAN Yang et al. (2019)||0.543||0.867||0.537||0.834||0.565||0.866|
|BootEA Sun et al. (2018)||0.653||0.874||0.731||0.629||0.847||0.703||0.622||0.854||0.701||0.333||0.511||0.425||0.233||0.393||0.316|
|KDCoE Chen et al. (2018)||0.483||0.569||0.496||0.335||0.380||0.339|
|MMR Shi and Xiao (2019)||0.635||0.878||0.647||0.858||0.623||0.847|
|NAEA Zhu et al. (2019)||0.673||0.894||0.752||0.650||0.867||0.720||0.641||0.873||0.718|
|OTEA Pei et al. (2019b)||0.361||0.541||0.447||0.270||0.440||0.352|
|JEANS-SFM w/ seed lexicon||0.788||0.947||0.848||0.723||0.890||0.781||0.738||0.931||0.803||0.494||0.571||0.549||0.416||0.512||0.446|
|JEANS-EDL w/ seed lexicon||0.789||0.954||0.850||0.736||0.915||0.815||0.736||0.937||0.810||0.484||0.560||0.549||0.413||0.498||0.433|
After the alignment learning process, given a query to find the counterpart entity of from , the answer is predicted as the -NN entity after applying to transform , denoted . The inference phase by default also adopts CSLS as the distance measure, which is consistent with recent works Sun et al. (2019b, 2020).
In this section, we evaluate JEANS on two benchmark datasets for cross-lingual entity alignment, and compare against a wide selection of recent baseline methods. We also provide detailed ablation study on model components of JEANS.
4.1 Experimental Settings
Datasets. Experiments are conducted on DBP15k Sun et al. (2017) and WK3l60k Chen et al. (2018) that are widely used by recent studies on this task. DBP15k contains four language-specific KGs that are respectively extracted from English (En), Chinese (Zh), French (Fr) and Japanese (Ja) DBpedia Lehmann et al. (2015), each of which contains around 65k-106k entities. Three sets of 15k alignment labels are constructed to align entities between each of the other three languages and En. WK3l60k contains larger KGs with around 57k to 65k entities in En, Fr and German (De) KGs, and around 55k reference entity alignment respectively for En-Fr and En-De settings. Statistics of the datasets are given in Appendix A.2.
We also use the text of Wikipedia dumps in the five participating languages in training. For Chinese and Japanese corpora thereof, we obtain the segmented versions respectively from PKUSEG Luo et al. (2019) and MeCab Kudo (2006).
Baseline methods. We compare with a wide selection of recent approaches for entity alignment on multilingual KGs. The baseline methods include (i) those employing different structure embedding techniques, namely MTransE Chen et al. (2017a), GCN-Align Wang et al. (2018), AlignE Sun et al. (2018), GCN-JE Wu et al. (2019b), KECG Li et al. (2019a), MuGCN Cao et al. (2019), RotatE Sun et al. (2019c), RSN Guo et al. (2019) and AliNet Sun et al. (2020); (ii) methods that incorporate auxiliary information of entities, namely JAPE Sun et al. (2017), SEA Pei et al. (2019a), GMN Xu et al. (2019) and HMAN Yang et al. (2019); (iii) semi-supervised alignment learning methods, including BootEA Sun et al. (2018), KDCoE Chen et al. (2018), MMR Shi and Xiao (2019), NAEA Zhu et al. (2019) and OTEA Pei et al. (2019b). Descriptions of these methods are given in Appendix A.1. Note that a few studies allow to incorporate machine translation Wu et al. (2019a, b); Yang et al. (2019) in training, or using pre-aligned word embeddings to delimit candidate spaces Xu et al. (2019). Results for such models are reported for the versions where external supervision is removed, so as to conduct fair comparison with all the rest models that are trained from scratch and using the same alignment labels in the experimental datasets.
Evaluation protocols. The use of the datasets are consistent with previous studies of the baseline methods. On each language pair in DBP15k, around 30% of seed alignment is used for training, the rest for testing. On WK3l60k, 20% of seed alignment on En-Fr and En-De settings is respectively used for training. Following the convention, we calculate several ranking metrics on test cases, including the accuracy , the proportion of cases that are ranked no larger than , and mean reciprocal rank . Note that to align with the results in previous studies Sun et al. (2020); Pei et al. (2019b), is set to 10 on DBP15k and 5 on WK3l60k. All metrics are preferred higher to indicate better performance.
Model Configurations. We use AMSGrad (Reddi et al., 2018) to optimize the training losses of the embedding learning process, for which we set the learning rate to 0.001, the exponential decay rates and to 0.9 and 0.999, and batch sizes to 512 for both and . Trainable parameters are initialized using Xavier initialization Glorot and Bengio (2010). The dimension is set to 300, which is often used for bilingual word embedding models trained on Wikipedia corpora Conneau et al. (2018); Gouws et al. (2015), considering that the vocabulary sizes and training data density here are relatively close to those models. The number of GCN layers is set to 2. We set negative sample sizes of triples and text contexts to 5, the text context width to be 10 and the bias in to be 2. Specifically, we evaluate variants of JEANS by adjusting the following technical details: (i) For the grounding process, aside from the simple surface form matching (marked with SFM), we also explore with the off-the-shelf Wikification-based EDL model from (Upadhyay et al., 2018
, marked with EDL). A grounding performance estimation is given in AppendixA.3; (ii) We consider both CSLS and metrics in learning and inference; (iii) We also consider the cases where we introduce additional 5k seed lexicon provided by Conneau et al. Conneau et al. (2018) for each language pair.
We report the entity alignment results in Table 1.
Considering the baseline results on DBP15k, we can see that the simplest variant of JEANS using SFM-based grounding has consistently outperformed all baselines on three cross-lingual settings. Particularly, it leads to 17.0-17.4% of absolute improvement in over the best structure-based baseline, 14.0-22.3% over the best entity profile based one, and 6.30-9.30% over the best semi-supervised one. This shows that while JEANS preserves the key merit of a semi-supervised entity alignment method, and effectively enhances the alignment of KGs by exploiting incidental supervision signals from unaligned text corpora. Considering different grounding techniques, we observe that SFM variants often perform closely to EDL ones. This indicates that simple SFM is enough to combine KG and text corpora for JEANS’s embedding learning without EDL-related resources. Meanwhile, if we introduce the optional seed lexicon, it leads to additional improvement of 1.2-2.2% in . This shows that JEANS effectively enables the use of available supervision data on lexemes to further enhance entity alignment, although it is not obligatory. The results on Wk3l60k generally exhibit similar observations. In comparison to KDCoE that leverages strong but expensive supervision data of entity descriptions in co-training, JEANS with 5k seed lexicon still offers better performance based on very accessible resources.
In general, the experiments here show that JEANS promisingly improves state-of-the-art performance for entity alignment, with only the need for unparalleled free text, and no need for additional labels.
4.3 Ablation Study
In Table 2 we report an ablation study for JEANS-SFM based on DBP15k, so as to understand the importance of each incorporated technique.
From the results, we observe that self-learning is the most important factor. The removal of it can lead to a drop of 10.1-13.8% in , as well as drastic drop of other metrics. This also explains why semi-supervised baselines (group 3) typically perform better than others. However, even with self-learning, the removal of text can lead to drop of 2.4% on En-Fr and 4.2% on En-Ja. This shows that context information JEANS retrieves from free text effectively infers the match of entities. On the other hand, the structure encoding of KGs is more important than textual contexts, as it causes higher performance drops of 6.7-8.8% in by removing KGs. Employing GCN leads to relatively slight performance gain, as joint learning the relation model and the language model can satisfyingly capture entity proximity. Changing the distance metric to also leads to 3.6-6.9% of decrease in . This shows CSLS’s ability to handle hubness and isolation is also important for similarity inference in the dense embedding space for the metric words and entities. Hence, this metric is also recommended by recent work Sun et al. (2020); Zhang et al. (2019).
In this paper, we propose JEANS for entity alignment. Different from previous methods that leverage only internal information of KGs, JEANS extends the learning on any text corpora that may contain the KG entities. For each language, a noisy grounding process first connect both data modalities, followed by an embedding learning process coupling GCN with relational modeling, and an self-learning based alignment learning process. Without introducing additional labeled data, JEANS offers significantly improved performance over state-of-the-art models on benchmark datasets. Hence, it shows the effectiveness and feasibility of exploiting incidental supervision from free text for entity alignment.
For future work, aside from experimenting with other embedding learning techniques for KGs and text, we plan to extend JEANS to learn associations on KGs with different specificity Hao et al. (2019). We also seek to extend the representation scheme in hyperbolic spaces Nickel and Kiela (2017); Chen and Quirk (2019), aiming at better capturing associations for hierarchical ontologies.
- Word embeddings for entity-annotated texts. In ECIR, Cited by: §2.
- A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In ACL, Cited by: §1.
- Enriching word vectors with subword information. TACL 5. Cited by: §3.2.2.
- Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD, Cited by: §2.
- Linking and extending an open multilingual Wordnet.. In ACL, Cited by: §1, §2.
- Translating embeddings for modeling multi-relational data. In NIPS, Cited by: §A.1, §2, §3.2.1, §3.
- Bridge text and knowledge by learning multi-prototype entity mention embedding. In ACL, Cited by: §2.
- Multi-channel graph neural network for entity alignment. In ACL, pp. 1452–1461. Cited by: §A.1, §1, §2, §3.2.1, §3.2.3, Table 1, §4.1.
- Incorporating structured commonsense knowledge in story completion. In AAAI, Cited by: §1.
- Embedding edge-attributed relational hierarchies. In SIGIR, Cited by: §5.
- Co-training embeddings of knowledge graphs and entity descriptions for cross-lingual entity alignment. In IJCAI, Cited by: §A.1, §A.2, §1, §1, §2, Table 1, §3, §3, §4.1, §4.1.
- Multilingual knowledge graph embeddings for cross-lingual knowledge alignment. In IJCAI, Cited by: §A.1, §1, §2, §3.2.1, §3.3, Table 1, §3, §3, §4.1.
- Multi-graph affinity embeddings for multilingual knowledge graphs. In AKBC, Cited by: §1.
- Word translation without parallel data. In ICLR, Cited by: §1, §2, §3.2.2, §3.3, §4.1.
- BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §3.2.3.
- Longest prefix matching using bloom filters. IEEE/ACM TON 14 (2), pp. 397–409. Cited by: §3.1.
- Retrofitting word vectors to semantic lexicons. In NAACL, Cited by: §3.3.
- Understanding the difficulty of training deep feedforward neural networks. In AISTATS, Cited by: §4.1.
BilBOWA: fast bilingual distributed representations without word alignments. In ICML, Cited by: §2, §4.1.
- Learning to exploit long-term relational dependencies in knowledge graphs. In ICML, Cited by: §A.1, §2, Table 1, §4.1.
- Entity linking via joint encoding of types, descriptions, and context. In EMNLP, Cited by: §1, §2.
- Universal representationlearning of knowledge bases by jointly embedding instances and ontological concepts. In KDD, Cited by: §2, §5.
- Incidental supervision from question-answering signals. arXiv:1909.00333. Cited by: §2.
- Space-efficient data structures for top-k completion. In WWW, Cited by: §3.1.
- Large-scale interactive ontology matching: algorithms and implementation.. In ECAI, Cited by: §1, §2.
- COGCOMPNLP: your swiss army knife for nlp. In LREC, Cited by: §3.1.
- Semi-supervised classification with graph convolutional networks. In ICLR, Cited by: §3.2.1.
- Mecab: yet another part-of-speech and morphological analyzer. http://mecab. sourceforge. jp. Cited by: §4.1.
- DBpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web 6 (2), pp. 167–195. Cited by: §1, §3, §4.1.
- Semi-supervised entity alignment via joint knowledge embedding model and cross-graph model. In EMNLP-IJCNLP, Cited by: §A.1, Table 1, §4.1.
- Teaching pretrained models with commonsense reasoning: a preliminary kb-based approach. In NeurIPS, Cited by: §1.
- KagNet: knowledge-aware graph networks for commonsense reasoning. In EMNLP-IJCNLP, Cited by: §1.
- PKUSEG: a toolkit for multi-domain chinese word segmentation.. CoRR abs/1906.11455. Cited by: §4.1.
- Yago3: a knowledge base from multilingual Wikipedias. In CIDR, Cited by: §1, §3.
The stanford corenlp natural language processing toolkit. In ACL, Cited by: §3.1.
- Efficient estimation of word representations in vector space. ICLR. Cited by: §3.2.2, §3.
- Never-ending learning. Communications of the ACM. Cited by: §1, §1, §2.
- Machine translation using semantic web technologies: a survey. Journal of Web Semantics 51, pp. 1–19. Cited by: §1.
- Jointly embedding entities and text with distant supervision. In RepL4NLP, Cited by: §2.
- Holographic embeddings of knowledge graphs.. In AAAI, Cited by: §2.
- Poincaré embeddings for learning hierarchical representations. In NIPS, Cited by: §5.
- Semi-supervised entity alignment via knowledge graph embedding with awareness of degree difference. In WWW, Cited by: §A.1, §1, §2, Table 1, §4.1.
- Improving cross-lingual entity alignment via optimal transport. In IJCAI, Cited by: §A.1, §A.2, §2, Table 1, §4.1, §4.1.
- Glove: global vectors for word representation.. In EMNLP, Cited by: §3.2.3.
- Deep contextualized word representations. In NAACL, Cited by: §3.2.3.
- How multilingual is multilingual bert?. In ACL, Cited by: §2.
- Sparsity and noise: where knowledge graph embeddings fall short. In EMNLP, Cited by: §1, §3.2.1.
- On the convergence of adam and beyond. In ICLR, Cited by: §4.1.
- Learning comment controversy prediction in web discussions using incidentally supervised multi-task cnns. In Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, Cited by: §2.
- Incidental supervision: moving beyond supervised learning. In AAAI, Cited by: §1, §2.
- A survey of cross-lingual word embedding models. JAIR. Cited by: §2.
- A generalized solution of the orthogonal procrustes problem. Psychometrika 31 (1), pp. 1–10. Cited by: §A.1, §3.
- Cross-lingual alignment of contextual word embeddings, with applications to zero-shot dependency parsing. In NAACL, Cited by: §2.
- Retrofitting contextualized word embeddings with paraphrases. In EMNLP-IJCNLP, Cited by: §3.3.
- Modeling multi-mapping relations for precise cross-lingual entity alignment. In EMNLP-IJCNLP, Cited by: §A.1, Table 1, §4.1.
- Ontology matching: state of the art and future challenges. TKDE 25 (1), pp. 158–176. Cited by: §2.
- Multi-lingual entity discovery and linking. In ACL, Cited by: §3.1.
- Unsupervised sparse vector densification for short text similarity. In NAACL, Cited by: §2.
- ConceptNet 5.5: an open multilingual graph of general knowledge.. In AAAI, Cited by: §1, §1, §2.
- Paris: probabilistic alignment of relations, instances, and schema. PVLDB 5 (3). Cited by: §1, §2.
- DREAM: a challenge data set and models for dialogue-based reading comprehension. TACL 7, pp. 217–231. Cited by: §1.
- Cross-lingual entity alignment via joint attribute-preserving embedding. In ISWC, Cited by: §A.1, §A.2, §1, §1, §2, §3.3, Table 1, §4.1, §4.1.
- Bootstrapping entity alignment with knowledge graph embedding.. In IJCAI, Cited by: §A.1, §A.2, §1, §1, §2, §3.2.1, Table 1, §3, §4.1.
- Knowledge graph alignment network with gated multi-hop neighborhood aggregation. In AAAI, Cited by: §A.1, §1, §3.2.3, §3.3, Table 1, §4.1, §4.1, §4.3.
- TransEdge: translating relation-contextualized embeddings for knowledge graphs. In ISWC, Cited by: §1, §3.3.
- Rotate: knowledge graph embedding by relational rotation in complex space. In ICLR, Cited by: §A.1, Table 1, §4.1.
- Representing text for joint embedding of text and knowledge bases.. In EMNLP, Cited by: §2.
- Entity alignment between knowledge graphs using attribute embeddings. AAAI. Cited by: §1, §2.
- Joint multilingual supervision for cross-lingual entity linking. In EMNLP, Cited by: §1, §4.1.
- Graph attention networks. In ICLR, Cited by: §3.2.3.
Knowledge graph embedding by translating on hyperplanes.. In AAAI, Cited by: §3.2.1.
- Knowledge graph and text jointly embedding.. In EMNLP, Cited by: §2.
- Cross-lingual knowledge graph alignment via graph convolutional networks. In EMNLP, Cited by: §A.1, §A.2, §2, §3.2.1, Table 1, §3, §4.1.
- Pidgin: ontology alignment using web text as interlingua. In CIKM, Cited by: §1, §2.
- Relation-aware entity alignment for heterogeneous knowledge graphs. In IJCAI, Cited by: §2, §4.1.
- Jointly learning entity and relation representations for entity alignment. In EMNLP-IJCNLP, Cited by: §2, Table 1, §4.1.
- Cross-lingual knowledge graph alignment via graph matching neural network. arXiv:1905.11605. Cited by: §2, Table 1, §4.1.
- Learning distributed representations of texts and entities from knowledge base. TACL 5, pp. 397–411. Cited by: §2.
- Embedding entities and relations for learning and inference in knowledge bases. ICLR. Cited by: §2.
- Aligning cross-lingual entities with multi-aspect information. In EMNLP-IJCNLP, Cited by: §A.1, §A.2, §1, §1, §2, §3.2.1, Table 1, §3, §3, §4.1.
- COTSAE: co-training of structure and attribute embeddings for entity alignment. In AAAI, Cited by: §2.
- Machine-translated knowledge transfer for commonsense causal reasoning. In AAAI, Cited by: §1.
- Multi-view knowledge graph embedding for entity alignment. In IJCAI, Cited by: §1, §1, §2, §4.3.
- Aligning knowledge and text embeddings by entity descriptions. In EMNLP, Cited by: §2.
- Iterative entity alignment via knowledge embeddings. In IJCAI, Cited by: §2.
- Neighborhood-aware attentional representation for multilingual knowledge graphs. In IJCAI, Cited by: §A.1, §2, Table 1, §4.1.
- Bilingual word embeddings for phrase-based machine translation.. In EMNLP, Cited by: §2.
Appendix A Appendices
a.1 Descriptions of Baseline Methods
We provide descriptions of baseline methods. In accord with Section 4.1, we also separate the descriptions in three groups.
MTransE Chen et al. (2017a) represents a pioneering method of this topic. It jointly learns a translational embedding model Bordes et al. (2013) and an alignment model that captures the correspondence of counterpart entities via transformations or distances of the embedding representations. Based on the methodology of MTransE, GCN-Align Wang et al. (2018) substitute the translational embedding model with GCN to better capture the entity based on their neighborhood structures. MECG Li et al. (2019a) extends the framework of GCN-Align with regularization term based on relational translation, aiming at differentiating between the information neighboring entities that play different roles of relations. MuGCN Cao et al. (2019) combines multiple channels of GCNs to better model the heterogeneous neighborhood information of entities in different KGs. For the same purpose, AliNet Sun et al. (2020) incorporates a gate mechanism in the neighborhood aggregation process of GAT. Both techniques offer satisfying performance in entity alignment without a transformation between KG-specific embedding spaces. Different from these neighborhood aggregation techniques, RSN Guo et al. (2019) focuses capturing the long-term relational dependency of entities by incorporating a gated recurrent network with highway links, and offers comparable performance to MuGCN. Besides the above embedding learning techniques, single-graph KG embedding models have also been evaluated for entity alignment in recent studies Guo et al. (2019); Sun et al. (2020), by simply treating the match of entities as a type of relation in the KG. According to these studies, while RotatE Sun et al. (2019c) outperforms others single-graph embedding models, it is significantly outperformed by most aforementioned entity alignment methods.
Besides different embedding learning techniques, there are approaches to obtain additional supervision signals from profile information of entities that are available in some KBs. JAPE Sun et al. (2017) introduces an auxiliary measure of entity attributes, and use this to strengthen the cross-lingual learning of MTransE. SEA Pei et al. (2019a) also obtains similarly auxiliary supervision signals based on centrality measures. HMAN Yang et al. (2019) is a GCN-based model that incorporates various modalities of entity information, including entity names, attributes, and literal descriptions that are also leveraged in KDCoE Chen et al. (2018).
Another line of research focuses on semi-supervised alignment learning to capture the entity alignment based on limited labels. BootEA Sun et al. (2018), MMR Shi and Xiao (2019) and NAEA Zhu et al. (2019) similarly conducts an self-learning approach to iteratively propose alignment label on unaligned entities, The main difference of these three models lies in the embedding learning techniques, given that BootEA is translational, MMR is RGCN based, and NAEA is GAT-based. KDCoE adopts an iterative co-training process of MTransE with another self-attentive Siamese encoder of entity descriptions, and both model components alternately propose alignment labels. Different from those iterative learning processes, OTEA Pei et al. (2019b) employs an optimal transport solution that is similar to the Procrustas solution Schönemann (1966) used in this work.
a.2 Statistics of the Datasets
a.3 Grounding Performance Estimation
|Estimation||Coverage||Avg match||Coverage||Avg match|
Due to the lack of ground truths on unlabeled text, it is hard to estimate the precision of entity grounding by the two types of techniques. However, as the requirement of the grounding process is to simply connect two data modalities for training embeddings, we may encourage a technique that handles enough entity mentions and offer a higher coverage on entity vocabularies. The corresponding estimations of these two factors for the two techniques are reported in Table 5. As we can observe that, without considering disambiguation, SFM can overall cover higher proportions of the entity vocabularies, while pre-trained EDL generally discovers more entity mentions for each entity. However, both techniques are sufficient to support the noisy grounding process and combine two data modalities for embedding learning and alignment induction.