Jointly Learning Knowledge Embedding and Neighborhood Consensus with Relational Knowledge Distillation for Entity Alignment

by   Xinhang Li, et al.
Tsinghua University

Entity alignment aims at integrating heterogeneous knowledge from different knowledge graphs. Recent studies employ embedding-based methods by first learning the representation of Knowledge Graphs and then performing entity alignment via measuring the similarity between entity embeddings. However, they failed to make good use of the relation semantic information due to the trade-off problem caused by the different objectives of learning knowledge embedding and neighborhood consensus. To address this problem, we propose Relational Knowledge Distillation for Entity Alignment (RKDEA), a Graph Convolutional Network (GCN) based model equipped with knowledge distillation for entity alignment. We adopt GCN-based models to learn the representation of entities by considering the graph structure and incorporating the relation semantic information into GCN via knowledge distillation. Then, we introduce a novel adaptive mechanism to transfer relational knowledge so as to jointly learn entity embedding and neighborhood consensus. Experimental results on several benchmarking datasets demonstrate the effectiveness of our proposed model.



There are no comments yet.


page 1

page 2

page 3

page 4


Jointly Learning Entity and Relation Representations for Entity Alignment

Entity alignment is a viable means for integrating heterogeneous knowled...

Relation-Aware Entity Alignment for Heterogeneous Knowledge Graphs

Entity alignment is the task of linking entities with the same real-worl...

Knowledge Graph Entity Alignment with Graph Convolutional Networks: Lessons Learned

In this work, we focus on the problem of entity alignment in Knowledge G...

Relation-Aware Neighborhood Matching Model for Entity Alignment

Entity alignment which aims at linking entities with the same meaning fr...

Learning to Exploit Long-term Relational Dependencies in Knowledge Graphs

We study the problem of knowledge graph (KG) embedding. A widely-establi...

Relational Reflection Entity Alignment

Entity alignment aims to identify equivalent entity pairs from different...

Knowledge Graph Embedding Methods for Entity Alignment: An Experimental Review

In recent years, we have witnessed the proliferation of knowledge graphs...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Knowledge Graphs (KGs) are able to provide unstructured knowledge in the simple and clear triple format <head, relation, tail>

. They are essential in supporting many natural language processing applications. Since KGs are constructed separately from heterogeneous resources and languages, they might use different expressions to indicate the same entity. As a result, different KGs often contain complementary contents and cross-lingual links. It is essential to integrate heterogeneous KGs into a unified one, thus increasing the accuracy and robustness of knowledge-driven applications.

Figure 1: An error-prone example of two different entities with same neighbors.

To this end, many efforts have been paid to study the problem of Entity Alignment, which aims at linking entities with the same identities from different KGs. Earlier approaches for entity alignment usually rely on manually created features [24], which is labor-intensive and time-consuming. Recent studies focused on embedding-based approaches as they are capable of representing and preserving the structures of KGs in low-dimensional embedding spaces. Generally speaking, there are two categories of approaches: translation-based and GNN-based. The translation-based models [34, 60, 7, 35] extend the idea of trans-family models, e.g. TransE, for knowledge graph embedding to learn the embedding of entities and relations in KGs. This kind of method is good at learning knowledge embedding but not satisfactory in sparse graph entity alignment. Recently GNN-based models [44, 32] employ the Graph Convolutional Network (GCN) [16] to make better use of the pre-aligned seeds to learn the entity embedding by utilizing the neighbor information so as to resolve such limitations and have achieved promising results. Some recent works [18] further jointly learned relational knowledge and neighborhood consensus [30] to get more robust and accurate predictions. However, since the learning objectives of relational knowledge and neighborhood consensus are different, it will lead to different optimization directions. As a result, the model would fail to learn the useful information due to the overfitting problem. For example, we can see in Figure 1 that existing models tend to wrongly align New York State in English and New York City in Chinese due to the strong hint of neighborhood consensus given by the same neighbors. However, in such a situation, the difference of relation semantic information between adjoin and locatedIn is more crucial in alignment to distinguish these two different entities, which cannot be treated properly without making a balance between the two learning objectives.

To address this problem, we propose Relational Knowledge Distillation for Entity Alignment (RKDEA), a GCN based model with relational knowledge distillation framework for entity alignment. Following previous studies [46, 47, 48, 26], we use a Highway Gated GCN model to learn the entity embedding. To decide the portion of relational knowledge embedding and neighborhood consensus in the training objective, we take advantage of the knowledge distillation [12] mechanism. More specifically, we first separately train two models with objectives of learning relational knowledge and neighborhood consensus. Next, we take the model with relational knowledge objective as teacher and the model with neighborhood consensus objective as student. Then we employ a relational distillation method to transfer relation information from the teacher model to the student. To effectively control the overall training objectives, we propose an adaptive temperature mechanism instead of treating it as a static hyper-parameter as previous studies [12, 31, 54, 27] did to adjust the weight of two kinds of information. We conduct extensive evaluations on several publicly available datasets that are widely used in previous studies. The experimental results and further analysis demonstrate that RKDEA can better integrate knowledge embedding and neighborhood consensus and thus outperforms state-of-the-art methods by an obvious margin.

2 Related Work

2.1 Knowledge Graph Entity Alignment

Knowledge Graph has a wide scope of application scenarios like similarity search [58, 41, 45, 23], information extraction [40], record de-duplication [19], and can help analyze different kinds of of data such as health [59], spatial [57, 21, 50] and text [39]. To automatically capture deep resemblance of graph structure information between heterogeneous KGs, recent studies focused on embedding-based approach. Based on the methodology, they could be categorized into two types: translation-based and GNN-based ones.

Translation-based Approaches   For the representation learning of a single KG, there have been many studies, such as TransE [3],TransH [42], TransR [20] and TransD [14]. These methods utilize a scoring function to model relational knowledge and therefore obtain entity and relation embeddings. Translation-based approaches are based on such studies. MTransE [7] applies TransE into entity alignment task with various transition techniques between different KGs. JAPE [34]

presents a way to combine structure and attribute information to jointly embed entities into a unified vector space. This kind of method is capable of capturing complex relation semantic information with the help of triple-level modeling. However, it is difficult for them to perceive the structural similarity of neighborhood information.

GNN-based Approaches

   Recently, the Graph Neural Network(GNN) has achieved tremendous success on the applications related to network embedding. GNN-based entity alignment methods incorporate neighborhood information with GNNs to provide global structural information. GCN-Align 

[44] directly applies the Graph Convolutional Network (GCN) as embedding module on entity alignment. MuGNN [5] proposes self and cross-KG attention mechanisms to better capture the structure information in the KGs. RDGCN [46] leverages relations to improve entity alignment with dual graphs. GMNN [49] and NMN [48] incorporate long-distance neighborhood information to strengthen the entity embeddings. SSP [26] jointly models KG global structure and local semantic information via flexible relation representation. KECG [18] trains model with knowledge embedding and neighborhood consensus objectives alternately.

Nevertheless, all these methods fail to address the problem of balancing the different learning objectives of relational knowledge and neighborhood consensus. Comparing with them, our approach could jointly learn knowledge embedding and neighborhood consensus in a more structured, fine-grained way via knowledge distillation.

2.2 Knowledge Distillation

Knowledge Distillation(KD) is a branch of transfer learning, which indicates transferring knowledge from a complex model (teacher) to a concise model (student). Typically, KD aims at transferring a mapping from inputs to outputs learned by teacher model to student model. By leveraging KD, the student model could learn implicit knowledge by incorporating an extra objective of the teacher’s outputs so as to gain better performance. It was first introduced to neural network by 

[12]. [31]

employs additional linear transformation in the middle of network to get a narrower student.

[54, 13, 38] transfer the knowledge in attention map to get robuster and more comprehensive representations. Recently, some works [53, 2, 9] have demonstrated that distilling models of identical architecture, i.e., self-distillation, can further improve the performance of neural network.

The challenges for graph representation learning lies in the heterogeneous nature of different graphs. Some recent approaches of applying KD have brought convincing results in solving heterogeneity problem. [17] proposes a novel graph data transfer learning framework with generalized Spectral CNN. [8] transfers similarity of different structure for metric learning. [27] indicates the effectiveness of distance-wise and angle-wise distillation loss in knowledge transfer between different structures. The objective of our work is similar to above studies but there are still many issues to be addressed to propose a reasonable distillation mechanism for the task of entity alignment between KGs.

3 Preliminary

Formally, a knowledge graph is defined as , where indicate sets of entities, relations and attributes, respectively. and denote sets of relation triples and attribute triples, respectively. In this paper, we focus on relation information irrespective of attributes. So the KG could be simplified to , where is the set of relation triples.

Given two heterogeneous KGs, and , Entity Alignment aims at finding entity pairs , and that represent same meaning semantically. In practice, there are always some pre-aligned entity and relation pairs provided as seed alignment. The seed alignments and where means equivalent in semantics, denote semantically equivalent pairs.

Figure 2: Overall architecture: Relational Knowledge Distillation for Entity Alignment. The upper part is the knowledge embedding teacher model while the lower part is the neighborhood consensus student model. They have the similar GCN model structure with different training objectives. The relational knowledge of embeddings is transferred from teacher to student via relational knowledge distillation.

4 Methodology

In order to better utilize relational knowledge and neighborhood information, we propose a knowledge distillation based framework to consider knowledge embedding and neighborhood consensus simultaneously.

As shown in Figure 2, our framework consists of three components:

  • A pre-trained two-layers GCN with highway gates as the teacher model to provide relational knowledge, whose objective function is similar to TransE;

  • A two-layers GCN with highway gates as the student model to learn the local graph structure by neighborhood consensus with seed alignments;

  • A knowledge distillation mechanism to transfer relational knowledge from teacher model to student model, specifically an objective of minimizing distance-wise distillation loss.

4.1 Highway Gated GCN

For both the teacher and student models, we utilize a GCN [16] based model to learn the representations of entities and relations. Specifically, we use the highway gated GCN which could capture long-distance neighborhood information by stacking multiple GCN layers as the basic building block for our model. The input of highway gated GCN model is an entity feature matrix , where is the number of entities and is entity feature dimension. For each GCN layer, the forward propagation is calculated as Equation (1):


where is the hidden state of the -th GCN layer and ,

is an activation function chosen as

, is an adjacency matrix derived from the connectivity matrix of graph

and an identity matrix

of self-connection, denotes the diagonal node degree matrix of , and denote the weights and dimensions of features in layer , respectively.

Following [33], we utilize layer-wise highway gates in forward propagation. With the help of stacked GCN layers, rich neighborhood knowledge indicating graph structure information could be captured in learning entity embeddings. The detailed calculating process is as Equation (2):



is a sigmoid function;


are weight matrix and bias vector of transform gate

, respectively; denotes element-wise multiplication; represents the carry gate for vanilla input of each layer opposite to the transform gate for transformed input.

4.2 Knowledge Embedding Model

As shown in Figure 2, the pre-trained teacher model aims at learning the knowledge embedding. In this paper, we choose the objective function of TransE [3] as an example. Note that it could also be replaced with other translation-based methods. The relation triple is denoted as a translation equation , where represents head entity, relation and tail entity, respectively. For each triple , we take normalized as scoring function. Following the previous studies, we apply negative sampling to generate negative unreal triples in the pre-training process. The objective function of the knowledge embedding teacher model is shown as Equation (4):


where , and are the embedding representations of entities and relations, denotes the aggregation of triples in two KGs, represents the negative sampled triples set derived from , is a margin hyper-parameter with positive values. For the sake of preserving semantic, we construct negative samples by randomly replacing the head or tail entity of an existing triple with other entity with similar semantic.

4.3 Neighborhood Consensus Model

The neighborhood consensus student model has the similar structure to the knowledge embedding teacher model. The only difference between them is the learning objective. While the teacher model learns relational knowledge in triples, the student model learns local graph structure information in neighbors. In order to calculate the neighborhood similarities between entities from different KGs, we utilize an energy function of distance of neighborhood aggregated entity embeddings. Specifically, given an entity pair , the similarity measure denotes as . The learning objective of the neighborhood consensus student model is to minimize the margin-based ranking loss in Equation (5):


where denotes positive part of element, is a margin hyper-parameter, and represent sets of positive and negative entity pairs, respectively. For the negative sampling, we choose the nearest neighbors as the negative corresponding entities rather than random sampling. Specifically, given an existing pair , we replace () with the entity () that is closest to () on distance.

4.4 Relational Knowledge Distillation

To integrate relational knowledge and neighborhood information via knowledge distillation, we need to address two issues: (i) how to learn the structural information along with contents via the distillation approach; (ii) how to propose a learning objective that minimizes the difference between the teacher and student models. Therefore, we need to dynamically adjust the contribution of knowledge embedding and neighborhood consensus during distillation.

To keep the relational knowledge in the process of distillation, we borrow the idea of energy function proposed in [27]. The basic idea is that it first randomly samples instances from the training instances. Next the energy function is applied to these instances to describe the relationship between them. Then the loss can be calculated as Equation (6):


where are randomly sampled training instances; is the distance measure of potential relational knowledge between teacher and student models; and are the output representations of input in teacher and student models, respectively. With such a training loss, the relational knowledge could be kept in the process of distillation.

Following this formulation, we utilize the same L2 distance measure with the teacher model between head and tail entities in triples as energy function to keep the potential relational information as shown in Equation (7):


where is a normalization factor to scale the distance of different vector spaces to the same scale. We empirically define as Equation (8):


where represents all triples in two KGs and is the number of triples in .

In order to improve the robustness of outliers, we propose to utilize Huber loss 

[15] rather than MSE loss as difference measure between the teacher and the student, which is shown as Equation (9):


Therefore, the objective of knowledge distillation is specified as Equation (10):


where and denote the embedding representations of the teacher and student model, respectively.

Next, we introduce how to dynamically adjust the contribution of alignment loss (for neighborhood consensus) and knowledge distillation loss (for knowledge embedding) in the learning objective of the student model. It is controlled with the hyper-parameter Temperature denoted as . As mentioned above, these two kinds of information may be adversarial due to different optimization directions. To address this problem, should be dynamically adjusted during the training process. Intuitively, in the early stage of training, it should focus on learning the relational knowledge where is more important; while in the late stage when two losses become very small, it should concentrate on the alignment loss to avoid overfitting to relational knowledge. Therefore, instead of using a static value of , we set adaptive value of as shown in Equation (11):


where denotes the loss value without gradient and .

Therefore, the final learning objective of our model is shown as Equation (12):


where denotes the neighborhood consensus alignment loss as Equation (5) shows. b

5 Experiment

5.1 Datasets

Following many previous studies, we evaluate all methods on the popular DBP15K and DWY100K datasets.

  • DBP15K [34]

    is composed of three cross-lingual datasets derived from DBpedia representing three language pairs of KGs respectively, which are DBP

    (Chinese to English), DBP (Japanese to English), and DBP (French to English). Each dataset consists of two KGs with hundreds of thousands of relation triples and 15K pre-aligned seed entity pairs along with seed relation pairs.

  • DWY100K [35] contains two large-scale cross-domain datasets derived from DBpedia, Wikidata, and YAGO3, denoted as DWY (DBpedia to Wikidata) and DWY (DBpedia to YAGO3). Similar to DBP15K, DWY100K means 100K seed entity alignments in each dataset.

The detailed statistics are shown in Table 1. In all experiments, we utilize 30% of seed alignments in training, which is consistent with previous studies.

Datasets #Ent. #Rel. #Rel. Triples
DBP ZH 66,469 2,830 153,929
EN 98,125 2,317 237,674
DBP JA 65,744 2,043 164,373
EN 95,680 2,096 233,319
DBP FR 66,858 1,379 192,191
EN 105,889 2,209 278,590
DWY DBpedia 100,000 330 463,294
Wikidata 100,000 220 448,774
DWY DBpedia 100,000 302 428,952
YAGO3 100,000 31 502,563
Table 1: Statistics of Datasets.

5.2 Baselines

To better verify the effectiveness of our proposed approach, we compare it with several state-of-the-art embedding-based models. For knowledge embedding oriented methods, we choose MTransE [7], IPTransE [60], SEA [28], RSN4EA [10] as the representatives, which align entities based on relational knowledge. For neighborhood consensus oriented methods, we choose GCN-Align [44], MuGNN [5], KECG [18] and AliNet [36], which apply graph neural network to aggregate neighborhood information for alignment. Among them, KECG explicitly models relational knowledge and neighborhood information with two learning objectives as our RKDEA does. These two methods are the most related ones to our proposed approach.

Since our work focuses on transferring relational knowledge rather than developing a better model for knowledge graphs entity alignment, we only utilize structure information for baselines for fair comparison according to the comprehensive survey [55]. Although there are some other studies in this topic [35, 29, 25, 51, 37, 22], they mainly focus on utilizing other information, such as attribute and description or applying data enhancement, e.g. bootstraping strategy and machine translation. Such approaches are orthogonal to our work and they could also be enhanced on the basis of RKDEA. Therefore, we exclude the comparison with them here.

For ablation study, we design two variants of our RKDEA, i.e., RKDEA (w/o KD) that does not employ knowledge distillation, RKDEA (w/o Temp.) that does not incorporate temperature factor to control the training process.

5.3 Implementation Details

In experiments, we choose the hyper-parameters by grid search as following: Learning rate is among {0.0001, 0.0005, 0.001, 0.005, 0.01}, are in {1.0, 2.0, 3.0}. Specifically, the optimal values for these hyper-parameters are for knowledge embedding teacher model, for neighborhood consensus student model, , . Following previous studies [46, 47, 48], the dimensions of embedding vectors in both teacher model and student model are set to 300 and we use the pre-trained glove

of entity names as input features. For DBP15K, the negative samples are updated each 50 epochs and the numbers of negative samples are set to

and , respectively. For DWY100K, the negative samples are updated every 10 epochs and the numbers of negative samples are set to and

, respectively. Following previous work, we use Hits@1 and Hits@10 as the main evaluation metrics.

5.4 Results

Hits@1 Hits@10 MRR Hits@1 Hits@10 MRR Hits@1 Hits@10 MRR
MTransE 0.308 0.614 0.364 0.279 0.575 0.349 0.244 0.556 0.335
IPTransE 0.406 0.735 0.516 0.367 0.693 0.474 0.333 0.685 0.451
SEA 0.424 ).796 0.548 0.385 0.783 0.518 0.400 0.797 0.533
RSN4EA 0.508 0.745 0.591 0.507 0.737 0.590 0.516 0.768 0.605
GCN-Align 0.413 0.744 0.549 0.399 0.745 0.546 0.373 0.745 0.532
MuGNN 0.494 0.844 0.611 0.501 0.857 0.621 0.495 0.870 0.621
KECG 0.478 0.835 0.598 0.490 0.844 0.610 0.486 0.851 0.610
AliNet 0.539 0.826 0.628 0.549 0.831 0.645 0.552 0.852 0.657
(w/o RKD) 0.438 0.802 0.564 0.462 0.811 0.574 0.446 0.822 0.583
(w/o Temp.) 0.573 0.857 0.677 0.576 0.873 0.681 0.564 0.862 0.673
RKDEA 0.603 0.872 0.703 0.597 0.881 0.698 0.622 0.912 0.721
Table 2: Performance Comparison on DBP15K datasets. The results are split into three parts by full lines. The upper part includes knowledge embedding methods; the middle part includes neighborhood consensus methods; while the bottom part includes three variants of our approach for the purpose of ablation study.
Models DWY DWY
Hits@1 Hits@10 MRR Hits@1 Hits@10 MRR
MTransE 0.281 0.520 0.363 0.252 0.493 0.334
IPTransE 0.349 0.638 0.447 0.297 0.558 0.386
SEA 0.518 0.802 0.616 0.516 0.736 0.592
RSN4EA 0.607 0.793 0.673 0.689 0.878 0.756
GCN-Align 0.506 0.772 0.600 0.597 0.838 0.682
MuGNN 0.616 0.897 0.714 0.741 0.937 0.810
KECG 0.632 0.900 0.728 0.728 0.915 0.798
AliNet 0.690 0.908 0.766 0.786 0.943 0.841
(w/o RKD) 0.577 0.848 0.659 0.671 0.889 0.751
(w/o Temp.) 0.703 0.921 0.773 0.818 0.961 0.870
RKDEA 0.756 0.973 0.821 0.823 0.971 0.879
Table 3: Performance Comparison on DWY100K datasets. The results are split into three parts by full lines. The upper part includes knowledge embedding methods; the middle part includes neighborhood consensus methods; while the bottom part includes three variants of our approach for the purpose of ablation study.

Table 2 shows the results of experiments on DBP15K and DWY100K. It can be seen that RKDEA achieves promising performance on both cross-lingual and cross-domain datasets, indicating the effectiveness of our proposed framework. Moreover, RKDEA achieves significant improvement over the compared baseline methods on the DBP15K dataset. The reason is that those baseline methods fail to incorporate complex relational knowledge due to the sparsity of DBP15K while our RKDEA is capable of exploiting fine-grained relational knowledge. Although KECG and HyperKA methods also explicitly learn knowledge embedding and neighborhood consensus, they fail to propose an effective way to integrate two different objectives. Meanwhile, with the help of knowledge distillation, RKDEA can effectively and flexibly incorporate relational knowledge into neighborhood consensus model and thus achieves much better performance.

In the large-scale dataset DWY100K, RKDEA also significantly outperforms all other methods. Since DWY100K is much larger than DBP15K with fewer relations and the graph structure is more similar, the neighborhood consensus plays a more important role, and the performance gain by utilizing knowledge distillation is less than DBP15K. Even though, RKDEA still reports reasonable results due to the properly designed techniques.

5.5 Effectiveness of Knowledge Distillation

Hits@1 Hits@10 MRR Hits@1 Hits@10 MRR Hits@1 Hits@10 MRR
KECG (w/o KE) 0.430 0.791 0.551 0.446 0.807 0.567 0.432 0.815 0.559
KECG (w/ Init.) 0.481 0.823 0.589 0.473 0.823 0.583 0.461 0.826 0.574
KECG 0.478 0.835 0.598 0.490 0.844 0.610 0.486 0.851 0.610
KECG (w/ KD) 0.513 0.853 0.627 0.516 0.861 0.633 0.535 0.877 0.651
HyperKA (w/o KE) 0.518 0.814 0.623 0.535 0.834 0.640 0.529 0.859 0.645
HyperKA (w/ Init.) 0.569 0.847 0.659 0.551 0.853 0.659 0.572 0.878 0.681
HyperKA 0.572 0.865 0.678 0.564 0.865 0.673 0.597 0.891 0.704
HyperKA (w/ KD) 0.581 0.868 0.693 0.584 0.879 0.691 0.601 0.894 0.711
Table 4: Results of KECG and HyperKA with different knowledge embedding methods on DBP15K.

To further analyze the importance of relational knowledge and the effectiveness of relational knowledge distillation, we integrate our knowledge distillation techniques with KECG and HyperKA models, producing four variants with different knowledge embedding methods for each model. Specifically, (w/o KE) denotes variants without knowledge embedding, (w/ Init.) denotes variants with entity embedding initialization of pre-trained knowledge embedding model and (w/ RKD) denotes variants with our relational knowledge distillation. As shown in Table 4, our distillation method (w/ RKD) yields the best performances for both KECG and HyperKA.

Figure 3: Effectiveness evaluation of incorporating Relational Knowledge Distillation (RKD) on DBP15K. The solid filled columns indicate models without RKD while the slash filled columns indicate models with RKD.

In order to illustrate the performance gain brought by our proposed knowledge distillation approach, Figure 3 shows the performance comparison among the models with and without relational knowledge distillation on DBP15K. The results clearly show that by introducing relational knowledge distillation, all three models, KECG, HyperKA and RKDEA achieve significant performance gain.

These results demonstrate that our proposed methods could be adopted to improve other existing KG alignment models and therefore further prove the potential and effectiveness of our proposed relational knowledge distillation method.

5.6 Impact of Adaptive Temperature Factor

The adaptive temperature mechanism is one of the core contributions in RKDEA. To explore the effect of incorporating the temperature factor, we conduct ablation study by comparing with RKDEA (w/o Temp.) on both cross-lingual and cross-domain datasets. Figure 4 shows the comparison of Hits@1 change curve with the iteration increasing between these two models during the training process. The results illustrate that RKDEA (w/o Temp.) converges faster while RKDEA achieves higher Hits@1 at the end of training process.

Figure 4: Impact of temperature factor in training process on DBP and DWY.

Actually, the temperature is a weight decay mechanism in the training process. In the early stage of training, when GNN is not well trained, the distilled relational knowledge is instructive for entity alignment and makes the model converge quickly into a relatively good state. However, as the training progress moves on, relational knowledge and neighborhood information may lead to different objectives. Therefore, if the contribution of distilled relational knowledge stays the same, the model will fall into trade-off between two directions: overfit the relational knowledge but underfit the neighborhood consensus or overfit the neighborhood consensus but underfit the relational knowledge. Consequently, involving excessive information could be potentially harmful while the adaptive temperature mechanism can avoid it by controlling the contribution of distilled relational knowledge.

6 Conclusion

In this paper, we study the problem of entity alignment over heterogeneous KGs. We propose a GCN based framework with knowledge distillation techniques to take advantage of the complex relational knowledge by jointly learning entity embedding and neighborhood consensus. With the help of relational knowledge distillation, our model can effectively and flexibly model relational knowledge and neighborhood information. Furthermore, by automatically adjusting the temperature parameter, our proposed model can dynamically control the contribution of different objectives and avoid overfitting. Experimental results on several popular benchmarking datasets show that the proposed solutions outperform the state-of-the-art methods by an obvious margin.


  • [1] Ba, J., Caruana, R.: In: NeurIPS. pp. 2654–2662 (2014)
  • [2] Bagherinezhad, H., Horton, M., Rastegari, M., Farhadi, A.: Label refinery: Improving imagenet classification through label progression. CoRR abs/1805.02641 (2018)
  • [3] Bordes, A., Usunier, N., García-Durán, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: NeurIPS. pp. 2787–2795 (2013)
  • [4] Bucila, C., Caruana, R., Niculescu-Mizil, A.: Model compression. In: SIGKDD. pp. 535–541 (2006)
  • [5] Cao, Y., Liu, Z., Li, C., Liu, Z., Li, J., Chua, T.: Multi-channel graph neural network for entity alignment. In: ACL. pp. 1452–1461 (2019)
  • [6] Chen, B., Zhang, J., Tang, X., Chen, H., Li, C.: Jarka: Modeling attribute interactions for cross-lingual knowledge alignment. In: PAKDD. vol. 12084, pp. 845–856 (2020)
  • [7] Chen, M., Tian, Y., Yang, M., Zaniolo, C.: Multilingual knowledge graph embeddings for cross-lingual knowledge alignment. In: IJCAI. pp. 1511–1517 (2017)
  • [8] Chen, Y., Wang, N., Zhang, Z.: Darkrank: Accelerating deep metric learning via cross sample similarities transfer. In: McIlraith, S.A., Weinberger, K.Q. (eds.) AAAI. pp. 2852–2859 (2018)
  • [9] Furlanello, T., Lipton, Z.C., Tschannen, M., Itti, L., Anandkumar, A.: Born-again neural networks. In: ICML. vol. 80, pp. 1602–1611 (2018)
  • [10] Guo, L., Sun, Z., Hu, W.: Learning to exploit long-term relational dependencies in knowledge graphs. In: ICML. vol. 97, pp. 2505–2514 (2019)
  • [11] Hao, Y., Zhang, Y., He, S., Liu, K., Zhao, J.: A joint embedding method for entity alignment of knowledge bases. In: CCKS. vol. 650, pp. 3–14 (2016)
  • [12] Hinton, G.E., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network (2015)
  • [13] Huang, Z., Wang, N.: Like what you like: Knowledge distill via neuron selectivity transfer. CoRR abs/1707.01219 (2017)
  • [14] Ji, G., He, S., Xu, L., Liu, K., Zhao, J.: Knowledge graph embedding via dynamic mapping matrix. In: ACL. pp. 687–696 (2015)
  • [15]

    Karasuyama, M., Takeuchi, I.: Nonlinear regularization path for the modified huber loss support vector machines. In: IJCNN. pp. 1–8 (2010)

  • [16] Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR (2017)
  • [17]

    Lee, J., Kim, H., Lee, J., Yoon, S.: Transfer learning for deep learning on graph-structured data. In: AAAI. pp. 2154–2160 (2017)

  • [18] Li, C., Cao, Y., Hou, L., Shi, J., Li, J., Chua, T.: Semi-supervised entity alignment via joint knowledge embedding model and cross-graph model. In: EMNLP. pp. 2723–2732 (2019)
  • [19] Li, Y., Li, J., Suhara, Y., Wang, J., Hirota, W., Tan, W.: Deep entity matching: Challenges and opportunities. ACM J. Data Inf. Qual. 13(1), 1:1–1:17 (2021)
  • [20] Lin, Y., Liu, Z., Sun, M., Liu, Y., Zhu, X.: Learning entity and relation embeddings for knowledge graph completion. In: AAAI. pp. 2181–2187 (2015)
  • [21] Liu, Y., Ao, X., Dong, L., Zhang, C., Wang, J., He, Q.: Spatiotemporal activity modeling via hierarchical cross-modal embedding. IEEE Trans. Knowl. Data Eng. 34(1), 462–474 (2022)
  • [22] Liu, Z., Cao, Y., Pan, L., Li, J., Chua, T.: Exploring and evaluating attributes, values, and structures for entity alignment. In: EMNLP. pp. 6355–6364 (2020)
  • [23]

    Lu, J., Lin, C., Wang, J., Li, C.: Synergy of database techniques and machine learning models for string similarity search and join. In: CIKM. pp. 2975–2976 (2019)

  • [24] Mahdisoltani, F., Biega, J., Suchanek, F.M.: YAGO3: A knowledge base from multilingual wikipedias. In: CIDR (2015)
  • [25] Mao, X., Wang, W., Xu, H., Lan, M., Wu, Y.: MRAEA: an efficient and robust entity alignment approach for cross-lingual knowledge graph. In: WSDM. pp. 420–428 (2020)
  • [26] Nie, H., Han, X., Sun, L., Wong, C.M., Chen, Q., Wu, S., Zhang, W.: Global structure and local semantics-preserved embeddings for entity alignment. In: IJCAI. pp. 3658–3664 (2020)
  • [27] Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: CVPR. pp. 3967–3976 (2019)
  • [28] Pei, S., Yu, L., Hoehndorf, R., Zhang, X.: Semi-supervised entity alignment via knowledge graph embedding with awareness of degree difference. In: WWW. pp. 3130–3136 (2019)
  • [29] Pei, S., Yu, L., Zhang, X.: Improving cross-lingual entity alignment via optimal transport. In: IJCAI. pp. 3231–3237 (2019)
  • [30] Rocco, I., Cimpoi, M., Arandjelovic, R., Torii, A., Pajdla, T., Sivic, J.: Neighbourhood consensus networks. In: NeurIPS. pp. 1658–1669 (2018)
  • [31] Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: Hints for thin deep nets. In: ICLR (2015)
  • [32] Schlichtkrull, M.S., Kipf, T.N., Bloem, P., van den Berg, R., Titov, I., Welling, M.: Modeling relational data with graph convolutional networks. In: ESWC. pp. 593–607 (2018)
  • [33] Srivastava, R.K., Greff, K., Schmidhuber, J.: Highway networks (2015)
  • [34] Sun, Z., Hu, W., Li, C.: Cross-lingual entity alignment via joint attribute-preserving embedding. In: ISWC. vol. 10587, pp. 628–644 (2017)
  • [35] Sun, Z., Hu, W., Zhang, Q., Qu, Y.: Bootstrapping entity alignment with knowledge graph embedding. In: IJCAI. pp. 4396–4402 (2018)
  • [36] Sun, Z., Wang, C., Hu, W., Chen, M., Dai, J., Zhang, W., Qu, Y.: Knowledge graph alignment network with gated multi-hop neighborhood aggregation. In: AAAI. pp. 222–229 (2020)
  • [37] Tang, X., Zhang, J., Chen, B., Yang, Y., Chen, H., Li, C.: BERT-INT: A bert-based interaction model for knowledge graph alignment. In: IJCAI. pp. 3174–3180 (2020)
  • [38] Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In: NeurIPS. pp. 1195–1204 (2017)
  • [39] Tian, B., Zhang, Y., Wang, J., Xing, C.: Hierarchical inter-attention network for document classification with multi-task learning. In: IJCAI. pp. 3569–3575 (2019)
  • [40] Wang, J., Lin, C., Li, M., Zaniolo, C.: Boosting approximate dictionary-based entity extraction with synonyms. Inf. Sci. 530, 1–21 (2020)
  • [41] Wang, J., Lin, C., Zaniolo, C.: Mf-join: Efficient fuzzy string similarity join with multi-level filtering. In: ICDE. pp. 386–397 (2019)
  • [42]

    Wang, Z., Zhang, J., Feng, J., Chen, Z.: Knowledge graph embedding by translating on hyperplanes. In: AAAI. pp. 1112–1119 (2014)

  • [43] Wang, Z., Li, J., Tang, J.: Boosting cross-lingual knowledge linking via concept annotation. In: IJCAI. pp. 2733–2739 (2013)
  • [44] Wang, Z., Lv, Q., Lan, X., Zhang, Y.: Cross-lingual knowledge graph alignment via graph convolutional networks. In: EMNLP. pp. 349–357 (2018)
  • [45] Wu, J., Zhang, Y., Wang, J., Lin, C., Fu, Y., Xing, C.: Scalable metric similarity join using mapreduce. In: ICDE. pp. 1662–1665 (2019)
  • [46] Wu, Y., Liu, X., Feng, Y., Wang, Z., Yan, R., Zhao, D.: Relation-aware entity alignment for heterogeneous knowledge graphs. In: IJCAI. pp. 5278–5284 (2019)
  • [47] Wu, Y., Liu, X., Feng, Y., Wang, Z., Zhao, D.: Jointly learning entity and relation representations for entity alignment. In: EMNLP. pp. 240–249 (2019)
  • [48] Wu, Y., Liu, X., Feng, Y., Wang, Z., Zhao, D.: Neighborhood matching network for entity alignment. In: ACL. pp. 6477–6487 (2020)
  • [49] Xu, K., Wang, L., Yu, M., Feng, Y., Song, Y., Wang, Z., Yu, D.: Cross-lingual knowledge graph alignment via graph matching neural network. In: ACL. pp. 3156–3161 (2019)
  • [50] Yang, J., Zhang, Y., Zhou, X., Wang, J., Hu, H., Xing, C.: A hierarchical framework for top-k location-aware error-tolerant keyword search. In: ICDE. pp. 986–997 (2019)
  • [51] Yang, K., Liu, S., Zhao, J., Wang, Y., Xie, B.: COTSAE: co-training of structure and attribute embeddings for entity alignment. In: AAAI. pp. 3025–3032 (2020)
  • [52] Ye, R., Li, X., Fang, Y., Zang, H., Wang, M.: A vectorized relational graph convolutional network for multi-relational network alignment. In: IJCAI. pp. 4135–4141 (2019)
  • [53] Yim, J., Joo, D., Bae, J., Kim, J.: A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In: CVPR. pp. 7130–7138 (2017)
  • [54]

    Zagoruyko, S., Komodakis, N.: Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In: ICLR (2017)

  • [55] Zeng, K., Li, C., Hou, L., Li, J.Z., Feng, L.: A comprehensive survey of entity alignment for knowledge graphs. AI Open 2, 1–13 (2021)
  • [56] Zhang, Q., Sun, Z., Hu, W., Chen, M., Guo, L., Qu, Y.: Multi-view knowledge graph embedding for entity alignment. In: IJCAI. pp. 5429–5435 (2019)
  • [57] Zhang, Y., Chen, Y., Yang, J., Wang, J., Hu, H., Xing, C., Zhou, X.: Clustering enhanced error-tolerant top-k spatio-textual search. World Wide Web 24(4), 1185–1214 (2021)
  • [58]

    Zhang, Y., Wu, J., Wang, J., Xing, C.: A transformation-based framework for KNN set similarity search. IEEE Trans. Knowl. Data Eng.

    32(3), 409–423 (2020)
  • [59]

    Zhao, K., Zhang, Y., Wang, Z., Yin, H., Zhou, X., Wang, J., Xing, C.: Modeling patient visit using electronic medical records for cost profile estimation. In: Pei, J., Manolopoulos, Y., Sadiq, S.W., Li, J. (eds.) DASFAA. pp. 20–36 (2018)

  • [60] Zhu, H., Xie, R., Liu, Z., Sun, M.: Iterative entity alignment via joint knowledge embeddings. In: IJCAI. pp. 4258–4264 (2017)
  • [61] Zhu, Q., Zhou, X., Wu, J., Tan, J., Guo, L.: Neighborhood-aware attentional representation for multilingual knowledge graphs. In: IJCAI. pp. 1943–1949 (2019)