Multi-teacher Knowledge Distillation for Knowledge Graph Completion

Link prediction based on knowledge graph embedding (KGE) aims to predict new triples to complete knowledge graphs (KGs) automatically. However, recent KGE models tend to improve performance by excessively increasing vector dimensions, which would cause enormous training costs and save storage in practical applications. To address this problem, we first theoretically analyze the capacity of low-dimensional space for KG embeddings based on the principle of minimum entropy. Then, we propose a novel knowledge distillation framework for knowledge graph embedding, utilizing multiple low-dimensional KGE models as teachers. Under a novel iterative distillation strategy, the MulDE model produces soft labels according to training epochs and student performance adaptively. The experimental results show that MulDE can effectively improve the performance and training speed of low-dimensional KGE models. The distilled 32-dimensional models are very competitive compared to some of state-or-the-art (SotA) high-dimensional methods on several commonly-used datasets.



There are no comments yet.


page 1

page 2

page 3

page 4


Swift and Sure: Hardness-aware Contrastive Learning for Low-dimensional Knowledge Graph Embeddings

Knowledge graph embedding (KGE) has drawn great attention due to its pot...

DistilE: Distiling Knowledge Graph Embeddings for Faster and Cheaper Reasoning

Knowledge Graph Embedding (KGE) is a popular method for KG reasoning and...

Multiple Run Ensemble Learning withLow-Dimensional Knowledge Graph Embeddings

Among the top approaches of recent years, link prediction using knowledg...

Technical Report of Team GraphMIRAcles in the WikiKG90M-LSC Track of OGB-LSC @ KDD Cup 2021

Link prediction in large-scale knowledge graphs has gained increasing at...

High-efficiency Euclidean-based Models for Low-dimensional Knowledge Graph Embeddings

Recent knowledge graph embedding (KGE) models based on hyperbolic geomet...

Self-attention Presents Low-dimensional Knowledge Graph Embeddings for Link Prediction

Recently, link prediction problem, also known as knowledge graph complet...

Reinforced Anytime Bottom Up Rule Learning for Knowledge Graph Completion

Most of todays work on knowledge graph completion is concerned with sub-...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Knowledge graphs (KGs), which describe human knowledge as factual triples in the form of (head entity, relation, tail entity), have shown great potential in various domains (Zhu et al., 2019; Wang et al., 2020). Popular real-world KGs, such as Yago (Suchanek et al., 2007)

and DBPedia

(Bizer et al., 2009), typically containing tens of millions of entities, are still far from complete. Because of high manual costs of discovering new triples (Paulheim, 2018), link prediction based on knowledge graph embedding (KGE) draws considerable attention (Che et al., 2020; Zhang et al., 2020). Given an entity and a relation of a triple (we call ‘e-r query’), a typical KGE model first represents entities and relations as trainable continuous vectors. Then, it defines a score function to measure each candidate in entity set and outputs the best one.

To achieve higher prediction accuracy, recent KGE models generally use high-dimensional embedding vectors up to 200 or 500 dimensions. However, when we have millions or billions of entities in the KG, the high-dimensional model requires enormous training costs and storage memory. This prevents downstream AI applications from updating KG embeddings promptly or being deployed on mobile devices such as cell phones. Instead of blindly increasing the embedding dimension, several research draw attention to improving the low-dimensional models (such as 8 or 32 dimensions) (Chami et al., 2020), or compressing pre-trained high-dimensional models (Sachan, 2020). However, the former cannot utilize the ‘high-accuracy’ knowledge from high-dimensional models, while the latter suffers from heavy pre-training costs and cannot continue training when the KG is modified.

Figure 1. The changes of model performance (Hits@10) with the growth of the embedding dimension on two datasets. The Ensemble model adds the triple scores of the four models directly.

Whether can we transfer ‘high-accuracy’ knowledge to a small model while avoiding to train high-dimensional models? We are inspired by a preliminary experiment using different hyperbolic KGE models (Chami et al., 2020). As shown in Fig. 1, when the embedding dimension is more than 64, the model performance slightly improves or starts to fluctuate. Besides, the 64-dimensional ensemble is already better than any higher-dimensional models, it can act as a lightweight source of ‘high-accuracy’ knowledge. To this end, we determine to employ different low-dimensional hyperbolic models as multiple teachers and train a smaller KGE model in a Knowledge Distillation (KD) process. Knowledge Distillation (KD) (Hinton et al., 2015) is a technology distilling ‘high-accuracy’ knowledge from a pre-trained big model (teacher) to train a small one (student). It has rarely applied in the knowledge graph domain.

Compared with a single high-dimensional teacher, we argue that there are at least three benefits of utilizing multiple low-dimensional teachers:

  • Reduce pre-training costs. The parameter summation of multiple teachers is relatively lower than high dimensional models’. Besides, the pre-training speed of the former can be further improved by parallel techniques.

  • Guarantee teacher performance. By integrating the results of multiple models, the prediction errors can be corrected for each other, so that the ensemble’s accuracy exceeds some of high dimensional models.

  • Improve distilling effect. Based on recent knowledge distillation studies, the student model is easier to acquire knowledge from a teacher with a similar size (Mirzadeh et al., ). A teacher dimension close to the lower bound is more suitable than hundreds of dimensions.

In this paper, we first theoretically analyze the capacity of low-dimensional space for KG embeddings based on the principle of minimum entropy, and estimate the approximate lower bound of embedding dimension when representing different scales of KGs. Under the guidance of theory, we pre-train multiple hyperbolic KGE models in low dimensions. Then, we propose a novel framework utilizing

Multi-teacher knowledge Distillation for knowledge graph Embeddings, named MulDE. Considering the difference between the classification problem and the link prediction task, we propose a novel iterative distillation strategy. There are two student components Junior and Senior in MulDE. In once iteration, the Junior component first selects top K candidates for each e-r query, and transfers to multiple teachers. Then, the Senior component integrates teacher results and transfers soft labels to the Junior. Therefore, instead of teachers passing on knowledge in one direction, the students can seek knowledge actively from teachers according to the current indistinguishable entities.

We conduct experiments on two commonly-used datasets, FB15k-237 and WN18RR. The results show that MulDE can significantly improve the performance of low-dimensional KGE models. The distilled RotH model outperforms the SotA low-dimensional models, and it is comparable to some high-dimensional ones. Compared with conventional single-teacher KD methods, MulDE can accelerate the training process and achieve higher accuracy. According to ablation experiments, we prove the effectiveness of the iterative distillation strategy and the other major modules in MulDE. We compare different teacher settings and conclude that using four different 64-dimensional hyperbolic models is optimal.

The rest of the paper is organized as follows. We briefly introduce the background and preliminaries related to our work in Section 2. Section 3 details the whole framework and basic components of the MulDE model. In Section 4, the theoretical analysis of the embedding dimension is carried out. Section 5 reports the experimental studies, and Section 4 further discusses the experimental investigations. We overview the related work in Section 8 and, finally, give some concluding remarks in Section 7.

2. Preliminaries

This section introduces some definitions and notations used throughout the paper.

Knowledge Graph Embeddings. Let and denote the set of entities and relations, a knowledge graph (KG) is a collection of factual triples where and . and refer to the number of entities and relations, respectively. A KGE model represents each entity or relation as a -dimensional continuous vector or , and learns a scoring function , to ensure that plausible triples receive higher scores. Table LABEL:tab:2.1

displays the scoring functions and loss functions of six popular KGE models. The details about those models are introduced in Section


Link Prediction. Generalized link prediction tasks include entity prediction and relation prediction. In this paper, we focus on the more challenging entity prediction task. Given an e-r query , the typical link prediction aims to predict the missing entity and output the predicted triple. This task is cast as a learning to rank problem. A KGE model outputs the scores of all candidate triples, denoted as , in which candidates with higher scores are possible to be true.

Knowledge Distillation

. Knowledge Distillation technologies aim to distill the knowledge from a larger deep neural network into a small network

(Hinton et al., 2015)

. The probability distribution over classes outputted by a teacher model, is referred to the ‘soft label’, which helps the student model to mimic the behaviour of the teacher model. A temperature factor T is introduced to control the importance of each soft label. Specifically, when

, the soft labels become one-hot vectors, i.e., the hard labels. With the increase of , the labels become more ‘softer’. When

, all classes share the same probability.

Figure 2. An illustration of the MulDE framework.

3. Methodology

The Multi-teacher Distillation Embedding (MulDE) model utilizes multiple pre-trained low-dimensional models as teachers, extracts knowledge by contrasting different teacher scores, and then supervises the training process of a low-dimensional student model. Figure 2 illustrates the architecture of the MulDE framework, including multiple teachers, the Junior component and the Senior component.

  • Multiple teachers are regarded as the data source of prediction sequences, and have no parameter updates in the training process. We employ four different hyperbolic KGE models as teachers. Details about pre-trained teachers are described in Section 3.1.

  • Junior component is the target low-dimensional KGE model. Given an e-r pair, the Junior sends its top K predicted entities to teachers, and gets the corresponding ‘soft labels’ from the Senior component. We will detail the Junior component in Section 3.2.

  • Senior component acquires knowledge from those ‘teacher’ models directly, and then generate soft labels through two mechanisms: relation-specific mechanism and contrast attention mechanism. The details of the Senior component will be discussed in Section 3.3.

As shown in Fig.2, unlike traditional one-way guidance, the iterative distilling strategy in MulDE forms a novel circular interaction between students and teachers. In each iteration, the Junior component makes a preliminary prediction based on an e-r query, and selects those indistinguishable entities (top K) to ask multiple teachers. By this way, the Junior can effectively correct its front-end prediction results and outperforms the other models in the same dimension level. Meanwhile, rather than processing the fixed teacher scores all the time, the Senior can adjust parameters continually according to Junior’s feedback, and produce soft labels according to training epochs and the Junior’s performance adaptively. The learning procedure of MulDE will be detailed in Section 3.4.

3.1. Pre-trained Teacher Models

To ensure the performance of the low-dimensional teachers, we employ a group of hyperbolic KGE models proposed by Chami et al. (Chami et al., 2020). Based on the previous work, we further add new models in this group. In this section, we describe these models briefly.

Taking the RotH model as an example, it uses a dimensional Poincaré ball model with trainable negative curvature. Embedding vectors are first mapped into this hyperbolic space, and a relation vector is regarded as a rotation transformation of entity vectors. Then, a hyperbolic distance function is employed to measure the difference of transformed head vector and candidate tail vector. Similarly, Chami et al. propose TransH and RefH. The former uses Möbius addition to imitate TransE (Bordes et al., 2013) in hyperbolic space, while the latter replaces the rotation transformation with a reflection one.

Considering the effectiveness of previous models using inner product transformation, such as DistMult (Yang et al., 2015) and ComplEx (Trouillon et al., 2016), we add another hyperbolic model named ‘DistH’. The scoring functions of these models are as following:


, where is the hyperbolic distance, is Möbius addition operation, is inner product operation and is the space curvature.

We select pre-trained low-dimensional models from this group as teachers , where is the number of teachers. Furthermore, to verify the performance of hyperbolic space in multi-teacher knowledge distillation, we prepare a corresponding Euclidean model ‘ModelE’ for each hyperbolic model ‘ModelH’. We will describe the experimental results in Section 5.

3.2. Junior Component

The Junior Component employs a general low-dimensional KGE model without special restrictions. The target of MulDE is training a higher-performance Junior Component under a knowledge distillation process. So that, the trained Junior model can be used for faster reasoning instead of high-dimensional teachers.

Most conventional knowledge distillation methods are used for the classification problem. However, the link prediction task is a ‘learning to rank’ problem, which usually be learned by distinguishing a positive target with randomly negative samples. A straightforward solution is to fit the teachers’ score distributions of the positive and negative samples. We argue that it has at least two drawbacks: First, this teaching task is ‘too easy’ for multiple teachers. Every teacher will output a result close to the hard label without an obvious difference. Second, the negative sampling can rarely ‘hit’ those really indistinguishable entities; it makes critical knowledge owned by teachers hardly pass on to the student model.

Soft Label Loss. We design a novel distilling strategy for KGE models. Given an e-r query, the Junior model evaluates all candidate entities in the entity set. Then, it selects the top K candidates with higher scores . As the beginning of an iteration, this prediction sequence is sent to every teacher model, and the Senior component returns soft labels after integrating different teacher outputs. We define the soft label supervision as a Kullback-Leibler (KL) divergence, the loss function is


, where is the Softmax function, is the number of e-r queries.

Hard Label Loss. Meanwhile, a hard label supervision based on conventional negative sampling is employed.


, where is a binary cross entropy loss, contains scores of positive target and sampling negative ones, is a one-hot label vector.

Finally, the Junior loss can be formulated by the weighted sum of and with a hyper-parameter to balance two parts.


We argue that both supervisions are necessary. The hard label supervision gives the Junior model opportunity of learning independently, and randomly negative sampling helps to handle more general queries. Meanwhile, the novel soft label supervision corrects the Top K prediction of the Junior model, encouraging it to imitate the ‘answer’ of teachers for these hard questions. In the experiments, we will evaluate four different low-dimensional Junior models mentioned in Section 3.2. As a general framework, MulDE can also apply to the previous KGE models, such as DistMult and RotatE.

3.3. Senior Component

As the bridge between multiple teachers and Junior component, the Senior component aims to integrate different score sequences from teachers and generate soft labels suitable for the Junior model to learn. There are two parts of input for the Senior, the top K scores from the Junior and multiple score sequences ( is the number of teachers) from the teacher models. All of the above sequences have length K corresponding to K candidate entities in .

In the senior component, we utilize two mechanisms for knowledge integration:

Relation-specific scaling mechanism. According to previous research (Akrami et al., 2020), different KGE models show performance advantages in different relationships. To improve the integrated soft labels, each single teacher should contribute more in the types of relations it excels at. To this end, we define a trainable scaling matrix , and assign an adaptive scaling value (ranges from 0 to 1) for each teacher sequence. The -th scaled sequence is computed as:


, where is the index of the relation in an e-r query. Note that all scores in a sequence are adjusted in proportion, their relative ranks have no changes. Attempts have been made to scale the scores for each position individually, but the loss of Senior component fluctuates sharply and is hard to converge.

Contrast attention mechanism. Considering the Junior model is randomly initialized, it needs to make a basic division of the whole entity set early in model training. At this time, the distribution of its scores would be significantly different from that of the trained teacher. A over-different soft label would block the training process. Based on this idea, we contrast the Junior sequence with each teacher sequence and evaluate their similarity. An attention mechanism is designed to make the integrated sequence more similar to the Junior sequence. The calculation process is as follows:


, where is the KL divergence, a higher means more different between and . is a parameter that increases exponentially with the number of training rounds. After the early period, the of different teacher tend to be equal. The soft labels start to focus on model performance instead of whether the Junior can adapt. Experimental results show that early adaptive soft labels can obviously increase convergence rate and prediction accuracy.

The contrast attention mechanism only effects in the early tens of training epochs, but its target is not for higher performance like the scaling mechanism. In order to training the parameters in , we utilize the summation of to compute loss. A cross-entropy loss is utilized to evaluate the performance of scaled scores in the Senior component as follows:


where is a one-hot vector in which the position of the target entity of the e-r query is 1 and the rest is 0. Besides, if the target entity is not in the candidate set, is the zero vector.

3.4. Learning Algorithm

There are three roles in the proposed MulDE framework, namely Teacher (pre-trained hyperbolic models), Senior Student (integration mechanisms), and Junior Student (the target low-dimensional model). As the teacher models have no parameter fine-tuning, we use a combined loss function to train the two student components and minimize the loss by exploiting the Adam (Kingma and Ba, 2015) optimizer. The complete loss function is as follows:


, where the third part is the parameter regularization term.

0:  Existing KG triples
0:  : the training e-r queries;   :the total entity set of KG;
0:  : the target entities for each query; // Pre-training Phase
1:  for i =  do
2:     Train the teacher model with the training dataset;
3:  end for// Training Phase
4:  for number of training iteration do
5:     for  do
6:         JuniorScoreFunction();
7:         GatherTopKCandidates();
8:         RandomNegSamples();
9:        for i =  do
10:            TeacherScoreFunction();
11:        end for
12:        RelationScaling();
13:        ConstrastAttention();
14:     end for
15:     Compute the Senior loss with ;
16:     Compute the Junior loss with and ;
17:     Update parameters of MulDE by gradient descent;
18:  end for
Algorithm 1 The Learning Procedure of MulDE

The learning algorithm of MulDE is presented in Algorithm 1. We emphasize the preciseness and availability of this learning procedure. In the pre-training and training phases, both the pre-trained models and the MulDE model cannot access the validation and test datasets. In the execution phase, the model cannot receive the labels from the input data.

4. Theoretical Analysis of Minimum Dimension

As the number of entities far exceeds that of relations, the entity embedding parameters usually account for more than 95% of the total parameters of a KGE model. Given a fixed entity set, reducing the embedding dimension can significantly save storage space. However, there is no theoretical analysis about the minimum embedding dimension required by a knowledge graph. Inspired by a recent research estimating the relationship between the dimension of word embeddings and the size of word vocabulary (Su, 2020), we analyze from the perspective of the minimum entropy principle and estimates a relatively reliable dimension for knowledge graphs of different sizes.

To simplify the derivation, we first define the plausibility of a triple as a joint probability , where is the e-r query . For most of KGE models, the relation is regarded as a transformation between and . Hence, we treat the e-r query as a transformed entity vector, thereby converting the triple sample into a normal pair-wise one. According to the Cartesian product, the number of all possible e-r queries is equal to .

Then, a general KGE framework is defined as the following:

Definition 1. General KGE Framework .

Given a knowledge graph , a KGE model learns entity and relation embeddings in a -dimensional vector space. The joint probability of a triple is computed as:


, where is the embedding vector of and . is a distance measure function. and refers to any vector transformation. Without loss of generality, employs the squared Euclidean distance function, i.e., .

From the view of information entropy, the KG itself has a certain degree of uncertainty. An e-r query points to one or more target entities. Meanwhile, given a large-scale entity set, the KG is often incomplete, and inevitably there are hidden triples that have not yet been discovered. When we use a KGE model to encode it, the embedding result should be equal to or less than this uncertainty. In other words, we should minimize the entropy of the model. So that we can ensure the KGE model is effective to keep the information of the KG.

According to the above analysis, we aim to find a sufficient condition to ensure the ’s entropy is smaller than that of KG , thereby the relationship between embedding dimension and KG scale can be depicted.

Theorem 1. The embedding dimension of has an approximate lower bound (the accommodate constant ), to satisfy the model has enough space to encode a KG . is the size of the entity set, is the size of the relation set and is natural logarithm.

Proof. The Proof is given in the Appendix.

Theorem 1 shows that when , the model can encode any KG satisfying

, which includes the common datasets in KGE domain, such as FB15k-237, WN18RR, and YAGO3-10. We further verify this conclusion in experiments which detailed in Section . Meanwhile, we can speculate that 128-dimensional embedding vectors can accommodate some large-scale KGs, such as DBpedia (

), and Freebase (). Note that, Theorem 1 is a rough estimate under a set of approximate conditions. The derived lower bound can only provide preliminary guidance for large-scale KGs, and its accuracy declines as the KG scale shrinks. It is difficult to conducting more precise derivation, which leaves as our future work.

Overall, this theoretical analysis demonstrates that low-dimensional vector space has the capability to accommodate a normal-size KG. It motivates us to utilize low-dimensional teachers to refine a low-dimensional model by Knowledge Distillation technologies, rather than roughly increasing the embedding dimension for higher performance.

Dataset #Rel #Ent #Train #Valid #Test
Table 1. Statistics of datasets used in experiments.
Type Dim Methods FB15K237 WN18RR
MRR Hits@10 Hits@1 MRR Hits@10 Hits@1
TransE 0.256 0.456 0.152 0.207 0.476 0.012
DistMult 0.286 0.445 0.202 0.412 0.484 0.372
ComplEx 0.283 0.447 0.202 0.431 0.513 0.395
ConvE 0.316 0.501 0.237 0.430 0.520 0.400
RotatE 0.338 0.533 0.241 0.476 0.571 0.428
TuckER 0.358 0.544 0.266 0.470 0.526 0.443
QuatE 0.348 0.550 0.248 0.488 0.582 0.438
RefH 0.346 0.536 0.252 0.461 0.568 0.404
RotH 0.344 0.535 0.246 0.496 0.586 0.449
AttH 0.348 0.540 0.252 0.486 0.573 0.443
RotatE 0.290 0.458 0.208 0.387 0.491 0.330
RefH 0.312 0.489 0.224 0.447 0.518 0.408
RotH 0.314 0.497 0.223 0.472 0.553 0.428
AttH 0.324 0.501 0.236 0.466 0.551 0.419
Our MulDE Models
TransH 0.308 0.488 0.217 0.231 0.518 0.081
MulDE-TransH 0.328 0.511 0.236 0.267 0.540 0.094
DistH 0.293 0.474 0.202 0.439 0.511 0.399
MulDE-DistH 0.326 0.509 0.235 0.460 0.545 0.417
RefH 0.302 0.474 0.215 0.453 0.526 0.414
MulDE-RefH 0.325 0.508 0.233 0.479 0.569 0.434
RotH 0.310 0.489 0.221 0.463 0.547 0.416
MulDE-RotH 0.328 0.515 0.237 0.481 0.574 0.433
Table 2. Link prediction results on two datasets. Best score of high-dimensional models underlined and best score of 32-dimensional models in Bold. means this score outperforms that of all previous low-dimensional models.

5. Experiments

5.1. Experimental Setup

Datasets. Experimental studies are conducted on two commonly used datasets. WN18RR (Bordes et al., 2014) is a subset of the English lexical database, WordNet (Miller, 1995). FB15k237 (Toutanova and Chen, 2015) is extracted from Freebase including knowledge facts about movies, actors, awards, and sports. It is created by removing inverse relations, because many test triples can be obtained simply by inverting triples in the training set. The statistics of the datasets are given in Table 1. ‘Rel’ and ‘Ent’ refer to the number of relations and entities in the dataset, and the other metrics refer to the amount of triples in training, validation and test sets.

Baselines. We implement MulDE by employing four hyperbolic KGE models as student or teacher models, including TransH, DistH, RefH, and RotH. Although AttH (Chami et al., 2020) achieves better performance in one or two metrics, it is a combination of RefH and RotH. To verify the influence of hyperbolic space in multi-teacher knowledge distillation, we also pre-train a corresponding Euclidean model ‘ModelE’ for each hyperbolic model ‘ModelH’.

The compared KGE models are in two classes: (1) High-dimensional models including the SotA KGE methods in Euclidean space or Hyperbolic space. All these optimal results are published utilizing more than 200 dimensions. (2) Low-dimensional models including RotH, RefH and AttH proposed by Chami et al. Benefiting from hyperbolic vector space, they have shown obvious advantages in low dimension. To save the space, We ignore to list other previous methods.

Implementation Details. We select the hyper-parameters of our model via grid search according to the metrics on the validation set. For teacher models, we pre-train different teachers with the embedding dimension among . According to the theoretical analysis, we set the embedding dimension to 64 in the main experiments. We select the learning rate among and the number of negative samples among . For the MulDE framework, we empirically select the Junior embedding dimension among , the length of prediction sequences among , the learning rate among , and the balance hyper-parameter among

. All experiments are performed on Intel Core i7-7700K CPU @ 4.20GHz and NVIDIA GeForce GTX1080 Ti GPU, and implemented in Python using PyTorch framework.

Evaluation Metrics.

For the link prediction experiments, we adopt two evaluation metrics: (1) MRR, the average inverse rank of the test triples, and (2) Hits@N, the proportion of correct entities ranked in top N. Higher MRR and Hits@N mean better performance. Following the previous works, we process the output sequence in the ‘Filter’ mode. Note that, for pre-trained models, we only remove those entities appearing in the training dataset.

Methods Dim MRR Hits@10 Hits@1
Teachers 64d 0.487 0.581 0.440
Origin 8d 0.210 0.352 0.140
Junior 8d 0.399 0.483 0.350
Growth 90.05% 37.41% 150.32%
Origin 16d 0.412 0.501 0.358
Junior 16d 0.464 0.547 0.419
Growth 12.60% 9.07% 17.22%
Origin 32d 0.463 0.547 0.416
Junior 32d 0.481 0.574 0.433
Growth 3.91% 4.91% 4.14%
Origin 64d 0.477 0.564 0.429
Junior 64d 0.482 0.579 0.430
Growth 1.05% 2.77% 0.21%
(a) Various Student Dimensions
Methods Dim MRR Hits@10 Hits@1
Origin 32d 0.463 0.547 0.416
Teachers 64d 0.487 0.581 0.440
Junior 32d 0.481 0.574 0.433
Growth -1.23% -1.20% -1.59%
Teachers 128d 0.488 0.582 0.437
Junior 32d 0.480 0.572 0.430
Growth -1.64% -1.71% -1.60%
Teachers 256d 0.483 0.578 0.434
Junior 32d 0.473 0.569 0.424
Growth -2.07% -1.60% -2.30%
Teachers 512d 0.479 0.584 0.423
Junior 32d 0.469 0.574 0.412
Growth -2.09% -1.71% -2.60%
(b) Various Teacher Dimensions
Table 3. The improvements of MulDE-RotH on WN18RR with different teacher and student dimensions . Bold numbers are the better RotH results with same dimension, while Growth is the rate of increase between the two in same region.

5.2. Link Prediction Task

We first evaluate our model in the 32-dimensional vector space, which is the same as the low-dimensional setting of Chami et al. (Chami et al., 2020). ‘MulDE-modelH’ is the model whose Junior model is a 32-dimensional hyperbolic model, and its four teachers, TransH, DistH, RotH and RefH, are pre-trained in the 64-dimensional space. We evaluate our distilled Junior models compared with its original results, and also contrast with SotA models in both low and high dimensions. The experimental results are shown in Table 2. From the table, we can have the following observations:

The four different Junior models trained by MulDE significantly outperform their original performance on two datasets. The MRR and Hits@10 of all four models have an average 5% improvement. Especially, the Hits@10 of RotH increases from 0.547 to 0.574 on WN19RR, and the Hits@1 of DistH improves from 0.202 to 0.235 on FB15k-237. The results illustrate the effectiveness of knowledge distillation for low-dimensional embedding.

Compared with previous low-dimensional models, ‘MulDE-RotH’ achieves the SotA results in all metrics on two datasets. Although the AttH model combines both RotH and RefH, it is weaker than our next-best model ‘MulDE-RefH’ in 5 metrics. Besides, the TransH model only utilizes a simple score function, but exceeds all previous low-dimensional models on FB15k-237 dataset after knowledge distillation training.

Compared with SotA high-dimensional models,‘MulDE-RotH’ exceeds multiple 200-dimensional models, including TransE, DistMult, ComplEx, and ConvE, on two datasets. Although its performance is lower than some of the latest models, ‘MulDE-RotH’ has shown strong competitiveness in some metrics. Significantly, its Hits@10 on WN18RR outperforms most of SotA models except QuatE and RotH (d=500).

Methods FB15K237 WN18RR
MRR Hits@10 Hits@1 MRR Hits@10 Hits@1
MulDE 0.328 0.515 0.237 0.481 0.574 0.433
MulDE w/o TopK 0.321 0.502 0.231 0.456 0.534 0.413
MulDE w/o RS 0.325 0.513 0.234 0.476 0.571 0.430
MulDE w/o CA 0.325 0.511 0.233 0.481 0.573 0.422
KD512d w/ TopK 0.322 0.502 0.229 0.469 0.564 0.421
KD64d w/ TopK 0.324 0.506 0.238 0.467 0.553 0.425
KD64d 0.321 0.498 0.230 0.459 0.540 0.424
Table 4. The results of ablation experiments on WN18RR and FB15k237. The student model of all variants is RotH. ‘TopK’ refers to the iterative distilling strategy replacing random candidates as TopK candidates, while ‘RS’ and ‘CA’ refer to two mechanisms in the Senior. ‘KD512d’ and ‘KD64d’ are two single-teacher models with only 512d and 64d RotH model as the teacher.

On the whole, we can conclude that the MulDE framework can successfully improve the low-dimensional hyperbolic models. The performance of those distilled models is even comparable to some high-dimensional models.

5.3. Results of Different Dimensions

The student dimension 32 and teacher dimension 64 are determined according to previous work and theoretical analysis. We further evaluate our MulDE framework with more different dimensions. For the student embedding dimension, we still focus on low-dimensional models with dimensions lower than 100, and compare distilled models with original ones. For the teacher embedding dimension, we select multiple dimension sizes over 64, including 128, 256, and 512. Then, we measure the performance gap between the 32-dimensional student and different high-dimensional teachers. The experimental results are shown in Table 3.

The performance of different low-dimensional student models are shown in Table 3 (a). As first, the results show that the accuracy of original model reduces obviously when the embedding dimension decreasing. Although the distilled Junior models achieve outperform its original accuracy, the lower-dimensional model still produce relatively poor results. Besides, from the ‘growth’ metrics, MulDE contributes more improvements for lower-dimensional models. The MRR and Hits@1 of 8d RotH increase more than 90% and 150%, which is much more significant than that of 64d RotH. Furthermore, the performance of 64d distilled models are very close to 32-dimensional ones, which suggests we can utilize 32d models in applications to save more storage, and even 8d models for faster interactions.

The results in Table 3 (b) support our theoretical analysis to some extent. At first, the performance of the teacher ensemble is similar in different high dimensions. The Hits@10 of 256d teachers is even worse than that of 64d. This illustrates that when the embedding dimension exceeds a lower bound, the capacity of the model will tend to be stable. Besides, with the increase of teacher dimensions, the MRR and Hits@1 of the Junior model decrease. It indicates that the teacher with over-high dimensions would be harder to transfer knowledge to a low-dimensional student. Therefore, the experimental results motivate us to apply 64-dimensional teachers, which achieve higher performance, and save more pre-training costs.

5.4. Ablation studies about Distillation Strategy

To analyze the benefits of our MulDE framework, we make a series of ablation experiments to evaluate different modules. Compared with conventional knowledge distillation strategy, there are three main improvements in MulDE: iterative distilling strategy (TopK), relation-specific scaling mechanism (RS), and contrast attention mechanism (CA). Therefore, we test the performance of MulDE without one of the three modules. Besides that, we also compare MulDE with single-teacher distillation in which the teacher is the same KGE model as the student and has 64d or 512d dimensions. To make it harder still, we apply iterative distilling strategy in these single-teacher models instead of random candidate labels. The experimental results are shown in Table 4.

At first, for the iterative distilling strategy, it is clear that the performance of MulDE dramatically decreases when removing this module. Especially, the Hits@10 on WN18RR reduces from 0.574 to 0.534 (-7.0%). This proves the necessity of using top K labels when applying knowledge distillation in link prediction tasks. In contrast, although two mechanisms are helpful, the contributions of RS and CA are relatively small. The further improvement of the Senior component would be our future work.

The single-teacher knowledge distillation can improve the original RotH model, but its performance is poorer than MulDE. This is the reason why we propose a novel distillation framework for KGE models. The iterative distilling strategy also positively effects the single-teacher framework, which increases Hits@10 on WN18RR by around 2%. Besides, a higher-dimensional teacher (512d) also has the enhancement effect, but its training cost is unavoidable larger than MulDE.

Overall, the experimental results indicate the effectiveness of three major modules in MulDE. Compared with the single-teacher strategy, our framework shows apparent improvements. In the next section, we further discuss the influence of different teacher combinations.

Figure 3. The Hits@1 of 32-dimensional RotH as training proceeds on WN18RR dataset. ‘RotH’ is the original training mode, ‘KD’ is the single-teacher KD process, ‘MulDE’ and ‘MulDE-CA’ are our models with and without contrast attention.
Figure 4. The bar charts for 32-dimensional student with different variants. (a) The MRR on WN18RR dataset with different settings of vector space. (b) The Hits@10 on FB15k237 dataset with different teacher combinations. We use ‘T’,‘D’,‘R1’,‘R2’ refers to TransH, DistH, RotH and RefH.

6. Discussions

In this section, we particularly discuss several arresting questions in MulDE.

Q1: How much distillation strategies accelerate student training speed? Fig. 3 shows the convergence of 32-dimensional RotH with different training processes. As expected, the student model trains faster under the supervision of a teacher. We can observe that MulDE converges faster than original RotH, and achieves higher accuracy. In contrast, the single-teacher KD model is much slower and continuously increases before 100 epoch. We also analyze the convergence of MulDE without contrast attention mechanism, i.e., MulDE-CA. Benefiting from contrast attention, the Hits@1 of MulDE in the early period (epoch=10) is around 3% higher than MulDE-CA, while it also leads to a higher final accuracy from 0.422 to 0.433. This proves the effectiveness of contrast attention, which can reduce the gap between student scores and teacher scores in the early period.

Q2: Whether and how much the hyperbolic space contributes to the result? We evaluate the MulDE framework by employing hyperbolic KGE models because of their performance in low dimensions. To answer the question, we analyze the contributions of hyperbolic models when they are teachers and the student, and also verify the effectiveness of MulDE in Euclidean space. The results are shown in Fig. 4 (a). ‘teaH’ and ‘stuH’ means using hyperbolic models, while ‘teaE’ and ‘stuE’ means using corresponding Euclidean models. The ‘teaH_stuH’ model is equal to MulDE-RotH, and the other variants assign different space types to teachers and students. Among the four models, ‘teaE_stuE’ is weaker than the other three, which indicates the advantage of hyperbolic space. Even so, the 0.466 Hits@1 is still equal to that of the previous AttH. Considering the middle two models, we can conjecture that using hyperbolic space in teachers are more effective than using hyperbolic student.

Q3: How about the contribution of different teacher choices on results? The complete MulDE framework utilize all of four hyperbolic models as teachers, because the whole ensemble can obtain better accuracy than the other combinations. We further analyze the contribution of every teacher in the ensemble. As shown in Fig. 4 (b), the performance of MulDE outperforms the other four kinds of teacher combinations. In terms of using two teachers, only using TransH and DistH is obviously poorer than using RotH and RefH, which indicates the importance of the later models. Comparing two models using three teachers, we find that RotH contributes more than RefH. It is reasonable because the accuracy of the original RotH is already higher than RefH’s. The experimental results prove the contribution of every teacher model. There would be a better teacher combination by adding in other different models, which will be our future work.

7. Related Work

In this section, we discuss recent research improvements in the KGE domain, and introduce relevant Knowledge Distillation research.

Predicting new triples using KGE methods has been an active research topic over the past few years. Benefiting from the simple calculation and good performance, some early methods such as TransE (Bordes et al., 2013), DistMult (Yang et al., 2015), ComplEx (Trouillon et al., 2016)

have been widely used in various AI tasks. With the rise of deep learning, several CNN-based methods have been proposed, such as ConvE

(Dettmers et al., 2018) and ConvKB (Nguyen et al., 2017). These methods outperform in link prediction tasks, but depend on more parameters heavily. Recently, there are several non-neural methods proposed. RotatE (Sun et al., 2019), inspired by Euler’s identity, can infer various relation patterns with a new rotation-based score function. QuatE (Zhang et al., 2019) and OTE (Tang et al., 2020) further improve the RotatE method. The former introduces more expressive hypercomplex-valued representations in the quaternion space, while the latter OTE extends the RotatE from 2-D complex domain to high dimensional space with orthogonal transforms to model relations.

Research related to this issue is relatively recent, including two representative approaches. One possible solution is compressing pre-trained high-dimensional models. Sachan (Sachan, 2020) utilized Embedding Compression methods to convert high-dimensional continuous vectors into discrete codes. Although generating a compressed model retaining ‘high-accuracy’ knowledge, a pre-trained high-dimensional model is necessary. Furthermore, the compressed model with discrete vectors cannot continue training when the KG is modified. Another solution is introducing new theories to improve low-dimensional KGE models directly. For example, Chami et al. (Chami et al., 2020) introduced hyperbolic embedding space with trainable curvature, and proposed a class of hyperbolic KGE models outperforming previous Euclidean-based methods in low-dimension. However, the limited number of parameters inevitably declines the model performance, and the small models cannot utilize the ‘high-accuracy’ knowledge from high-dimensional models.

Knowledge Distillation (KD) aims to transfer ‘knowledge’ from one machine learning model (i.e., the teacher) to another one (i.e., the student). Hinton et al.

(Hinton et al., 2015) introduce the first KD framework, which applies the classification probabilities of a trained model as ‘soft labels’ and defines a parameter called ‘Temperature’ to control the ‘soft’ degree of those labels. Inspired by this, several KD-based approaches are proposed in different research domains. Furlanello et al. (Furlanello et al., 2018) propose Born-Again Networks in which the student is parameterized identically to their teachers. Yang et al. (Yang et al., 2019) added an additional loss term to facilitate a few secondary classes to emerge and complement to the primary class. Li et al. (Li et al., 2019) distill human knowledge from a teacher model to enhance pedestrian attribute recognition task. To the best of our knowledge, our work is the first to apply the KD technologies into the link prediction of KGs.

8. Conclusion

Recent KGE models tend to apply high-dimensional embedding vectors to improve performance. They are hardly applied in practical applications due to heavy training cost and memory storage. In this paper, we theoretical analyze the relationship between the embedding dimension and the KG scale, and prove that low-dimensional vector space has the capability of representing normal-size KGs. Then, we propose a novel multi-teacher knowledge distillation framework for knowledge graph embeddings, named MulDE. Utilizing multiple hyperbolic KGE models as teachers, we present a novel iterative distillation strategy to extract high-accuracy knowledge for low-dimensional student adaptively. The experimental results show that the RotH model distilled by MulDE outperforms SotA low-dimensional models on two commonly-used datasets. Compared with general single-teacher KD methods, MulDE is able to accelerate student training speed.

These positive results encourage us to explore the following further research activities in the future:

  • Based on the Theorem 1, we will further research the lower bound of embedding dimension in both Euclidean and hyperbolic space, and explore the influence of different relation transformations.

  • To improve the knowledge quality from teachers, we will further improve the knowledge integration in the Senior component, and achieve higher-accuracy soft labels in low dimensions.

  • Regarding the choice of multiple teachers, we will deeply analyze the reason for the increased accuracy of the ensemble. We will discover new teacher combinations by importing other relation transformations.


  • F. Akrami, M. S. Saeef, Q. Zhang, W. Hu, and C. Li (2020) Realistic re-evaluation of knowledge graph completion methods: an experimental study. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, pp. 1995–2010. Cited by: §3.3.
  • C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, and S. Hellmann (2009) DBpedia - A crystallization point for the web of data. J. Web Semant. 7 (3), pp. 154–165. Cited by: §1.
  • A. Bordes, X. Glorot, J. Weston, and Y. Bengio (2014) A semantic matching energy function for learning with multi-relational data. Machine Learning 94, pp. 233–259. Cited by: §5.1.
  • A. Bordes, N. Usunier, A. García-Durán, J. Weston, and O. Yakhnenko (2013) Translating embeddings for modeling multi-relational data. In Proceedings of Advances in Neural Information Processing Systems (NIPS 2013), pp. 2787–2795. Cited by: §3.1, §7.
  • T. T. Cai, J. Fan, and T. Jiang (2013) Distributions of angles in random packing on spheres. Journal of machine learning research : JMLR 14 (1), pp. 1837–1864. Cited by: Appendix A A.
  • I. Chami, A. Wolf, D. Juan, F. Sala, S. Ravi, and C. Ré (2020) Low-dimensional hyperbolic knowledge graph embeddings. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 6901–6914. Cited by: §1, §1, §3.1, §5.1, §5.2, §7.
  • F. Che, D. Zhang, J. Tao, M. Niu, and B. Zhao (2020) ParamE: regarding neural network parameters as relation embeddings for knowledge graph completion. In

    The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, New York, NY, USA, February 7-12

    pp. 2774–2781. Cited by: §1.
  • T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel (2018) Convolutional 2d knowledge graph embeddings. In Proceedings of the 32th AAAI Conference on Artificial Intelligence, pp. 1811–1818. Cited by: §7.
  • T. Furlanello, Z. C. Lipton, M. Tschannen, L. Itti, and A. Anandkumar (2018) Born-again neural networks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, pp. 1602–1611. Cited by: §7.
  • G. E. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. CoRR abs/1503.02531. Cited by: §1, §2, §7.
  • D. P. Kingma and J. L. Ba (2015) Adam: a method for stochastic optimization. In Proceedings of International Conference on Learning Representations (ICLR 2015), Cited by: §3.4.
  • Q. Li, X. Zhao, R. He, and K. Huang (2019) Pedestrian attribute recognition by joint visual-semantic reasoning and knowledge distillation. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, pp. 833–839. Cited by: §7.
  • G. A. Miller (1995) WORDNET: a lexical database for english. Communications of the ACM 38, pp. 39–41. Cited by: §5.1.
  • [14] S. Mirzadeh, M. Farajtabar, A. Li, N. Levine, A. Matsukawa, and H. Ghasemzadeh Improved knowledge distillation via teacher assistant. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, Cited by: 3rd item.
  • D. Q. Nguyen, T. D. Nguyen, D. Q. Nguyen, and D. Q. Phung (2017)

    A novel embedding model for knowledge base completion based on convolutional neural network

    In Proceedings of the 2017 Conference of the North American Chapter of the Association for Computational Linguistics, Vol. 2, pp. 327–333. Cited by: §7.
  • H. Paulheim (2018) How much is a triple? estimating the cost of knowledge graph creation. In Proceedings of the 17th International Semantic Web Conference (ISWC 2018), Vol. 21802180. Cited by: §1.
  • M. Sachan (2020) Knowledge graph embedding compression. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 2681–2691. Cited by: §1, §7.
  • J. Su (2020) Principle of minimum entropy (vi): how to choose the dimension of word vector?. In Retrieved from (2020, Aug 20), Cited by: Appendix A A, §4.
  • F. M. Suchanek, G. Kasneci, and G. Weikum (2007) Yago: a core of semantic knowledge. In Proceedings of the 16th International Conference on World Wide Web, WWW 2007, Banff, Alberta, Canada, May 8-12, 2007, pp. 697–706. Cited by: §1.
  • Z. Sun, Z. Deng, J. Nie, and J. Tang (2019) RotatE: knowledge graph embedding by relational rotation in complex space. In 7th International Conference on Learning Representations, ICLR 2019, Cited by: §7.
  • Y. Tang, J. Huang, G. Wang, X. He, and B. Zhou (2020) Orthogonal relation transforms with graph context modeling for knowledge graph embedding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 2713–2722. Cited by: §7.
  • K. Toutanova and D. Chen (2015) Observed versus latent features for knowledge base and text inference. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality, pp. 57–66. Cited by: §5.1.
  • T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, and G. Bouchard (2016) Complex embeddings for simple link prediction. In Proceedings of the 33th International Conference on Machine Learning, pp. 2071–2080. Cited by: §3.1, §7.
  • X. Wang, Y. Xu, X. He, Y. Cao, M. Wang, and T. Chua (2020) Reinforced negative sampling over knowledge graph for recommendation. In WWW ’20: The Web Conference 2020, Taipei, Taiwan, April 20-24, pp. 99–109. Cited by: §1.
  • B. Yang, W. Yih, X. He, J. Gao, and L. Deng (2015) Embedding entities and relations for learning and inference in knowledge bases. In 3rd International Conference on Learning Representations, ICLR 2015, Cited by: §3.1, §7.
  • C. Yang, L. Xie, S. Qiao, and A. L. Yuille (2019) Training deep neural networks in generations: A more tolerant teacher educates better students. In Proceedings of The Thirty-Third AAAI Conference on Artificial Intelligence, 2019, pp. 5628–5635. Cited by: §7.
  • N. Zhang, S. Deng, Z. Sun, J. Chen, W. Zhang, and H. Chen (2020) Relation adversarial network for low resource knowledge graph completion. In WWW ’20: The Web Conference 2020, Taipei, Taiwan, April 20-24, pp. 1–12. Cited by: §1.
  • S. Zhang, Y. Tay, L. Yao, and Q. Liu (2019) Quaternion knowledge graph embeddings.

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP-IJCNLP 2019)

    Cited by: §7.
  • Q. Zhu, X. Zhou, J. Wu, J. Tan, and L. Guo (2019) Neighborhood-aware attentional representation for multilingual knowledge graphs. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, pp. 1943–1949. Cited by: §1.

Appendix A A The Proof of Theorem 1.

First, the information entropy of the KG is:


According to Eq.14, the information entropy of the KGE model is:


Then, we use sampling approximation to estimate as following:


, where is the amount number of all possible (query, target) pairs. So that we have an approximate :


This problem depends on the approximate solution to , which is the expectation of the distance between any two embedding vectors.

According to experimental observation, the elements in embedding vectors follow the uniform distribution. Here, we can assume that the absolute value of each element is about 1, then the length of each

-dimensional embedding vector is approximately . So that, all embedding vectors are uniformly distributed on an n-dimensional hypersphere with radius , and we have:


, where is the intersection angle of two embedding vectors. The distribution of the intersection angle between any two vectors in -dimensional space is easy to know (Cai et al., 2013)

, the probability density function is:


Set , is:


Hence, we can compute as:


Similarly, we can compute different expectation parts in , and the final approximate solution of is:


To make sure the model has enough space to encode a KG , a sufficient condition is that . As the information entropy is hard to compute, we relax the condition as to get an approximate estimation of .

Note that, if there is no sampling error when computing , should always more than 0. Following the analysis in (Su, 2020), the reason of the sampling error is that a small size samples cannot estimate the -dimensional model when is big enough. In the other word, if , has an appropriate to accommodate the KG

Finally, given fixed and , we can get an approximate lower bound of by numerical calculation. In the calculation, we find that is linearly dependent on . Hence, we can set an constant to approximate:


, where we define an accommodate constant .