1. Introduction
Knowledge graphs (KGs), which describe human knowledge as factual triples in the form of (head entity, relation, tail entity), have shown great potential in various domains (Zhu et al., 2019; Wang et al., 2020). Popular realworld KGs, such as Yago (Suchanek et al., 2007)
and DBPedia
(Bizer et al., 2009), typically containing tens of millions of entities, are still far from complete. Because of high manual costs of discovering new triples (Paulheim, 2018), link prediction based on knowledge graph embedding (KGE) draws considerable attention (Che et al., 2020; Zhang et al., 2020). Given an entity and a relation of a triple (we call ‘er query’), a typical KGE model first represents entities and relations as trainable continuous vectors. Then, it defines a score function to measure each candidate in entity set and outputs the best one.To achieve higher prediction accuracy, recent KGE models generally use highdimensional embedding vectors up to 200 or 500 dimensions. However, when we have millions or billions of entities in the KG, the highdimensional model requires enormous training costs and storage memory. This prevents downstream AI applications from updating KG embeddings promptly or being deployed on mobile devices such as cell phones. Instead of blindly increasing the embedding dimension, several research draw attention to improving the lowdimensional models (such as 8 or 32 dimensions) (Chami et al., 2020), or compressing pretrained highdimensional models (Sachan, 2020). However, the former cannot utilize the ‘highaccuracy’ knowledge from highdimensional models, while the latter suffers from heavy pretraining costs and cannot continue training when the KG is modified.
Whether can we transfer ‘highaccuracy’ knowledge to a small model while avoiding to train highdimensional models? We are inspired by a preliminary experiment using different hyperbolic KGE models (Chami et al., 2020). As shown in Fig. 1, when the embedding dimension is more than 64, the model performance slightly improves or starts to fluctuate. Besides, the 64dimensional ensemble is already better than any higherdimensional models, it can act as a lightweight source of ‘highaccuracy’ knowledge. To this end, we determine to employ different lowdimensional hyperbolic models as multiple teachers and train a smaller KGE model in a Knowledge Distillation (KD) process. Knowledge Distillation (KD) (Hinton et al., 2015) is a technology distilling ‘highaccuracy’ knowledge from a pretrained big model (teacher) to train a small one (student). It has rarely applied in the knowledge graph domain.
Compared with a single highdimensional teacher, we argue that there are at least three benefits of utilizing multiple lowdimensional teachers:

Reduce pretraining costs. The parameter summation of multiple teachers is relatively lower than high dimensional models’. Besides, the pretraining speed of the former can be further improved by parallel techniques.

Guarantee teacher performance. By integrating the results of multiple models, the prediction errors can be corrected for each other, so that the ensemble’s accuracy exceeds some of high dimensional models.

Improve distilling effect. Based on recent knowledge distillation studies, the student model is easier to acquire knowledge from a teacher with a similar size (Mirzadeh et al., ). A teacher dimension close to the lower bound is more suitable than hundreds of dimensions.
In this paper, we first theoretically analyze the capacity of lowdimensional space for KG embeddings based on the principle of minimum entropy, and estimate the approximate lower bound of embedding dimension when representing different scales of KGs. Under the guidance of theory, we pretrain multiple hyperbolic KGE models in low dimensions. Then, we propose a novel framework utilizing
Multiteacher knowledge Distillation for knowledge graph Embeddings, named MulDE. Considering the difference between the classification problem and the link prediction task, we propose a novel iterative distillation strategy. There are two student components Junior and Senior in MulDE. In once iteration, the Junior component first selects top K candidates for each er query, and transfers to multiple teachers. Then, the Senior component integrates teacher results and transfers soft labels to the Junior. Therefore, instead of teachers passing on knowledge in one direction, the students can seek knowledge actively from teachers according to the current indistinguishable entities.We conduct experiments on two commonlyused datasets, FB15k237 and WN18RR. The results show that MulDE can significantly improve the performance of lowdimensional KGE models. The distilled RotH model outperforms the SotA lowdimensional models, and it is comparable to some highdimensional ones. Compared with conventional singleteacher KD methods, MulDE can accelerate the training process and achieve higher accuracy. According to ablation experiments, we prove the effectiveness of the iterative distillation strategy and the other major modules in MulDE. We compare different teacher settings and conclude that using four different 64dimensional hyperbolic models is optimal.
The rest of the paper is organized as follows. We briefly introduce the background and preliminaries related to our work in Section 2. Section 3 details the whole framework and basic components of the MulDE model. In Section 4, the theoretical analysis of the embedding dimension is carried out. Section 5 reports the experimental studies, and Section 4 further discusses the experimental investigations. We overview the related work in Section 8 and, finally, give some concluding remarks in Section 7.
2. Preliminaries
This section introduces some definitions and notations used throughout the paper.
Knowledge Graph Embeddings. Let and denote the set of entities and relations, a knowledge graph (KG) is a collection of factual triples where and . and refer to the number of entities and relations, respectively. A KGE model represents each entity or relation as a dimensional continuous vector or , and learns a scoring function , to ensure that plausible triples receive higher scores. Table LABEL:tab:2.1
displays the scoring functions and loss functions of six popular KGE models. The details about those models are introduced in Section
8.Link Prediction. Generalized link prediction tasks include entity prediction and relation prediction. In this paper, we focus on the more challenging entity prediction task. Given an er query , the typical link prediction aims to predict the missing entity and output the predicted triple. This task is cast as a learning to rank problem. A KGE model outputs the scores of all candidate triples, denoted as , in which candidates with higher scores are possible to be true.
Knowledge Distillation
. Knowledge Distillation technologies aim to distill the knowledge from a larger deep neural network into a small network
(Hinton et al., 2015). The probability distribution over classes outputted by a teacher model, is referred to the ‘soft label’, which helps the student model to mimic the behaviour of the teacher model. A temperature factor T is introduced to control the importance of each soft label. Specifically, when
, the soft labels become onehot vectors, i.e., the hard labels. With the increase of , the labels become more ‘softer’. When, all classes share the same probability.
3. Methodology
The Multiteacher Distillation Embedding (MulDE) model utilizes multiple pretrained lowdimensional models as teachers, extracts knowledge by contrasting different teacher scores, and then supervises the training process of a lowdimensional student model. Figure 2 illustrates the architecture of the MulDE framework, including multiple teachers, the Junior component and the Senior component.

Multiple teachers are regarded as the data source of prediction sequences, and have no parameter updates in the training process. We employ four different hyperbolic KGE models as teachers. Details about pretrained teachers are described in Section 3.1.

Junior component is the target lowdimensional KGE model. Given an er pair, the Junior sends its top K predicted entities to teachers, and gets the corresponding ‘soft labels’ from the Senior component. We will detail the Junior component in Section 3.2.

Senior component acquires knowledge from those ‘teacher’ models directly, and then generate soft labels through two mechanisms: relationspecific mechanism and contrast attention mechanism. The details of the Senior component will be discussed in Section 3.3.
As shown in Fig.2, unlike traditional oneway guidance, the iterative distilling strategy in MulDE forms a novel circular interaction between students and teachers. In each iteration, the Junior component makes a preliminary prediction based on an er query, and selects those indistinguishable entities (top K) to ask multiple teachers. By this way, the Junior can effectively correct its frontend prediction results and outperforms the other models in the same dimension level. Meanwhile, rather than processing the fixed teacher scores all the time, the Senior can adjust parameters continually according to Junior’s feedback, and produce soft labels according to training epochs and the Junior’s performance adaptively. The learning procedure of MulDE will be detailed in Section 3.4.
3.1. Pretrained Teacher Models
To ensure the performance of the lowdimensional teachers, we employ a group of hyperbolic KGE models proposed by Chami et al. (Chami et al., 2020). Based on the previous work, we further add new models in this group. In this section, we describe these models briefly.
Taking the RotH model as an example, it uses a dimensional Poincaré ball model with trainable negative curvature. Embedding vectors are first mapped into this hyperbolic space, and a relation vector is regarded as a rotation transformation of entity vectors. Then, a hyperbolic distance function is employed to measure the difference of transformed head vector and candidate tail vector. Similarly, Chami et al. propose TransH and RefH. The former uses Möbius addition to imitate TransE (Bordes et al., 2013) in hyperbolic space, while the latter replaces the rotation transformation with a reflection one.
Considering the effectiveness of previous models using inner product transformation, such as DistMult (Yang et al., 2015) and ComplEx (Trouillon et al., 2016), we add another hyperbolic model named ‘DistH’. The scoring functions of these models are as following:
(1)  
(2)  
(3)  
(4)  
(5) 
, where is the hyperbolic distance, is Möbius addition operation, is inner product operation and is the space curvature.
We select pretrained lowdimensional models from this group as teachers , where is the number of teachers. Furthermore, to verify the performance of hyperbolic space in multiteacher knowledge distillation, we prepare a corresponding Euclidean model ‘ModelE’ for each hyperbolic model ‘ModelH’. We will describe the experimental results in Section 5.
3.2. Junior Component
The Junior Component employs a general lowdimensional KGE model without special restrictions. The target of MulDE is training a higherperformance Junior Component under a knowledge distillation process. So that, the trained Junior model can be used for faster reasoning instead of highdimensional teachers.
Most conventional knowledge distillation methods are used for the classification problem. However, the link prediction task is a ‘learning to rank’ problem, which usually be learned by distinguishing a positive target with randomly negative samples. A straightforward solution is to fit the teachers’ score distributions of the positive and negative samples. We argue that it has at least two drawbacks: First, this teaching task is ‘too easy’ for multiple teachers. Every teacher will output a result close to the hard label without an obvious difference. Second, the negative sampling can rarely ‘hit’ those really indistinguishable entities; it makes critical knowledge owned by teachers hardly pass on to the student model.
Soft Label Loss. We design a novel distilling strategy for KGE models. Given an er query, the Junior model evaluates all candidate entities in the entity set. Then, it selects the top K candidates with higher scores . As the beginning of an iteration, this prediction sequence is sent to every teacher model, and the Senior component returns soft labels after integrating different teacher outputs. We define the soft label supervision as a KullbackLeibler (KL) divergence, the loss function is
(6) 
, where is the Softmax function, is the number of er queries.
Hard Label Loss. Meanwhile, a hard label supervision based on conventional negative sampling is employed.
(7)  
, where is a binary cross entropy loss, contains scores of positive target and sampling negative ones, is a onehot label vector.
Finally, the Junior loss can be formulated by the weighted sum of and with a hyperparameter to balance two parts.
(8) 
We argue that both supervisions are necessary. The hard label supervision gives the Junior model opportunity of learning independently, and randomly negative sampling helps to handle more general queries. Meanwhile, the novel soft label supervision corrects the Top K prediction of the Junior model, encouraging it to imitate the ‘answer’ of teachers for these hard questions. In the experiments, we will evaluate four different lowdimensional Junior models mentioned in Section 3.2. As a general framework, MulDE can also apply to the previous KGE models, such as DistMult and RotatE.
3.3. Senior Component
As the bridge between multiple teachers and Junior component, the Senior component aims to integrate different score sequences from teachers and generate soft labels suitable for the Junior model to learn. There are two parts of input for the Senior, the top K scores from the Junior and multiple score sequences ( is the number of teachers) from the teacher models. All of the above sequences have length K corresponding to K candidate entities in .
In the senior component, we utilize two mechanisms for knowledge integration:
Relationspecific scaling mechanism. According to previous research (Akrami et al., 2020), different KGE models show performance advantages in different relationships. To improve the integrated soft labels, each single teacher should contribute more in the types of relations it excels at. To this end, we define a trainable scaling matrix , and assign an adaptive scaling value (ranges from 0 to 1) for each teacher sequence. The th scaled sequence is computed as:
(9) 
, where is the index of the relation in an er query. Note that all scores in a sequence are adjusted in proportion, their relative ranks have no changes. Attempts have been made to scale the scores for each position individually, but the loss of Senior component fluctuates sharply and is hard to converge.
Contrast attention mechanism. Considering the Junior model is randomly initialized, it needs to make a basic division of the whole entity set early in model training. At this time, the distribution of its scores would be significantly different from that of the trained teacher. A overdifferent soft label would block the training process. Based on this idea, we contrast the Junior sequence with each teacher sequence and evaluate their similarity. An attention mechanism is designed to make the integrated sequence more similar to the Junior sequence. The calculation process is as follows:
(10)  
(11) 
, where is the KL divergence, a higher means more different between and . is a parameter that increases exponentially with the number of training rounds. After the early period, the of different teacher tend to be equal. The soft labels start to focus on model performance instead of whether the Junior can adapt. Experimental results show that early adaptive soft labels can obviously increase convergence rate and prediction accuracy.
The contrast attention mechanism only effects in the early tens of training epochs, but its target is not for higher performance like the scaling mechanism. In order to training the parameters in , we utilize the summation of to compute loss. A crossentropy loss is utilized to evaluate the performance of scaled scores in the Senior component as follows:
(12) 
where is a onehot vector in which the position of the target entity of the er query is 1 and the rest is 0. Besides, if the target entity is not in the candidate set, is the zero vector.
3.4. Learning Algorithm
There are three roles in the proposed MulDE framework, namely Teacher (pretrained hyperbolic models), Senior Student (integration mechanisms), and Junior Student (the target lowdimensional model). As the teacher models have no parameter finetuning, we use a combined loss function to train the two student components and minimize the loss by exploiting the Adam (Kingma and Ba, 2015) optimizer. The complete loss function is as follows:
(13) 
, where the third part is the parameter regularization term.
The learning algorithm of MulDE is presented in Algorithm 1. We emphasize the preciseness and availability of this learning procedure. In the pretraining and training phases, both the pretrained models and the MulDE model cannot access the validation and test datasets. In the execution phase, the model cannot receive the labels from the input data.
4. Theoretical Analysis of Minimum Dimension
As the number of entities far exceeds that of relations, the entity embedding parameters usually account for more than 95% of the total parameters of a KGE model. Given a fixed entity set, reducing the embedding dimension can significantly save storage space. However, there is no theoretical analysis about the minimum embedding dimension required by a knowledge graph. Inspired by a recent research estimating the relationship between the dimension of word embeddings and the size of word vocabulary (Su, 2020), we analyze from the perspective of the minimum entropy principle and estimates a relatively reliable dimension for knowledge graphs of different sizes.
To simplify the derivation, we first define the plausibility of a triple as a joint probability , where is the er query . For most of KGE models, the relation is regarded as a transformation between and . Hence, we treat the er query as a transformed entity vector, thereby converting the triple sample into a normal pairwise one. According to the Cartesian product, the number of all possible er queries is equal to .
Then, a general KGE framework is defined as the following:
Definition 1. General KGE Framework .
Given a knowledge graph , a KGE model learns entity and relation embeddings in a dimensional vector space. The joint probability of a triple is computed as:
(14) 
, where is the embedding vector of and . is a distance measure function. and refers to any vector transformation. Without loss of generality, employs the squared Euclidean distance function, i.e., .
From the view of information entropy, the KG itself has a certain degree of uncertainty. An er query points to one or more target entities. Meanwhile, given a largescale entity set, the KG is often incomplete, and inevitably there are hidden triples that have not yet been discovered. When we use a KGE model to encode it, the embedding result should be equal to or less than this uncertainty. In other words, we should minimize the entropy of the model. So that we can ensure the KGE model is effective to keep the information of the KG.
According to the above analysis, we aim to find a sufficient condition to ensure the ’s entropy is smaller than that of KG , thereby the relationship between embedding dimension and KG scale can be depicted.
Theorem 1. The embedding dimension of has an approximate lower bound (the accommodate constant ), to satisfy the model has enough space to encode a KG . is the size of the entity set, is the size of the relation set and is natural logarithm.
Proof. The Proof is given in the Appendix.
Theorem 1 shows that when , the model can encode any KG satisfying
, which includes the common datasets in KGE domain, such as FB15k237, WN18RR, and YAGO310. We further verify this conclusion in experiments which detailed in Section . Meanwhile, we can speculate that 128dimensional embedding vectors can accommodate some largescale KGs, such as DBpedia (
), and Freebase (). Note that, Theorem 1 is a rough estimate under a set of approximate conditions. The derived lower bound can only provide preliminary guidance for largescale KGs, and its accuracy declines as the KG scale shrinks. It is difficult to conducting more precise derivation, which leaves as our future work.Overall, this theoretical analysis demonstrates that lowdimensional vector space has the capability to accommodate a normalsize KG. It motivates us to utilize lowdimensional teachers to refine a lowdimensional model by Knowledge Distillation technologies, rather than roughly increasing the embedding dimension for higher performance.
Dataset  #Rel  #Ent  #Train  #Valid  #Test 
FB15k237  
WN18RR 
Type  Dim  Methods  FB15K237  WN18RR  
MRR  Hits@10  Hits@1  MRR  Hits@10  Hits@1  


TransE  0.256  0.456  0.152  0.207  0.476  0.012  
DistMult  0.286  0.445  0.202  0.412  0.484  0.372  
ComplEx  0.283  0.447  0.202  0.431  0.513  0.395  
ConvE  0.316  0.501  0.237  0.430  0.520  0.400  
RotatE  0.338  0.533  0.241  0.476  0.571  0.428  
TuckER  0.358  0.544  0.266  0.470  0.526  0.443  
QuatE  0.348  0.550  0.248  0.488  0.582  0.438  
RefH  0.346  0.536  0.252  0.461  0.568  0.404  
RotH  0.344  0.535  0.246  0.496  0.586  0.449  
AttH  0.348  0.540  0.252  0.486  0.573  0.443  


RotatE  0.290  0.458  0.208  0.387  0.491  0.330  
RefH  0.312  0.489  0.224  0.447  0.518  0.408  
RotH  0.314  0.497  0.223  0.472  0.553  0.428  
AttH  0.324  0.501  0.236  0.466  0.551  0.419  


TransH  0.308  0.488  0.217  0.231  0.518  0.081  
MulDETransH  0.328  0.511  0.236  0.267  0.540  0.094  
DistH  0.293  0.474  0.202  0.439  0.511  0.399  
MulDEDistH  0.326  0.509  0.235  0.460  0.545  0.417  
RefH  0.302  0.474  0.215  0.453  0.526  0.414  
MulDERefH  0.325  0.508  0.233  0.479  0.569  0.434  
RotH  0.310  0.489  0.221  0.463  0.547  0.416  
MulDERotH  0.328  0.515  0.237  0.481  0.574  0.433  
5. Experiments
5.1. Experimental Setup
Datasets. Experimental studies are conducted on two commonly used datasets. WN18RR (Bordes et al., 2014) is a subset of the English lexical database, WordNet (Miller, 1995). FB15k237 (Toutanova and Chen, 2015) is extracted from Freebase including knowledge facts about movies, actors, awards, and sports. It is created by removing inverse relations, because many test triples can be obtained simply by inverting triples in the training set. The statistics of the datasets are given in Table 1. ‘Rel’ and ‘Ent’ refer to the number of relations and entities in the dataset, and the other metrics refer to the amount of triples in training, validation and test sets.
Baselines. We implement MulDE by employing four hyperbolic KGE models as student or teacher models, including TransH, DistH, RefH, and RotH. Although AttH (Chami et al., 2020) achieves better performance in one or two metrics, it is a combination of RefH and RotH. To verify the influence of hyperbolic space in multiteacher knowledge distillation, we also pretrain a corresponding Euclidean model ‘ModelE’ for each hyperbolic model ‘ModelH’.
The compared KGE models are in two classes: (1) Highdimensional models including the SotA KGE methods in Euclidean space or Hyperbolic space. All these optimal results are published utilizing more than 200 dimensions. (2) Lowdimensional models including RotH, RefH and AttH proposed by Chami et al. Benefiting from hyperbolic vector space, they have shown obvious advantages in low dimension. To save the space, We ignore to list other previous methods.
Implementation Details. We select the hyperparameters of our model via grid search according to the metrics on the validation set. For teacher models, we pretrain different teachers with the embedding dimension among . According to the theoretical analysis, we set the embedding dimension to 64 in the main experiments. We select the learning rate among and the number of negative samples among . For the MulDE framework, we empirically select the Junior embedding dimension among , the length of prediction sequences among , the learning rate among , and the balance hyperparameter among
. All experiments are performed on Intel Core i77700K CPU @ 4.20GHz and NVIDIA GeForce GTX1080 Ti GPU, and implemented in Python using PyTorch framework.
Evaluation Metrics.
For the link prediction experiments, we adopt two evaluation metrics: (1) MRR, the average inverse rank of the test triples, and (2) Hits@N, the proportion of correct entities ranked in top N. Higher MRR and Hits@N mean better performance. Following the previous works, we process the output sequence in the ‘Filter’ mode. Note that, for pretrained models, we only remove those entities appearing in the training dataset.
Methods  Dim  MRR  Hits@10  Hits@1 
Teachers  64d  0.487  0.581  0.440 
Origin  8d  0.210  0.352  0.140 
Junior  8d  0.399  0.483  0.350 
Growth  90.05%  37.41%  150.32%  
Origin  16d  0.412  0.501  0.358 
Junior  16d  0.464  0.547  0.419 
Growth  12.60%  9.07%  17.22%  
Origin  32d  0.463  0.547  0.416 
Junior  32d  0.481  0.574  0.433 
Growth  3.91%  4.91%  4.14%  
Origin  64d  0.477  0.564  0.429 
Junior  64d  0.482  0.579  0.430 
Growth  1.05%  2.77%  0.21%  
(a) Various Student Dimensions  
Methods  Dim  MRR  Hits@10  Hits@1 
Origin  32d  0.463  0.547  0.416 
Teachers  64d  0.487  0.581  0.440 
Junior  32d  0.481  0.574  0.433 
Growth  1.23%  1.20%  1.59%  
Teachers  128d  0.488  0.582  0.437 
Junior  32d  0.480  0.572  0.430 
Growth  1.64%  1.71%  1.60%  
Teachers  256d  0.483  0.578  0.434 
Junior  32d  0.473  0.569  0.424 
Growth  2.07%  1.60%  2.30%  
Teachers  512d  0.479  0.584  0.423 
Junior  32d  0.469  0.574  0.412 
Growth  2.09%  1.71%  2.60%  
(b) Various Teacher Dimensions 
5.2. Link Prediction Task
We first evaluate our model in the 32dimensional vector space, which is the same as the lowdimensional setting of Chami et al. (Chami et al., 2020). ‘MulDEmodelH’ is the model whose Junior model is a 32dimensional hyperbolic model, and its four teachers, TransH, DistH, RotH and RefH, are pretrained in the 64dimensional space. We evaluate our distilled Junior models compared with its original results, and also contrast with SotA models in both low and high dimensions. The experimental results are shown in Table 2. From the table, we can have the following observations:
The four different Junior models trained by MulDE significantly outperform their original performance on two datasets. The MRR and Hits@10 of all four models have an average 5% improvement. Especially, the Hits@10 of RotH increases from 0.547 to 0.574 on WN19RR, and the Hits@1 of DistH improves from 0.202 to 0.235 on FB15k237. The results illustrate the effectiveness of knowledge distillation for lowdimensional embedding.
Compared with previous lowdimensional models, ‘MulDERotH’ achieves the SotA results in all metrics on two datasets. Although the AttH model combines both RotH and RefH, it is weaker than our nextbest model ‘MulDERefH’ in 5 metrics. Besides, the TransH model only utilizes a simple score function, but exceeds all previous lowdimensional models on FB15k237 dataset after knowledge distillation training.
Compared with SotA highdimensional models,‘MulDERotH’ exceeds multiple 200dimensional models, including TransE, DistMult, ComplEx, and ConvE, on two datasets. Although its performance is lower than some of the latest models, ‘MulDERotH’ has shown strong competitiveness in some metrics. Significantly, its Hits@10 on WN18RR outperforms most of SotA models except QuatE and RotH (d=500).
Methods  FB15K237  WN18RR  
MRR  Hits@10  Hits@1  MRR  Hits@10  Hits@1  
MulDE  0.328  0.515  0.237  0.481  0.574  0.433 
MulDE w/o TopK  0.321  0.502  0.231  0.456  0.534  0.413 
MulDE w/o RS  0.325  0.513  0.234  0.476  0.571  0.430 
MulDE w/o CA  0.325  0.511  0.233  0.481  0.573  0.422 
KD512d w/ TopK  0.322  0.502  0.229  0.469  0.564  0.421 
KD64d w/ TopK  0.324  0.506  0.238  0.467  0.553  0.425 
KD64d  0.321  0.498  0.230  0.459  0.540  0.424 
On the whole, we can conclude that the MulDE framework can successfully improve the lowdimensional hyperbolic models. The performance of those distilled models is even comparable to some highdimensional models.
5.3. Results of Different Dimensions
The student dimension 32 and teacher dimension 64 are determined according to previous work and theoretical analysis. We further evaluate our MulDE framework with more different dimensions. For the student embedding dimension, we still focus on lowdimensional models with dimensions lower than 100, and compare distilled models with original ones. For the teacher embedding dimension, we select multiple dimension sizes over 64, including 128, 256, and 512. Then, we measure the performance gap between the 32dimensional student and different highdimensional teachers. The experimental results are shown in Table 3.
The performance of different lowdimensional student models are shown in Table 3 (a). As first, the results show that the accuracy of original model reduces obviously when the embedding dimension decreasing. Although the distilled Junior models achieve outperform its original accuracy, the lowerdimensional model still produce relatively poor results. Besides, from the ‘growth’ metrics, MulDE contributes more improvements for lowerdimensional models. The MRR and Hits@1 of 8d RotH increase more than 90% and 150%, which is much more significant than that of 64d RotH. Furthermore, the performance of 64d distilled models are very close to 32dimensional ones, which suggests we can utilize 32d models in applications to save more storage, and even 8d models for faster interactions.
The results in Table 3 (b) support our theoretical analysis to some extent. At first, the performance of the teacher ensemble is similar in different high dimensions. The Hits@10 of 256d teachers is even worse than that of 64d. This illustrates that when the embedding dimension exceeds a lower bound, the capacity of the model will tend to be stable. Besides, with the increase of teacher dimensions, the MRR and Hits@1 of the Junior model decrease. It indicates that the teacher with overhigh dimensions would be harder to transfer knowledge to a lowdimensional student. Therefore, the experimental results motivate us to apply 64dimensional teachers, which achieve higher performance, and save more pretraining costs.
5.4. Ablation studies about Distillation Strategy
To analyze the benefits of our MulDE framework, we make a series of ablation experiments to evaluate different modules. Compared with conventional knowledge distillation strategy, there are three main improvements in MulDE: iterative distilling strategy (TopK), relationspecific scaling mechanism (RS), and contrast attention mechanism (CA). Therefore, we test the performance of MulDE without one of the three modules. Besides that, we also compare MulDE with singleteacher distillation in which the teacher is the same KGE model as the student and has 64d or 512d dimensions. To make it harder still, we apply iterative distilling strategy in these singleteacher models instead of random candidate labels. The experimental results are shown in Table 4.
At first, for the iterative distilling strategy, it is clear that the performance of MulDE dramatically decreases when removing this module. Especially, the Hits@10 on WN18RR reduces from 0.574 to 0.534 (7.0%). This proves the necessity of using top K labels when applying knowledge distillation in link prediction tasks. In contrast, although two mechanisms are helpful, the contributions of RS and CA are relatively small. The further improvement of the Senior component would be our future work.
The singleteacher knowledge distillation can improve the original RotH model, but its performance is poorer than MulDE. This is the reason why we propose a novel distillation framework for KGE models. The iterative distilling strategy also positively effects the singleteacher framework, which increases Hits@10 on WN18RR by around 2%. Besides, a higherdimensional teacher (512d) also has the enhancement effect, but its training cost is unavoidable larger than MulDE.
Overall, the experimental results indicate the effectiveness of three major modules in MulDE. Compared with the singleteacher strategy, our framework shows apparent improvements. In the next section, we further discuss the influence of different teacher combinations.
6. Discussions
In this section, we particularly discuss several arresting questions in MulDE.
Q1: How much distillation strategies accelerate student training speed? Fig. 3 shows the convergence of 32dimensional RotH with different training processes. As expected, the student model trains faster under the supervision of a teacher. We can observe that MulDE converges faster than original RotH, and achieves higher accuracy. In contrast, the singleteacher KD model is much slower and continuously increases before 100 epoch. We also analyze the convergence of MulDE without contrast attention mechanism, i.e., MulDECA. Benefiting from contrast attention, the Hits@1 of MulDE in the early period (epoch=10) is around 3% higher than MulDECA, while it also leads to a higher final accuracy from 0.422 to 0.433. This proves the effectiveness of contrast attention, which can reduce the gap between student scores and teacher scores in the early period.
Q2: Whether and how much the hyperbolic space contributes to the result? We evaluate the MulDE framework by employing hyperbolic KGE models because of their performance in low dimensions. To answer the question, we analyze the contributions of hyperbolic models when they are teachers and the student, and also verify the effectiveness of MulDE in Euclidean space. The results are shown in Fig. 4 (a). ‘teaH’ and ‘stuH’ means using hyperbolic models, while ‘teaE’ and ‘stuE’ means using corresponding Euclidean models. The ‘teaH_stuH’ model is equal to MulDERotH, and the other variants assign different space types to teachers and students. Among the four models, ‘teaE_stuE’ is weaker than the other three, which indicates the advantage of hyperbolic space. Even so, the 0.466 Hits@1 is still equal to that of the previous AttH. Considering the middle two models, we can conjecture that using hyperbolic space in teachers are more effective than using hyperbolic student.
Q3: How about the contribution of different teacher choices on results? The complete MulDE framework utilize all of four hyperbolic models as teachers, because the whole ensemble can obtain better accuracy than the other combinations. We further analyze the contribution of every teacher in the ensemble. As shown in Fig. 4 (b), the performance of MulDE outperforms the other four kinds of teacher combinations. In terms of using two teachers, only using TransH and DistH is obviously poorer than using RotH and RefH, which indicates the importance of the later models. Comparing two models using three teachers, we find that RotH contributes more than RefH. It is reasonable because the accuracy of the original RotH is already higher than RefH’s. The experimental results prove the contribution of every teacher model. There would be a better teacher combination by adding in other different models, which will be our future work.
7. Related Work
In this section, we discuss recent research improvements in the KGE domain, and introduce relevant Knowledge Distillation research.
Predicting new triples using KGE methods has been an active research topic over the past few years. Benefiting from the simple calculation and good performance, some early methods such as TransE (Bordes et al., 2013), DistMult (Yang et al., 2015), ComplEx (Trouillon et al., 2016)
have been widely used in various AI tasks. With the rise of deep learning, several CNNbased methods have been proposed, such as ConvE
(Dettmers et al., 2018) and ConvKB (Nguyen et al., 2017). These methods outperform in link prediction tasks, but depend on more parameters heavily. Recently, there are several nonneural methods proposed. RotatE (Sun et al., 2019), inspired by Euler’s identity, can infer various relation patterns with a new rotationbased score function. QuatE (Zhang et al., 2019) and OTE (Tang et al., 2020) further improve the RotatE method. The former introduces more expressive hypercomplexvalued representations in the quaternion space, while the latter OTE extends the RotatE from 2D complex domain to high dimensional space with orthogonal transforms to model relations.Research related to this issue is relatively recent, including two representative approaches. One possible solution is compressing pretrained highdimensional models. Sachan (Sachan, 2020) utilized Embedding Compression methods to convert highdimensional continuous vectors into discrete codes. Although generating a compressed model retaining ‘highaccuracy’ knowledge, a pretrained highdimensional model is necessary. Furthermore, the compressed model with discrete vectors cannot continue training when the KG is modified. Another solution is introducing new theories to improve lowdimensional KGE models directly. For example, Chami et al. (Chami et al., 2020) introduced hyperbolic embedding space with trainable curvature, and proposed a class of hyperbolic KGE models outperforming previous Euclideanbased methods in lowdimension. However, the limited number of parameters inevitably declines the model performance, and the small models cannot utilize the ‘highaccuracy’ knowledge from highdimensional models.
Knowledge Distillation (KD) aims to transfer ‘knowledge’ from one machine learning model (i.e., the teacher) to another one (i.e., the student). Hinton et al.
(Hinton et al., 2015) introduce the first KD framework, which applies the classification probabilities of a trained model as ‘soft labels’ and defines a parameter called ‘Temperature’ to control the ‘soft’ degree of those labels. Inspired by this, several KDbased approaches are proposed in different research domains. Furlanello et al. (Furlanello et al., 2018) propose BornAgain Networks in which the student is parameterized identically to their teachers. Yang et al. (Yang et al., 2019) added an additional loss term to facilitate a few secondary classes to emerge and complement to the primary class. Li et al. (Li et al., 2019) distill human knowledge from a teacher model to enhance pedestrian attribute recognition task. To the best of our knowledge, our work is the first to apply the KD technologies into the link prediction of KGs.8. Conclusion
Recent KGE models tend to apply highdimensional embedding vectors to improve performance. They are hardly applied in practical applications due to heavy training cost and memory storage. In this paper, we theoretical analyze the relationship between the embedding dimension and the KG scale, and prove that lowdimensional vector space has the capability of representing normalsize KGs. Then, we propose a novel multiteacher knowledge distillation framework for knowledge graph embeddings, named MulDE. Utilizing multiple hyperbolic KGE models as teachers, we present a novel iterative distillation strategy to extract highaccuracy knowledge for lowdimensional student adaptively. The experimental results show that the RotH model distilled by MulDE outperforms SotA lowdimensional models on two commonlyused datasets. Compared with general singleteacher KD methods, MulDE is able to accelerate student training speed.
These positive results encourage us to explore the following further research activities in the future:

Based on the Theorem 1, we will further research the lower bound of embedding dimension in both Euclidean and hyperbolic space, and explore the influence of different relation transformations.

To improve the knowledge quality from teachers, we will further improve the knowledge integration in the Senior component, and achieve higheraccuracy soft labels in low dimensions.

Regarding the choice of multiple teachers, we will deeply analyze the reason for the increased accuracy of the ensemble. We will discover new teacher combinations by importing other relation transformations.
References
 Realistic reevaluation of knowledge graph completion methods: an experimental study. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 1419, 2020, pp. 1995–2010. Cited by: §3.3.
 DBpedia  A crystallization point for the web of data. J. Web Semant. 7 (3), pp. 154–165. Cited by: §1.
 A semantic matching energy function for learning with multirelational data. Machine Learning 94, pp. 233–259. Cited by: §5.1.
 Translating embeddings for modeling multirelational data. In Proceedings of Advances in Neural Information Processing Systems (NIPS 2013), pp. 2787–2795. Cited by: §3.1, §7.
 Distributions of angles in random packing on spheres. Journal of machine learning research : JMLR 14 (1), pp. 1837–1864. Cited by: Appendix A A.
 Lowdimensional hyperbolic knowledge graph embeddings. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 510, 2020, pp. 6901–6914. Cited by: §1, §1, §3.1, §5.1, §5.2, §7.

ParamE: regarding neural network parameters as relation embeddings for knowledge graph completion.
In
The ThirtyFourth AAAI Conference on Artificial Intelligence, AAAI 2020, New York, NY, USA, February 712
, pp. 2774–2781. Cited by: §1.  Convolutional 2d knowledge graph embeddings. In Proceedings of the 32th AAAI Conference on Artificial Intelligence, pp. 1811–1818. Cited by: §7.
 Bornagain neural networks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, pp. 1602–1611. Cited by: §7.
 Distilling the knowledge in a neural network. CoRR abs/1503.02531. Cited by: §1, §2, §7.
 Adam: a method for stochastic optimization. In Proceedings of International Conference on Learning Representations (ICLR 2015), Cited by: §3.4.
 Pedestrian attribute recognition by joint visualsemantic reasoning and knowledge distillation. In Proceedings of the TwentyEighth International Joint Conference on Artificial Intelligence, IJCAI 2019, pp. 833–839. Cited by: §7.
 WORDNET: a lexical database for english. Communications of the ACM 38, pp. 39–41. Cited by: §5.1.
 [14] Improved knowledge distillation via teacher assistant. In The ThirtyFourth AAAI Conference on Artificial Intelligence, Cited by: 3rd item.

A novel embedding model for knowledge base completion based on convolutional neural network
. In Proceedings of the 2017 Conference of the North American Chapter of the Association for Computational Linguistics, Vol. 2, pp. 327–333. Cited by: §7.  How much is a triple? estimating the cost of knowledge graph creation. In Proceedings of the 17th International Semantic Web Conference (ISWC 2018), Vol. 21802180. Cited by: §1.
 Knowledge graph embedding compression. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 510, 2020, pp. 2681–2691. Cited by: §1, §7.
 Principle of minimum entropy (vi): how to choose the dimension of word vector?. In Retrieved from https://spaces.ac.cn/archives/7695 (2020, Aug 20), Cited by: Appendix A A, §4.
 Yago: a core of semantic knowledge. In Proceedings of the 16th International Conference on World Wide Web, WWW 2007, Banff, Alberta, Canada, May 812, 2007, pp. 697–706. Cited by: §1.
 RotatE: knowledge graph embedding by relational rotation in complex space. In 7th International Conference on Learning Representations, ICLR 2019, Cited by: §7.
 Orthogonal relation transforms with graph context modeling for knowledge graph embedding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 510, 2020, pp. 2713–2722. Cited by: §7.
 Observed versus latent features for knowledge base and text inference. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality, pp. 57–66. Cited by: §5.1.
 Complex embeddings for simple link prediction. In Proceedings of the 33th International Conference on Machine Learning, pp. 2071–2080. Cited by: §3.1, §7.
 Reinforced negative sampling over knowledge graph for recommendation. In WWW ’20: The Web Conference 2020, Taipei, Taiwan, April 2024, pp. 99–109. Cited by: §1.
 Embedding entities and relations for learning and inference in knowledge bases. In 3rd International Conference on Learning Representations, ICLR 2015, Cited by: §3.1, §7.
 Training deep neural networks in generations: A more tolerant teacher educates better students. In Proceedings of The ThirtyThird AAAI Conference on Artificial Intelligence, 2019, pp. 5628–5635. Cited by: §7.
 Relation adversarial network for low resource knowledge graph completion. In WWW ’20: The Web Conference 2020, Taipei, Taiwan, April 2024, pp. 1–12. Cited by: §1.

Quaternion knowledge graph embeddings.
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLPIJCNLP 2019)
. Cited by: §7.  Neighborhoodaware attentional representation for multilingual knowledge graphs. In Proceedings of the TwentyEighth International Joint Conference on Artificial Intelligence, IJCAI 2019, pp. 1943–1949. Cited by: §1.
Appendix A A The Proof of Theorem 1.
First, the information entropy of the KG is:
(A.1) 
According to Eq.14, the information entropy of the KGE model is:
(A.2)  
(A.3) 
Then, we use sampling approximation to estimate as following:
(A.4)  
(A.5) 
, where is the amount number of all possible (query, target) pairs. So that we have an approximate :
(A.6)  
(A.7) 
This problem depends on the approximate solution to , which is the expectation of the distance between any two embedding vectors.
According to experimental observation, the elements in embedding vectors follow the uniform distribution. Here, we can assume that the absolute value of each element is about 1, then the length of each
dimensional embedding vector is approximately . So that, all embedding vectors are uniformly distributed on an ndimensional hypersphere with radius , and we have:(A.8)  
(A.9)  
(A.10)  
(A.11) 
, where is the intersection angle of two embedding vectors. The distribution of the intersection angle between any two vectors in dimensional space is easy to know (Cai et al., 2013)
, the probability density function is:
(A.12) 
Set , is:
(A.13) 
Hence, we can compute as:
(A.15)  
(A.16) 
Similarly, we can compute different expectation parts in , and the final approximate solution of is:
(A.17)  
(A.18) 
To make sure the model has enough space to encode a KG , a sufficient condition is that . As the information entropy is hard to compute, we relax the condition as to get an approximate estimation of .
Note that, if there is no sampling error when computing , should always more than 0. Following the analysis in (Su, 2020), the reason of the sampling error is that a small size samples cannot estimate the dimensional model when is big enough. In the other word, if , has an appropriate to accommodate the KG
Finally, given fixed and , we can get an approximate lower bound of by numerical calculation. In the calculation, we find that is linearly dependent on . Hence, we can set an constant to approximate:
(A.19) 
(A.20) 
, where we define an accommodate constant .
Comments
There are no comments yet.