Decompressing Knowledge Graph Representations for Link Prediction

by   Xiang Kong, et al.

This paper studies the problem of predicting missing relationships between entities in knowledge graphs through learning their representations. Currently, the majority of existing link prediction models employ simple but intuitive scoring functions and relatively small embedding size so that they could be applied to large-scale knowledge graphs. However, these properties also restrict the ability to learn more expressive and robust features. Therefore, diverging from most of the prior works which focus on designing new objective functions, we propose, DeCom, a simple but effective mechanism to boost the performance of existing link predictors such as DistMult, ComplEx, etc, through extracting more expressive features while preventing overfitting by adding just a few extra parameters. Specifically, embeddings of entities and relationships are first decompressed to a more expressive and robust space by decompressing functions, then knowledge graph embedding models are trained in this new feature space. Experimental results on several benchmark knowledge graphs and advanced link prediction systems demonstrate the generalization and effectiveness of our method. Especially, RESCAL + DeCom achieves state-of-the-art performance on the FB15k-237 benchmark across all evaluation metrics. In addition, we also show that compared with DeCom, explicitly increasing the embedding size significantly increase the number of parameters but could not achieve promising performance improvement.


page 1

page 2

page 3

page 4


MDistMult: A Multiple Scoring Functions Model for Link Prediction on Antiviral Drugs Knowledge Graph

Knowledge graphs (KGs) on COVID-19 have been constructed to accelerate t...

Augmenting Knowledge Graphs for Better Link Prediction

Embedding methods have demonstrated robust performance on the task of li...

NePTuNe: Neural Powered Tucker Network for Knowledge Graph Completion

Knowledge graphs link entities through relations to provide a structured...

Self-attention Presents Low-dimensional Knowledge Graph Embeddings for Link Prediction

Recently, link prediction problem, also known as knowledge graph complet...

Embedding Cardinality Constraints in Neural Link Predictors

Neural link predictors learn distributed representations of entities and...

Runtime Performances Benchmark for Knowledge Graph Embedding Methods

This paper wants to focus on providing a characterization of the runtime...

Complex and Holographic Embeddings of Knowledge Graphs: A Comparison

Embeddings of knowledge graphs have received significant attention due t...


Recently, knowledge bases (KBs) such as Freebase [1], WordNet [15], Yago [21] has proven useful in many tasks, including reading comprehension, recommendation system, information retrieval, etc. These KBs collect facts related to the real world as directed graphs (knowledge graphs), in which entities (nodes) are connected by their relationships (edges). As a result, a fact is represented by a triple (, , ), i.e., a relation between a subject entity and an object entity .

Although these knowledge graphs contain millions of entities, they are usually incomplete, i.e., some relationships between entities are missing. Accordingly, extensive research has been done on predicting those missing links through learning low-dimensional embedding representations of entities and relations [3, 2, 26, 13, 17, 16, 9, 24, 5, 4, 25, 6, 14, 8, 19, 22]. Considering the large size of knowledge graphs with millions of facts, current popular link predictors tend to be fast and shallow, utilizing simple scoring functions and small embedding sizes, but at the potential expense of learning less expressive features.

In this work, different from the majority of prior studies, the goals of which are to design new scoring functions, such as TransE [2], DistMult [26], ComplEx [24], RotatE [22], etc, we propose a general and effective method which could be applied to these models to boost their performance without explicitly increasing the embedding size or changing the scoring function. In more details, the original embeddings of entities and relations will be mapped to a more expressive and robust space by decompressing functions, then these link prediction models will be trained in this new space. Our method is simple and general enough to be applied to existing link prediction models and experimental results on different benchmark knowledge graphs and popular link prediction models demonstrate that our method could boost the scores with just a small amount of extra parameters.

Specifically, our contributions are as follows:

  • We propose DeCom, a simple but effective decompressing method to significantly improve the performance of many existing knowledge graph embedding models.

  • Without the need of increasing the embedding size explicitly, DeCom is able to help link prediction models save a lot of storing space and GPU memory.

  • By employing convolutional neural network as the decompressing network, DeCom-based models only add a few extra parameters to original models, thus being highly parameter-efficient. Even if we use the fully connected network as the decompressing network, the number of extra parameters will still be constant and not growing with knowledge graph size.

  • Experiments on several benchmark knowledge graphs and link prediction models show the effectiveness of our model; among them, RESCAL + DeCom achieves state-of-the-art result on FB15k-237 across all evaluation metrics.

Model Scoring Function DeCom Scoring Function
DistMult [26]
ComplEx [24] Re(, , ) Re()
Table 1: Scoring function of some link prediction models w/ and w/o decompressing (DeCom) layer, where denotes the generalized dot product, represents decompressing operations for the subject entity, relation and object entity respectively and , and represent the embedding of the subject entity, relation and object entity respectively.
Figure 1: The architecture of DeCom-based link predicting models. Embeddings of subject, object and relations, , ,

will be first fed into their corresponding DeCom layers to obtain more expressive and robust features, then the original scoring function (SF) will compute a score based on these decompressing features. The final score represents the probability of the input triple (s, r, o) being true.


Formally, a knowledge graph consists of entities (vertices) and relations (edges) , and could be represented by triples (facts) {(, , )} . Each fact (triple) represent a relationship between one subject entity and one object entity . Commonly, there are millions of entities in one knowledge graph but a lot of links (relationships) between them are missing. Therefore, completing these missing links is referred as Knowledge Base Completion (KBC), or more specifically, Link Prediction.

Most literature approaches the link prediction task by learning low-dimensional embedding vectors of knowledge graph entities and relations, known as Knowledge Graph Embedding (KGE). They formalize the problem into finding a scoring function

: , which is able to compute a score of each triple (, ,

), indicating whether this triple should be true or false. Intuitively, a promising scoring function should be able to assign higher scores to true triples than false ones. Within some of the recent models, a non-linearity such as the logistic sigmoid function is applied to the scoring function to give a corresponding probability prediction.

Table 1 lists several popular scoring functions from the literature. In these models, entities and relations are represented by low-dimensional embedding vectors, except for RESCAL where the relations are represented by full-rank matrices.


RESCAL [18] is a powerful link prediction model, the scoring function of which is a bilinear product between subject and object entities’ embeddings and a full rank matrix for each relation. Due to its large number of parameters, RESCAL suffers from overfitting issue and explicitly increasing the relation embedding dimension will quadratically boost the number of its parameters.


In order to mitigate the above issue, DistMult [26], a special case of RESCAL, employs a diagonal matrix to represent each relation so that the number of parameters grows linearly in terms of the embedding size. The resulting scoring function is equivalent to the inner product of three vectors. However, DistMult could not handle asymmetric relations, as (, , ) and (, , ) will be assigned to the same score.


To model asymmetric relations, ComplEx [24]

extends DistMult from the real space to the complex space. Even though each relation matrix of ComplEx is still diagonal, the subject and object entity embeddings for the same entity are no longer equivalent, but complex conjugates, which introduces asymmetry into the tensor decomposition and thus enables ComplEx to model asymmetric relations.

Trade off between parameter growth and model performance

Due to the large number of entities (vertices) in a knowledge graph, the number of parameters and computational costs are two essential aspects to evaluate a link prediction model. Specifically, since the number of entity and relation embedding parameters are and , where and are the entity and relation embedding dimension respectively, a large embedding size will lead to a unmanageable number of parameters. For example, applying DistMult with an embedding size of 400 to the whole Freebase needs more than 100 GB memory to store its parameters.

As a consequence, from Table 1, it is easy to find that all these popular scoring functions are simple and only contain some basic operations, such as matrix multiplications and vector products, etc. Also, they tend to set the dimensionality of entities’ and relations’ embedding size relatively low (around 200). As a result, these simple, small and fast models could be applicable in real-world scenarios. The drawback is that these relatively low-dimensional embeddings may not have the capacity to model the semantic of the knowledge graph very well (high bias), posing a negative effect on model performance.

In all, the choice of knowledge graph embedding size, though rarely discussed, is an important problem to be addressed. In this work, we propose DeCom, a simple but effective method to decompress the low-dimensional embedding to a high dimensional space. Furthermore, DeCom has the following attractive aspects:

  • DeCom does not explicitly increase the embedding size so that the model is still able to be scaled in a manageable way.

  • DeCom is able to not only implicitly increase the expressiveness but extract more robust features from the original embedding to achieve better performance as well, thus reduce the risk of overfitting.

  • DeCom learns a more general representation through the decompressing network which could be easily incorporated into many existing link predictors.

Motivation and Approach Overview

Despite a large amount of literature that designs new scoring functions, there have been limited discussions about how large the embedding size should be. sharma2018towards sharma2018towards examine the geometry of knowledge graph embeddings and their experiment results suggest that for multiplicative methods (DistMult, ComplEx, etc.), increasing entity and relation embedding size leads to decreasing conicity (a high value of conicity would imply that the vectors lie in a narrow cone centered at origin) which might improve link prediction performance. It also has been proved that several bi-linear methods can be fully-expressive (i.e. there exists an assignment of values to the embeddings that accurately separates the correct triples from incorrect ones) given large enough embedding size [11]. However, they also show that the upper bound of embedding size for full-expressiveness is , where is the number of entities and is the number of relations in knowledge graphs, which is not feasible even on a small-scale toy knowledge graph.

People may be encouraged to use larger embedding size by above observations, but it is not just modeling scalability that sets them back. In fact, as shown in our experiment results in Table 3

(rows 6 vs. 13 and 14 vs. 21) , increasing embedding size does not guarantee better performance. The same phenomenon has been observed for word embeddings, and yin2018dimensionality yin2018dimensionality explains this phenomenon under the bias-variance tradeoff framework: larger embedding size leads to decreased bias (better reconstruct the factorized coorcurrence matrix), but increased variance (overfit to the noise in the matrix). The same analysis can be applied to knowledge graph embedding as well, considering RESCAL, for example, what it essentially does is a tensor decomposition

where is the tensor that represents the training graph, if there is a relation from entity to entity , and if there is none. is the entity embedding matrix where is the embedding vector of the th entity and is the embedding size. is the relation tensor where is the embedding matrix of the relation. This decomposition can be lossless with large enough embedding size, but if we do obtain such an embedding, the performance on evaluation and test sets will be zero - as the training graph tensor is corrupted from the true graph tensor by randomly flipping some of its entries. Moreover, for the link prediction task, we are exclusively evaluating those corrupted entries. It is popular for recent studies to prove that their model is fully expressive (unbiased with large enough ), but it may not be relevant to actual model performance, as the variance plays an important role here. Since DistMult is a special case of RESCAL and ComplEx generalizes DistMult to complex space, this analysis can be applied to them, and other multiplicative models, as well.

Motivated by the scalability issue and bias-variance tradeoff, we propose to use a shallow neural network to decompress low-dimensional embedding vectors to a higher-dimensional space before applying the scoring functions. The intuition is that the low-dimensional embeddings will store the compressed information about entities/relations, and the decompressing network will project this compressed representation into a higher-dimensional space which is easier for the simple scoring functions to handle, thus achieving the low bias of high dimensional embedding with much fewer parameters. On the other hand, the decompressing network must learn the general information about the knowledge graph, making it more robust to noise and have lower variance. Less number of total parameters also suggests the model is less prone to overfitting.

Detailed Approach

#          Model FB15k-237
MRR H@1 H@3 H@10
1 RESCAL 100 - - 0.255 0.185 0.278 0.397
2     + F-DeCom 100 400 200 0.353 0.260 0.388 0.535
3     + F-DeCom-En 100 200 - 0.349 0.260 0.381 0.526
4     + F-DeCom-Rel 200 - 200 0.354 0.261 0.388 0.536
5     + vanilla expansion 400 - - 0.317 0.233 0.344 0.483
6 DistMult 100 - 100 - 0.258 0.173 0.283 0.417
7     + C-DeCom 100 400 100 400 0.291 0.210 0.318 0.454
8     + F-DeCom 100 400 100 400 0.299 0.213 0.329 0.470
9     + F-DeCom-En 100 400 400 - 0.296 0.214 0.325 0.460
10     + F-DeCom-Rel 200 - 100 200 0.281 0.201 0.307 0.442
11     + C-DeCom-En 100 400 400 - 0.278 0.198 0.306 0.442
12     + C-DeCom-Rel 400 - 100 400 0.273 0.192 0.299 0.436
13     + vanilla expansion 400 - 400 - 0.269 0.186 0.291 0.428
14 ComplEx 100 - 100 - 0.257 0.182 0.270 0.426
15     + C-DeCom 100 400 100 400 0.284 0.200 0.313 0.453
16     + F-DeCom 100 200 100 200 0.303 0.218 0.334 0.473
17     + F-DeCom-En 100 400 100 400 0.280 0.205 0.325 0.461
18     + F-DeCom-Rel 100 400 100 400 0.280 0.205 0.325 0.461
19     + C-DeCom-En 100 400 100 400 0.285 0.203 0.312 0.450
20     + C-DeCom-Rel 100 400 100 400 0.283 0,199 0.323 0.453
21     + vanilla expansion 400 - 400 - 0.267 0.188 0.292 0.426
Table 2: Performance of different models w/ and w/o decompressing on the testset of FB15k-237 dataset. C-DeCom and F-DeCom denotes the CNNs-based and FCNNs-based decompressing functions. *-En means only decompressing entity embeddings and *-Rel shows that only relation embeddings are decompressed. and denote the dimension of entity features before and after decompressing layer. and represent the dimension of relation features before and after decompressing layer. ’-’ denotes no decompressing layer in the model. ‘vanilla expansion‘ means explicitly increasing the embedding dimension (same notation are followed in other tables).
#          Model WN18RR
MRR H@1 H@3 H@10
1 RESCAL 100 - - 0.441 0.417 0.452 0.487
2     + F-DeCom 100 200 100 0.457 0.427 0.469 0.515
3     + F-DeCom-En 100 400 - 0.451 0.424 0.464 0.500
4     + F-DeCom-Rel 200 - 100 0.453 0.427 0.464 0.503
5     + vanilla expansion 400 - - 0.436 0.415 0.444 0.475
6 DistMult 100 - 100 - 0.427 0.381 0.436 0.487
7     + C-DeCom 100 400 100 400 0.445 0.413 0.458 0.510
8     + F-DeCom 100 200 100 200 0.450 0.418 0.461 0.515
9     + F-DeCom-En 100 200 200 - 0.440 0.401 0.452 0.507
10     + F-DeCom-Rel 200 - 100 200 0.442 0.396 0.449 0.508
11     + C-DeCom-En 100 400 400 - 0.431 0.392 0.447 0.502
12     + C-DeCom-Rel 400 - 100 400 0.440 0.411 0.450 0.505
13     + vanilla expansion 400 - 400 - 0.422 0.380 0.437 0.482
14 ComplEx 100 - 100 - 0.445 0.415 0.457 0.502
15     + C-DeCom 100 400 100 400 0.438 0.419 0.476 0.521
16     + F-DeCom 100 400 100 400 0.452 0.410 0.461 0.509
17     + F-DeCom-En 100 200 200 - 0.442 0.406 0.461 0.504
18     + F-DeCom-Rel 200 - 100 200 0.448 0.411 0.463 0.507
19     + C-DeCom-En 100 400 400 - 0.441 0.410 0.460 0.507
20     + C-DeCom-Rel 400 - 100 400 0.448 0.420 0.455 0.511
21     + vanilla expansion 400 - 400 - 0.440 0.411 0.452 0.510
Table 3: Performance of different models w/ and w/o decompressing on the testset of WN18RR dataset.
         Model FB15k-237 WN18RR
MRR H@1 H@3 H@10 MRR H@1 H@3 H@10
RESCAL 0.255 0.185 0.278 0.397 0.441 0.417 0.452 0.487
        + DeCom 0.353 0.260 0.388 0.535 0.457 0.427 0.469 0.515
DistMult[] 0.241 0.155 0.263 0.419 0.430 0.390 0.440 0.490
DistMult (ours) 0.258 0.173 0.283 0.417 0.427 0.381 0.436 0.487
        + DeCom 0.299 0.213 0.329 0.470 0.450 0.418 0.461 0.515
ComplEx[] 0.247 0.158 0.275 0.428 0.440 0.410 0.460 0.510
ComplEx (ours) 0.257 0.182 0.270 0.426 0.445 0.415 0.457 0.502
        + DeCom 0.303 0.218 0.334 0.473 0.452 0.410 0.461 0.509
RotatE[] 0.332 0.235 0.368 0.524 0.475 0.433 0.494 0.556
      - adv sample 0.297 0.205 0.328 0.480 - - - -
ConvE[] 0.325 0.237 0.356 0.501 0.430 0.400 0.440 0.520
Table 4: Performance of different models w/ and w/o decompressing on the testset of FB15k-237 and WN18RR datasets. Results of [] and [] are taken from dettmers2018convolutional dettmers2018convolutional and sun2019rotate sun2019rotate. -adv sample stands for RotatE without adversarial sampling, which should be a more fair comparison. For each DeCom-enhanced result, the best result is selected from all DeCom settings from Tables 2 and 3.


We denote the decompressing functions as , , for the subject entity , relation r and objective entity respectively, where and are the original and projected embedding sizes repsectively. For any scoring function , we could simply incorportate our decompressing function and change it into . For example, the scoring function of DistMult is , and after inserting the decompressing layer, we can change it into . Table 1 shows more examples about scoring functions w/ and w/o decomressing operations. Figure 1 shows the DeCom-based knowledge graph embedding model architecture.

Decompressing Functions

Theoretically, DeCom could be implemented by any kinds of architectures, such as fully-connected, convolutional and recurrent neural networks. 、 In this work, we mainly explore decompressing functions via convolutional neural networks (CNNs) and fully connected neural networks (FCNNs). Because the embedding has no sequential nature, the recurrent neural network has not been explored in this work. Furthermore, we need to point out that decompressing functions,

, , are independent and do not need to be same.

CNNs-based DeCom (C-DeCom)

Because of the high parameter efficiency and fast computation speed, CNNs are suitable to represent decompressing functions. Details of the CNNs-based decompressing function are as follows: for a batch of triples, we first look up their embedding vectors from entity and relation embedding tables. Then we feed them into one layer of 1-D CNN followed by the batch normalization and dropout, and use the final output to train a knowledge embedding model. Here, Batch normalization 

[10] and dropout [20] are employed to speed up training and prevent overfitting. Generally, this method could be easily incorporated into any non-parametric scoring functions.

FCNNs-based decompressing function (F-DeCom)

Similar to C-DeCom, F-DeCom employs a linear layer to decompress the input features into a higher dimensional feature space.


We experiment with three bi-linear knowledge graph embedding models, i.e., RESCAL [18], DistMult [26] and ComplEx [24] on two benchmark datasets, i.e., FB15k-237 [23] and WN18RR [6], to show that our proposed method could consistently boost the performance of knowledge graph embedding methods.

Experimental Settings

Benchmark Datasets.

FB15k [2] and WN18 [2] are widely used for evaluating knowledge graph embedding methods.  toutanova2015observed toutanova2015observed shows that FB15k contains a large number of inverse relations and most test triples can be inferred from its reverse relation in the training set, so they delete the reverse relations from FB15k and propose FB15k-237. There are 14,541 entities and 237 kinds of relations in FB15k-237. Similarly, dettmers2018convolutional dettmers2018convolutional removes the reverse relations from WN18 and propose WN18RR. Therefore, in this paper, we evaluate our methods on FB15k-237 and WN18RR. There are 40,934 entities and 11 types of relations in WN18RR.

Evaluation Protocol.

We follow the standard evaluation protocol of this task. For each test triple , we corrupt subjecst or the objects in the knowledge graph into or . Then we rank the triples and see how good the ground truth is ranked. Triples that are different from the ground truth but are also correct are filtered. Mean Reciprocal Rank (MRR) and Hit@N (H@N) where N, are standard evaluation measures for these datasets and are reported in our experiments.

Different DeCom strategies

In order to further explore DeCom, various decompressing strategies are explored and the details are the following :

  • Different Decompressing functions: in this work, two decompressing functions, F-DeCom and C-DeCom, are explored in our experiments.

  • Decompressing objects: Because of the high flexibility of DeCom, decompressing functions could be applied on 1) just entities 2) just relations and 3) both entities and relations.

Hyperparamerter Settings.

In order to make our results comparable, for each link predicting baseline model, we keep most of the hyperparameters and training strategies the same between the original model and DeCom-enhanced model. All models are trained for 500 epochs, embedding size is 100, and other hyperparameters are chosen based on the performance on the validation set by grid search.

For DistMult [26] and ComplEx [24], following dettmers2018convolutional dettmers2018convolutional, 1-1 training strategy is employed, and Adagrad [7] is used as the optimizer; besides, we regularize these two models by forcing their entity embeddings to have a L2 norm of 1 after parameter updating and the pairwise margin-based ranking loss (margin=1.0) [2] is employed. Furthermore, we find that regularizing entity embeddings after the decompressing layer to have a L2 norm of 1 could effectively prevent overfitting and make the training process stable. The range of the learning rate of Adagrad is {0.08, 0.10, 0.12}.

For RESCAL [18], we apply 1-N [6] training strategy, employ Adam [12]

as the optimizer and set binary cross entropy as the loss function. The range of the learning rate of Adam is {0.01, 0.005, 0.001, 0.0005}. Because RESCAL’s relations are represented as full-rank matrices, and it’s not intuitive to decompress a low-dimensional vector into a matrix by convolution, we only experiment it with fully connected networks.

For each model’s corresponding DeCom-enhanced model, in order to make them comparable, the training strategies such as the optimizer, 1-1 or 1-N training, hyperparameters grid search range, etc, remain the same. Besides that, hyperparameters of the decompressing function are selected via grid search according to the performance on the validation set. The ranges of hyperparameters of the DeCom layer for the grid search are set as follows: for C-DeCom, the number of kernel {2, 3, 4}, the size of kernel {3, 4}, for F-DeCom, the dimension of decompressed features are {200, 400}, for RESCAL relations only, pre-decompress dimension {100, 200, 400, 1000, 2000}.

Main Results

Link prediction results on two datasets of three baseline models and their corresponding DeCom-based models are shown in Tables 2, 3 and 4.

DeCom vs. no DeCom: DeCom-based knowledge graph embedding models outperform their corresponding baseline models significantly, which demonstrates the expressive power of DeCom. Also note that the DeCom models also outperform the baseline models with explicitly increased embedding size, indicating that they are more robust to overfitting.

C-DeCom vs. F-DeCom: The F-DeCom is able to generally obtain better scores but is more prone to be overfitting because from row 16 and rows 2, 8 in Tables 2 and 3 respectively, the best F-DeCom feature size is 200 instead of 400. One reason that F-DeCom achieving higher scores is that it could extract features from all embedding dimensions but C-DeCom is only able to extract features in the range of kernels.

DeCom-En vs. DeCom-Rel: From related rows in Table 2 and 3, just decompressing relation features could obtain slightly better result. We attribute this to that modelling relation between entities is more complicated which needs more expressive and robust features from DeCom.

DeCom vs. others In Table 4 we collect the scores of best configurations from Table 2 and 3 and compare them with some other recent works. Especially, DeCom-based RESCAL link prediction models achieve state-of-the-art performance on the FB15k-237 dataset across all metrics.

We further note that DeCom could assist the original model to achieve higher improvement on the dataset with a larger number of relations. Specifically, link prediction models with DeCom achieve +16% and +5% averaged improvement on FB15k-237 and WN18RR. We attribute this to that WN18RR is simpler in structure and the original embedding already has the ability to extract meaningful features from the small number of relations. Explicitly increasing embedding size also makes baseline performance worse on WN18RR, which suggests that 100 dimensions may be enough. Therefore, models trained on FB15k-237 benefit more from DeCom.

Model Embed. Size DeCom Size MRR H@10 Param. Size Speed (triples/sec)
RESCAL 100 - 0.258 0.402 6214300 ( 1378.03
RESCAL 400 - 0.324 0.487 81977200 ( 1317.60
RESCAL + F-DeCom 100 400 0.356 0.537 33751300 ( 1229.68
DistMult 100 - 0.273 0.421 1501900 ( 1429.06
DistMult 400 - 0.284 0.435 6007600 ( 1400.31
DistMult + C-DeCom 100 400 0.314 0.484 1501942 ( 1400.73
DistMult + F-DeCom 100 400 0.302 0.468 1583700 ( 1386.90
ComplEx 100 - 0.280 0.432 3003800 ( 1412.71
ComplEx 400 - 0.270 0.428 12015200 ( 1310.36
ComplEx + C-DeCom 100 400 0.288 0.449 3003882 ( 1275.31
ComplEx + F-DeCom 100 400 0.308 0.472 3949120 ( 1221.56
Table 5: Comparison between models with different types of DeCom layers on the validation set of FB15k-237. The speed is calculated by the number of triples processed per second during predicting (validation) time. DeCom size means the size of features after decompressing layer.

Analysis and Discussion

Parameter and Running Time Efficiency

The decompressing layer is able to map the original embedding to a more expressive and robust feature space. One natural question is: what if we explicitly increase the embedding size? Therefore, we increase the embedding size from 100 to 400 to match the feature size after decompressing layer and compare them from different perspectives. The result is shown in Table 5. It is clear to find that models with decompressing layer not only achieve better performance, but are much more parameter efficient with a little sacrifice of prediction speed. Especially, ComplEx with large embedding size instead harms the performance. We attribute this to the overfitting of two many parameters. Comparing with C-DeCom, F-DeCom obtains better scores with a little more parameters and slower decoding speed.

Model MRR H@10
DistMult 100 - 100 - 0.273 0.421
      +C-DeCom 100 100 100 100 0.291 0.452
      +F-DeCom 100 100 100 100 0.287 0.448
ComplEx 100 - 100 - 0.280 0.432
      +C-DeCom 100 100 100 100 0.283 0.441
      +F-DeCom 100 100 100 100 0.292 0.458
Table 6: The robustness comparison between DeCom and original models on FB15k-237 validation set.

Why is DeCom effective?

We think that there are two main reasons to explain the effectiveness of DeCom:

1) Implicitly increasing the feature dimension to improve model’s expressiveness by decompressing functions.

2) Learning more robust features. To further understand this fact, decompressing functions are designed to keep the size of input features (original embedding) and output ones the same. Specifically, we set the output feature size of F-DeCom and C-DeCom the same as the input embedding dimension, i.e., 100. The result is shown in Table 6. Despite there is no increase in embedding size, the DeCom models still achieve the performance improvement, suggesting that they could learn more robust embeddings.

Related Work

In order to predict the missing links in knowledge graphs, knowledge graph embedding (KGE) methods have been extensively studied in recent years. For example, RESCAL [18] employs a bilinear product between vector embeddings for each subject and object entity and a full rank matrix for each relation. TransE [2] implicitly models relations through representing each relation as a bijection between source and target entities. DistMult [26], as a special case of RESCAL, uses a diagonal matrix for each representation so that the amount of parameters grows linearly. ComplEx [24] extends DistMult through modeling asymmetric relations by introducing complex embeddings. RotatE [22] models the relation as a rotation operation from the subject entity to the object entity in the complex vector space. Most prior methods are based on simple operations and shallow neural network, which make them fast, scalable and memory-efficient, however, these properties also restrict the expressiveness of learned features. Concurrently, in order to mitigate this problem, dettmers2018convolutional dettmers2018convolutional (ConvE) employs 2-D convolution operations on the subject entity and relation embedding vectors, after they are reshaped to matrices and concatenated. However, the reshaping and concatenation operations and applying 2-D convolution on word embeddings are not intuitive.


In this work, in order to increase expressiveness and robustness of shallow link predictors, we propose, DeCom, a flexible decompressing mechanism which is able to map low-dimensional embeddings to a more expressive and robust space by adding just a few extra parameters. DeCom could be easily incorporated into many existing knowledge graph embedding models and experimental results show that it could boost the performance of many popular link predictors on several knowledge graphs and obtain state-of-the-art results on FB15k-237 across all evaluation metrics.


  • [1] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 1247–1250. Cited by: Introduction.
  • [2] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko (2013) Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems, pp. 2787–2795. Cited by: Introduction, Introduction, Benchmark Datasets., Hyperparamerter Settings., Related Work.
  • [3] A. Bordes, J. Weston, R. Collobert, and Y. Bengio (2011) Learning structured embeddings of knowledge bases. In

    Twenty-Fifth AAAI Conference on Artificial Intelligence

    Cited by: Introduction.
  • [4] L. Cai and W. Y. Wang (2017) Kbgan: adversarial learning for knowledge graph embeddings. arXiv preprint arXiv:1711.04071. Cited by: Introduction.
  • [5] R. Das, A. Neelakantan, D. Belanger, and A. McCallum (2016) Chains of reasoning over entities, relations, and text using recurrent neural networks. arXiv preprint arXiv:1607.01426. Cited by: Introduction.
  • [6] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel (2018) Convolutional 2d knowledge graph embeddings. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: Introduction, Hyperparamerter Settings., Experiments.
  • [7] J. Duchi, E. Hazan, and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization.

    Journal of Machine Learning Research

    12 (Jul), pp. 2121–2159.
    Cited by: Hyperparamerter Settings..
  • [8] T. Ebisu and R. Ichise (2018) Toruse: knowledge graph embedding on a lie group. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: Introduction.
  • [9] J. Feng, M. Huang, M. Wang, M. Zhou, Y. Hao, and X. Zhu (2016) Knowledge graph embedding by flexible translation. In Fifteenth International Conference on the Principles of Knowledge Representation and Reasoning, Cited by: Introduction.
  • [10] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: CNNs-based DeCom (C-DeCom).
  • [11] S. M. Kazemi and D. Poole (2018) SimplE embedding for link prediction in knowledge graphs. In Advances in Neural Information Processing Systems, pp. 4284–4295. Cited by: Motivation and Approach Overview.
  • [12] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Hyperparamerter Settings..
  • [13] D. Krompaß, S. Baier, and V. Tresp (2015) Type-constrained representation learning in knowledge graphs. In International semantic web conference, pp. 640–655. Cited by: Introduction.
  • [14] T. Lacroix, N. Usunier, and G. Obozinski (2018) Canonical tensor decomposition for knowledge base completion. arXiv preprint arXiv:1806.07297. Cited by: Introduction.
  • [15] G. A. Miller (1995) WordNet: a lexical database for english. Communications of the ACM 38 (11), pp. 39–41. Cited by: Introduction.
  • [16] D. Q. Nguyen, K. Sirts, L. Qu, and M. Johnson (2016) STransE: a novel embedding model of entities and relationships in knowledge bases. arXiv preprint arXiv:1606.08140. Cited by: Introduction.
  • [17] M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich (2015) A review of relational machine learning for knowledge graphs. Proceedings of the IEEE 104 (1), pp. 11–33. Cited by: Introduction.
  • [18] M. Nickel, V. Tresp, and H. Kriegel (2011) A three-way model for collective learning on multi-relational data.. In ICML, Vol. 11, pp. 809–816. Cited by: Table 1, RESCAL, Hyperparamerter Settings., Experiments, Related Work.
  • [19] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. Van Den Berg, I. Titov, and M. Welling (2018) Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pp. 593–607. Cited by: Introduction.
  • [20] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: CNNs-based DeCom (C-DeCom).
  • [21] F. M. Suchanek, G. Kasneci, and G. Weikum (2007) Yago: a core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web, pp. 697–706. Cited by: Introduction.
  • [22] Z. Sun, Z. Deng, J. Nie, and J. Tang (2019) RotatE: knowledge graph embedding by relational rotation in complex space. arXiv preprint arXiv:1902.10197. Cited by: Introduction, Introduction, Related Work.
  • [23] K. Toutanova and D. Chen (2015) Observed versus latent features for knowledge base and text inference. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality, pp. 57–66. Cited by: Experiments.
  • [24] T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, and G. Bouchard (2016) Complex embeddings for simple link prediction. In International Conference on Machine Learning, pp. 2071–2080. Cited by: Table 1, Introduction, Introduction, ComplEx, Hyperparamerter Settings., Experiments, Related Work.
  • [25] Q. Xie, X. Ma, Z. Dai, and E. Hovy (2017) An interpretable knowledge transfer model for knowledge base completion. arXiv preprint arXiv:1704.05908. Cited by: Introduction.
  • [26] B. Yang, W. Yih, X. He, J. Gao, and L. Deng (2014) Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575. Cited by: Table 1, Introduction, Introduction, DistMult, Hyperparamerter Settings., Experiments, Related Work.