Introduction
Recently, knowledge bases (KBs) such as Freebase [1], WordNet [15], Yago [21] has proven useful in many tasks, including reading comprehension, recommendation system, information retrieval, etc. These KBs collect facts related to the real world as directed graphs (knowledge graphs), in which entities (nodes) are connected by their relationships (edges). As a result, a fact is represented by a triple (, , ), i.e., a relation between a subject entity and an object entity .
Although these knowledge graphs contain millions of entities, they are usually incomplete, i.e., some relationships between entities are missing. Accordingly, extensive research has been done on predicting those missing links through learning lowdimensional embedding representations of entities and relations [3, 2, 26, 13, 17, 16, 9, 24, 5, 4, 25, 6, 14, 8, 19, 22]. Considering the large size of knowledge graphs with millions of facts, current popular link predictors tend to be fast and shallow, utilizing simple scoring functions and small embedding sizes, but at the potential expense of learning less expressive features.
In this work, different from the majority of prior studies, the goals of which are to design new scoring functions, such as TransE [2], DistMult [26], ComplEx [24], RotatE [22], etc, we propose a general and effective method which could be applied to these models to boost their performance without explicitly increasing the embedding size or changing the scoring function. In more details, the original embeddings of entities and relations will be mapped to a more expressive and robust space by decompressing functions, then these link prediction models will be trained in this new space. Our method is simple and general enough to be applied to existing link prediction models and experimental results on different benchmark knowledge graphs and popular link prediction models demonstrate that our method could boost the scores with just a small amount of extra parameters.
Specifically, our contributions are as follows:

We propose DeCom, a simple but effective decompressing method to significantly improve the performance of many existing knowledge graph embedding models.

Without the need of increasing the embedding size explicitly, DeCom is able to help link prediction models save a lot of storing space and GPU memory.

By employing convolutional neural network as the decompressing network, DeCombased models only add a few extra parameters to original models, thus being highly parameterefficient. Even if we use the fully connected network as the decompressing network, the number of extra parameters will still be constant and not growing with knowledge graph size.

Experiments on several benchmark knowledge graphs and link prediction models show the effectiveness of our model; among them, RESCAL + DeCom achieves stateoftheart result on FB15k237 across all evaluation metrics.
Model  Scoring Function  DeCom Scoring Function 

RESCAL [18]  
DistMult [26]  ⟨⟩  ⟨⟩ 
ComplEx [24]  Re(⟨, , ⟩)  Re(⟨⟩) 
Background
Formally, a knowledge graph consists of entities (vertices) and relations (edges) , and could be represented by triples (facts) {(, , )} . Each fact (triple) represent a relationship between one subject entity and one object entity . Commonly, there are millions of entities in one knowledge graph but a lot of links (relationships) between them are missing. Therefore, completing these missing links is referred as Knowledge Base Completion (KBC), or more specifically, Link Prediction.
Most literature approaches the link prediction task by learning lowdimensional embedding vectors of knowledge graph entities and relations, known as Knowledge Graph Embedding (KGE). They formalize the problem into finding a scoring function
: , which is able to compute a score of each triple (, ,), indicating whether this triple should be true or false. Intuitively, a promising scoring function should be able to assign higher scores to true triples than false ones. Within some of the recent models, a nonlinearity such as the logistic sigmoid function is applied to the scoring function to give a corresponding probability prediction.
Table 1 lists several popular scoring functions from the literature. In these models, entities and relations are represented by lowdimensional embedding vectors, except for RESCAL where the relations are represented by fullrank matrices.
Rescal
RESCAL [18] is a powerful link prediction model, the scoring function of which is a bilinear product between subject and object entities’ embeddings and a full rank matrix for each relation. Due to its large number of parameters, RESCAL suffers from overfitting issue and explicitly increasing the relation embedding dimension will quadratically boost the number of its parameters.
DistMult
In order to mitigate the above issue, DistMult [26], a special case of RESCAL, employs a diagonal matrix to represent each relation so that the number of parameters grows linearly in terms of the embedding size. The resulting scoring function is equivalent to the inner product of three vectors. However, DistMult could not handle asymmetric relations, as (, , ) and (, , ) will be assigned to the same score.
ComplEx
To model asymmetric relations, ComplEx [24]
extends DistMult from the real space to the complex space. Even though each relation matrix of ComplEx is still diagonal, the subject and object entity embeddings for the same entity are no longer equivalent, but complex conjugates, which introduces asymmetry into the tensor decomposition and thus enables ComplEx to model asymmetric relations.
Trade off between parameter growth and model performance
Due to the large number of entities (vertices) in a knowledge graph, the number of parameters and computational costs are two essential aspects to evaluate a link prediction model. Specifically, since the number of entity and relation embedding parameters are and , where and are the entity and relation embedding dimension respectively, a large embedding size will lead to a unmanageable number of parameters. For example, applying DistMult with an embedding size of 400 to the whole Freebase needs more than 100 GB memory to store its parameters.
As a consequence, from Table 1, it is easy to find that all these popular scoring functions are simple and only contain some basic operations, such as matrix multiplications and vector products, etc. Also, they tend to set the dimensionality of entities’ and relations’ embedding size relatively low (around 200). As a result, these simple, small and fast models could be applicable in realworld scenarios. The drawback is that these relatively lowdimensional embeddings may not have the capacity to model the semantic of the knowledge graph very well (high bias), posing a negative effect on model performance.
In all, the choice of knowledge graph embedding size, though rarely discussed, is an important problem to be addressed. In this work, we propose DeCom, a simple but effective method to decompress the lowdimensional embedding to a high dimensional space. Furthermore, DeCom has the following attractive aspects:

DeCom does not explicitly increase the embedding size so that the model is still able to be scaled in a manageable way.

DeCom is able to not only implicitly increase the expressiveness but extract more robust features from the original embedding to achieve better performance as well, thus reduce the risk of overfitting.

DeCom learns a more general representation through the decompressing network which could be easily incorporated into many existing link predictors.
Motivation and Approach Overview
Despite a large amount of literature that designs new scoring functions, there have been limited discussions about how large the embedding size should be. sharma2018towards sharma2018towards examine the geometry of knowledge graph embeddings and their experiment results suggest that for multiplicative methods (DistMult, ComplEx, etc.), increasing entity and relation embedding size leads to decreasing conicity (a high value of conicity would imply that the vectors lie in a narrow cone centered at origin) which might improve link prediction performance. It also has been proved that several bilinear methods can be fullyexpressive (i.e. there exists an assignment of values to the embeddings that accurately separates the correct triples from incorrect ones) given large enough embedding size [11]. However, they also show that the upper bound of embedding size for fullexpressiveness is , where is the number of entities and is the number of relations in knowledge graphs, which is not feasible even on a smallscale toy knowledge graph.
People may be encouraged to use larger embedding size by above observations, but it is not just modeling scalability that sets them back. In fact, as shown in our experiment results in Table 3
(rows 6 vs. 13 and 14 vs. 21) , increasing embedding size does not guarantee better performance. The same phenomenon has been observed for word embeddings, and yin2018dimensionality yin2018dimensionality explains this phenomenon under the biasvariance tradeoff framework: larger embedding size leads to decreased bias (better reconstruct the factorized coorcurrence matrix), but increased variance (overfit to the noise in the matrix). The same analysis can be applied to knowledge graph embedding as well, considering RESCAL, for example, what it essentially does is a tensor decomposition
where is the tensor that represents the training graph, if there is a relation from entity to entity , and if there is none. is the entity embedding matrix where is the embedding vector of the th entity and is the embedding size. is the relation tensor where is the embedding matrix of the relation. This decomposition can be lossless with large enough embedding size, but if we do obtain such an embedding, the performance on evaluation and test sets will be zero  as the training graph tensor is corrupted from the true graph tensor by randomly flipping some of its entries. Moreover, for the link prediction task, we are exclusively evaluating those corrupted entries. It is popular for recent studies to prove that their model is fully expressive (unbiased with large enough ), but it may not be relevant to actual model performance, as the variance plays an important role here. Since DistMult is a special case of RESCAL and ComplEx generalizes DistMult to complex space, this analysis can be applied to them, and other multiplicative models, as well.
Motivated by the scalability issue and biasvariance tradeoff, we propose to use a shallow neural network to decompress lowdimensional embedding vectors to a higherdimensional space before applying the scoring functions. The intuition is that the lowdimensional embeddings will store the compressed information about entities/relations, and the decompressing network will project this compressed representation into a higherdimensional space which is easier for the simple scoring functions to handle, thus achieving the low bias of high dimensional embedding with much fewer parameters. On the other hand, the decompressing network must learn the general information about the knowledge graph, making it more robust to noise and have lower variance. Less number of total parameters also suggests the model is less prone to overfitting.
Detailed Approach
#  Model  FB15k237  

MRR  H@1  H@3  H@10  
1  RESCAL  100      0.255  0.185  0.278  0.397  
2  + FDeCom  100  400  200  0.353  0.260  0.388  0.535  
3  + FDeComEn  100  200    0.349  0.260  0.381  0.526  
4  + FDeComRel  200    200  0.354  0.261  0.388  0.536  
5  + vanilla expansion  400      0.317  0.233  0.344  0.483  
6  DistMult  100    100    0.258  0.173  0.283  0.417 
7  + CDeCom  100  400  100  400  0.291  0.210  0.318  0.454 
8  + FDeCom  100  400  100  400  0.299  0.213  0.329  0.470 
9  + FDeComEn  100  400  400    0.296  0.214  0.325  0.460 
10  + FDeComRel  200    100  200  0.281  0.201  0.307  0.442 
11  + CDeComEn  100  400  400    0.278  0.198  0.306  0.442 
12  + CDeComRel  400    100  400  0.273  0.192  0.299  0.436 
13  + vanilla expansion  400    400    0.269  0.186  0.291  0.428 
14  ComplEx  100    100    0.257  0.182  0.270  0.426 
15  + CDeCom  100  400  100  400  0.284  0.200  0.313  0.453 
16  + FDeCom  100  200  100  200  0.303  0.218  0.334  0.473 
17  + FDeComEn  100  400  100  400  0.280  0.205  0.325  0.461 
18  + FDeComRel  100  400  100  400  0.280  0.205  0.325  0.461 
19  + CDeComEn  100  400  100  400  0.285  0.203  0.312  0.450 
20  + CDeComRel  100  400  100  400  0.283  0,199  0.323  0.453 
21  + vanilla expansion  400    400    0.267  0.188  0.292  0.426 
#  Model  WN18RR  

MRR  H@1  H@3  H@10  
1  RESCAL  100      0.441  0.417  0.452  0.487  
2  + FDeCom  100  200  100  0.457  0.427  0.469  0.515  
3  + FDeComEn  100  400    0.451  0.424  0.464  0.500  
4  + FDeComRel  200    100  0.453  0.427  0.464  0.503  
5  + vanilla expansion  400      0.436  0.415  0.444  0.475  
6  DistMult  100    100    0.427  0.381  0.436  0.487 
7  + CDeCom  100  400  100  400  0.445  0.413  0.458  0.510 
8  + FDeCom  100  200  100  200  0.450  0.418  0.461  0.515 
9  + FDeComEn  100  200  200    0.440  0.401  0.452  0.507 
10  + FDeComRel  200    100  200  0.442  0.396  0.449  0.508 
11  + CDeComEn  100  400  400    0.431  0.392  0.447  0.502 
12  + CDeComRel  400    100  400  0.440  0.411  0.450  0.505 
13  + vanilla expansion  400    400    0.422  0.380  0.437  0.482 
14  ComplEx  100    100    0.445  0.415  0.457  0.502 
15  + CDeCom  100  400  100  400  0.438  0.419  0.476  0.521 
16  + FDeCom  100  400  100  400  0.452  0.410  0.461  0.509 
17  + FDeComEn  100  200  200    0.442  0.406  0.461  0.504 
18  + FDeComRel  200    100  200  0.448  0.411  0.463  0.507 
19  + CDeComEn  100  400  400    0.441  0.410  0.460  0.507 
20  + CDeComRel  400    100  400  0.448  0.420  0.455  0.511 
21  + vanilla expansion  400    400    0.440  0.411  0.452  0.510 
Model  FB15k237  WN18RR  

MRR  H@1  H@3  H@10  MRR  H@1  H@3  H@10  
RESCAL  0.255  0.185  0.278  0.397  0.441  0.417  0.452  0.487 
+ DeCom  0.353  0.260  0.388  0.535  0.457  0.427  0.469  0.515 
DistMult[]  0.241  0.155  0.263  0.419  0.430  0.390  0.440  0.490 
DistMult (ours)  0.258  0.173  0.283  0.417  0.427  0.381  0.436  0.487 
+ DeCom  0.299  0.213  0.329  0.470  0.450  0.418  0.461  0.515 
ComplEx[]  0.247  0.158  0.275  0.428  0.440  0.410  0.460  0.510 
ComplEx (ours)  0.257  0.182  0.270  0.426  0.445  0.415  0.457  0.502 
+ DeCom  0.303  0.218  0.334  0.473  0.452  0.410  0.461  0.509 
RotatE[]  0.332  0.235  0.368  0.524  0.475  0.433  0.494  0.556 
 adv sample  0.297  0.205  0.328  0.480         
ConvE[]  0.325  0.237  0.356  0.501  0.430  0.400  0.440  0.520 
DeCom
We denote the decompressing functions as , , for the subject entity , relation r and objective entity respectively, where and are the original and projected embedding sizes repsectively. For any scoring function , we could simply incorportate our decompressing function and change it into . For example, the scoring function of DistMult is , and after inserting the decompressing layer, we can change it into . Table 1 shows more examples about scoring functions w/ and w/o decomressing operations. Figure 1 shows the DeCombased knowledge graph embedding model architecture.
Decompressing Functions
Theoretically, DeCom could be implemented by any kinds of architectures, such as fullyconnected, convolutional and recurrent neural networks. 、 In this work, we mainly explore decompressing functions via convolutional neural networks (CNNs) and fully connected neural networks (FCNNs). Because the embedding has no sequential nature, the recurrent neural network has not been explored in this work. Furthermore, we need to point out that decompressing functions,
, , are independent and do not need to be same.CNNsbased DeCom (CDeCom)
Because of the high parameter efficiency and fast computation speed, CNNs are suitable to represent decompressing functions. Details of the CNNsbased decompressing function are as follows: for a batch of triples, we first look up their embedding vectors from entity and relation embedding tables. Then we feed them into one layer of 1D CNN followed by the batch normalization and dropout, and use the final output to train a knowledge embedding model. Here, Batch normalization
[10] and dropout [20] are employed to speed up training and prevent overfitting. Generally, this method could be easily incorporated into any nonparametric scoring functions.FCNNsbased decompressing function (FDeCom)
Similar to CDeCom, FDeCom employs a linear layer to decompress the input features into a higher dimensional feature space.
Experiments
We experiment with three bilinear knowledge graph embedding models, i.e., RESCAL [18], DistMult [26] and ComplEx [24] on two benchmark datasets, i.e., FB15k237 [23] and WN18RR [6], to show that our proposed method could consistently boost the performance of knowledge graph embedding methods.
Experimental Settings
Benchmark Datasets.
FB15k [2] and WN18 [2] are widely used for evaluating knowledge graph embedding methods. toutanova2015observed toutanova2015observed shows that FB15k contains a large number of inverse relations and most test triples can be inferred from its reverse relation in the training set, so they delete the reverse relations from FB15k and propose FB15k237. There are 14,541 entities and 237 kinds of relations in FB15k237. Similarly, dettmers2018convolutional dettmers2018convolutional removes the reverse relations from WN18 and propose WN18RR. Therefore, in this paper, we evaluate our methods on FB15k237 and WN18RR. There are 40,934 entities and 11 types of relations in WN18RR.
Evaluation Protocol.
We follow the standard evaluation protocol of this task. For each test triple , we corrupt subjecst or the objects in the knowledge graph into or . Then we rank the triples and see how good the ground truth is ranked. Triples that are different from the ground truth but are also correct are filtered. Mean Reciprocal Rank (MRR) and Hit@N (H@N) where N, are standard evaluation measures for these datasets and are reported in our experiments.
Different DeCom strategies
In order to further explore DeCom, various decompressing strategies are explored and the details are the following :

Different Decompressing functions: in this work, two decompressing functions, FDeCom and CDeCom, are explored in our experiments.

Decompressing objects: Because of the high flexibility of DeCom, decompressing functions could be applied on 1) just entities 2) just relations and 3) both entities and relations.
Hyperparamerter Settings.
In order to make our results comparable, for each link predicting baseline model, we keep most of the hyperparameters and training strategies the same between the original model and DeComenhanced model. All models are trained for 500 epochs, embedding size is 100, and other hyperparameters are chosen based on the performance on the validation set by grid search.
For DistMult [26] and ComplEx [24], following dettmers2018convolutional dettmers2018convolutional, 11 training strategy is employed, and Adagrad [7] is used as the optimizer; besides, we regularize these two models by forcing their entity embeddings to have a L2 norm of 1 after parameter updating and the pairwise marginbased ranking loss (margin=1.0) [2] is employed. Furthermore, we find that regularizing entity embeddings after the decompressing layer to have a L2 norm of 1 could effectively prevent overfitting and make the training process stable. The range of the learning rate of Adagrad is {0.08, 0.10, 0.12}.
For RESCAL [18], we apply 1N [6] training strategy, employ Adam [12]
as the optimizer and set binary cross entropy as the loss function. The range of the learning rate of Adam is {0.01, 0.005, 0.001, 0.0005}. Because RESCAL’s relations are represented as fullrank matrices, and it’s not intuitive to decompress a lowdimensional vector into a matrix by convolution, we only experiment it with fully connected networks.
For each model’s corresponding DeComenhanced model, in order to make them comparable, the training strategies such as the optimizer, 11 or 1N training, hyperparameters grid search range, etc, remain the same. Besides that, hyperparameters of the decompressing function are selected via grid search according to the performance on the validation set. The ranges of hyperparameters of the DeCom layer for the grid search are set as follows: for CDeCom, the number of kernel {2, 3, 4}, the size of kernel {3, 4}, for FDeCom, the dimension of decompressed features are {200, 400}, for RESCAL relations only, predecompress dimension {100, 200, 400, 1000, 2000}.
Main Results
Link prediction results on two datasets of three baseline models and their corresponding DeCombased models are shown in Tables 2, 3 and 4.
DeCom vs. no DeCom: DeCombased knowledge graph embedding models outperform their corresponding baseline models significantly, which demonstrates the expressive power of DeCom. Also note that the DeCom models also outperform the baseline models with explicitly increased embedding size, indicating that they are more robust to overfitting.
CDeCom vs. FDeCom: The FDeCom is able to generally obtain better scores but is more prone to be overfitting because from row 16 and rows 2, 8 in Tables 2 and 3 respectively, the best FDeCom feature size is 200 instead of 400. One reason that FDeCom achieving higher scores is that it could extract features from all embedding dimensions but CDeCom is only able to extract features in the range of kernels.
DeComEn vs. DeComRel: From related rows in Table 2 and 3, just decompressing relation features could obtain slightly better result. We attribute this to that modelling relation between entities is more complicated which needs more expressive and robust features from DeCom.
DeCom vs. others In Table 4 we collect the scores of best configurations from Table 2 and 3 and compare them with some other recent works. Especially, DeCombased RESCAL link prediction models achieve stateoftheart performance on the FB15k237 dataset across all metrics.
We further note that DeCom could assist the original model to achieve higher improvement on the dataset with a larger number of relations. Specifically, link prediction models with DeCom achieve +16% and +5% averaged improvement on FB15k237 and WN18RR. We attribute this to that WN18RR is simpler in structure and the original embedding already has the ability to extract meaningful features from the small number of relations. Explicitly increasing embedding size also makes baseline performance worse on WN18RR, which suggests that 100 dimensions may be enough. Therefore, models trained on FB15k237 benefit more from DeCom.
Model  Embed. Size  DeCom Size  MRR  H@10  Param. Size  Speed (triples/sec) 
RESCAL  100    0.258  0.402  6214300 (  1378.03 
RESCAL  400    0.324  0.487  81977200 (  1317.60 
RESCAL + FDeCom  100  400  0.356  0.537  33751300 (  1229.68 
DistMult  100    0.273  0.421  1501900 (  1429.06 
DistMult  400    0.284  0.435  6007600 (  1400.31 
DistMult + CDeCom  100  400  0.314  0.484  1501942 (  1400.73 
DistMult + FDeCom  100  400  0.302  0.468  1583700 (  1386.90 
ComplEx  100    0.280  0.432  3003800 (  1412.71 
ComplEx  400    0.270  0.428  12015200 (  1310.36 
ComplEx + CDeCom  100  400  0.288  0.449  3003882 (  1275.31 
ComplEx + FDeCom  100  400  0.308  0.472  3949120 (  1221.56 
Analysis and Discussion
Parameter and Running Time Efficiency
The decompressing layer is able to map the original embedding to a more expressive and robust feature space. One natural question is: what if we explicitly increase the embedding size? Therefore, we increase the embedding size from 100 to 400 to match the feature size after decompressing layer and compare them from different perspectives. The result is shown in Table 5. It is clear to find that models with decompressing layer not only achieve better performance, but are much more parameter efficient with a little sacrifice of prediction speed. Especially, ComplEx with large embedding size instead harms the performance. We attribute this to the overfitting of two many parameters. Comparing with CDeCom, FDeCom obtains better scores with a little more parameters and slower decoding speed.
Model  MRR  H@10  

DistMult  100    100    0.273  0.421 
+CDeCom  100  100  100  100  0.291  0.452 
+FDeCom  100  100  100  100  0.287  0.448 
ComplEx  100    100    0.280  0.432 
+CDeCom  100  100  100  100  0.283  0.441 
+FDeCom  100  100  100  100  0.292  0.458 
Why is DeCom effective?
We think that there are two main reasons to explain the effectiveness of DeCom:
1) Implicitly increasing the feature dimension to improve model’s expressiveness by decompressing functions.
2) Learning more robust features. To further understand this fact, decompressing functions are designed to keep the size of input features (original embedding) and output ones the same. Specifically, we set the output feature size of FDeCom and CDeCom the same as the input embedding dimension, i.e., 100. The result is shown in Table 6. Despite there is no increase in embedding size, the DeCom models still achieve the performance improvement, suggesting that they could learn more robust embeddings.
Related Work
In order to predict the missing links in knowledge graphs, knowledge graph embedding (KGE) methods have been extensively studied in recent years. For example, RESCAL [18] employs a bilinear product between vector embeddings for each subject and object entity and a full rank matrix for each relation. TransE [2] implicitly models relations through representing each relation as a bijection between source and target entities. DistMult [26], as a special case of RESCAL, uses a diagonal matrix for each representation so that the amount of parameters grows linearly. ComplEx [24] extends DistMult through modeling asymmetric relations by introducing complex embeddings. RotatE [22] models the relation as a rotation operation from the subject entity to the object entity in the complex vector space. Most prior methods are based on simple operations and shallow neural network, which make them fast, scalable and memoryefficient, however, these properties also restrict the expressiveness of learned features. Concurrently, in order to mitigate this problem, dettmers2018convolutional dettmers2018convolutional (ConvE) employs 2D convolution operations on the subject entity and relation embedding vectors, after they are reshaped to matrices and concatenated. However, the reshaping and concatenation operations and applying 2D convolution on word embeddings are not intuitive.
Conclusion
In this work, in order to increase expressiveness and robustness of shallow link predictors, we propose, DeCom, a flexible decompressing mechanism which is able to map lowdimensional embeddings to a more expressive and robust space by adding just a few extra parameters. DeCom could be easily incorporated into many existing knowledge graph embedding models and experimental results show that it could boost the performance of many popular link predictors on several knowledge graphs and obtain stateoftheart results on FB15k237 across all evaluation metrics.
References
 [1] (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 1247–1250. Cited by: Introduction.
 [2] (2013) Translating embeddings for modeling multirelational data. In Advances in neural information processing systems, pp. 2787–2795. Cited by: Introduction, Introduction, Benchmark Datasets., Hyperparamerter Settings., Related Work.

[3]
(2011)
Learning structured embeddings of knowledge bases.
In
TwentyFifth AAAI Conference on Artificial Intelligence
, Cited by: Introduction.  [4] (2017) Kbgan: adversarial learning for knowledge graph embeddings. arXiv preprint arXiv:1711.04071. Cited by: Introduction.
 [5] (2016) Chains of reasoning over entities, relations, and text using recurrent neural networks. arXiv preprint arXiv:1607.01426. Cited by: Introduction.
 [6] (2018) Convolutional 2d knowledge graph embeddings. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: Introduction, Hyperparamerter Settings., Experiments.

[7]
(2011)
Adaptive subgradient methods for online learning and stochastic optimization.
Journal of Machine Learning Research
12 (Jul), pp. 2121–2159. Cited by: Hyperparamerter Settings..  [8] (2018) Toruse: knowledge graph embedding on a lie group. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: Introduction.
 [9] (2016) Knowledge graph embedding by flexible translation. In Fifteenth International Conference on the Principles of Knowledge Representation and Reasoning, Cited by: Introduction.
 [10] (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: CNNsbased DeCom (CDeCom).
 [11] (2018) SimplE embedding for link prediction in knowledge graphs. In Advances in Neural Information Processing Systems, pp. 4284–4295. Cited by: Motivation and Approach Overview.
 [12] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Hyperparamerter Settings..
 [13] (2015) Typeconstrained representation learning in knowledge graphs. In International semantic web conference, pp. 640–655. Cited by: Introduction.
 [14] (2018) Canonical tensor decomposition for knowledge base completion. arXiv preprint arXiv:1806.07297. Cited by: Introduction.
 [15] (1995) WordNet: a lexical database for english. Communications of the ACM 38 (11), pp. 39–41. Cited by: Introduction.
 [16] (2016) STransE: a novel embedding model of entities and relationships in knowledge bases. arXiv preprint arXiv:1606.08140. Cited by: Introduction.
 [17] (2015) A review of relational machine learning for knowledge graphs. Proceedings of the IEEE 104 (1), pp. 11–33. Cited by: Introduction.
 [18] (2011) A threeway model for collective learning on multirelational data.. In ICML, Vol. 11, pp. 809–816. Cited by: Table 1, RESCAL, Hyperparamerter Settings., Experiments, Related Work.
 [19] (2018) Modeling relational data with graph convolutional networks. In European Semantic Web Conference, pp. 593–607. Cited by: Introduction.
 [20] (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: CNNsbased DeCom (CDeCom).
 [21] (2007) Yago: a core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web, pp. 697–706. Cited by: Introduction.
 [22] (2019) RotatE: knowledge graph embedding by relational rotation in complex space. arXiv preprint arXiv:1902.10197. Cited by: Introduction, Introduction, Related Work.
 [23] (2015) Observed versus latent features for knowledge base and text inference. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality, pp. 57–66. Cited by: Experiments.
 [24] (2016) Complex embeddings for simple link prediction. In International Conference on Machine Learning, pp. 2071–2080. Cited by: Table 1, Introduction, Introduction, ComplEx, Hyperparamerter Settings., Experiments, Related Work.
 [25] (2017) An interpretable knowledge transfer model for knowledge base completion. arXiv preprint arXiv:1704.05908. Cited by: Introduction.
 [26] (2014) Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575. Cited by: Table 1, Introduction, Introduction, DistMult, Hyperparamerter Settings., Experiments, Related Work.