1 Introduction
Knowledge Graphs (KGs) such as NELL [8] and Freebase[2] are repositories of information stored as multirelational graphs. They are used in many applications such as information extraction, question answering etc. oSuch KGs contain world knowledge in the form of relational triples , where entity is connected to entity using directed relation . For example, (DonaldTrump, PresidentOf, USA) would indicate the fact that Donald Trump is the president of USA. Although current KGs are fairly large, containing millions of facts, they tend to be quite sparse [18]. To overcome this sparsity, Knowledge Base Completion (KBC) or Link Prediction is performed to infer missing facts from existing ones. Low dimensional vector representations of entities and relations, also called embeddings, have been extensively used for this problem [16; 10]. Most of these methods use a characteristic score function which distinguishes correct triples (high score) from incorrect triples (low score). Some of these methods and their scoring functions are summarized in Table 1.
WN18 and FB15k are two standard benchmarks datasets for evaluating link prediction over KGs. Previous research have shown that these two datasets suffer from inverse relation bias [4] (more in Section 4.2.1). Performance in these datasets is largely dependent on the model’s ability to predict inverse relations, at the expense of other independent relations. In fact, a simple rulebased model exploiting such bias was shown to have achieved stateoftheart performance in these two datasets [4]. To overcome this shortcoming, several variants, such as FB15k237 [15], WN18RR [16], FB13 and WN11 [13] have been proposed in the literature.
ComplEx [16] and HolE [10] are two popular KG embedding techniques which achieve stateoftheart performance on the WN18 and FB15k datasets. However, we observe that such methods do not perform as well uniformly across all the other datasets mentioned above. This may suggest that using a predefined scoring function, as in ComplEx and HolE, might not be the best option to achieve competitive results on all datasets. Ideally, we would prefer a model that achieves near stateoftheart performance on any given dataset.
In this paper, we demonstrate that a simple neural network based score function that can adapt to different datasets and achieve near stateoftheart performance on multiple datasets. The main contributions of this papers can be summarized as follows.

We quantitatively demonstrate the severity of the inverse relation bias in the standard benchmark datasets.

We empirically show that current stateoftheart methods do not perform consistently well over different datasets.

We demonstrate that ERMLP[5], a simple neural network based scoring function, has the ability to adapt to different datasets achieving near stateoftheart performance consistently. We also consider a variant, ERMLP2d.
Code is available at https://github.com/SrinivasR/AKBC2017Paper14.git.
2 Related Work
Several methods have been proposed for learning KG embeddings. They differ in the way entities and relations are modeled, the score function used for scoring triples, and the loss function used for training. For example, TransE
[3] uses real vectors for representing both entities and relations, while RESCAL [11] uses real vectors for entities, and real matrices for relations.Translational Models: One of the initial models for KG embeddings is TransE [3], which models relation as translation vectors from head entity to tail entity for a given triple . A pairwise ranking loss is then used for learning these embeddings. Following the basic idea of translation vectors in TransE, there have been many methods which improve the performance. Some of these methods are TransH [17], TransR [7], TransA [19], TransG [20] etc.
Multiplicative Models: HolE [10] and ComplEx [16] are recent methods which achieve stateoftheart performance in link prediction in commonly used datasets FB15k and WN18. HolE models entities and relations as real vectors and can handle asymmetric relations. ComplEx uses complex vectors and can handle symmetric, asymmetric as well as antisymmetric relations. We use these methods as representatives of the stateoftheart in our experiments.
Neural Models: Several methods that use neural networks for scoring triples have been proposed. Notable among them are NTN [13], CONV [15], ConvE [4], and RGCN [12]
. CONV uses the internal structure of textual relations as input to a Convolutional Neural Network. NTN learns a tensor net
[13] for each relation in the knowledge graph. ConvE uses convolutional neural networks (CNNs) over reshaped input vectors for scoring triples. RGCN takes a different approach and uses Graph Convolutional Networks to obtain embeddings from the graph. DistMult (or some other linear model) is then used on these embeddings to obtain a score. We focus on simple neural models, such as ERMLP [12], and find that such simple models are more effective in KG embedding and link prediction than more complicated models such as ConvE or RGCN.3 Knowledge Graph Embedding using Simple Neural Networks
Rather than using a predefined function to score triples, and then learn embeddings to fit this scoring function, we use a neural network to jointly learn both the scoring function and embeddings together to fit the dataset.
3.1 Neural Network as a Score Function
We use a simple feedforward Neural Network with a single hidden layer as the approximator of the scoring function of a given triple. In particular, we use ERMLP
[12], a previously proposed neural network model for KG embedding, and ERMLP2d, a variant of ERMLP we propose. Architectures of the two models are shown in Figure 1.Let , be the dimensional embeddings of the entities and respectively^{1}^{1}1We use boldface to refer to embeddings of corresponding italicized objects. Similarly, is the embedding of relation , whose dimensions are and in ERMLP and ERMLP2d, respectively, as we shall explain below. In ERMLP, the head, relation and tail embeddings are concatenated and fed as input to the NN, so its input layer is of size . In ERMLP2d, the concatenated head and tail embeddings are translated using the relation embedding of size , so input size in ERMLP2d is . Both models have a single fully connected hidden layer. This leads to an output node, which is taken as the score of the given triple .
Let
denote the activation function and
denote concatenation of vectors a and b. Let and be respectively the hidden and output layer weight matrices in ERMLP. is a single bias value. Let the equivalent parameters in ERMLP2d be , , and . The triple scoring functions for ERMLP and ERMLP2d respectively are given below.We consider the sigmoid function,
, to be the probability of correctness of a triple. We train the model to assign probability of
to correct triples and to incorrect triples. Let be the set of positive and (sampled) negative triples, with label . We optimize the crossentropy loss given below, with replaced by and for ERMLP and ERMLP2d, respectively.4 Experimental Setup
4.1 Implementation Details
For initialization of embeddings, Uniform initialization with range was used. Xavier initialization was used for weights.
In the neural networkbased models, we used Dropout [14] with p = 0.5 on the hidden layer to prevent overfitting. The regularization parameter for weight decay was chosen from {0.001, 0.01, 0.1} based on cross validation.
We chose the hidden layer size between {10d, 20d}. Since the size of this layer determines the expressive power of the model, datasets with simple relations (lower relationspecific indigree) require smaller number of hidden units and more difficult datasets require a higher number to achieve optimal performance.
ReLU [9] activation is used in the hidden layer to achieve fast convergence. To minimize the objective function, we used ADAM [6] with learning rate 0.001. Dimensionality of entity and relation embeddings were set equal in ERMLP, i.e., . For ERMLP2d, we have .
was cross validated on {100, 200} for all datasets. All experiments were run using Tensorflow on a single GTX 1080 GPU. To achieve maximum GPU utilization, we set the batch size larger than used previously in literature, choosing from {10000, 20000, 50000} using cross validation. For sampling negative triples, we used
bernoulli method as described in [3].Model  Number of Parameters  WN18  FB15K 

HolE  
ComplEx  
ConvE  
ERMLP  
ERMLP2d 
4.2 Datasets
We ran experiments on the datasets listed in Table 3. Previous work has noted that the two benchmark datasets – FB15K and WN18 – have a high number of redundant and reversible relations [15]. A simple rulebased model, exploiting such deficiencies, was shown to have achieved stateoftheart performance in these datasets [4]. This suggests that evaluation restricted only to these two datasets may not be an accurate indication of the model’s capability. In order to address this issue, we evaluate model performance over six datasets, as summarized in Table 3.
4.2.1 Inverse Relation Bias
WN18  FB15K  WN18RR  FB15K237  WN11  FB13  
Number of Relations  18  1345  11  237  11  13 
Percentage of Trivial Test Triples  72.12%  54.42%  0%  0%  0%  0% 
In a knowledge graph, a pair of relations r and r’ are said to be inverse relations if a correct triple (h,r,t) implies the existence of another correct triple (t,r’,h), and vice versa. A trivial triple refers to the existence of a triple in the test dataset when is already present in the training dataset, with and being inverse relations. A model that can learn inverse relations well at the expense of other types of relations will still achieve very good performance on datasets involving such biased relations. This is undesirable, since our goal is to learn effective embeddings of highly multirelational graphs.
We quantitatively investigated the bias of various datasets towards such inverse relations by measuring the fraction of trivial triples present in them. The results are summarized in Table 3. Using the training dataset, each pair of relations were tested for inversion. They were identified as inverses if 80% or more triples that contained one relation appeared as inverse triple involving the other relation. As can be seen from the table, the two standard benchmark datasets – FB15K and WN18 – have a large number of trivial triples. This is in contrast to four other preexisting datasets in literature – FB13, WN11, FB15K237, and WN18RR – which do not suffer from such bias. As mentioned above, we perform experiments spanning all six datasets.
5 Experiments
We chose HolE and ComplEx for comparison as these are the stateoftheart in the current literature. Both models were reimplemented by us for fair comparison. We were able to achieve better performance for HolE on FB15K than what was reported in the original paper. Available results of ConvE and RGCN have been taken from [4] for comparison. We evaluated these models on the link prediction task and the results are reported in Table 4.
WN18  FB15K  
Hits@10  MR  MRR  Hits@10  MR  MRR  
HolE  94.12  810  0.934  84.35  113  0.64 
ComplEx  94.64  826  0.938  87.33  113  0.75 
ConvE  95.5  504  0.942  87.3  64  0.745 
RGCN  96.4    0.814  84.2    0.696 
ERMLP  94.2  299  0.895  80.14  81  0.57 
ERMLP2d  93.66  372  0.893  80.04  81  0.567 
FB15K237  WN18RR  
Hits@10  MR  MRR  Hits@10  MR  MRR  
HolE  47.0  501  0.298  42.4  6129  0.395 
ComplEx  50.7  381  0.326  50.7  5261  0.444 
ConvE  45.8  330  0.301  41.1  7323  0.342 
RGCN  41.7    0.248       
ERMLP  54.0  219  0.342  41.92  4798  0.366 
ERMLP2d  54.65  234  0.338  42.1  4233  0.358 
FB13  WN11  

Hits@10  MR  MRR  Hits@10  MR  MRR  
HolE  54.87  1436  0.392  10.48  10182  0.059 
ComplEx  51.38  6816  0.382  10.9  11134  0.071 
ERMLP  63.13  705  0.479  14.01  4660  0.071 
ERMLP2d  62.66  821  0.476  13.26  4290  0.073 
5.1 Analysis of results
Based on the results in in Table 4, we make the following observations.

Neural network based models achieve stateoftheart performance on WN11, FB13 and FB15K237, and perform competitively on WN18. This is encouraging since all these datasets (except WN18) have zero trivial triples (Table 3) and are therefore more challenging compared to the other datasets.

Surprisingly, linear models such as ComplEx and HolE perform better than neural models on WN18RR, a dataset without trivial triples. This behavior has been related with the PageRank (a measure of indegree) of central nodes in different datasets by [4]. They found that linear models perform better on simpler datasets with low relationspecific indegree, such as WordNet. This is because they are easier to optimize and are able to find better local minima. Neural models show superior performance on complex datasets with higher relationspecific indegree.

Despite the effectiveness of a simple neural model like ERMLP, such methods haven’t received much attention in recent literature. Even though ERMLP was compared against HolE in [10], a rigorous comparison involving diverse datasets was missing. Results in this paper address this gap, and show that such simple models merit further consideration in the future.
6 Conclusions and Future Work
In this work, we showed that the current stateoftheart models do not achieve uniformly good performance across different datasets, and that the current benchmark datasets can be misleading when evaluating a model’s ability to represent multirelational graphs. We recommend that models henceforth be evaluated on multiple datasets so as to ensure their adaptability to Knowledge Graphs with different characteristics. We also showed that a neural network with a single hidden layer, which learns the scoring function together with the embeddings, can achieve competitive performance across datasets in spite of its simplicity. In future, we plan to identify the characteristics of datasets that determine the performance of various models.
References
 [1]
 Bollacker et al. [2008] Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data. AcM, 1247–1250.
 Bordes et al. [2013] Antoine Bordes, Nicolas Usunier, Alberto GarciaDuran, Jason Weston, and Oksana Yakhnenko. 2013. Translating embeddings for modeling multirelational data. In Advances in neural information processing systems. 2787–2795.
 Dettmers et al. [2017] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel. 2017. Convolutional 2D Knowledge Graph Embeddings. ArXiv eprints (July 2017). arXiv:cs.LG/1707.01476
 Dong et al. [2014] Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. 2014. Knowledge vault: A webscale approach to probabilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 601–610.
 Kingma and Ba [2014] Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
 Lin et al. [2015] Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. 2015. Learning Entity and Relation Embeddings for Knowledge Graph Completion.. In AAAI. 2181–2187.
 Mitchell et al. [2015] Tom M Mitchell, William W Cohen, Estevam R Hruschka Jr, Partha Pratim Talukdar, Justin Betteridge, Andrew Carlson, Bhavana Dalvi Mishra, Matthew Gardner, Bryan Kisiel, Jayant Krishnamurthy, et al. 2015. Never Ending Learning.. In AAAI. 2302–2310.

Nair and Hinton [2010]
Vinod Nair and
Geoffrey E Hinton. 2010.
Rectified linear units improve restricted boltzmann
machines. In
Proceedings of the 27th international conference on machine learning (ICML10)
. 807–814.  Nickel et al. [2016] Maximilian Nickel, Lorenzo Rosasco, Tomaso A Poggio, et al. 2016. Holographic Embeddings of Knowledge Graphs.. In AAAI. 1955–1961.
 Nickel and Tresp [2013] Maximilian Nickel and Volker Tresp. 2013. Logistic tensor factorization for multirelational data. arXiv preprint arXiv:1306.2084 (2013).
 Schlichtkrull et al. [2017] M. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov, and M. Welling. 2017. Modeling Relational Data with Graph Convolutional Networks. ArXiv eprints (March 2017). arXiv:stat.ML/1703.06103
 Socher et al. [2013] Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. 2013. Reasoning with neural tensor networks for knowledge base completion. In Advances in neural information processing systems. 926–934.
 Srivastava et al. [2014] Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research 15, 1 (2014), 1929–1958.
 Toutanova et al. [2015] Kristina Toutanova, Danqi Chen, Patrick Pantel, Hoifung Poon, Pallavi Choudhury, and Michael Gamon. 2015. Representing Text for Joint Embedding of Text and Knowledge Bases.. In EMNLP, Vol. 15. 1499–1509.
 Trouillon et al. [2016] Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. 2016. Complex embeddings for simple link prediction. In International Conference on Machine Learning. 2071–2080.

Wang
et al. [2014]
Zhen Wang, Jianwen Zhang,
Jianlin Feng, and Zheng Chen.
2014.
Knowledge Graph Embedding by Translating on Hyperplanes.. In
AAAI. 1112–1119.  West et al. [2014] Robert West, Evgeniy Gabrilovich, Kevin Murphy, Shaohua Sun, Rahul Gupta, and Dekang Lin. 2014. Knowledge base completion via searchbased question answering. In Proceedings of the 23rd international conference on World wide web. ACM, 515–526.
 Xiao et al. [2015a] Han Xiao, Minlie Huang, Yu Hao, and Xiaoyan Zhu. 2015a. Transa: An adaptive approach for knowledge graph embedding. arXiv preprint arXiv:1509.05490 (2015).
 Xiao et al. [2015b] Han Xiao, Minlie Huang, Yu Hao, and Xiaoyan Zhu. 2015b. TransG: A Generative Mixture Model for Knowledge Graph Embedding. arXiv preprint arXiv:1509.05488 (2015).
 Yang et al. [2014] Bishan Yang, Wentau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. 2014. Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575 (2014).
Comments
There are no comments yet.