Revisiting Simple Neural Networks for Learning Representations of Knowledge Graphs

by   Srinivas Ravishankar, et al.

We address the problem of learning vector representations for entities and relations in Knowledge Graphs (KGs) for Knowledge Base Completion (KBC). This problem has received significant attention in the past few years and multiple methods have been proposed. Most of the existing methods in the literature use a predefined characteristic scoring function for evaluating the correctness of KG triples. These scoring functions distinguish correct triples (high score) from incorrect ones (low score). However, their performance vary across different datasets. In this work, we demonstrate that a simple neural network based score function can consistently achieve near start-of-the-art performance on multiple datasets. We also quantitatively demonstrate biases in standard benchmark datasets, and highlight the need to perform evaluation spanning various datasets.



There are no comments yet.


page 1

page 2

page 3

page 4


A Critical Examination of RESCAL for Completion of Knowledge Bases with Transitive Relations

Link prediction in large knowledge graphs has received a lot of attentio...

Learning Sequence Encoders for Temporal Knowledge Graph Completion

Research on link prediction in knowledge graphs has mainly focused on st...

An Adversarial Transfer Network for Knowledge Representation Learning

Knowledge representation learning has received a lot of attention in the...

Efficient Relation-aware Scoring Function Search for Knowledge Graph Embedding

The scoring function, which measures the plausibility of triplets in kno...

Knowledge Base Completion: Baseline strikes back (Again)

Knowledge Base Completion has been a very active area recently, where mu...

KGBoost: A Classification-based Knowledge Base Completion Method with Negative Sampling

Knowledge base completion is formulated as a binary classification probl...

AutoSF+: Towards Automatic Scoring Function Design for Knowledge Graph Embedding

Scoring functions, which measure the plausibility of triples, have becom...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Knowledge Graphs (KGs) such as NELL [8] and Freebase[2] are repositories of information stored as multi-relational graphs. They are used in many applications such as information extraction, question answering etc. oSuch KGs contain world knowledge in the form of relational triples , where entity is connected to entity using directed relation . For example, (DonaldTrump, PresidentOf, USA) would indicate the fact that Donald Trump is the president of USA. Although current KGs are fairly large, containing millions of facts, they tend to be quite sparse [18]. To overcome this sparsity, Knowledge Base Completion (KBC) or Link Prediction is performed to infer missing facts from existing ones. Low dimensional vector representations of entities and relations, also called embeddings, have been extensively used for this problem [16; 10]. Most of these methods use a characteristic score function which distinguishes correct triples (high score) from incorrect triples (low score). Some of these methods and their scoring functions are summarized in Table 1.

TransE [3] HolE [10] DistMult [21] ComplEx [16]
Table 1: Score functions of some well known Knowledge Graph embedding models. Here h, t, r are vector embeddings for entities h, t and relation r, respectively. , and represent circular correlation, sum of component-wise product, and real part of complex number, respectively.

WN18 and FB15k are two standard benchmarks datasets for evaluating link prediction over KGs. Previous research have shown that these two datasets suffer from inverse relation bias [4] (more in Section 4.2.1). Performance in these datasets is largely dependent on the model’s ability to predict inverse relations, at the expense of other independent relations. In fact, a simple rule-based model exploiting such bias was shown to have achieved state-of-the-art performance in these two datasets [4]. To overcome this shortcoming, several variants, such as FB15k-237 [15], WN18RR [16], FB13 and WN11 [13] have been proposed in the literature.

ComplEx [16] and HolE [10] are two popular KG embedding techniques which achieve state-of-the-art performance on the WN18 and FB15k datasets. However, we observe that such methods do not perform as well uniformly across all the other datasets mentioned above. This may suggest that using a predefined scoring function, as in ComplEx and HolE, might not be the best option to achieve competitive results on all datasets. Ideally, we would prefer a model that achieves near state-of-the-art performance on any given dataset.

In this paper, we demonstrate that a simple neural network based score function that can adapt to different datasets and achieve near state-of-the-art performance on multiple datasets. The main contributions of this papers can be summarized as follows.

  • We quantitatively demonstrate the severity of the inverse relation bias in the standard benchmark datasets.

  • We empirically show that current state-of-the-art methods do not perform consistently well over different datasets.

  • We demonstrate that ER-MLP[5], a simple neural network based scoring function, has the ability to adapt to different datasets achieving near state-of-the-art performance consistently. We also consider a variant, ER-MLP-2d.

Code is available at

2 Related Work

Several methods have been proposed for learning KG embeddings. They differ in the way entities and relations are modeled, the score function used for scoring triples, and the loss function used for training. For example, TransE

[3] uses real vectors for representing both entities and relations, while RESCAL [11] uses real vectors for entities, and real matrices for relations.

Translational Models: One of the initial models for KG embeddings is TransE [3], which models relation as translation vectors from head entity to tail entity for a given triple . A pair-wise ranking loss is then used for learning these embeddings. Following the basic idea of translation vectors in TransE, there have been many methods which improve the performance. Some of these methods are TransH [17], TransR [7], TransA [19], TransG [20] etc.

Multiplicative Models: HolE [10] and ComplEx [16] are recent methods which achieve state-of-the-art performance in link prediction in commonly used datasets FB15k and WN18. HolE models entities and relations as real vectors and can handle asymmetric relations. ComplEx uses complex vectors and can handle symmetric, asymmetric as well as anti-symmetric relations. We use these methods as representatives of the state-of-the-art in our experiments.

Neural Models: Several methods that use neural networks for scoring triples have been proposed. Notable among them are NTN [13], CONV [15], ConvE [4], and R-GCN [12]

. CONV uses the internal structure of textual relations as input to a Convolutional Neural Network. NTN learns a tensor net

[13] for each relation in the knowledge graph. ConvE uses convolutional neural networks (CNNs) over reshaped input vectors for scoring triples. R-GCN takes a different approach and uses Graph Convolutional Networks to obtain embeddings from the graph. DistMult (or some other linear model) is then used on these embeddings to obtain a score. We focus on simple neural models, such as ER-MLP [12], and find that such simple models are more effective in KG embedding and link prediction than more complicated models such as ConvE or R-GCN.

3 Knowledge Graph Embedding using Simple Neural Networks

Rather than using a predefined function to score triples, and then learn embeddings to fit this scoring function, we use a neural network to jointly learn both the scoring function and embeddings together to fit the dataset.

3.1 Neural Network as a Score Function

We use a simple feed-forward Neural Network with a single hidden layer as the approximator of the scoring function of a given triple. In particular, we use ER-MLP

[12], a previously proposed neural network model for KG embedding, and ER-MLP-2d, a variant of ER-MLP we propose. Architectures of the two models are shown in Figure 1.

Figure 1: Architecture of (a) ER-MLP, (b) ER-MLP-2d

Let , be the -dimensional embeddings of the entities and respectively111We use boldface to refer to embeddings of corresponding italicized objects. Similarly, is the embedding of relation , whose dimensions are and in ER-MLP and ER-MLP-2d, respectively, as we shall explain below. In ER-MLP, the head, relation and tail embeddings are concatenated and fed as input to the NN, so its input layer is of size . In ER-MLP-2d, the concatenated head and tail embeddings are translated using the relation embedding of size , so input size in ER-MLP-2d is . Both models have a single fully connected hidden layer. This leads to an output node, which is taken as the score of the given triple .


denote the activation function and

denote concatenation of vectors a and b. Let and be respectively the hidden and output layer weight matrices in ER-MLP. is a single bias value. Let the equivalent parameters in ER-MLP-2d be , , and . The triple scoring functions for ER-MLP and ER-MLP-2d respectively are given below.

We consider the sigmoid function,

, to be the probability of correctness of a triple. We train the model to assign probability of

to correct triples and to incorrect triples. Let be the set of positive and (sampled) negative triples, with label . We optimize the cross-entropy loss given below, with replaced by and for ER-MLP and ER-MLP-2d, respectively.

4 Experimental Setup

4.1 Implementation Details

For initialization of embeddings, Uniform initialization with range was used. Xavier initialization was used for weights. In the neural network-based models, we used Dropout [14] with p = 0.5 on the hidden layer to prevent overfitting. The regularization parameter for weight decay was chosen from {0.001, 0.01, 0.1} based on cross validation. We chose the hidden layer size between {10d, 20d}. Since the size of this layer determines the expressive power of the model, datasets with simple relations (lower relation-specific indigree) require smaller number of hidden units and more difficult datasets require a higher number to achieve optimal performance.

ReLU [9] activation is used in the hidden layer to achieve fast convergence. To minimize the objective function, we used ADAM [6] with learning rate 0.001. Dimensionality of entity and relation embeddings were set equal in ER-MLP, i.e., . For ER-MLP-2d, we have .

was cross validated on {100, 200} for all datasets. All experiments were run using Tensorflow on a single GTX 1080 GPU. To achieve maximum GPU utilization, we set the batch size larger than used previously in literature, choosing from {10000, 20000, 50000} using cross validation. For sampling negative triples, we used

bernoulli method as described in [3].

Model Number of Parameters WN18 FB15K
Table 2: Number of parameters of various methods over the WN18 and FB15K datasets. Above, represents entity embedding size, is the number of entities, and is the number of relations.

4.2 Datasets

We ran experiments on the datasets listed in Table 3. Previous work has noted that the two benchmark datasets – FB15K and WN18 – have a high number of redundant and reversible relations [15]. A simple rule-based model, exploiting such deficiencies, was shown to have achieved state-of-the-art performance in these datasets [4]. This suggests that evaluation restricted only to these two datasets may not be an accurate indication of the model’s capability. In order to address this issue, we evaluate model performance over six datasets, as summarized in Table 3.

4.2.1 Inverse Relation Bias

WN18 FB15K WN18RR FB15K-237 WN11 FB13
Number of Relations 18 1345 11 237 11 13
Percentage of Trivial Test Triples 72.12% 54.42% 0% 0% 0% 0%
Table 3: Inverse Relation Bias present in various datasets. Please see Section 4.2.1 for more details.

In a knowledge graph, a pair of relations r and r’ are said to be inverse relations if a correct triple (h,r,t) implies the existence of another correct triple (t,r’,h), and vice versa. A trivial triple refers to the existence of a triple in the test dataset when is already present in the training dataset, with and being inverse relations. A model that can learn inverse relations well at the expense of other types of relations will still achieve very good performance on datasets involving such biased relations. This is undesirable, since our goal is to learn effective embeddings of highly multi-relational graphs.

We quantitatively investigated the bias of various datasets towards such inverse relations by measuring the fraction of trivial triples present in them. The results are summarized in Table 3. Using the training dataset, each pair of relations were tested for inversion. They were identified as inverses if 80% or more triples that contained one relation appeared as inverse triple involving the other relation. As can be seen from the table, the two standard benchmark datasets – FB15K and WN18 – have a large number of trivial triples. This is in contrast to four other pre-existing datasets in literature – FB13, WN11, FB15K-237, and WN18RR – which do not suffer from such bias. As mentioned above, we perform experiments spanning all six datasets.

5 Experiments

We chose HolE and ComplEx for comparison as these are the state-of-the-art in the current literature. Both models were re-implemented by us for fair comparison. We were able to achieve better performance for HolE on FB15K than what was reported in the original paper. Available results of ConvE and R-GCN have been taken from [4] for comparison. We evaluated these models on the link prediction task and the results are reported in Table 4.

WN18 FB15K
Hits@10 MR MRR Hits@10 MR MRR
HolE 94.12 810 0.934 84.35 113 0.64
ComplEx 94.64 826 0.938 87.33 113 0.75
ConvE 95.5 504 0.942 87.3 64 0.745
R-GCN 96.4 - 0.814 84.2 - 0.696
ER-MLP 94.2 299 0.895 80.14 81 0.57
ER-MLP-2d 93.66 372 0.893 80.04 81 0.567
FB15K-237 WN18RR
Hits@10 MR MRR Hits@10 MR MRR
HolE 47.0 501 0.298 42.4 6129 0.395
ComplEx 50.7 381 0.326 50.7 5261 0.444
ConvE 45.8 330 0.301 41.1 7323 0.342
R-GCN 41.7 - 0.248 - - -
ER-MLP 54.0 219 0.342 41.92 4798 0.366
ER-MLP-2d 54.65 234 0.338 42.1 4233 0.358
FB13 WN11
Hits@10 MR MRR Hits@10 MR MRR
HolE 54.87 1436 0.392 10.48 10182 0.059
ComplEx 51.38 6816 0.382 10.9 11134 0.071
ER-MLP 63.13 705 0.479 14.01 4660 0.071
ER-MLP-2d 62.66 821 0.476 13.26 4290 0.073
Table 4: Link prediction performance of various methods on different datasets. Available results for ConvE and R-GCN are taken from [4]. Please see Section 5 for more details.

5.1 Analysis of results

Based on the results in in Table 4, we make the following observations.

  1. Neural network based models achieve state-of-the-art performance on WN11, FB13 and FB15K-237, and perform competitively on WN18. This is encouraging since all these datasets (except WN18) have zero trivial triples (Table 3) and are therefore more challenging compared to the other datasets.

  2. Surprisingly, linear models such as ComplEx and HolE perform better than neural models on WN18RR, a dataset without trivial triples. This behavior has been related with the PageRank (a measure of indegree) of central nodes in different datasets by [4]. They found that linear models perform better on simpler datasets with low relation-specific indegree, such as WordNet. This is because they are easier to optimize and are able to find better local minima. Neural models show superior performance on complex datasets with higher relation-specific indegree.

  3. Despite the effectiveness of a simple neural model like ER-MLP, such methods haven’t received much attention in recent literature. Even though ER-MLP was compared against HolE in [10], a rigorous comparison involving diverse datasets was missing. Results in this paper address this gap, and show that such simple models merit further consideration in the future.

6 Conclusions and Future Work

In this work, we showed that the current state-of-the-art models do not achieve uniformly good performance across different datasets, and that the current benchmark datasets can be misleading when evaluating a model’s ability to represent multi-relational graphs. We recommend that models henceforth be evaluated on multiple datasets so as to ensure their adaptability to Knowledge Graphs with different characteristics. We also showed that a neural network with a single hidden layer, which learns the scoring function together with the embeddings, can achieve competitive performance across datasets in spite of its simplicity. In future, we plan to identify the characteristics of datasets that determine the performance of various models.