Knowledge Graph Embedding for Link Prediction: A Comparative Analysis

02/03/2020 ∙ by Andrea Rossi, et al. ∙ University of Alberta Università Roma Tre 0

Knowledge Graphs (KGs) have found many applications in industry and academic settings, which in turn, have motivated considerable research efforts towards large-scale information extraction from a variety of sources. Despite such efforts, it is well known that even state-of-the-art KGs suffer from incompleteness. Link Prediction (LP), the task of predicting missing facts among entities already a KG, is a promising and widely studied task aimed at addressing KG incompleteness. Among the recent LP techniques, those based on KG embeddings have achieved very promising performances in some benchmarks. Despite the fast growing literature in the subject, insufficient attention has been paid to the effect of the various design choices in those methods. Moreover, the standard practice in this area is to report accuracy by aggregating over a large number of test facts in which some entities are over-represented; this allows LP methods to exhibit good performance by just attending to structural properties that include such entities, while ignoring the remaining majority of the KG. This analysis provides a comprehensive comparison of embedding-based LP methods, extending the dimensions of analysis beyond what is commonly available in the literature. We experimentally compare effectiveness and efficiency of 16 state-of-the-art methods, consider a rule-based baseline, and report detailed analysis over the most popular benchmarks in the literature.



There are no comments yet.


page 22

page 29

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Knowledge Graphs (KGs) are structured representations of real world information. In a KG nodes represent entities, such as people and places; labels are types of relations that can connect them; edges are specific facts connecting two entities with a relation. Due to their capability to model structured, complex data in a machine-readable way, KGs are nowadays widely employed in various domains, ranging from question answering to information retrieval and content-based recommendation systems, and they are vital to any semantic web project (Hovy et al., 2013). Examples of notable KGs are FreeBase (Bollacker et al., 2008), WikiData (Vrandečić and Krötzsch, 2014), DBPedia (Auer et al., 2007), Yago (Suchanek et al., 2007) and – in industry – Google KG (Singhal, 2012), Microsoft Satori (Qian, 2013) and Facebook Graph Search (Stocky and Rasmussen, 2014). These massive KGs can contain millions of entities and billions of facts.

Despite such efforts, it is well known that even state-of-the-art KGs suffer from incompleteness. For instance, it has been observed that over 70% of person entities have no known place of birth, and over 99% have no known ethnicity (West et al., 2014; Dong et al., 2014) in FreeBase, one of the largest and most widely used KGs for research purposes. This has led researchers to propose various techniques for correcting errors as well as adding missing facts to KGs (Paulheim, 2017), commonly known as the task of Knowledge Graph Completion or Knowledge Graph Augmentation. Growing an existing KG can be done by extracting new facts from external sources, such as Web corpora, or by inferring missing facts from those already in the KG. The latter approach, called Link Prediction (LP), is the focus of our analysis.

LP has been an increasingly active area of research, which has more recently benefited from the explosion of machine learning and deep learning techniques. The vast majority of LP models nowadays use original KG elements to learn low-dimensional representations dubbed

Knowledge Graph Embeddings, and then employ them to infer new facts. Inspired by a few seminal works such as RESCAL (Nickel et al., 2011) and TransE (Bordes et al., 2013), in the short span of just a few years researchers have developed dozens of novel models based on very different architectures. One aspect that is common to the vast majority of papers in this area, but nevertheless also problematic, is that they report results aggregated over a large number of test facts in which few entities are over-represented. As a result, LP methods can exhibit good performance on these benchmarks by attending only to such entities while ignoring the others. Moreover, the limitations of the current best-practice can make it difficult for one to understand how the papers in this literature fit together and to picture what research directions are worth pursuing. In addition to that, the strengths, weaknesses and limitations of the current techniques are still unknown, that is, the circumstances allowing models to perform better have been hardly investigated. Roughly speaking, we still do not really know what makes a fact easy or hard to learn and predict.

In order to mitigate the issues mentioned above, we carry out an extensive comparative analysis of a representative set of LP models based on KG embeddings. We privilege state-of-the-art systems, and consider works belonging to a wide range of architectures. We train and tune such systems from scratch and provide experimental results beyond what is available in the original papers, by proposing new and informative evaluation practices. Specifically:

  • We take into account 16 models belonging to diverse machine learning and deep learning architectures; we also adopt as a baseline an additional state-of-the-art LP model based on rule mining. We provide a detailed description of the approaches considered for experimental comparison and a summary of related literature, together with an educational taxonomy for Knowledge Graph Embedding techniques.

  • We take into account the 5 most commonly employed datasets as well as the most popular metrics currently used for benchmarking; we analyze in detail their features and peculiarities.

  • For each model we provide quantitative results for efficiency and effectiveness on every dataset.

  • We define a set of structural features in the training data, and we measure how they affect the predictive performance of each model on each test fact.

The datasets, the code and all the resources used in our work are publicly available through our GitHub repository.111 For each model and dataset, we also share CSV files containing, for each test prediction, the rank and the list of all the entities predicted up to the correct one.


The paper is organized as follow. Section 2 provides background on KG embedding and LP. Section 3 introduces the models included in our work, presenting them in a taxonomy to facilitate their description. Section 4 describes the analysis directions and approaches we follow in our work. Section 5 reports our results and observations. Section 6 provides lessons learned and future research directions. Section 7 discusses related works, and Section 8 provides concluding remarks.

2. The Link Prediction Problem

This section provides a detailed outline for the LP task in the context of KGs, introducing key concepts that we are going to refer to in our work.

We define a KG as a labeled, directed multi-graph :

  • : a set of nodes representing entities;

  • : a set of labels representing relations;

  • : a set of edges representing facts connecting pairs of entities. Each fact is a triple ⟨h, r, t⟩, where h is the head, r is the relation, and t is the tail of the fact.

Link Prediction (LP) is the task of exploiting the existing facts in a KG to infer missing ones. This amounts to guessing the correct entity that completes ⟨h, r, ?⟩ (tail prediction) or ⟨?, r, t⟩ (head prediction). For the sake of simplicity, when talking about head and tail prediction globally, we call source entity the known entity in the prediction, and target entity the one to predict.

In time, numerous approaches have been proposed to tackle the LP task. Some methods are based on observable features and employ techniques such as Rule Mining (Galárraga et al., 2013)(Galárraga et al., 2015)(Meilicke et al., 2018)(HUYNH et al., 2019) or the Path Ranking Algorithm (Lao and Cohen, 2010)(Lao et al., 2011)

to identify missing triples in the graph. Recently, with the rise of novel Machine Learning techniques, researchers have been experimenting on capturing latent features of the graph with vectorized representations, or embeddings, of its components. In general,

embeddings are vectors of numerical values that can be used to represent any kind of elements (e.g., depending on the domain: words, people, products…). Embeddings are learned automatically, based on how the corresponding elements occur and interact with each other in datasets representative of the real world. For instance, word embeddings have become a standard way to represent words in a vocabulary, and they are usually learned using textual corpora as input data. When it comes to KGs, embeddings are typically used to represent entities and relationships using the graph structure; the resulting vectors, dubbed KG Embeddings, embody the semantics of the original graph, and can be used to identify new links inside it, thus tackling the LP task.

In the following we use letters to identify KG elements (entities or relations), and letters to identify the corresponding embeddings. Given for instance a generic entity, we may use e when referring to its element in the graph, and when referring to its embedding.

Datasets employed in LP research are typically obtained subsampling real-world KGs; each dataset can therefore be seen as a small KG with its own sets of entities , relations and facts . In order to facilitate research, is further split into three disjoint subsets: a training set , a validation set and a test set .

Most of LP models based on embeddings define a scoring function

to estimate the plausibility of any fact

using their embeddings:

In this paper, unless differently specified, we are going to assume that the higher the score of , the more plausible the fact.

During training, embeddings are usually initialized randomly and subsequently improved with optimization algorithms such as back-propagation with gradient descent. The positive samples in are often randomly corrupted in order to generate negative samples. The optimization process aims at maximizing the plausibility of positive facts as well as minimizing the plausibility of negative facts; this often amounts to employing a triplet loss function. Over time, more effective ways to generate negative triples have been proposed, such as sampling from a Bernouilli distribution (Wang et al., 2014) or generating them with adversarial algorithms (Sun et al., 2019). In addition to the embeddings of KG elements, models may also use the same optimization algorithms to learn additional parameters (e.g. the weights of neural layers). Such parameters, if present, are employed in the scoring function to process the actual embeddings of entities and relations. Since they are not specific to any KG element, they are often dubbed shared parameters.

In prediction phase, given an incomplete triple , the missing tail is inferred as the entity that, completing the triple, results in the highest score:

Head prediction is performed analogously.

Evaluation is carried out by performing both head and tail prediction on all test triples in , and computing for each prediction how the target entity ranks against all the other ones. Ideally, the target entity should yield the highest plausibility.

Ranks can be computed in two largely different settings, called raw and filtered scenarios. As a matter of fact, a prediction may have multiple valid answers: for instance, when predicting the tail for ⟨ Barack ObamaparentNatasha Obama ⟩, a model may associate a higher score to Malia Obama than to Natasha Obama. More generally, if the predicted fact is contained in (that is, either in , or in or in ), the answer is valid. Depending on whether valid answers should be considered acceptable or not, two separate settings have been devised:

  • Raw Scenario: in this scenario, valid entities outscoring the target one are considered as mistakes. Therefore they do contribute to the rank computation. Given a test fact ⟨, the raw rank of the target tail is computed as:

    The raw rank in head prediction can be computed analogously.

  • Filtered Scenario: in this scenario, valid entities outscoring the target one are not considered mistakes. Therefore they are skipped when computing the rank. Given a test fact ⟨, the filtered rank of the target tail is computed as:

    The filtered rank in head prediction can be computed analogously.

In order to compute the rank it is also necessary to define the policy to apply when the target entity obtains the same score as other ones. This event is called a tie and it can be handled with different policies:

  • min: the target is given the lowest rank among the entities in tie. This is the most permissive policy, and it may result in artificially boosting performances: as an extreme example, a model systematically setting the same score to all entities would obtain perfect results under this policy.

  • average: the target is given the average rank among the entities in tie.

  • random: the target is given a random rank among the entities in tie. On large test sets, this policy should globally amount to the average policy.

  • ordinal: the entities in tie are given ranks based on the order in which they have been passed to the model. This usually depends on the internal identifiers of entities, which are independent from their scores: therefore this policy should globally correspond to the random policy.

  • max: the target is given the highest (worst) rank among the entities in tie. This is the most strict policy.

The ranks obtained from test predictions are usually employed to compute standard global metrics. The most commonly employed metrics in LP are:

Mean Rank (MR)

It is the average of the obtained ranks:

It is always between 1 and

, and the lower it is, the better the model results. It is very sensitive to outliers, therefore researchers lately have started avoiding it, resorting to Mean Reciprocal Rank instead.

Mean Reciprocal Rank (MRR)

It is the average of the inverse of the obtained ranks:

It is always between 0 and 1, and the higher it is, the better the model results.

Hits@K (H@K)

It is the ratio of predictions for which the rank is equal or lesser than a threshold :

Common values for K are . The higher the H@K, the better the model results. In particular, when , it measures the ratio of the test facts in which the target was predicted correctly on the first try. H@1 and MRR are often closely related, because these predictions also correspond to the most relevant addends to the MRR formula.

These metrics can be computed either separately for subsets of predictions (e.g. considering separately head and tail predictions) or considering all test predictions altogether.

3. Overview of Link Prediction Techniques

In this section we survey and discuss the main LP approaches for KGs based on latent features. As already described in Section 2, LP models can exploit a large variety of approaches and architectures, depending on how they model the optimization problem and on the techniques they implement to tackle it.

In order to overview their highly diverse characteristics we propose a novel taxonomy illustrated in Figure 1. We define three main families of models, and further divide each of them into smaller groups, identified by unique colours. For each group, we include the most valid representative models, prioritizing the ones reaching state-of-the-art performance and, whenever possible, those with publicly available implementations. The result is a set of 16 models based on extremely diverse architectures; these are the models we subsequently employ in the experimental sections of our comparative analysis. For each model we also report the year of publication as well as the influences it has received from the others. We believe that this taxonomy facilitates the understanding of these models and of the experiments carried out in our work.

Further information on the included models, such as their loss function and their space complexity, is reported in Table 


In our analysis we focus on the body of literature for systems that learn from the KG structure. We refer the reader to works discussing how to leverage additional sources of information, such as textual captions (Toutanova et al., 2015),(Wang and Li, 2016),(An et al., 2018), images (Xie et al., 2017) or pre-computed rules (Guo et al., 2018); see (Gesese et al., 2019) for a survey exclusive to these models.

Figure 1. Taxonomy for the LP models included in our analysis. Dotted arrows indicate that the target method builds on the source method by either generalizing or specializing the definition of its scoring function. The included models are: DistMult (Yang et al., 2015); ComplEx (Trouillon et al., 2016); ANALOGY (Liu et al., 2017); SimplE (Kazemi and Poole, 2018); HolE (Nickel et al., 2016); TuckER (Balazevic et al., 2019); TransE (Bordes et al., 2013); STransE (Nguyen et al., 2016); CrossE (Zhang et al., 2019); TorusE (Ebisu and Ichise, 2018); RotatE (Sun et al., 2019); ConvE (Dettmers et al., 2018); ConvKB (Nguyen et al., 2018); ConvR (Jiang et al., 2019); CapsE (Nguyen et al., 2019); RSN (Guo et al., 2019).

We identify three main families of models: 1) Tensor Decomposition Models; 2) Geometric Models; 3) Deep Learning Models.

3.1. Tensor Decomposition Models

Models in this family interpret LP as a task of tensor decomposition (Kolda and Bader, 2009). These models implicitly consider the KG as a 3D adjacency matrix (that is, a 3-way tensor), that is only only partially observable due to the KG incompleteness. The tensor is decomposed into a combination (e.g. a multi-linear product) of low-dimensional vectors: such vectors are used as embeddings for entities and relations. The core idea is that, provided that the model does not overfit on the training set, the learned embeddings should be able to generalize, and associate high values to unseen true facts in the graph adjacency matrix. In practice, the score of each fact is computed operating that combination on the specific embeddings involved in that fact; the embeddings are learned as usual by optimizing the scoring function for all training facts. These models tend to employ few or no shared parameters at all; this makes them particularly light and easy to train.

3.1.1. Bilinear Models

Given the head embedding and the tail embedding , these models represent the relation embedding as a bidimensional matrix . The scoring function is then computed as a bilinear product:

where symbol denotes matrix product. These models usually differ from one another by introducing specific additional constraints on the embeddings they learn. For this group, in our comparative analysis, we include the following representative models:

DistMult (Yang et al., 2015) forces all relation embeddings to be diagonal matrices, which consistently reduces the space of parameters to be learned, resulting in a much easier model to train. On the other hand, this makes the scoring function commutative, with , which amounts to treating all relations as symmetric. Despite this flaw, it has been demonstrated that, when carefully tuned, DistMult can still reach state-of-the-art performance (Kadlec et al., 2017).

ComplEx (Trouillon et al., 2016), similarly to DistMult, forces each relation embedding to be a diagonal matrix, but extends such formulation in the complex space: , , . In the complex space, the bilinear product becomes a Hermitian product, where in lieu of the traditional , its conjugate-transpose is employed. This disables the commutativeness above mentioned for the scoring function, allowing ComplEx to successfully model asymmetric relations as well.

Analogy (Liu et al., 2017) aims at modeling analogical reasoning, which is key for any kind of knowledge induction. It employs the general bilinear scoring function but adds two main constraints inspired by analogical structures: must be a normal matrix: ; for each pair of relations , , their composition must be commutative: . The authors demonstrate that normal matrices can be successfully employed for modelling asymmetric relations.

SimplE (Kazemi and Poole, 2018) forces relation embeddings to be diagonal matrices, similarly to DistMult, but extends it by associating with each entity two separate embeddings, and , depending on whether is used as head or tail; associating with each relation two separate diagonal matrices, and , expressing the relation in its regular and inverse direction. The score of a fact is computed averaging the bilinear scores of the regular fact and its inverse version. It has been demonstrated that SimplE is fully expressive, and therefore, unlike DistMult, it can model also asymmetric relations.

3.1.2. Non-bilinear Models

These models combine the head, relation and tail embeddings of composition using formulations different from the strictly bilinear product.

HolE (Nickel et al., 2016), instead of using bilinear products, computes circular correlation (denoted by in Table 1) between the embeddings of head and tail entities; then, it performs matrix multiplication with the relation embedding. Note that in this model the relation embeddings have the same shape as the entity embedding. The authors point out that circular correlation can be seen as a compression of the full matrix product: this makes HolE less expensive than an unconstrained bilinear model in terms of both time and space complexity.

TuckER (Balazevic et al., 2019) relies on the Tucker decomposition (Hitchcock, 1927), which factorizes a tensor into a set of vectors and a smaller shared core . The TuckER model learns jointly with the KG embeddings. As a matter of fact, learning globally shared parameters is rather uncommon in Matrix Factorization Models; the authors explain that can be seen as a shared pool of prototype relation matrices, that get combined in a different way for each relation depending in its embedding. In TuckER the dimensions of entity and relation embeddings are independent from each other, with entity embeddings and relation embeddings . The shape of depends on the dimensions of entities and relations, with . In Table 1, we denote with the tensor product along mode used by TuckER.

3.2. Geometric Models

Geometric Models interpret relations as geometric transformations in the latent space. Given a fact, the head embedding undergoes a spatial transformation that uses the values of the relation embedding as parameters. The fact score is the distance between the resulting vector and the tail vector; such an offset is computed using a distance function (e.g. L1 of L2 norm).

Depending on the analytical form of , Geometric models may share similarities with Tensor Decomposition models, but in these cases geometric models usually need to enforce additional constraints in order to make their implement a valid spatial transformation. For instance, the rotation operated by model RotatE can be formulated as a matrix product, but the rotation matrix would need to be diagonal and to have elements with modulus 1.

Much like with Matrix Factorization Models, these systems usually avoid shared parameters, running back-propagation directly on the embeddings. We identify three groups in this family: (i) Pure Translational Models, (ii) Translational Models with Additional Embeddings, and (iii) Roto-translational models.

3.2.1. Pure Translational Models

These models interpret each relation as a translation in the latent space: the relation embedding is just added to the head embedding, and we expect to land in a position close to the tail embedding. These models thus represent entities and relations as one-dimensional vectors of same length.

TransE (Bordes et al., 2013) was the first LP model to propose a geometric interpretation of the latent space, largely inspired by the capability observed in Word2vec vectors (Mikolov et al., 2013) to capture relations between words in the form of translations between their embeddings. TransE enforces this explicitly, requiring that the tail embedding lies close to the sum of the head and relation embeddings, according to the chosen distance function. Due to the nature of translation, TransE is not able to correctly handle one-to-many and many-to-one relations, as well as symmetric and transitive relations.

3.2.2. Translational models with Additional Embeddings

These models may associate more than one embedding to each KG element. This often amounts to using specialized embeddings, such as relation-specific embeddings for each entity or, vice-versa, entity-specific embeddings for each relation. As a consequence, these models overcome the limitations of purely translational models at the cost of learning a larger number of parameters.

STransE (Nguyen et al., 2016), in addition to the -sized embeddings seen in TransE, associates to each relation two additional independent matrices and . When computing the score of a fact , before operating the usual translation, is pre-multiplied by and by ; this amounts to use relation-specific embeddings for the head and tail, alleviating the issues suffered by TransE on 1-to-many, many-to-one and many-to-many relations.

CrossE (Zhang et al., 2019) is one of the most recent and also most effective models in this group. For each relation it learns an additional relation-specific embedding . Given any fact , CrossE uses element-wise products (denoted by in Table 1) to combine and with

. This results in triple-specific embeddings, dubbed interaction embeddings, that are then used in the translation. Interestingly, despite not relying on neural layers, this model adopts the common deep learning practice to interpose operations with non-linear activation functions, such as

hyperbolic tangent and sigmoid denoted (denoted respectively by and in Table 1).

3.2.3. Roto-Translational Models

These models include operations that are not directly expressible as pure translations: this often amounts to perform rotation-like transformations either in combination or in alternative to translations.

TorusE (Ebisu and Ichise, 2018) was motivated by the observation that the regularization used in TransE forces entity embeddings to lie on a hypersphere, thus limiting their capability to satisfy the translational constraint. To solve this problem, TorusE projects each point of the traditional open manifold into a point on a torus . The authors define torus distance functions , and , corresponding to L1, L2 and squared L2 norm respectively (we report in Table 1 the scoring function with the extended form of ).

RotatE (Sun et al., 2019) represents relations as rotations in a complex latent space, with , and all belonging to . The embedding is a rotation vector: in all its elements, the complex component conveys the rotation along that axis, whereas the real component is always equal to 1. The rotation is applied to by operating an element-wise product (once again noted with in 1). L1 norm is used for measuring the distance from . The authors demonstrate that rotation allows to model correctly numerous relational patterns, such as symmetry/anti-symmetry, inversion and composition.

3.3. Deep Learning Models

Deep Learning Models use deep neural networks to perform the LP task. Neural Networks learn parameters such as weights and biases, that they combine with the input data in order to recognize significant patterns. Deep neural networks usually organize parameters into separate layers, generally interspersed with non-linear activation functions.

In time, numerous types of layers have been developed, applying very different operations to the input data. Dense layers, for instance, will just combine the input data with weights and add a bias : . For the sake of simplicity, in the following formulas we will not mention the use of bias, keeping it implicit. More advanced layers perform more complex operations, such as convolutional layers, that learn convolution kernels to apply to the input data, or recurrent layers, that handle sequential inputs in a recursive fashion.

In the LP field, KG embeddings are usually learned jointly with the weights and biases of the layers; these shared parameters make these models more expressive, but potentially heavier, harder to train, and more prone to overfitting. We identify three groups in this family, based on the neural architecture they employ: (i) Convolutional Neural Networks, (ii) Capsule Neural Networks, and (iii) Recurrent Neural Networks.

3.3.1. Convolutional Neural Networks

These models use one or multiple convolutional layers (LeCun et al., 1998): each of these layers performs convolution on the input data (e.g. the embeddings of the KG elements in a training fact) applying low-dimensional filters . The result is a feature map that is usually then passed to additional dense layers in order to compute the fact score.

ConvE (Dettmers et al., 2018) represents entities and relations as one-dimensional -sized embeddings. When computing the score of a fact, it concatenates and reshapes the head and relation embeddings and into a unique input ; we dub the resulting dimensions . This input is let through a convolutional layer with a set of filters, and then through a dense layer with neurons and a set of weights . The output is finally combined with the tail embedding using dot product, resulting in the fact score. When using the entire matrix of entity embeddings instead of the embedding of just the one target entity

, this architecture can be seen as a classifier with


ConvKB (Nguyen et al., 2018) models entities and relations as same-sized one-dimensional embeddings. Differently from ConvE, given any fact , it concatenates all their embeddings , and into a input matrix . This input is passed to a convolutional layer with a set of filters of shape , resulting in a feature map. The feature map is let through a dense layer with only one neuron and weights

, resulting in the fact score. This architecture can be seen as a binary classifier, yielding the probability that the input fact is valid.

ConvR (Jiang et al., 2019) represents entity and relation embeddings as one-dimensional vectors of different dimensions and . For any fact , is first reshaped into a matrix of shape , where . is then reshaped and split into a set of convolutional filters, each of which has size . These filters are then employed to run convolution on ; this amounts to performing an adaptive convolution with relation-specific filters. The resulting feature maps are passed to a dense layer with weights , As in ConvE, the fact score is obtained combining the neural output with the tail embedding using dot product.

3.3.2. Capsule Neural Networks

Capsule networks (CapsNets) are composed of groups of neurons, called capsules, that encode specific features of the input, such as the presence of a specific object in an image (Sabour et al., 2017). CapsNets are designed to recognize such features without losing spatial information the way that convolutional networks do. Each capsule sends its output to higher order ones, with connections decided by a dynamic routing process. The probability of a capsule detecting the feature is given by the length of its output vector.

CapsE (Nguyen et al., 2019) embeds entities and relations into -sized one-dimensional vectors, under the basic assumption that different embeddings encode homologous aspects in the same positions. Similarly to ConvKB, it concatenates , and into one input matrix. This is let through a convolutional layer with filters. The result is a matrix in which the -th value of any row uniquely depends on , and . The matrix is let through a capsule layer; a separate capsule handles each column, thus receiving information regarding one aspect of the input fact. A second layer with one capsule is used to yield the triple score. In Table 1, we denote the capsule layers with .

3.3.3. Recurrent Neural Networks (RNNs)

These models employ one or multiple recurrent layers (Hopfield, 1982) to analyze entire paths (sequences of facts) extracted from the training set, instead of just processing individual facts separately.

RSN (Guo et al., 2019)

is based on the observation that basic RNNs may be unsuitable for LP, because they do not explicitly handle the path alternation of entities and relations, and when predicting a fact tail, in the current time step they are only passed its relation, and not the head (seen in the previous step). To overcome these issues, they propose Recurrent Skipping Networks (RSNs): in any time step, if the input is a relation, the hidden state is updated re-using the fact head too. The fact score is computed performing the dot product between the output vector and the target embedding. In training, the model learns relation paths built from the train facts using biased random walk sampling. It employs a specially optimized loss function resorting to a type-based noise contrastive estimation. In Table 

1 we denote the RSN operation with ; the number of layers stacked in a RSN cell as ; the number of weight matrices as ; the number of neurons in each RSN layer as .

Table 1. Loss Function, constraints and space complexity for the models included in our analysis.

4. Methodology

In this section we describe the implementations and training protocols of the models discussed before, as well as the datasets and procedures we use to study their efficiency and effectiveness.

4.1. Datasets

Datasets for benchmarking LP are usually obtained by sampling real-world KGs, and then splitting the obtained facts into a training, a validation and a test set. We conduct our analysis using the 5 best-established datasets in the LP field; we report some of their most important properties in Table 2.

FB15k is probably the most commonly used benchmark so far. Its creators (Bordes et al., 2013) selected all the FreeBase entities with more than 100 mentions and also featured in the Wikilinks database;222 they extracted all facts involving them (thus also including their lower-degree neighbors), except the ones with literals, e.g. dates, proper nouns, etc. They also converted -ary relations represented with reification into cliques of binary edges; this operation has greatly affected the graph structure and semantics, as described in Section 4.3.4.

WN18, also introduced by the authors of TransE (Bordes et al., 2013), was extracted from WordNet333, a linguistig KG ontology meant to provide a dictionary/thesaurus to support NLP and automatic text analysis. In WordNet entities correspond to synsets (word senses) and relations represent their lexical connections (e.g. “hypernym”). In order to build WN18, the authors used WordNet as a starting point, and then iteratively filtered out entities and relationships with too few mentions.

FB15k-237 is a subset of FB15k built by Toutanova and Chen  (Toutanova and Chen, 2015), inspired by the observation that FB15k suffers from test leakage, consisting in test data being seen by models at training time. In FB15k this issue is due to the presence of relations that are near-identical or the inverse of one another. In order to assess the severity of this problem, Toutanova and Chen have shown that a simple model based on observable features can easily reach state-of-the-art performance on FB15k. FB15k-237 was built to be a more challenging dataset: the authors first selected facts from FB15k involving the 401 largest relations and removed all equivalent or inverse relations. In order to filter away all trivial triples, they also ensured that none of the entities connected in the training set are also directly linked in the validation and test sets.

WN18RR is a subset of WN18 built by Dettmers et al. (2018), also after observing test leakage in WN18. They demonstrate the severity of said leakage by showing that a simple rule-based model based on inverse relation detection, dubbed Inverse Model, achieves state-of-the-art results in both WN18 and FB15k. To resolve that, they build the far more challenging WN18RR dataset by applying a pipeline similar to the one employed for FB15k-237 (Toutanova and Chen, 2015). It has been recently acknowledged by the authors (TimDettmers, ) that the test set includes 212 entities that do not appear in the training set, making it impossible to reasonably predict about 6.7% test facts.

YAGO3-10, sampled from the YAGO3 KG (Mahdisoltani et al., 2013), was also proposed by Dettmers et al. (2018). It was obtained selecting entities with at least 10 different relations and gathering all facts involving them, thus also including their neighbors. Moreover, unlike FB15k and FB15k-237, YAGO3-10 also keeps the facts about textual attributes found in the KG. As a consequence, as stated by the authors, the majority of its triples deals with descriptive properties of people, such as citizenship or gender. That the poor performances of the Inverse Model (Dettmers et al., 2018) in YAGO3-10 suggest that this benchmark should not suffer from the same test leakage issues as FB15k and WN18.

Table 2. The 5 LP datasets included in our comparative analysis, and their general properties.

4.2. Efficiency Analysis

For each model, we consider two main formulations for efficiency:

  • Training Time: the time required to learn the optimal embeddings for all entities and relations.

  • Prediction Time: the time required to generate the full rankings for one test fact, including both head and tail predictions.

Training Time and Prediction Time mostly depend on the model architecture (e.g. deep neural networks may require longer computations due to their inherently longer pipeline of operations);

on the model hyperparameters, such as embedding size and number of negative samples for each positive one;

on the dataset size, namely the number of entities and relations to learn and, for the Training Time, the number of training triples to process. Training Time and Prediction Time mostly depend on the model architecture (e.g. deep neural networks may require longer computations due to their shared parameters); on the model hyperparameters, such as embedding size and number of negative samples for each positive one; on the dataset size, namely the number of entities and relations to learn and, for the Training Time, the number of training triples to process.

4.3. Effectiveness Analysis

We analyze the effectiveness of LP models based on the structure of the training graph. Therefore, we define measurable structural features and we treat each of them as a separate research direction, investigating how it correlates to the predictive performance of each model in each dataset.

We take into account 4 different structural features for each test fact:

  • Number of Peers, namely the valid alternatives for the source and target entities;

  • Relational Path Support, taking into account paths connecting the head and tail of the test fact;

  • Relation Properties that affect both the semantics and the graph structure;

  • Degree of the original reified relation, for datasets generated from KGs using reification.

We address these features in Sections 4.3.1, respectively.

4.3.1. Number of Peers

  • head peers: the set of entities ;

  • tail peers: the set of entities .

In other words, the head peers are all the alternatives for seen during training, conditioned to having relation and tail . Analogously, tail peers are the alternatives for when the head is and the relation is . Consistently to the notation introduced in Section 2, we identify the peers for the source and the target entity of a prediction as source peers and target peers respectively.

Figure 2. Example of head peers and tail peers in a small portion of a KG.

We illustrate an example in Figure 2: considering the fact ⟨Barack ObamaparentMalia Obama⟩, the entity Michelle Obama would be a peer for Barack Obama, because entity Michelle Obama is parent to Malia Obama too. Analogously, entity Natasha Obama is a peer for Malia Obama. In head prediction, when Malia Obama is the source entity and Barack Obama is the target entity, Michelle Obama is a target peer and Natasha Obama is a source peer. In tail prediction peers are just reversed: since now Malia Obama is target entity and Barack Obama is source entity, Michelle Obama is a source peer whereas Natasha Obama is a target peer.

Our intuition is that the numbers of source and target peers may affect predictions with subtle, possibly unanticipated, effects.

On the one hand, the number of source peers can be seen as the number of training samples from which models can directly learn how to predict the current target entity given the current relation. For instance, when performing tail prediction on fact ⟨Barack ObamanationalityUSA⟩, the source peers are all the other entities with nationality USA that the model gets to see in training: they are the examples from which our models can learn what can make a person have American citizenship.

On the other hand, the number of target peers can be seen as the number of answers correctly satisfying this prediction seen by the model during training. For instance, given the same fact as before ⟨Barack ObamanationalityUSA⟩, but performing head prediction this time, the other USA citizens seen in training are now target peers. Since all of them constitute valid alternatives for the target answers, too many target peers may intuitively lead models to confusion and performance degradation.

Our experimental results on source and target peers, reported in Section 5.3.1, confirm our hypothesis.

4.3.2. Relational Path Support

In any KG a path is a sequence of facts in which the tail of each fact corresponds to the head of the next one. The length of the path is the number of consecutive facts it contains. In what follows, we call the sequence of relation names (ignoring entities) in a path a relational path.

Relational paths allow one to identify patterns corresponding to specific relations. For instance, knowing the facts ⟨Barack Obamaplace of birthHonolulu⟩ and ⟨Honolululocated inUSA⟩, it should be possible to predict that ⟨Barack ObamanationalityUSA⟩. Paths have been leveraged for a long time by LP techniques based on observable features, such as the Path Ranking Algorithm (Lao and Cohen, 2010),(Lao et al., 2011). The same cannot be said about models based on embeddings, in which the majority of them learn individual facts separately. Just a few models directly rely on paths, e.g. PTransE (Lin et al., 2015) or, in our analysis, RSN (Guo et al., 2019); some models do not employ paths directly in training but use them for additional tasks, as the explanation approach proposed by CrossE (Zhang et al., 2019).

Our intuition is that even models that train on individual facts, as they progressively scan and learn the entire training set, acquire indirect knowledge of its paths as well. As a consequence, in a roundabout way, they may be able to leverage to some extent the patterns observable in paths in order to make better predictions.

Therefore we investigate how the support provided by paths in training can make test predictions easier for embedding-based models. We define a novel measure of Relational Path Support (RPS) that estimates for any fact how the paths connecting the head to the tail facilitate their prediction. In greater detail, the RPS value for a fact ⟨hrt⟩ measures how the relation paths connecting h to t match those most usually co-occurring with r. In models that heavily rely on relation patterns, a high RPS value should correspond to good predictions, whereas a low one should correspond to bad ones.

Our RPS metric is a variant of the TF-IDF statistical measure (Schütze et al., 2008) commonly used in Information Retrieval. The TF-IDF value of any word in a document of a collection measures both how relevant and how specific is to , based respectively on the frequency of in and on the number of other documents in including . Any document and any keyword-based query can be modeled as a vector with the TF-IDF values of all words in the vocabulary. Given any query , a TF-IDF-based search engine will retrieve the documents with vectors most similar to the vector of .

In our scenario we treat each relation path as a word and each relation as a document. When a relation path co-occurs with a relation (that is, it connects the head and tail of a fact featuring ) we interpret this as the word belonging to the document . We treat each test fact as a query whose keywords are the relation paths connecting its head to the tail. In greater detail, this is the procedure we apply to compute our RPS measure:

  1. For each training fact we extract from the set of relational paths leading from to . Whenever in a path a step does not have the correct orientation, we reverse it and mark its relation with the prefix ”INV”. Our vocabulary is the set of resulting relational paths. Due to computational constraints, we limit ourselves to relational paths with length equal or lesser than 3.

  2. We aggregate the extracted sets by the relation of the training fact. We obtain, for each relation :

    • the number of training facts featuring ;

    • for each relational path , the number of times that is supported by . Of course, .

  3. We compute Document Frequencies (DFs): .

  4. We compute Term Frequencies (TFs): .

  5. We compute Inverse Document Frequencies (IDFs): .

  6. For each relation we compute the TF-IDF vector: .

  7. For each test fact we extract the set of relational paths connecting its head to its tail analogously to point .

  8. For each we apply the same formulas seen in points - to compute DF, TF and IDF and the whole TF-IDF vector; in all computations we treat each as if it was an additional document.

  9. For each we compute

    as the cosine-similarity between its TF-IDF vector and the TD-IDF vector of its relation

    : .

Figure 3. Example for Relational Path Support

The RPS of a test fact estimates how similar it is to training facts with the same relation in terms of co-occurring relation paths. This corresponds to measure how much the relation paths suggest that, given the source and relation in the test fact, the target is indeed the right answer for prediction.

Example 4.1 ().

Figure 3 shows a graph where black solid edges represent training facts and green dashed edges represent test facts. The collection of documents is , and test facts ⟨ HarrynationalityCanada ⟩ and ⟨ Harryworks_inCanada ⟩ correspond to two queries. We compute words and frequencies for each document and query. Note that the two test facts in our example connect the same head to the same tail, so the corresponding queries have the same keywords (the relational path born_in + located_in).

We obtain TF-IDF values for each word in each document as described above. For instance, for document and word :

Other values can be computed analogously; for instance, ;

The TF-IDF value for each query can be computed analogously, except that the query must be included among the documents. The two queries our example share the same keywords, so they will result in identical vectors.

The RPS for ⟨ HarrynationalityCanada ⟩ is the cosine-similarity between its vector the vector of nationality, and it measures 0.712403; analogously, the RPS for ⟨ Harryworks_inCanada ⟩ is the cosine-similarity with the vector of nationality, and it measures 0.447214. As expected, the former RPS value is higher than the latter: the relational paths connecting Harry with Canada are more similar to the those usually observed with nationality than those usually observed with works_in. In other words, in our small example the relation path born_in + located_in co-occurs with nationality more than with works_in.

While the number of peers only depends on the local neighborhood of the source and target entity, RPS relies on paths that typically have length greater than one. In other words, the number of peers can be seen as a form of information very close to the test fact, whereas RPS is more prone to take into account longer-range dependencies.

Our experimental results on the analysis of relational path support are reported in Section 5.3.2.

4.3.3. Relation Properties

Depending on their semantics, relations can be characterized by several properties heavily affecting the ways in which they appear in the facts. Such properties have been well known in the LP literature for a long time, because they may lead a relation to form very specific structures and patterns in the graph; this, depending on the model, can make their facts easier or harder to learn and predict.

As a matter of fact, depending on their scoring function, some models may be even incapable of learning certain types of relations correctly. For instance, TransE (Bordes et al., 2013) and some of its successors are inherently unable to learn symmetric and transitive relations due to the nature of translation itself. Analogously, DistMult (Yang et al., 2015) can not handle anti-symmetric relations, because given any fact , it assigns the same score to too.

This has led some works to formally introduce the concept of full expressiveness (Kazemi and Poole, 2018): a model is fully expressive if, given any valid graph, there exists at least one combination of embedding values for the model that correctly separates all correct triples from incorrect ones. A fully expressive model has the theoretical potential to learn correctly any valid graph, without being hindered by intrinsic limitations. Examples of models that have been demonstrated to be fully expressive are SimplE (Kazemi and Poole, 2018), TuckER (Balazevic et al., 2019), ComplEx (Trouillon et al., 2016) and HolE (Trouillon and Nickel, 2017).

Being capable of learning certain relations, however, does not necessarily imply reaching good performance on them. Even for fully expressive models, certain properties may be inherently harder to handle than others. For instance, Meilicke et al. (Meilicke et al., 2018) have analyzed how their implementations of HolE (Nickel et al., 2016), RESCAL (Nickel et al., 2011) and TransE (Bordes et al., 2013) perform on symmetric relations in various datasets; they report surprisingly bad results for HolE on symmetric relations in FB15K, despite HolE being fully expressive).

At this regard, we lead a systematical analysis: we define a comprehensive set of relation properties and verify how they affect performance for all our models.

We take into account the following properties:

  • Reflexivity: in the original definition, a reflexive relation connects each element with itself. This is not suitable for KGs, where different entities may only be involved with some relations, based on their type. As a consequence, in our analysis we use the following definition: is reflexive if , too.

  • Irreflexivity: is irreflexive if .

  • Symmetry: is symmetric if , too.

  • Anti-symmetry: is anti-symmetric if , .

  • Transitivity: is transitive if pair of facts and , as well.

We do not consider other properties, such as Equivalence and Order (partial or complete), because we experimentally observed that in all datasets included in our analysis only a negligible number of facts would be included in the resulting buckets.

On each dataset we use the following approach. First, for each relation in the dataset we extract the corresponding training facts and use them to identify its properties. Due to the inherent incompleteness of the training set, we employ a tolerance threshold: a property is verified if the ratio of training facts showing the corresponding behaviour exceeds the threshold. In all our experiments, we set tolerance to . Then, we group the test facts based on the properties of their relations. If a relation possesses multiple properties, its test facts will belong to multiple groups. Finally, we compute predictive performance scored by each model on each group of test facts.

We report our results regarding relation properties in Section 5.3.3.

4.3.4. Reified Relations

Some KGs support relations with cardinality greater than 2, connecting more than two entities at a time. In relations, cardinality is closely related to semantics, and some relations inherently make more sense when modeled in this way. For example, an actor winning an award for her performance in a movie can be modeled with a unique relation connecting the actor, the award and the movie. KGs that support relations with cardinality greater than 2 often handle them in one of the following ways:

  • using hyper-graphs: in a hyper-graph, each hyper-edge can link more than two nodes at a time by design. Hyper-graphs can not be directly expressed as a set of triples.

  • using reification: if a relation needs to connect multiple entities, it is modeled with an intermediate node linked to those entities by binary relations. The relation cardinality thus becomes the degree of the reified node. Reification allows relations with cardinality greater than 2 to be indirectly modeled; the graph is thus still representable as a set of triples.

The popular KG FreeBase, that has been used to generate important LP datasets such as FB15k and FB15k-237, employs reified relations, with intermediate nodes of type Compound Value Type (CVT). By extension, we refer to such intermediate nodes as CVTs.

In the process of extracting FB15k from FreeBase (Bordes et al., 2013), CVTs were removed and converted into cliques in which the entities previously connected to the CVT are now connected to one another; the labels of the new edges are obtained concatenating the corresponding old ones. This also applies to FB15k-237, that was obtained by just sampling FB15k further (Toutanova and Chen, 2015). It has been pointed out that this conversion, dubbed “Star-to-Clique” (S2C), is irreversible (Wen et al., 2016). In our study we have observed further consequences to the S2C policy:

  • From a structural standpoint, S2C transforms a CVT with degree into a clique with at most at most edges. Therefore, some parts of the graph become locally much denser than before. The generated edges are often redundant, and in the filtering operated to create FB15k-237, many of them are removed.

  • From a semantic standpoint, the original meaning of the relations is vastly altered. After exploding CVTs into cliques, deduplication is performed: if the same two entities were linked multiple times by the same types of relation using multiple CVTs – e.g. an artist winning multiple awards for the same work – this information is lost, as shown in Figure 4. In other words, in the new semantics, each fact has happened at least once.

Figure 4. Example of how the Star2Clique process operates on a small portion of a KG.

We hypothesize that the very dense, redundant and locally-consistent areas generated by S2C may have consequences on predictive behaviour. Therefore, for FB15k and FB15k-237, we have tried to extract for each test fact generated by S2C the degree of the original CVT, in order to correlate it with the predictive performance of our models.

For each test fact generated by S2C we have tried to recover the corresponding CVT from the latest FreeBase dump available.444 As already pointed out, S2C is not reversible due to its inherent deduplication; therefore, this process often yields multiple CVTs for the same fact. In this case, we have taken into account the CVT with highest degree. We also report that, quite strangely, for a few test facts built with S2C we could not find any CVTs in the FreeBase dump. For these facts, we have set the original reified relation degree to the minimum possible value, that is 2.

In this way, we were able to map each test fact to the degree of the corresponding CVT with highest degree; we have then proceeded to investigate the correlations between such degrees and the predictive performance of our models.

Our results on the analysis of reified relations are reported in Section 5.4.

5. Experimental Results

In this section we provide a detailed report for the experiments and comparisons carried out in our work.

5.1. Experimental set-up

In this section we provide a brief overview of the environment we have used in our work and of the procedures followed to train and evaluate all our LP models. We also provide a description for the baseline model we use in all the experiments of our analysis.

5.1.1. Environment

All of our experiments, as well as the training and evaluation of each model, have been performed on a server environment using 88 CPUs Intel Core(TM) i7-3820 at 3.60GH, 516GB RAM and 4 GPUs NVIDIA Tesla P100-SXM2, each with 16GB VRAM. The operating system is Ubuntu 17.10 (Artful Aardvark).

5.1.2. Training procedures

We have trained and evaluated from scratch all the models introduced in Section 3. In order to make our results directly reproducible, we have employed, whenever possible, publicly available implementations. As a matter of fact we only include one model for which the implementation is not available online, that is ConvR (Jiang et al., 2019); we thank the authors for kindly sharing their code with us.

When we have found multiple implementations for the same model, we have always chosen the best best performing one, with the goal of analyzing each model at its best. This has resulted in the following choices:

  • For TransE (Bordes et al., 2013), DistMult (Yang et al., 2015) and HolE (Nickel et al., 2016) we have used the implementation provided by project Ampligraph (Costabello et al., 2019) and available in their repository (Accenture, );

  • For ComplEx (Trouillon et al., 2016) we have used Timothée Lacroix’s version with N3 regularization (Lacroix et al., 2018), available in Facebook Research repository (facebookresearch, );

  • For SimplE  (Kazemi and Poole, 2018) we have used the fast implementation by Bahare Fatemi (baharefatemi, ), as suggested by the creators of the model themselves.

For all the other models we use the original implementations shared by the authors themselves in their repositories.

As shown by Kadlec et al. (2017), LP models tend to be extremely sensitive to hyperparameter tuning, and the hyperparameters for any model often need to be tuned separately for each dataset. The authors of a model usually define the space of acceptable values of each hyperparameter, and then run grid or random search to find the best performing combination.

In our trainings, we have relied on the hyperparameter settings reported by the authors for all datasets on which they have run experiments. Not all authors have evaluated their models on all of the datasets we include in our analysis, therefore in several cases we have not found official guidance on the hyperparameters to use. In these cases, we have explored ourselves the spaces of hyperparameters defined by the authors in their papers.

Considering the sheer size of such spaces (often containing thousands of combinations), as well as the duration of each training (usually taking several hours), running a grid search or even a random search was generally unfeasible. We have thus resorted to hand tuning to the best of our possibilities, using what is familiarly called a panda approach (Kostadinov, 2018) (in contrast to a caviar approach where large batches of training processes are launched). We report in Appendix A the best hyperparameter combination we have found for each model, in Table 6.

The filtered H@1, H@10, MR and MRR results obtained for each model in each dataset are displayed in Table 3. As mentioned in Section 5.1.3, for models relying on min policy in their original implementation we report their results obtained with average policy instead, as we observed that min policy can lead to results not directly comparable to those of the other models. We investigate this phenomenon in Section 5.4.1 We note that in AnyBURL (Meilicke et al., 2019), for datasets FB15k-237 and YAGO3-10 we used training time 1,000 secs whereas the original papers report slightly better results with a training time of 10000s; this is because, due to our already described necessity for full rankings, when using models trained for 10,000 secs prediction times got prohibitively long. Therefore, under suggestion of the authors themselves, we resorted to using the second best training time 1,000 secs for these datasets. We also note that STransE (Nguyen et al., 2016), ConvKB (Nguyen et al., 2018) and CapsE (Nguyen et al., 2019)

use transfer learning and require embeddings pre-trained on TransE 

(Bordes et al., 2013). For FB15k, FB15k-237, WN18 and WN18RR we have found and used TransE (Bordes et al., 2013) embeddings trained and shared by the authors themselves across their repositories; for YAGO3-10, on which the authors did not work, we used embeddings trained with the TransE implementation that we used in our analysis (Accenture, ).

Table 3. Global H@1, H@10, MR and MRR results for all LP models on each dataset. The best results of each metric for each dataset are marked in bold and underlined.

5.1.3. Evaluation Metrics

When it comes to evaluate the predictive performance of models, we focus on the filtered scenario, and, when reporting global results, we use all the four most popular metrics: H@1, H@10, MR and MRR.

In our finer-grained experiments, we have run all our experiments computing both H@1 and MRR. The use of a H@K measure coupled with a mean-rank-based measure is very popular in LP works. At this regard, we focus on H@1 instead of using larger values of because, as observed by Kadlec et al. (2017), low values of allow the emerging of more more marked differences among different models. Similarly, we choose MRR because is a very stable metric, while simple MR tends to be highly sensitive to outliers. In this paper we mostly report H@1 results for our experiments, as we have usually observed analogous trends using MRR.

As described in Section 2, we have observed that the implementations of different models may rely on different policies for handling ties. Therefore, we have modified them in order to extract evaluation results with multiple policies for each model. In most cases we have not found significant variations; nonetheless, for a few models, min policy yields significantly different results from the other policies. In other words, using different tie policies may make results not directly comparable to one another.

Therefore, unless differently specified, for models that employed min policy in their original implementation we report their average results instead, as the latter are directly comparable to the results of the other models, whereas the former are not. We have led further experiments on this topic and report interesting findings in Section 5.4.1

5.1.4. Baseline

As a baseline we use AnyBURL (Meilicke et al., 2019), a popular LP model based on observable features. AnyBURL (acronym for Anytime Bottom-Up Rule Learning) treats each training fact as a compact representation of a very specific rule; it then tries to generalize it, with the goal of covering and satisfying as many training facts as possible. In greater detail, AnyBURL samples paths of increasing length from the training dataset. For each path of length , it computes a rule containing atoms, and stores it if some quality criteria are matched. AnyBURL keeps analyzing paths of same length until a saturation threshold is exceeded; when this happens, it moves on to paths of length .

As a matter of fact, AnyBURL is a very challenging baseline for LP, as it is shown to outperform most latent-features-based models. It is also computationally fast: depending on the training setting, in order to learn its rules it can take from a couple of minutes (100s setting) to about 3 hours (10000s setting). When it comes to evaluation, AnyBURL is designed to return the top-k scoring entities for each prediction in the test set. When used in this way, it is strikingly fast as well. In order to use it as a baseline in our experiments, however, we needed to extract full ranking for each prediction, setting : this resulted in much longer computation times.

As a side note, we observe that in AnyBURL predictions, even the full ranking may not contain all entities, as it only includes those supported by at least one rule in the model. This means that in very few facts, the target entity may not be included even in the full ranking. In this very unlikely event, we assume that all the entities that have not been predicted have identical score 0, and we apply the avg policy for ties.

Code and documentation for AnyBURL are publicly available (web.informatik.uni, ).

5.2. Efficiency

In this section we report our results regarding the efficiency of LP models in terms of time required for training and for prediction.

In Figure 5 we illustrate, for each model, the time in hours spent for training on each dataset. We observe that training times range from around 1h to about 200-300 hours. Not surprisingly, the largest dataset YAGO3-10 usually requires significantly longer training times. In comparison to the embedding-based models, the baseline project AnyBURL (Meilicke et al., 2019) is strikingly fast. AnyBURL treats the training time as a configuration parameter, and reaches state-of-the-art performance after just 100s for FB15k and WN18, and 1000s for FB15k237, WN18RR and YAGO3-10. As already mentioned, STransE (Nguyen et al., 2016), ConvKB (Nguyen et al., 2018) and CapsE (Nguyen et al., 2019) require embeddings pre-trained on TransE (Bordes et al., 2013); we do not include pre-training times in our measurements. In Figure 6 we illustrate, for each model, the prediction time, defined as the time required to generate the full ranking in both head and tail predictions for one fact. These scores are mainly affected by the embedding dimensions and by the evaluation batch size; at this regard we note that ConvKB (Nguyen et al., 2018) and CapsE (Nguyen et al., 2019) are the only models that require multiple batches for running one prediction, and this may have negatively affected their prediction performance. In our experiments for these models we have used evaluation batch size 2048 (the maximum allowed in our setting). We observe that ANALOGY behaves particularly well in terms of prediction time; this may possibly depend on this model being implemented in C++. For the baseline AnyBURL (Meilicke et al., 2019), we have obtained prediction times a posteriori by dividing the we whole rules application times by the numbers of facts in the datasets. The obtained prediction times are significantly higher than the ones for embedding-based methods; we stress that this depends on the fact that AnyBURL is not designed to generate full rankings, and that using top-k policy lower values would result in much faster computations.

Figure 5. Training times in hours for each LP model on each dataset. Y axis is in logscale.
Figure 6. Prediction times in milliseconds for each LP model on each dataset. Y axis is in logscale.

5.3. Effectiveness

In this section we report our results regarding the effectiveness of LP models in terms of time predictive performance.

5.3.1. Peer Analysis

Our goal in this experiment is to analyze how the predictive performance varies when taking into account test facts with different numbers of source peers, or tail peers, or both.

We report in Figures (a)a, (b)b, (a)a, (b)b, 9 our results. These plots show how performances trend when increasing the number of source peers or target peers. We use use H@1 for measuring effectiveness. The plots are incremental, meaning that, for each number of source (target) peers, we report H@1 percentage for all predictions with source (target) peers equal or lesser than that number. In each graph, for each number of source (target) peers we also report the overall percentage of involved test facts: this provides the distribution of facts by peers.

Our observations are intriguing. First, we point out that almost always, predictions with a greater number of source peers show better H@1 results. A way to explain this phenomenon is to consider that the source peers of a prediction are the examples seen in training in which the target entity displays the same role as in the fact to predict. For instance, when performing tail prediction for ⟨ Barack ObamanationalityUSA ⟩, having many source peers means that the model has seen in training numerous examples of people with nationality USA. Intuitively, such examples provide the models with meaningful information, allowing them to more easily understand when other entities (such as Barack Obama) have nationality USA as well.

Second, we observe that very often a greater number of target peers leads to worse H@1 results. For instance, when performing head prediction for ⟨ ? Michelle Obamaborn inChicago ⟩, target peers are numerous if we have already seen in training many other entities born in Chicago. These entities seem confuse models when they are asked to predict that other entities (such as Michelle Obama) are born in Chicago as well.

We underline that this decrease in performance is not caused by the target peers just outscoring the target entity: we are taking into account filtered scenario results, therefore target peers, being valid answers to our predictions, do not contribute to the rank computation.

These correlations between numbers of peers and performance are particularly evident in datasets FB15K and FB15K-237. Albeit at a lesser extent, they are also visible in YAGO3-10, especially regarding the source peers. In WN18RR these trends seem much less evident. This is probably due to the very skewed dataset structure: more than 60% predictions involve less than 1 source peer or target peer. In WN18, where the distribution is very skewed as well, models show pretty balanced behaviours. Most of them reach almost perfect results, above 90% H@1.

(a) FB15k
(b) FB15k-237
Figure 7. Cumulative H@1 results for each LP model on the Freebase datasets, and corresponding cumulative distribution of test facts, varying the number of source peers (left) and target peers (right). X axis is in logscale.
(a) WN18
(b) WN18RR
Figure 8. Cumulative H@1 results for each LP model on the Wordnet datasets, and corresponding cumulative distribution of test facts, varying the number of source peers (left) and target peers (right). X axis is in logscale.
(a) YAGO3-10
Figure 9. Cumulative H@1 results for each LP model on YAGO3-10, and corresponding cumulative distribution of test facts, varying the number of source peers (left) and target peers (right). X axis is in logscale.

5.3.2. Relational Path Support

Our goal in this experiment is to analyze how the predictive effectiveness of LP models varies when taking into account predictions with different values of Relational Path Support (RPS). RPS is computed using the TF-IDF-based metric introduced in Section 4, using relational paths with length up to 3.

We report in Figures (a)a, (b)b, (a)a, (b)b, 12 our results, using H@1 for measuring performance. Similarly to the experiments with source and target peers reported in Section 5.3.2, we use incremental metrics, showing for each value of RPS the percentage and the H@1 of all the facts with support up to that value.

We observe that, for almost all models, greater RPS values lead to better performance. This proves that such models, to a certain extent, are capable of benefiting from longer-range dependencies.

This correlation is visible in all datasets. It is particularly evident in WN18, WN18RR and YAGO3-10, and to a slightly lesser extent in FB15k-237. We point out that in FB15k-237 and WN18RR a significant percentage of facts displays a very low path support (less than 0.1). This is likely due to the filtering process employed to generate these datasets: removing facts breaks paths in the graph, thus making relational patterns less frequently observable.

FB15k is the dataset in which the correlation between RPS and performances, albeit clearly observable, seems weakest; we see this as a consequence of the severe test leakage displayed by FB15k. As a matter of fact, we have found evidence suggesting that, in presence of many relations with same or inverse meaning, models tend to focus on shorter dependencies for predictions, ignoring longer relational paths. We show this by replicating our experiment using RPS values computed with relational paths of maximum lengths 1 and 2, instead of 3. We report the FB15k chart Figure 13, and the other charts in Appendix B. In FB15k and WN18, well known for their test leakage, the correlation with performances becomes evidently stronger. In FB15k-237, WN18RR and YAGO3-10, on the contrary, it weakens, meaning that 3-step relational paths are actually associated with correct predictions in these datasets.

Test leakage in FB15k and WN18 is actually so prominent that, on these datasets, we were able to use RPS as the scoring function of a standalone LP model based on observable features, obtaining acceptable results. We report the results of this experiment in Table 4. The evaluation pipeline is the same employed by all LP models and described in Section 2. Due to computational constraints, we use the RPS measure with paths up to 2 steps long, and use on each dataset a sample of 512 test facts instead of the whole test set (a single test fact can take more than 1h to predict). We do not run the experiment on YAGO3-10, on which the very high number of entities would make the ranking loop unfeasible. This experiment can be seen as analogous to the ones run by Toutanova et al. (Toutanova and Chen, 2015) and Dettmers et al. (Dettmers et al., 2018), where simple models based on observable features are run on FB15k and WN18 to assess the consequences of their test leakage.

(a) FB15k
(b) FB15k-237
Figure 10. H@1 results for each LP model on the Freebase datasets varying the RPS of the test facts, and corresponding cumulative distribution of test facts.
(a) WN18
(b) WN18RR
Figure 11. H@1 results for each LP model on the Wordnet datasets varying the RPS of the test facts, and corresponding cumulative distribution of test facts.
(a) YAGO3-10
Figure 12. H@1 results for each LP model on YAGO3-10 varying the RPS of the test facts, and corresponding cumulative distribution.
Figure 13. H@1 results for each LP model on FB15k varying the RPS of the test facts, computing RPS with paths up to length 1 and up to length 2.
Table 4. performances of an LP model based on observable features, using as a scoring function the RPS measure with relational paths up to 2 steps long.

5.3.3. Relation Properties

Our goal in this experiment is to analyze how models perform when the relation in the fact to predict have specific properties. We take into account Reflexivity, Symmetry, Transitivity, Irreflexivity and Anti-symmetry, as already described in Section 4.3.3. We report in Figures (a)a, (b)b, (a)a, (b)b, 16 our results, using H@1 for measuring effectiveness.

We divide test facts into buckets based on the properties of their relations. When a relation possesses multiple properties, the corresponding test facts are put in all the corresponding buckets. In all charts we include an initial bucket named any containing all the test facts. For each model, this works as a global baseline, as it allows to compare the H@1 of each bucket to the global H@1 of that model. Analogously to the distribution graphs observed in the previous experiments, for each bucket in each dataset we also report the percentage of test facts it contains.

In FB15K and FB15K-237 we observe an impressive majority of irreflexive and anti-symmetric relations. Only a few facts involve reflexive, symmetric or transitive relations. In WN18 and WN18RR the percentage of facts with symmetric relations is quite higher, but no reflexive and transitive relations are found at all. In YAGO3-10 all test facts feature irreflexive relations; there is a high percentage of facts featuring anti-symmetric relations as well, whereas only a few of them involve symmetric or transitive relations.

In FB15K all models based on embeddings seem to perform quite well on reflexive relations; on the other hand, the baseline AnyBURL (Meilicke et al., 2019) obtains quite bad results on them possibly due to its rule-based approach We also observe that translational models such as TransE (Bordes et al., 2013), CrossE (Zhang et al., 2019) and STransE (Nguyen et al., 2016) struggle to handle symmetric and transitive relations, with very poor results. This problem seems alleviated by the introducton of rotational operations in TorusE (Ebisu and Ichise, 2018) and RotatE (Sun et al., 2019).

In FB15K-237, all models display globally worse performance; nonetheless, interestingly most of them manage to keep good performance on reflexive relations, the exceptions being ANALOGY , SimplE (Kazemi and Poole, 2018), ConvE (Dettmers et al., 2018) and RSN (Guo et al., 2019). On the contrary they all display terrible performance in symmetric relations. This may depend on the sampling policy, that involves removing training facts connecting two entities when they are already linked in the test set: given any test fact ⟨⟩, even when is symmetric models can never see in training ⟨⟩.

In WN18 and WN18RR we observe a rather different situation. This time, symmetric relations are easily handled by most models, with the notable exceptions of TransE (Bordes et al., 2013) and ConvKB (Nguyen et al., 2018). On WN18RR, the good results on symmetric relations balance, for most models, sub-par performance on irreflexive and anti-symmetric relations.

In YAGO3-10 we observe once again TransE (Bordes et al., 2013) and ConvKB (Nguyen et al., 2018) having a hard time handling symmetric relations; on these relations, most models actually tend to behave a little worse than their global H@1.

(a) FB15k
(b) FB15k-237
Figure 14. H@1 results for each LP model on the Freebase datasets and corresponding percentages of test facts, for various relation properties. The best results for each column are in bold and underlined.
(a) WN18
(b) WN18RR
Figure 15. H@1 results for each LP model on the Wordnet datasets, and corresponding percentages of test facts, for various relation properties. The best results for each column are in bold and underlined.
(a) YAGO3-10
Figure 16. H@1 results for each LP model on YAGO3-10, and corresponding percentages of test facts, for various relation properties. The best results for each column are in bold and underlined.

5.4. Reified Relation Degree

Our goal in this experiment is to analyze how, in FreeBase-derived datasets, the degrees of the original reified relations affect predictive performance. Due to the properties of the S2C operations employed to explode reified relations into cliques, a higher degree of the original reified relation corresponds to a locally richer area in the dataset; therefore we expect such higher degrees to correspond to better performance.

We divide test facts into disjoint buckets based on the degree of the original reified relation, extracted as reported in Section 4.

We compute the predictive performance of these buckets separately; we also include a separate bucket with degree value 1, containing the test facts that were not originated from reified relations in FreeBase. We report predictive performances using H@1 in Figures (a)a and (b)b. We also show, for each bucket, the percentage of test facts it contains with respect to the whole test set.

In FB15K, in most models we observe that a higher degree generally corresponds to better H@1. The main exceptions are TransE (Bordes et al., 2013), CrossE (Zhang et al., 2019) and STransE (Nguyen et al., 2016), that show a stable or even worsening pattern. We found that, considering more permissive H@K metrics (e.g. H@10), all models, including these three, improve their performance; we explain this by considering that, due to the very nature of the S2C transformation, original reified relations tend to generate a high number of facts containing symmetric relations. TransE, STransE and CrossE are naturally inclined to represent symmetric relations with very small vectors in the embedding space: as a consequence, when learning facts with symmetric relations, these models tend to place the possible answers very close to each other in the embedding space. The result would be a crowded area in which the the correct target is often outranked when it comes to H@1, but manages to make it to the top K answers for larger values of K.

In FB15k-237 most of the redundant facts obtained from reified relations have been filtered away, therefore the large majority of test facts belongs to the first bucket.

(a) FB15k
(b) FB15k-237
Figure 17. H@1 results for each LP model on the Freebase datasets, and corresponding distribution of test facts, varying the degree of the original reified relation in FreeBase. The best results for each column are marked in bold and underlined.

5.4.1. Sensitivity to tie policy

We have observed that a few models, in their evaluation, are strikingly sensitive to the policy used for handling ties. This happens when models give the same score to multiple different entities in the same prediction: in this case results obtained with different policies diverge, and they are not comparable to one another anymore. In the most extreme case, if a model always gives the same score to all entities in the dataset, using min policy it will obtain H@1 = 1.0 (perfect score) whereas using any other policy it would obtain H@1 around 0.0.

In our experiments we have found that CrossE (Zhang et al., 2019) and, to a much greater extent, ConvKB (Nguyen et al., 2018) and CapsE (Nguyen et al., 2019), seem sensitive to this issue. Note that in their original implementations ConvKB and CapsE use min policy by default, whereas CrossE uses ordinal policy by default.

Table 5. Results obtained with average or ordinal tie policy (avg) against results obtained with min tie policy (min). The table features all the models for which these results show discrepancies (ConvKB; CapsE; CrossE), and the corresponding experiments (CapsE “Leaky”; CrossE without sigmoid).

In FB15k and FB15k-237 both ConvKB and CapsE display huge discrepancies on all metrics, whereas on WN18 and WN18RR the results are almost identical. On these datasets, no remarkable differences are observable for CrossE, except for MR, that is inherently sensitive to small variations. On YAGO3-10, quite interestingly, ConvKB does not seem to suffer from this issue, while CrossE shows a noticeable difference. CapsE shows the largest problems, with a behaviour akin to the extreme example described above.

We have run experiments on the architecture of these models in order to investigate which components are most responsible for these behaviours. We have found strong experimental evidence that saturating activation functions may be the one of the main causes of this issue.

Saturating activation functions yield the same result for inputs beyond (or below) a certain value. For instance, the ReLU function (Rectified Linear Unit), returns 0 for any input lesser or equal to 0. Intuitively, saturating activation functions make it more likely to set identical scores to different entities, thus causing the observed issue.

ConvKB and CapsE both use ReLUs between their layers. In order to verify our hypothesis, we have trained a version of CapsE substituting its ReLUs with Leaky ReLUs. The Leaky ReLU function is a non-saturating alternative to ReLU: it keeps a linear behaviour even for inputs lesser than 0, with slope between 0 and 1. In our experiment we used . We report in Table 5 also the result of this CapsE variation, that we dubbed CapsE “Leaky”. As a matter of fact, for CapsE “Leaky” the differences between results obtained with min and average are much less prominent. In FB15k and FB15k-237, differences in H@1 and MRR either disappear or decrease of 2 or even 3 orders of magnitude, becoming barely observable. In WN18 and WN18RR, much like in the original CapsE, no differences are observable. In YAGO3-10 we still report a significant difference between min and average results, but it is much smaller than before.

CrossE does not employ explicitly saturating activation functions; nonetheless, after thoroughly investigating the model architecture, we have found that in this case too the issue is rooted in saturation. As shown in the scoring function in table 1

, CrossE normalizes its scores by applying a sigmoid function in a final step. The sigmoid function is not saturating, but for values with very large modulus its slope is almost nil: therefore, due to low-level approximations, it behaves just like a saturating function. Therefore, we tested again CrossE just removing the sigmoid function in evaluation; since the sigmoid function is monotonous and growing, removing it does not affect entity ranks. In the obtained results all discrepancies between

min and average policies disappear. As before, we report the results in Table 5.

For all the other models in our analysis we have not found significant differences among their results obtained with different policies.

6. Key Takeaways and Research Directions

In this section we summarize the key takeaways from our comparative analysis. We believe these lessons can inform and inspire future research on LP models based on KG embeddings.

6.1. Effect of the design choices

We discuss here comprehensive observations regarding the performances of models as well as their robustness across evaluation datasets and metrics. We report findings regarding trends investing entire families based on specific design choices, as well as unique feats displayed by individual models.

Among those included in our analysis, Tensor Decomposition models show the most solid results across datasets. In the implementations taken into account, most of these systems display uniform performances on all evaluation metrics across the datasets of our analysis (with the potential exceptions of ANALOGY and SimplE, that are seemingly more fluctuating). In particular, ComplEx with its N3 regularization displays amazing results on all metrics across all datasets, being the only embedding-based model consistently comparable to the baseline AnyBURL.

The Geometric family, on the other hand, shows slightly more unstable results. In the past years, research has devoted a considerable effort into translational models, ranging from TransE to its many successors with multi-embedding policies for handling many-to-one, one-to-many and many-to-many relations. These models show interesting results, but still suffer from some irregularities across metrics and datasets. For instance, models such as TransE and STransE seem to particularly struggle on the WN18RR dataset, especially when it comes to H@1 and MRR metrics. All in all, models relying solely on translations seem to have been outclassed by recent roto-translational ones. At this regard, RotatE shows remarkably consistent performances across all datasets, and it particularly shines when taking into account H@10.

Deep Learning models, finally, are the most diverse family, with wildly different results depending on the architectural choices of the models and on their implementations. ConvR and RSN display by far the best results in this family, achieving very similar, state-of-the-art performance in FB15k, WN18 and YAGO3-10. In FB15k-237 and WN18RR, whose filtering processes have cut away the most relevant paths, RSN seems to have a harder time, probably due to its formulation that explicitly leverages paths. On the other hand, models such as ConvKB and CapsE often achieve promising results on H@10 and MR metrics, whereas they seem to struggle with H@1 and MRR; furthermore, in some datasets they are clearly hindered by their issues with tie policies described in Section 5.4.1.

We stress that in the LP task the rule-based AnyBURL proves to be a remarkably well-performing model, as it consistently ranks among the best models across almost all datasets and metrics.

6.2. The importance of the graph structure

We have shown consistent experimental evidence that graph structural features have a large influence on what models manage to learn and predict.

We observe that in almost all models and datasets, predictions seem to be facilitated by the presence of source peers and hindered by the presence of target peers. As already mentioned, source peers work as examples that allow models to characterize more effectively the relation and the target to predict, whereas target peers lead models to confusion, as they try to optimize embeddings to fit too many different answers for the same question.

We also observe evidence suggesting that almost all models – even across those that only learn individual facts in training – seem able to leverage to some extent relational paths and patterns.

All in all, the toughest scenarios for LP models seem to take place when there are relatively more target peers than source peers, in conjunction with a low support offered by relational paths. In these cases, models usually tend to show quite unsatisfactory performances. We believe that these are the areas where future research has most room for improvement, and thus the next big challenges to address in the LP research field.

We also point out interesting differences in behaviours and correlations depending on the features of the employed dataset. In FB15k and WN18, which display strong test leakage, model performances show a prominent correlation with the support provided by shorter relational paths, with length 1 or 2. This is likely caused by such short paths including relations with inverse meaning or same meaning as the relation in the facts to predict. On the contrary, in their FB15k-237, WN18RR and YAGO3-10, which do not suffer from test leakage, models appear to rely also on longer relational paths (3 steps), as well as on the numbers of source/target peers.

We believe that this leaves room for intriguing observations. In presence of very short patterns providing overwhelming predictive evidence (e.g., the inverse relations that cause test leakage), models seem very prone to just focusing on them, disregarding other forms of reasoning: this can be seen as an unhealthy consequence of the test leakage problem. In more balanced scenarios, on the contrary, models seem to investigate to a certain extent longer dependencies, as well as to focus more on analogical reasoning supported by examples (such as source peers).

We also observe that applying LP models based on embeddings to infer relations with cardinality greater than 2 is still an open problem. As already mentioned in Section 4.3.4, the FreeBase KG represents hyperedges as reified CVT nodes. Hyperedges constitute the large majority of edges in FreeBase: as noted by Fatemi et al. (Fatemi et al., 2019) and Wen et al. (Wen et al., 2016), 61% of the FreeBase relations are beyond-binary, and the corresponding hyperedges involve more than 1/3rd of the FreeBase entities. The FB15k and FB15k-237 datasets have been built by performing S2C explosion on FreeBase subsamples; this has resulted in greatly altering both the graph structure and its semantics, with overall loss of information. We believe that, in order to assess the effects of this process, it would be fruitful to extract novel versions of FB15k and FB15k-237 in their original reified structure without applying S2C. We also note that models such as m-TransH (Wen et al., 2016) and the recent HypE (Fatemi et al., 2019) have tried to circumvent these issues by developing systems that can explicitly learn hyperedges. Despite them being technically usable on datasets with binary relations, of course their unique features emerge best when dealing with relations beyond binary.

6.3. The importance of tie policies

We report that differences in the policies used to handle score ties can lead to huge differences in predictive performance in evaluation. As a matter of fact, such policies are today treated as almost negligible implementation details, and they are hardly ever even reported when presenting novel LP models. Nevertheless, we show that performances computed relying on different policies risk to be not directly comparable to one another, and might not even reflect the actual predictive effectiveness of models. Therefore we strongly advise researchers to use the same policy in the future; in our opinion, the “average” policy seems the most reasonable choice. We have also found strong experimental evidence that saturating activation functions, such as ReLU, play a key role in leading models to assign the same scores to multiple entities in the same prediction; approximations may also lead non-saturating functions, such as Sigmoid, behave as saturating in regions where their slope is particularly close to 0.

7. Related Works

Works related to ours can be roughly divided into two main categories: analyses and surveys. Analyses usually run further experiments trying to convey deeper understandings on LP models, whereas surveys usually attempt to organize them into comprehensive taxonomies based on their features and capabilities.


Chandrahas et al. (Sharma et al., 2018) study geometrical properties of the obtained embeddings in the latent space. They separate models into additive and multiplicative and measure the Alignment To Mean (ATM) and conicity of the learned vectors, showing that additive models tend to learn significantly sparser vectors than multiplicative ones. They then check how this reflects on the model peformances. Their observations are intriguing, especially for multiplicative models, where a high conicity (and thus a low vector spread) seems to correlate to better effectiveness.

Wang et al(Wang et al., 2019) provide a critique on the current benchmarking practices. They observe that current evaluation practices only compute the rankings for test facts; therefore, we are only verifying that, when a question is ”meaningful” and has answers, our models prioritize the correct ones over the wrong ones. This amounts to performing question answering rather than KG completion, because we are not making sure that questions with no answers (and therefore not in the dataset) result in low scores. Therefore, they propose a novel evaluation pipeline, called Entity-Pair Ranking (PR) including all possible combinations in . We wholly agree with their observations; unfortunately, we found that for our experiments, where the full ranking for all predictions is required for all models in all datasets, PR evaluation is way too time-consuming and thus unfeasible.

Akrami et al.(Akrami et al., 2018) use the same intuition as Toutanova et al(Toutanova and Chen, 2015) to carry out a slightly more structured analysis, as they use a wider variety of models to check the performance gap between FB15k and FB15K-237.

Kadlec et al(Kadlec et al., 2017) demonstrate that a carefully tuned implementation of DistMult (Yang et al., 2015) can achieve state-of-the-art performances, surpassing most of its own successors, raising questions on whether we are developing better LP models or just tuning better hyperparameters.

Tran et al.(Tran and Takasu, 2019) interpret 4 models based on matrix factorization as special cases of the same multi-embedding interaction mechanism. In their formulation, each KG element is expressed as a set of vectors ; the scoring functions combine such vectors using trilinear products. The authors also include empirical analyses and comparisons among said models, and introduce a new multi-embedding one based on quaternion algebra.

All the above mentioned analyses have a very different scope from ours. Their goal is generally to address specific issues or investigate vertical hypotheses; on the other hand, our objective is to run an extensive comparison of models belonging to vastly different families, investigating the effects of distinct design choices, discussing the effects of different benchmarking practices and underlining the importance of the graph structure.


Nickel et al. (Nickel et al., 2015) provide an overview for the most popular techniques in the whole field of Statistic Relational Learning, to which LP belongs. The authors include both traditional approaches based on observable graph features and more recent ones based on latent features. Since the paper has been published, however, a great deal of further progress has been made in KG Embeddings.

Cai et al. (Cai et al., 2018) provide a survey for the whole Graph Embedding field. Their scope is not limited to KGs: on the contrary, they overview models handling a wide variety of graphs (Homogeneous, Heterogeneous, with Auxiliary Information, Constructed from Non-Relational Data) with an even wider variety of techniques. Some KG embedding models are briefly discussed in a section dedicated to models that minimize margin-based ranking loss.

To this end, the surveys by Wang et al. (Wang et al., 2017) and by Nguyen (Nguyen, 2017) are the most relevant to our work, as they specifically focus on KG Embedding methods. In the work by Wang et al. (Wang et al., 2017), models are first coarsely grouped based on the input data they rely on (facts only; relation paths; textual contents; etc); the resulting groups undergo further finer-grained selection, taking into account for instance the nature of their scoring functions (e.g. distance-based or semantic-matching-based). What’s more, they offer detailed descriptions for each of the models they encompass, explicitly stating its architectural peculiarities as well as its space and time complexities. Finally, they take into account a large variety of applications that the Knowledge Graph Embedding models can support. The work by Nguyen (Nguyen, 2017) is similar, albeit more concise, and also includes current state-of-the-art methods such as RotatE (Sun et al., 2019).

Our work is fundamentally different from these surveys: while they only report results available in the original papers, we design experiments to extensively investigate the empirical behaviours of models. As discussed in Section 1, results reported in the original papers are generally obtained in very different settings and they are generally global metrics on the whole test sets; as a consequence, it is difficult to interpret and compare them.

8. Conclusions

In this work we have presented the first extensive comparative analysis on LP models based on KG embedding.

We have surveyed 16 LP models representative of diverse techniques and architectures, and we have analyzed their efficiency and effectiveness on the 5 most popular datasets in literature.

We have introduced a set of structural properties characterizing the training data, and we have shown strong experimental evidence that they produce paramount effects on prediction performances. In doing so, we have investigated the circumstances that allow models to perform satisfactorily, while identifying the areas where research still has room for improvement.

We have thoroughly discussed the current evaluation practices, verifying that they can rely on different low-level policies producing incomparable and, in some cases, misleading results. We have analyzed the components that make models most sensitive to these policies, providing useful observations for future research.


We thank Simone Scardapane, Matteo Cannaviccio and Alessandro Temperoni for their insightful discussions. We heartfully thank the authors of AnyBURL, ComplEx-N3, ConvR, CrossE and RSN for their for their amazing support and guidance on their models.


  • [1] Accenture. Ampligraph. [Online; accessed 10-October-2019].
  • Akrami et al. [2018] F. Akrami, L. Guo, W. Hu, and C. Li. Re-evaluating embedding-based knowledge graph completion methods. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pages 1779–1782. ACM, 2018.
  • An et al. [2018] B. An, B. Chen, X. Han, and L. Sun. Accurate text-enhanced knowledge graph representation learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 745–755, 2018.
  • Auer et al. [2007] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a web of open data. In The semantic web, pages 722–735. Springer, 2007.
  • [5] baharefatemi. Simple. [Online; accessed 10-October-2019].
  • Balazevic et al. [2019] I. Balazevic, C. Allen, and T. M. Hospedales. Tucker: Tensor factorization for knowledge graph completion. CoRR, abs/1901.09590, 2019. URL
  • Bollacker et al. [2008] K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1247–1250. AcM, 2008.
  • Bordes et al. [2013] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko. Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems, pages 2787–2795. NIPS, 2013.
  • Cai et al. [2018] H. Cai, V. W. Zheng, and K. Chang. A comprehensive survey of graph embedding: problems, techniques and applications. IEEE Transactions on Knowledge and Data Engineering, 2018.
  • Costabello et al. [2019] L. Costabello, S. Pai, C. L. Van, R. McGrath, N. McCarthy, and P. Tabacof. AmpliGraph: a Library for Representation Learning on Knowledge Graphs, Mar. 2019. URL
  • Dettmers et al. [2018] T. Dettmers, P. Minervini, P. Stenetorp, and S. Riedel. Convolutional 2d knowledge graph embeddings. In

    AAAI Conference on Artificial Intelligence

    , 2018.
  • Dong et al. [2014] X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, pages 601–610. ACM, 2014. ISBN 978-1-4503-2956-9. doi: 10.1145/2623330.2623623. URL
  • Ebisu and Ichise [2018] T. Ebisu and R. Ichise. Toruse: Knowledge graph embedding on a lie group. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 1819–1826, 2018. URL
  • [14] facebookresearch. kbc. [Online; accessed 10-October-2019].
  • Fatemi et al. [2019] B. Fatemi, P. Taslakian, D. Vázquez, and D. Poole. Knowledge hypergraphs: Extending knowledge graphs beyond binary relations. CoRR, abs/1906.00137, 2019. URL
  • Galárraga et al. [2015] L. Galárraga, C. Teflioudi, K. Hose, and F. M. Suchanek. Fast rule mining in ontological knowledge bases with amie+. The VLDB Journal—The International Journal on Very Large Data Bases, 24(6):707–730, 2015.
  • Galárraga et al. [2013] L. A. Galárraga, C. Teflioudi, K. Hose, and F. Suchanek. Amie: association rule mining under incomplete evidence in ontological knowledge bases. In Proceedings of the 22nd international conference on World Wide Web, pages 413–422. ACM, 2013.
  • Gesese et al. [2019] G. A. Gesese, R. Biswas, and H. Sack. A comprehensive survey of knowledge graph embeddings with literals: Techniques and applications. In Proceedings of the Workshop on Deep Learning for Knowledge Graphs (DL4KG2019) Co-located with the 16th Extended Semantic Web Conference 2019 (ESWC 2019), Portoroz, Slovenia, June 2, 2019, pages 31–40, 2019. URL
  • Guo et al. [2019] L. Guo, Z. Sun, and W. Hu. Learning to exploit long-term relational dependencies in knowledge graphs. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 2505–2514. PMLR, 2019.
  • Guo et al. [2018] S. Guo, Q. Wang, L. Wang, B. Wang, and L. Guo. Knowledge graph embedding with iterative guidance from soft rules. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • Hitchcock [1927] F. L. Hitchcock. The expression of a tensor or a polyadic as a sum of products. Journal of Mathematics and Physics, 6(1-4):164–189, 1927.
  • Hopfield [1982] J. J. Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982.
  • Hovy et al. [2013] E. Hovy, R. Navigli, and S. P. Ponzetto. Collaboratively built semi-structured content and artificial intelligence: The story so far. Artif. Intell., 194:2–27, Jan. 2013. ISSN 0004-3702. doi: 10.1016/j.artint.2012.10.002.
  • HUYNH et al. [2019] V.-P. HUYNH, V. MEDURI, S. ORTONA, P. PAPOTTI, and N. AHMADI. Mining expressive rules in knowledge graphs. 2019.
  • Jiang et al. [2019] X. Jiang, Q. Wang, and B. Wang. Adaptive convolution for multi-relational learning. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 978–987, 2019. URL
  • Kadlec et al. [2017] R. Kadlec, O. Bajgar, and J. Kleindienst. Knowledge base completion: Baselines strike back. arXiv preprint arXiv:1705.10744, 2017.
  • Kazemi and Poole [2018] S. M. Kazemi and D. Poole. Simple embedding for link prediction in knowledge graphs. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pages 4289–4300, 2018. URL
  • Kolda and Bader [2009] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500, 2009.
  • Kostadinov [2018] S. Kostadinov.

    Recurrent Neural Networks with Python Quick Start Guide: Sequential learning and language modeling with TensorFlow

    Packt Publishing Ltd, 2018.
  • Lacroix et al. [2018] T. Lacroix, N. Usunier, and G. Obozinski. Canonical tensor decomposition for knowledge base completion. In ICML, volume 80 of Proceedings of Machine Learning Research, pages 2869–2878. PMLR, 2018.
  • Lao and Cohen [2010] N. Lao and W. W. Cohen. Relational retrieval using a combination of path-constrained random walks. Machine learning, 81(1):53–67, 2010.
  • Lao et al. [2011] N. Lao, T. Mitchell, and W. W. Cohen. Random walk inference and learning in a large scale knowledge base. In

    Proceedings of the Conference on Empirical Methods in Natural Language Processing

    , pages 529–539. Association for Computational Linguistics, 2011.
  • LeCun et al. [1998] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Lin et al. [2015] Y. Lin, Z. Liu, H. Luan, M. Sun, S. Rao, and S. Liu. Modeling relation paths for representation learning of knowledge bases. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pages 705–714, 2015. URL
  • Liu et al. [2017] H. Liu, Y. Wu, and Y. Yang. Analogical inference for multi-relational embeddings. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 2168–2178, 2017. URL
  • Mahdisoltani et al. [2013] F. Mahdisoltani, J. Biega, and F. M. Suchanek. Yago3: A knowledge base from multilingual wikipedias. 2013.
  • Meilicke et al. [2018] C. Meilicke, M. Fink, Y. Wang, D. Ruffinelli, R. Gemulla, and H. Stuckenschmidt. Fine-grained evaluation of rule-and embedding-based systems for knowledge graph completion. In International Semantic Web Conference, pages 3–20. Springer, 2018.
  • Meilicke et al. [2019] C. Meilicke, M. W. Chekol, D. Ruffinelli, and H. Stuckenschmidt. Anytime bottom-up rule learning for knowledge graph completion. 2019.
  • Mikolov et al. [2013] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In 1st International Conference on Learning Representations, ICLR 2013, Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, 2013. URL
  • Nguyen [2017] D. Q. Nguyen. An overview of embedding models of entities and relationships for knowledge base completion. CoRR, abs/1703.08098, 2017. URL
  • Nguyen et al. [2016] D. Q. Nguyen, K. Sirts, L. Qu, and M. Johnson. Stranse: a novel embedding model of entities and relationships in knowledge bases. arXiv preprint arXiv:1606.08140, 2016.
  • Nguyen et al. [2018] D. Q. Nguyen, T. D. Nguyen, D. Q. Nguyen, and D. Q. Phung. A novel embedding model for knowledge base completion based on convolutional neural network. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers), pages 327–333, 2018. URL
  • Nguyen et al. [2019] D. Q. Nguyen, T. Vu, T. D. Nguyen, D. Q. Nguyen, and D. Q. Phung. A capsule network-based embedding model for knowledge graph completion and search personalization. In NAACL-HLT (1), pages 2180–2189. Association for Computational Linguistics, 2019.
  • Nickel et al. [2011] M. Nickel, V. Tresp, and H.-P. Kriegel. A three-way model for collective learning on multi-relational data. In ICML, volume 11, pages 809–816, 2011.
  • Nickel et al. [2015] M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich. A review of relational machine learning for knowledge graphs. Proceedings of the IEEE, 104(1):11–33, 2015.
  • Nickel et al. [2016] M. Nickel, L. Rosasco, and T. A. Poggio. Holographic embeddings of knowledge graphs. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA., pages 1955–1961, 2016. URL
  • Paulheim [2017] H. Paulheim. Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic web, 8(3):489–508, 2017.
  • Qian [2013] R. Qian. Understand your world with bing., 2013. Accessed: 2019-10-30.
  • Sabour et al. [2017] S. Sabour, N. Frosst, and G. E. Hinton. Dynamic routing between capsules. In Advances in neural information processing systems, pages 3856–3866, 2017.
  • Schütze et al. [2008] H. Schütze, C. D. Manning, and P. Raghavan. Introduction to information retrieval. In Proceedings of the international communication of association for computing machinery conference, page 260, 2008.
  • Sharma et al. [2018] A. Sharma, P. Talukdar, et al. Towards understanding the geometry of knowledge graph embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 122–131, 2018.
  • Singhal [2012] A. Singhal. Introducing the knowledge graph: things, not strings., 2012. Accessed: 2019-10-30.
  • Stocky and Rasmussen [2014] T. Stocky and L. Rasmussen. Introducing graph search beta, 2014. URL Blogpost in Facebook Newsroom.
  • Suchanek et al. [2007] F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web, pages 697–706. ACM, 2007.
  • Sun et al. [2019] Z. Sun, Z. Deng, J. Nie, and J. Tang. Rotate: Knowledge graph embedding by relational rotation in complex space. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, 2019. URL
  • [56] TimDettmers. Conve. [Online; accessed 10-October-2019].
  • Toutanova and Chen [2015] K. Toutanova and D. Chen. Observed versus latent features for knowledge base and text inference. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality, pages 57–66, 2015.
  • Toutanova et al. [2015] K. Toutanova, D. Chen, P. Pantel, H. Poon, P. Choudhury, and M. Gamon. Representing text for joint embedding of text and knowledge bases. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1499–1509, 2015.
  • Tran and Takasu [2019] H. N. Tran and A. Takasu. Analyzing knowledge graph embedding methods from a multi-embedding interaction perspective. In Proceedings of the Workshops of the EDBT/ICDT 2019 Joint Conference, EDBT/ICDT 2019, Lisbon, Portugal, March 26, 2019, 2019. URL
  • Trouillon and Nickel [2017] T. Trouillon and M. Nickel. Complex and holographic embeddings of knowledge graphs: A comparison. CoRR, abs/1707.01475, 2017.
  • Trouillon et al. [2016] T. Trouillon, J. Welbl, S. Riedel, É. Gaussier, and G. Bouchard. Complex embeddings for simple link prediction. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, pages 2071–2080, 2016. URL
  • Vrandečić and Krötzsch [2014] D. Vrandečić and M. Krötzsch. Wikidata: a free collaborative knowledge base. 2014.
  • Wang et al. [2017] Q. Wang, Z. Mao, B. Wang, and L. Guo. Knowledge graph embedding: A survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering (TKDE), 29(12):2724–2743, 2017.
  • Wang et al. [2019] Y. Wang, D. Ruffinelli, R. Gemulla, S. Broscheit, and C. Meilicke. On evaluating embedding models for knowledge base completion. In Proceedings of the 4th Workshop on Representation Learning for NLP, RepL4NLP@ACL 2019, Florence, Italy, August 2, 2019, pages 104–112, 2019. URL
  • Wang and Li [2016] Z. Wang and J.-Z. Li. Text-enhanced representation learning for knowledge graph. In IJCAI, pages 1293–1299, 2016.
  • Wang et al. [2014] Z. Wang, J. Zhang, J. Feng, and Z. Chen.

    Knowledge graph embedding by translating on hyperplanes.

    In Proceedings of the 28th AAAI Conference on Artificial Intelligence, volume 14, pages 1112–1119. AAAI, 2014.
  • [67] web.informatik.uni Anyburl. [Online; accessed 10-October-2019].
  • Wen et al. [2016] J. Wen, J. Li, Y. Mao, S. Chen, and R. Zhang. On the representation and embedding of knowledge bases beyond binary relations. In IJCAI, pages 1300–1307. IJCAI/AAAI Press, 2016.
  • West et al. [2014] R. West, E. Gabrilovich, K. Murphy, S. Sun, R. Gupta, and D. Lin. Knowledge base completion via search-based question answering. In Proceedings of the 23rd international conference on World wide web, pages 515–526. ACM, 2014.
  • Xie et al. [2017] R. Xie, Z. Liu, H. Luan, and M. Sun. Image-embodied knowledge representation learning. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, pages 3140–3146, 2017. doi: 10.24963/ijcai.2017/438. URL
  • Yang et al. [2015] B. Yang, W. Yih, X. He, J. Gao, and L. Deng. Embedding entities and relations for learning and inference in knowledge bases. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL
  • Zhang et al. [2019] W. Zhang, B. Paudel, W. Zhang, A. Bernstein, and H. Chen. Interaction embeddings for prediction and explanation in knowledge graphs. In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining, WSDM 2019, Melbourne, VIC, Australia, February 11-15, 2019, pages 96–104, 2019. doi: 10.1145/3289600.3291014. URL

Appendix A Hyperparameters

We report here the hyperparameter setting used for each model in our experiments. We highlight in yellow the settings we have found manually, and report in the column the size of the corresponding space of combinations.

Table 6. Hyperparameters used to train all the models in our work. : batch size; alternatively : batch count.

: training epochs; alternatively

: training steps. : embedding dimension; alternatively, and : entity and relation embedding dimension. : learning rate. : regularization margin. : regularization method; : lambda for . : label smoothing. : optimizer (default: ). : convolutional filters. : convolutional kernel. : dropout rate (: in input; : in -th hidden layer; : in features). : negative samples per training fact (default: 1). : temperature in adversarial negative sampling. : initialize filters as [0.1, 0.1, -0.1] if

, otherwise from a truncated normal distribution.

Appendix B RPS with paths of maximum lengths 1 and 2

(a) FB15k-237
Figure 18. H@1 results for each LP model on FB15k-237 varying the RPS of the test facts, computing RPS with paths up to length 1 and up to length 2.
(a) WN18
(b) WN18RR
Figure 19. H@1 results for each LP model on WordNet datasets varying the RPS of the test facts, computing RPS with paths up to length 1 and up to length 2.
(a) YAGO3-10
Figure 20. H@1 results for each LP model on Yago datasets varying the RPS of the test facts, computing RPS with paths up to length 1 and up to length 2.