I Introduction
I am convinced that the crux of the problem of learning is recognizing relationships and being able to use them.
Christopher Strachey in a letter to Alan Turing, 1954
Traditional machine learning algorithms take as input a feature vector, which represents an object in terms of numeric or categorical attributes. The main learning task is to learn a mapping from this feature vector to an output prediction of some form. This could be class labels, a regression score, or an unsupervised cluster id or latent vector (embedding). In Statistical Relational Learning (SRL), the representation of an object can contain its relationships to other objects. Thus the data is in the form of a
graph, consisting of nodes (entities) and labelled edges (relationships between entities). The main goals of SRL include prediction of missing edges, prediction of properties of nodes, and clustering nodes based on their connectivity patterns. These tasks arise in many settings such as analysis of social networks and biological pathways. For further information on SRL see [1, 2, 3].In this article, we review a variety of techniques from the SRL community and explain how they can be applied to largescale knowledge graphs (KGs), i.e., graph structured knowledge bases (KBs) that store factual information in form of relationships between entities. Recently, a large number of knowledge graphs have been created, including YAGO [4], DBpedia [5], NELL [6], Freebase [7], and the Google Knowledge Graph [8]. As we discuss in Section II, these graphs contain millions of nodes and billions of edges. This causes us to focus on scalable SRL techniques, which take time that is (at most) linear in the size of the graph.
We can apply SRL methods to existing KGs to learn a model that can predict new facts (edges) given existing facts. We can then combine this approach with information extraction methods that extract “noisy” facts from the Web (see e.g., [9, 10]). For example, suppose an information extraction method returns a fact claiming that Barack Obama was born in Kenya, and suppose (for illustration purposes) that the true place of birth of Obama was not already stored in the knowledge graph. An SRL model can use related facts about Obama (such as his profession being US President) to infer that this new fact is unlikely to be true and should be discarded. This provides us a way to “grow” a KG automatically, as we explain in more detail in Section IX.
The remainder of this paper is structured as follows. In Section II we introduce knowledge graphs and some of their properties. Section III discusses SRL and how it can be applied to knowledge graphs. There are two main classes of SRL techniques: those that capture the correlation between the nodes/edges using latent variables, and those that capture the correlation directly using statistical models based on the observable properties of the graph. We discuss these two families in Section IV and Section V, respectively. Section VI describes methods for combining these two approaches, in order to get the best of both worlds. Section VII discusses how such models can be trained on KGs. In Section VIII we discuss relational learning using Markov Random Fields. In Section IX we describe how SRL can be used in automated knowledge base construction projects. In Section X we discuss extensions of the presented methods, and Section XI presents our conclusions.
Ii Knowledge Graphs
In this section, we introduce knowledge graphs, and discuss how they are represented, constructed, and used.
Iia Knowledge representation
Knowledge graphs model information in the form of entities and relationships between them. This kind of relational knowledge representation has a long history in logic and artificial intelligence
[11], for example, in semantic networks [12] and frames [13]. More recently, it has been used in the Semantic Web community with the purpose of creating a “web of data” that is readable by machines [14]. While this vision of the Semantic Web remains to be fully realized, parts of it have been achieved. In particular, the concept of linked data [15, 16] has gained traction, as it facilitates publishing and interlinking data on the Web in relational form using the W3C Resource Description Framework (RDF) [17, 18]. (For an introduction to knowledge representation, see e.g. [11, 19, 20]).In this article, we will loosely follow the RDF standard and represent facts in the form of binary relationships, in particular (subject, predicate, object) (SPO) triples, where subject and object are entities and predicate is the relation between them. (We discuss how to represent higherarity relations in Section XA.) The existence of a particular SPO triple indicates an existing fact, i.e., that the respective entities are in a relationship of the given type. For instance, the information
Leonard Nimoy was an actor who played the character Spock in the sciencefiction movie Star Trek
can be expressed via the following set of SPO triples:
subject  predicate  object 

(LeonardNimoy,  profession,  Actor) 
(LeonardNimoy,  starredIn,  StarTrek) 
(LeonardNimoy,  played,  Spock) 
(Spock,  characterIn,  StarTrek) 
(StarTrek,  genre,  ScienceFiction) 
We can combine all the SPO triples together to form a multigraph, where nodes represent entities (all subjects and objects), and directed edges represent relationships. The direction of an edge indicates whether entities occur as subjects or objects, i.e., an edge points from the subject to the object. Different relations are represented via different types of edges (also called edge labels). This construction is called a knowledge graph (KG), or sometimes a heterogeneous information network [21].) See Figure 1 for an example.
In addition to being a collection of facts, knowledge graphs often provide type hierarchies (Leonard Nimoy is an actor, which is a person, which is a living thing) and type constraints (e.g., a person can only marry another person, not a thing).
IiB Open vs. closed world assumption
While existing triples always encode known true relationships (facts), there are different paradigms for the interpretation of nonexisting triples:

Under the closed world assumption (CWA), nonexisting triples indicate false relationships. For example, the fact that in Figure 1 there is no starredIn edge from Leonard Nimoy to Star Wars is interpreted to mean that Nimoy definitely did not star in this movie.

Under the open world assumption (OWA), a nonexisting triple is interpreted as unknown, i.e., the corresponding relationship can be either true or false. Continuing with the above example, the missing edge is not interpreted to mean that Nimoy did not star in Star Wars. This more cautious approach is justified, since KGs are known to be very incomplete. For example, sometimes just the main actors in a movie are listed, not the complete cast. As another example, note that even the place of birth attribute, which you might think would be typically known, is missing for of all people included in Freebase [22].
RDF and the Semantic Web make the openworld assumption. In Section VIIB we also discuss the local closed world assumption (LCWA), which is often used for training relational models.
IiC Knowledge base construction
Completeness, accuracy, and data quality are important parameters that determine the usefulness of knowledge bases and are influenced by the way knowledge bases are constructed. We can classify KB construction methods into four main groups:

In curated approaches, triples are created manually by a closed group of experts.

In collaborative approaches, triples are created manually by an open group of volunteers.

In automated semistructured approaches, triples are extracted automatically from semistructured text (e.g., infoboxes in Wikipedia) via handcrafted rules, learned rules, or regular expressions.

In automated unstructured
approaches, triples are extracted automatically from unstructured text via machine learning and natural language processing techniques (see, e.g.,
[9] for a review).
Method  Schema  Examples 
Curated  Yes  Cyc/OpenCyc [23], WordNet [24], 
UMLS [25]  
Collaborative  Yes  Wikidata [26], Freebase [7] 
Auto. SemiStructured  Yes  YAGO [4, 27], DBPedia [5], 
Freebase [7]  
Auto. Unstructured  Yes  Knowledge Vault [28], NELL [6], 
PATTY [29], PROSPERA [30],  
DeepDive/Elementary [31]  
Auto. Unstructured  No  ReVerb [32], OLLIE [33], 
PRISMATIC [34] 
Construction of curated knowledge bases typically leads to highly accurate results, but this technique does not scale well due to its dependence on human experts. Collaborative knowledge base construction, which was used to build Wikipedia and Freebase, scales better but still has some limitations. For instance, as mentioned previously, the place of birth attribute is missing for of all people included in Freebase, even though this is a mandatory property of the schema [22]. Also, a recent study [35] found that the growth of Wikipedia has been slowing down. Consequently, automatic knowledge base construction methods have been gaining more attention.
Such methods can be grouped into two main approaches. The first approach exploits semistructured data, such as Wikipedia infoboxes, which has led to large, highly accurate knowledge graphs such as YAGO [4, 27] and DBpedia [5]
. The accuracy (trustworthiness) of facts in such automatically created KGs is often still very high. For instance, the accuracy of YAGO2 has been estimated
^{1}^{1}1For detailed statistics see http://www.mpiinf.mpg.de/departments/databasesandinformationsystems/research/yagonaga/yago/statistics/ to be over 95% through manual inspection of sample facts [36], and the accuracy of Freebase [7] was estimated to be 99%^{2}^{2}2http://thenoisychannel.com/2011/11/15/cikm2011industryeventjohngiannandreaonfreebasearosettastoneforentities. However, semistructured text still covers only a small fraction of the information stored on the Web, and completeness (or coverage) is another important aspect of KGs. Hence the second approach tries to “read the Web”, extracting facts from the natural language text of Web pages. Example projects in this category include NELL [6] and the Knowledge Vault [28]. In Section IX, we show how we can reduce the level of “noise” in such automatically extracted facts by using the knowledge from existing, highquality repositories.KGs, and more generally KBs, can also be classified based on whether they employ a fixed or open lexicon of entities and relations. In particular, we distinguish two main types of KBs:

In schemabased approaches, entities and relations are represented via globally unique identifiers and all possible relations are predefined in a fixed vocabulary. For example, Freebase might represent the fact that Barack Obama was born in Hawaii using the triple (/m/02mjmr, /people/person/bornin, /m/03gh4), where /m/02mjmr is the unique machine ID for Barack Obama.

In schemafree approaches, entities and relations are identified using open information extraction (OpenIE) techniques [37], and represented via normalized but not disambiguated strings (also referred to as surface names). For example, an OpenIE system may contain triples such as (“Obama”, “born in”, “Hawaii”), (“Barack Obama”, “place of birth”, “Honolulu”), etc. Note that it is not clear from this representation whether the first triple refers to the same person as the second triple, nor whether “born in” means the same thing as “place of birth”. This is the main disadvantage of OpenIE systems.
Number of  

Knowledge Graph  Entities  Relation Types  Facts 
Freebase^{3}^{3}3Nonredundant triples, see [28, Table 1]  40 M  35,000  637 M 
Wikidata^{4}^{4}4Last published numbers: https://tools.wmflabs.org/wikidatatodo/stats.php and https://www.wikidata.org/wiki/Category:All_Properties  18 M  1,632  66 M 
DBpedia (en)^{5}^{5}5English content, Version 2014 from http://wiki.dbpedia.org/dataset2014  4.6 M  1,367  538 M 
YAGO2 ^{6}^{6}6See [27, Table 5]  9.8 M  114  447 M 
Google Knowledge Graph^{7}^{7}7Last published numbers: http://insidesearch.blogspot.de/2012/12/getsmarteranswersfromknowledge_4.html  570 M  35,000  18,000 M 
IiD Uses of knowledge graphs
Knowledge graphs provide semantically structured information that is interpretable by computers — a property that is regarded as an important ingredient to build more intelligent machines [38]. Consequently, knowledge graphs are already powering multiple “Big Data” applications in a variety of commercial and scientific domains. A prime example is the integration of Google’s Knowledge Graph, which currently stores 18 billion facts about 570 million entities, into the results of Google’s search engine [8]. The Google Knowledge Graph is used to identify and disambiguate entities in text, to enrich search results with semantically structured summaries, and to provide links to related entities in exploratory search. (Microsoft has a similar KB, called Satori, integrated with its Bing search engine [39].)
Enhancing search results with semantic information from knowledge graphs can be seen as an important step to transform textbased search engines into semantically aware question answering services. Another prominent example demonstrating the value of knowledge graphs is IBM’s question answering system Watson, which was able to beat human experts in the game of Jeopardy!. Among others, this system used YAGO, DBpedia, and Freebase as its sources of information [40]. Repositories of structured knowledge are also an indispensable component of digital assistants such as Siri, Cortana, or Google Now.
Knowledge graphs are also used in several specialized domains. For instance, Bio2RDF [41], Neurocommons [42], and LinkedLifeData [43] are knowledge graphs that integrate multiple sources of biomedical information. These have been used for question answering and decision support in the life sciences.
IiE Main tasks in knowledge graph construction and curation
In this section, we review a number of typical KG tasks.
Link prediction
is concerned with predicting the existence (or probability of correctness) of (typed) edges in the graph (i.e., triples). This is important since existing knowledge graphs are often missing many facts, and some of the edges they contain are incorrect
[44]. In the context of knowledge graphs, link prediction is also referred to as knowledge graph completion. For example, in Figure 1, suppose the characterIn edge from ObiWan Kenobi to Star Wars were missing; we might be able to predict this missing edge, based on the structural similarity between this part of the graph and the part involving Spock and Star Trek. It has been shown that relational models that take the relationships of entities into account can significantly outperform nonrelational machine learning methods for this task (e.g., see [45, 46]).Entity resolution (also known as record linkage [47], object identification [48], instance matching [49], and deduplication [50]) is the problem of identifying which objects in relational data refer to the same underlying entities. See Figure 2 for a small example. In a relational setting, the decisions about which objects are assumed to be identical can propagate through the graph, so that matching decisions are made collectively for all objects in a domain rather than independently for each object pair (see, for example, [51, 52, 53]). In schemabased automated knowledge base construction, entity resolution can be used to match the extracted surface names to entities stored in the knowledge graph.
Linkbased clustering extends featurebased clustering to a relational learning setting and groups entities in relational data based on their similarity. However, in linkbased clustering, entities are not only grouped by the similarity of their features but also by the similarity of their links. As in entity resolution, the similarity of entities can propagate through the knowledge graph, such that relational modeling can add important information for this task. In social network analysis, linkbased clustering is also known as community detection [54].
Iii Statistical Relational Learning for Knowledge Graphs
Statistical Relational Learning is concerned with the creation of statistical models for relational data. In the following sections we discuss how statistical relational learning can be applied to knowledge graphs. We will assume that all the entities and (types of) relations in a knowledge graph are known. (We discuss extensions of this assumption in Section XC). However, triples are assumed to be incomplete and noisy; entities and relation types may contain duplicates.
Notation
Before proceeding, let us define our mathematical notation. (Variable names will be introduced later in the appropriate sections.) We denote scalars by lower case letters, such as ; column vectors (of size ) by bold lower case letters, such as ; matrices (of size ) by bold upper case letters, such as ; and tensors (of size ) by bold upper case letters with an underscore, such as . We denote the ’th “frontal slice” of a tensor by (which is a matrix of size ), and the ’th element by (which is a scalar). We use to denote the vertical stacking of vectors and , i.e., . We can convert a matrix of size into a vector of size by stacking all columns of , denoted . The inner (scalar) product of two vectors (both of size ) is defined by . The tensor (Kronecker) product of two vectors (of size and ) is a vector of size with entries . Matrix multiplication is denoted by as usual. We denote the norm of a vector by , and the Frobenius norm of a matrix by . We denote the vector of all ones by
, and the identity matrix by
.Iiia Probabilistic knowledge graphs
We now introduce some mathematical background so we can more formally define statistical models for knowledge graphs.
Let be the set of all entities and be the set of all relation types in a knowledge graph. We model each possible triple
over this set of entities and relations as a binary random variable
that indicates its existence. All possible triples in can be grouped naturally in a thirdorder tensor (threeway array) , whose entries are set such thatWe will refer to this construction as an adjacency tensor (cf. Figure 3).
Each possible realization of
can be interpreted as a possible world. To derive a model for the entire knowledge graph, we are then interested in estimating the joint distribution
, from a subsetof observed triples. In doing so, we are estimating a probability distribution over possible worlds, which allows us to predict the probability of triples based on the state of the entire knowledge graph. While
in adjacency tensors indicates the existence of a triple, the interpretation of depends on whether the open world, closed world, or localclosed world assumption is made. For details, see Section VIIB.Note that the size of can be enormous for large knowledge graphs. For instance, in the case of Freebase, which currently consists of over million entities and relations, the number of possible triples exceeds elements. Of course, type constraints reduce this number considerably.
Even amongst the syntactically valid triples, only a tiny fraction are likely to be true. For example, there are over 450,000 thousands actors and over 250,000 movies stored in Freebase. But each actor stars only in a small number of movies. Therefore, an important issue for SRL on knowledge graphs is how to deal with the large number of possible relationships while efficiently exploiting the sparsity of relationships. Ideally, a relational model for largescale knowledge graphs should scale at most linearly with the data size, i.e., linearly in the number of entities , linearly in the number of relations , and linearly in the number of observed triples .
IiiB Statistical properties of knowledge graphs
Knowledge graphs typically adhere to some deterministic rules, such as type constraints and transitivity (e.g., if Leonard Nimoy was born in Boston, and Boston is located in the USA, then we can infer that Leonard Nimoy was born in the USA). However, KGs have typically also various “softer” statistical patterns or regularities, which are not universally true but nevertheless have useful predictive power.
One example of such statistical pattern is known as homophily, that is, the tendency of entities to be related to other entities with similar characteristics. This has been widely observed in various social networks [55, 56]. For example, USborn actors are more likely to star in USmade movies. For multirelational data (graphs with more than one kind of link), homophily has also been referred to as autocorrelation [57].
Another statistical pattern is known as block structure. This refers to the property where entities can be divided into distinct groups (blocks), such that all the members of a group have similar relationships to members of other groups [58, 59, 60]. For example, we can group some actors, such as Leonard Nimoy and Alec Guinness, into a science fiction actor block, and some movies, such as Star Trek and Star Wars, into a science fiction movie block, since there is a high density of links from the scifi actor block to the scifi movie block.
Graphs can also exhibit global and longrange statistical dependencies, i.e., dependencies that can span over chains of triples and involve different types of relations. For example, the citizenship of Leonard Nimoy (USA) depends statistically on the city where he was born (Boston), and this dependency involves a path over multiple entities (Leonard Nimoy, Boston, USA) and relations (bornIn, locatedIn, citizenOf). A distinctive feature of relational learning is that it is able to exploit such patterns to create richer and more accurate models of relational domains.
When applying statistical models to incomplete knowledge graphs, it should be noted that the distribution of facts in such KGs can be skewed. For instance, KGs that are derived from Wikipedia will inherit the skew that exists in distribution of facts in Wikipedia itself.
^{8}^{8}8As an example, there are currently 10,306 male and 7,586 female American actors listed in Wikipedia, while there are only 1,268 male and 1,354 female Indian, and 77 male and no female Nigerian actors. India and Nigeria, however, are the largest and second largest film industries in the world. Statistical models as discussed in the following sections can be affected by such biases in the input data and need to be interpreted accordingly.IiiC Types of SRL models
As we discussed, the presence or absence of certain triples in relational data is correlated with (i.e., predictive of) the presence or absence of certain other triples. In other words, the random variables are correlated with each other. We will discuss three main ways to model these correlations:
 M1)

Assume all are conditionally independent given latent features associated with subject, object and relation type and additional parameters (latent feature models)
 M2)

Assume all are conditionally independent given observed graph features and additional parameters (graph feature models)
 M3)

Assume all have local interactions (Markov Random Fields)
In what follows we will mainly focus on M1 and M2 and their combination; M3 will be the topic of Section VIII.
The model classes M1 and M2 predict the existence of a triple via a score function which represents the model’s confidence that a triple exists given the parameters . The conditional independence assumptions of M1 and M2 allow the probability model to be written as follows:
(1) 
where is the sigmoid (logistic) function, and
(2) 
is the Bernoulli distribution.
We will refer to models of the form Equation 1 as probabilistic models. In addition to probabilistic models, we will also discuss models which optimize under other criteria, for instance models which maximize the margin between existing and nonexisting triples. We will refer to such models as scorebased models. If desired, we can derive probabilities for scorebased models via Platt scaling [61].
There are many different methods for defining . In the following Sections VIII, VI, V and IV we will discuss different options for all model classes. In Section VII we will furthermore discuss aspects of how to train these models on knowledge graphs.
Iv Latent Feature Models
In this section, we assume that the variables are conditionally independent given a set of global latent features and parameters, as in Equation 1. We discuss various possible forms for the score function below. What all models have in common is that they explain triples via latent features of entities (This is justified via various theoretical arguments [62]). For instance, a possible explanation for the fact that Alec Guinness received the Academy Award is that he is a good actor. This explanation uses latent features of entities (being a good actor) to explain observable facts (Guinness receiving the Academy Award). We call these features “latent” because they are not directly observed in the data. One task of all latent feature models is therefore to infer these features automatically from the data.
In the following, we will denote the latent feature representation of an entity by the vector where denotes the number of latent features in the model. For instance, we could model that Alec Guinness is a good actor and that the Academy Award is a prestigious award via the vectors
where the component corresponds to the latent feature Good Actor and correspond to Prestigious Award. (Note that, unlike this example, the latent features that are inferred by the following models are typically hard to interpret.)
The key intuition behind relational latent feature models is that the relationships between entities can be derived from interactions of their latent features. However, there are many possible ways to model these interactions, and many ways to derive the existence of a relationship from them. We discuss several possibilities below. See Table III for a summary of the notation.
Relational data  

Symbol  Meaning  
Number of entities  
Number of relations  
Number of training examples  
th entity in the dataset (e.g., LeonardNimoy)  
th relation in the dataset (e.g., bornIn)  
Set of observed positive triples  
Set of observed negative triples  
Probabilistic Knowledge Graphs  
Symbol  Meaning  Size 
(Partially observed) labels for all triples  
Score for all possible triples  
Slice of for relation  
Slice of for relation  
Graph and Latent Feature Models  
Symbol  Meaning  
Feature vector representation of triple  
Weight vector to derive scores for relation  
Set of all parameters of the model  
Sigmoid (logistic) function  
Latent Feature Models  
Symbol  Meaning  Size 
Number of latent features for entities  
Number of latent features for relations  
Latent feature repr. of entity  
Latent feature repr. of relation  
Size of layer  
Size of layer  
Size of layer  
Entity embedding matrix  
Bilinear weight matrix for relation  
Linear feature map for pairs of entities  
for relation  
Linear feature map for triples 
Iva RESCAL: A bilinear model
RESCAL [63, 64, 65] is a relational latent feature model which explains triples via pairwise interactions of latent features. In particular, we model the score of a triple as
(3) 
where is a weight matrix whose entries specify how much the latent features and interact in the th relation. We call this a bilinear model, since it captures the interactions between the two entity vectors using multiplicative terms. For instance, we could model the pattern that good actors are likely to receive prestigious awards via a weight matrix such as
In general, we can model block structure patterns via the magnitude of entries in , while we can model homophily patterns via the magnitude of its diagonal entries. Anticorrelations in these patterns can be modeled via negative entries in .
Hence, in Equation 3 we compute the score of a triple via the weighted sum of all pairwise interactions between the latent features of the entities and . The parameters of the model are . During training we jointly learn the latent representations of entities and how the latent features interact for particular relation types.
In the following, we will discuss further important properties of the model for learning from knowledge graphs.
Relational learning via shared representations
In equation (3), entities have the same latent representation regardless of whether they occur as subjects or objects in a relationship. Furthermore, they have the same representation over all different relation types. For instance, the th entity occurs in the triple as the subject of a relationship of type , while it occurs in the triple as the object of a relationship of type . However, the predictions and both use the same latent representation of the th entity. Since all parameters are learned jointly, these shared representations permit to propagate information between triples via the latent representations of entities and the weights of relations. This allows the model to capture global dependencies in the data.
Semantic embeddings
The shared entity representations in RESCAL capture also the similarity of entities in the relational domain, i.e., that entities are similar if they are connected to similar entities via similar relations [65]. For instance, if the representations of and are similar, the predictions and
will have similar values. In return, entities with many similar observed relationships will have similar latent representations. This property can be exploited for entity resolution and has also enabled largescale hierarchical clustering on relational data
[63, 64]. Moreover, since relational similarity is expressed via the similarity of vectors, the latent representations can act as proxies to give nonrelational machine learning algorithms such as means or kernel methods access to the relational similarity of entities.Connection to tensor factorization
RESCAL is similar to methods used in recommendation systems [66], and to traditional tensor factorization methods [67]. In matrix notation, Equation 3 can be written compactly as as , where is the matrix holding all scores for the th relation and the th row of holds the latent representation of . See Figure 4 for an illustration. In the following, we will use this tensor representation to derive a very efficient algorithm for parameter estimation.
Fitting the model
If we want to compute a probabilistic model, the parameters of RESCAL can be estimated by minimizing the logloss using gradientbased methods such as stochastic gradient descent
[68]. RESCAL can also be computed as a scorebased model, which has the main advantage that we can estimate the parameters very efficiently: Due to its tensor structure and due to the sparsity of the data, it has been shown that the RESCAL model can be computed via a sequence of efficient closedform updates when using the squaredloss [63, 64]. In this setting, it has been shown analytically that a single update of and scales linearly with the number of entities , linearly with the number of relations , and linearly with the number of observed triples, i.e., the number of nonzero entries in [64]. We call this algorithm RESCALALS.^{9}^{9}9ALS stands for Alternating LeastSquares In practice, a small number (say 30 to 50) of iterated updates are often sufficient for RESCALALS to arrive at stable estimates of the parameters. Given a current estimate of , the updates for each can be computed in parallel to improve the scalability on knowledge graphs with a large number of relations. Furthermore, by exploiting the special tensor structure of RESCAL, we can derive improved updates for RESCALALS that compute the estimates for the parameters with a runtime complexity of for a single update (as opposed to a runtime complexity of for naive updates) [65, 69]. In summary, for relational domains that can be explained via a moderate number of latent features, RESCALALS is highly scalable and very fast to compute. For more detail on RESCALALS see also Equation 26 in Section VII.Decoupled Prediction
In Equation 3, the probability of single relationship is computed via simple matrixvector products in time. Hence, once the parameters have been estimated, the computational complexity to predict the score of a triple depends only on the number of latent features and is independent of the size of the graph. However, during parameter estimation, the model can capture global dependencies due to the shared latent representations.
Relational learning results
RESCAL has been shown to achieve stateoftheart results on a number of relational learning tasks. For instance, [63] showed that RESCAL provides comparable or better relationship prediction results on a number of small benchmark datasets compared to Markov Logic Networks (with structure learning) [70], the Infinite (Hidden) Relational model [71, 72], and Bayesian Clustered Tensor Factorization [73]. Moreover, RESCAL has been used for link prediction on entire knowledge graphs such as YAGO and DBpedia [64, 74]. Aside from link prediction, RESCAL has also successfully been applied to SRL tasks such as entity resolution and linkbased clustering. For instance, RESCAL has shown stateoftheart results in predicting which authors, publications, or publication venues are likely to be identical in publication databases [63, 65]. Furthermore, the semantic embedding of entities computed by RESCAL has been exploited to create taxonomies for uncategorized data via hierarchical clusterings of entities in the embedding space [75].
IvB Other tensor factorization models
Various other tensor factorization methods have been explored for learning from knowledge graphs and multirelational data. [76, 77] factorized adjacency tensors using the CP tensor decomposition to analyze the link structure of Web pages and Semantic Web data respectively. [78] applied pairwise interaction tensor factorization [79] to predict triples in knowledge graphs. [80] applied factorization machines to large unirelational datasets in recommendation settings. [81] proposed a tensor factorization model for knowledge graphs with a very large number of different relations.
It is also possible to use discrete latent factors. [82] proposed Boolean tensor factorization to disambiguate facts extracted with OpenIE methods and applied it to large datasets [83]. In contrast to previously discussed factorizations, Boolean tensor factorizations are discrete models, where adjacency tensors are decomposed into binary factors based on Boolean algebra.
IvC Matrix factorization methods
Another approach for learning from knowledge graphs is based on matrix factorization, where, prior to the factorization, the adjacency tensor is reshaped into a matrix by associating rows with subjectobject pairs and columns with relations (cf. [84, 85]), or into a matrix by associating rows with subjects and columns with relation/objects (cf. [86, 87]). Unfortunately, both of these formulations lose information compared to tensor factorization. For instance, if each subjectobject pair is modeled via a different latent representation, the information that the relationships and share the same object is lost. It also leads to an increased memory complexity, since a separate latent representation is computed for each pair of entities, requiring parameters (compared to parameters for RESCAL).
IvD Multilayer perceptrons
We can interpret RESCAL as creating composite representations of triples and predicting their existence from this representation. In particular, we can rewrite RESCAL as
(4)  
(5) 
where . Equation 4 follows from Equation 3 via the equality . Hence, RESCAL represents pairs of entities via the tensor product of their latent feature representations (Equation 5) and predicts the existence of the triple from via (Equation 4). See also creftype (a)a. For a further discussion of the tensor product to create composite latent representations please see [88, 89, 90].
Since the tensor product explicitly models all pairwise interactions, RESCAL can require a lot of parameters when the number of latent features are large (each matrix has entries). This can, for instance, lead to scalability problems on knowledge graphs with a large number of relations.
In the following we will discuss models based on multilayer perceptrons (MLPs), also known as feedforward neural networks. In the context of multidimensional data they can be referred to a muliway neural networks. This approach allows us to consider alternative ways to create composite triple representations and to use nonlinear functions to predict their existence.
In particular, let us define the following EMLP model (E for entity):
(6)  
(7)  
(8) 
where is the function applied elementwise to vector ; one often uses the nonlinear function .
Here is an additive hidden layer, which is deriving by adding together different weighed components of the entity representations. In particular, we create a composite representation via the concatenation of and . However, concatenation alone does not consider any interactions between the latent features of and . For this reason, we add a (vectorvalued) hidden layer of size , from which the final prediction is derived via . The important difference to tensorproduct models like RESCAL is that we learn the interactions of latent features via the matrix (Equation 7), while the tensor product considers always all possible interactions between latent features. This adaptive approach can reduce the number of required parameters significantly, especially on datasets with a large number of relations.
One disadvantage of the EMLP is that it has to define a vector and a matrix for every possible relation, which requires parameters per relation. An alternative is to embed the relation itself, using a dimensional vector . We can then define
(9)  
(10)  
(11) 
We call this model the ERMLP, since it applies an MLP to an embedding of the entities and relations. Please note that ERMLP uses a global weight vector for all relations. This model was used in the KV project (see Section IX), since it has many fewer parameters than the EMLP (see Table V); the reason is that is independent of the relation .
It has been shown in [91] that MLPs can learn to put “semantically similar” words close by in the embedding space, even if they are not explicitly trained to do so. In [28], they show a similar result for the semantic embedding of relations using ERMLP. For example, Table IV shows the nearest neighbors of latent representations of selected relations that have been computed with a 60 dimensional model on Freebase. Numbers in parentheses represent squared Euclidean distances. It can be seen that ERMLP puts semantically related relations near each other. For instance, the closest relations to the children relation are parents, spouse, and birthplace.
Relation  Nearest Neighbors  

children  parents  (0.4)  spouse  (0.5)  birthplace  (0.8) 
birthdate  children  (1.24)  gender  (1.25)  parents  (1.29) 
eduend^{10}^{10}10The relations edustart, eduend, jobstart, jobend represent the start and end dates of attending an educational institution and holding a particular job, respectively  jobstart  (1.41)  edustart  (1.61)  jobend  (1.74) 
IvE Neural tensor networks
Method  Num. Parameters  

RESCAL [64]      
EMLP [92]      
ERMLP [28]      
NTN [92]    
Structured Embeddings [93]      
TransE [94]   
We can combine traditional MLPs with bilinear models, resulting in what [92] calls a “neural tensor network” (NTN). More precisely, we can define the NTN model as follows:
(12)  
(13)  
(14) 
Here is a tensor, where the th slice has size , and there are slices. We call a bilinear hidden layer, since it is derived from a weighted combination of multiplicative terms.
NTN is a generalization of the RESCAL approach, as we explain in Section XIIA. Also, it uses the additive layer from the EMLP model. However, it has many more parameters than the EMLP or RESCAL models. Indeed, the results in [95] and [28] both show that it tends to overfit, at least on the (relatively small) datasets uses in those papers.
IvF Latent distance models
Another class of models are latent distance models (also known as latent space models in social network analysis), which derive the probability of relationships from the distance between latent representations of entities: entities are likely to be in a relationship if their latent representations are close according to some distance measure. For unirelational data, [96] proposed this approach first in the context of social networks by modeling the probability of a relationship via the score function where refers to an arbitrary distance measure such as the Euclidean distance.
The structured embedding (SE) model [93] extends this idea to multirelational data by modeling the score of a triple as:
(15) 
where . In Equation 15 the matrices , transform the global latent feature representations of entities to model relationships specifically for the th relation. The transformations are learned using the ranking loss in a way such that pairs of entities in existing relationships are closer to each other than entities in nonexisting relationships.
To reduce the number of parameters over the SE model, the TransE model [94] translates the latent feature representations via a relationspecific offset instead of transforming them via matrix multiplications. In particular, the score of a triple is defined as:
(16) 
This model is inspired by the results in [91], who showed that some relationships between words could be computed by their vector difference in the embedding space. As noted in [95], under unitnorm constraints on and using the squared Euclidean distance, we can rewrite Equation 16 as follows:
(17) 
Furthermore, if we assume , so that , and , so that , then we can rewrite this model as follows:
(18) 
IvG Comparison of models
Table V summarizes the different models we have discussed. A natural question is: which model is best? [28] showed that the ERMLP model outperformed the NTN model on their particular dataset. [95] performed more extensive experimental comparison of these models, and found that RESCAL (called the bilinear model) worked best on two link prediction tasks. However, clearly the best model will be dataset dependent.
V Graph Feature Models
In this section, we assume that the existence of an edge can be predicted by extracting features from the observed edges in the graph. For example, due to social conventions, parents of a person are often married, so we could predict the triple (John, marriedTo, Mary) from the existence of the path John Anne Mary, representing a common child. In contrast to latent feature models, this kind of reasoning explains triples directly from the observed triples in the knowledge graph. We will now discuss some models of this kind.
Va Similarity measures for unirelational data
Observable graph feature models are widely used for link prediction in graphs that consist only of a single relation, e.g., social network analysis (friendships between people), biology (interactions of proteins), and Web mining (hyperlinks between Web sites). The intuition behind these methods is that similar entities are likely to be related (homophily) and that the similarity of entities can be derived from the neighborhood of nodes or from the existence of paths between nodes. For this purpose, various indices have been proposed to measure the similarity of entities, which can be classified into local, global, and quasilocal approaches [97].
Local similarity indices such as Common Neighbors, the AdamicAdar index [98] or Preferential Attachment [99] derive the similarity of entities from their number of common neighbors or their absolute number of neighbors. Local similarity indices are fast to compute for single relationships and scale well to large knowledge graphs as their computation depends only on the direct neighborhood of the involved entities. However, they can be too localized to capture important patterns in relational data and cannot model longrange or global dependencies.
Global similarity indices such as the Katz index [100] and the LeichtHolmeNewman index [101] derive the similarity of entities from the ensemble of all paths between entities, while indices like Hitting Time, Commute Time, and PageRank [102] derive the similarity of entities from random walks on the graph. Global similarity indices often provide significantly better predictions than local indices, but are also computationally more expensive [97, 56].
Quasilocal similarity indices like the Local Katz index [56] or Local Random Walks [103] try to balance predictive accuracy and computational complexity by deriving the similarity of entities from paths and random walks of bounded length.
In Section VC, we will discuss an approach that extends this idea of quasilocal similarity indices for unirelational networks to learn from large multirelational knowledge graphs.
VB Rule Mining and Inductive Logic Programming
Another class of models that works on the observed variables of a knowledge graph extracts rules via mining methods and uses these extracted rules to infer new links. The extracted rules can also be used as a basis for Markov Logic as discussed in Section VIII
. For instance, ALEPH is an Inductive Logic Programming (ILP) system that attempts to learn rules from relational data via inverse entailment
[104] (For more information on ILP see e.g., [105, 3, 106]). AMIE is a rule mining system that extracts logical rules (in particular Horn clauses) based on their support in a knowledge graph [107, 108]. In contrast to ALEPH, AMIE can handle the openworld assumption of knowledge graphs and has shown to be up to three orders of magnitude faster on large knowledge graphs [108]. The basis for the Semantic Web is Description Logic and [109, 110, 111] describe approaches for logicoriented machine learning approaches in this context. Also to mention are data mining approaches for knowledge graphs as described in [112, 113, 114]. An advantage of rulebased systems is that they are easily interpretable as the model is given as a set of logial rules. However, rules over observed variables cover usually only a subset of patterns in knowledge graphs (or relational data) and useful rules can be challenging to learn.
VC Path Ranking Algorithm
The Path Ranking Algorithm (PRA) [115, 116] extends the idea of using random walks of bounded lengths for predicting links in multirelational knowledge graphs. In particular, let denote a path of length of the form , where represents the sequence of edge types . We also require there to be a direct arc , representing the existence of a relationship of type from to . Let represent the set of all such paths of length , ranging over path types . (We can discover such paths by enumerating all (typeconsistent) paths from entities of type to entities of type . If there are too many relations to make this feasible, we can perform random sampling.)
We can compute the probability of following such a path by assuming that at each step, we follow an outgoing link uniformly at random. Let be the probability of this particular path; this can be computed recursively by a sampling procedure, similar to PageRank (see [116] for details). The key idea in PRA is to use these path probabilities as features for predicting the probability of missing edges. More precisely, define the feature vector
(19) 
We can then predict the edge probabilities using logistic regression:
(20) 
Interpretability
A useful property of PRA is that its model is easily interpretable. In particular, relation paths can be regarded as bodies of weighted rules — more precisely Horn clauses — where the weight specifies how predictive the body of the rule is for the head. For instance, Table VI shows some relation paths along with their weights that have been learned by PRA in the KV project (see Section IX) to predict which college a person attended, i.e., to predict triples of the form (p, college, c). The first relation path in Table VI can be interpreted as follows: it is likely that a person attended a college if the sports team that drafted the person is from the same college. This can be written in the form of a Horn clause as follows:
(p, college, c) (p, draftedBy, t) (t, school, c) . 
By using a sparsity promoting prior on
, we can perform feature selection, which is equivalent to rule learning.
Relational learning results
PRA has been shown to outperform the ILP method FOIL [106] for link prediction in NELL [116]. It has also been shown to have comparable performance to ERMLP on link prediction in KV: PRA obtained a result of 0.884 for the area under the ROC curve, as compared to 0.882 for ERMLP [28].
Relation Path  F1  Prec  Rec  Weight 

(draftedBy, school)  0.03  1.0  0.01  2.62 
(sibling(s), sibling, education, institution)  0.05  0.55  0.02  1.88 
(spouse(s), spouse, education, institution)  0.06  0.41  0.02  1.87 
(parents, education, institution)  0.04  0.29  0.02  1.37 
(children, education, institution)  0.05  0.21  0.02  1.85 
(placeOfBirth, peopleBornHere, education)  0.13  0.1  0.38  6.4 
(type, instance, education, institution)  0.05  0.04  0.34  1.74 
(profession, peopleWithProf., edu., inst.)  0.04  0.03  0.33  2.19 
Vi Combining latent and graph feature models
It has been observed experimentally (see, e.g., [28]) that neither stateoftheart relational latent feature models (RLFMs) nor stateoftheart graph feature models are superior for learning from knowledge graphs. Instead, the strengths of latent and graphbased models are often complementary (see e.g., [117]), as both families focus on different aspects of relational data:

Latent feature models are wellsuited for modeling global relational patterns via newly introduced latent variables. They are computationally efficient if triples can be explained with a small number of latent variables.

Graph feature models are wellsuited for modeling local and quasilocal graphs patterns. They are computationally efficient if triples can be explained from the neighborhood of entities or from short paths in the graph.
There has also been some theoretical work comparing these two approaches [118]. In particular, it has been shown that tensor factorization can be inefficient when relational data consists of a large number of strongly connected components. Fortunately, such “problematic” relations can often be handled efficiently via graphbased models. A good example is the marriedTo relation: One marriage corresponds to a single strongly connected component, so data with a large number of marriages would be difficult to model with RLFMs. However, predicting marriedTo links via graphbased models is easy: the existence of the triple (John, marriedTo, Mary) can be simply predicted from the existence of (Mary, marriedTo, John), by exploiting the symmetry of the relation. If the (Mary, marriedTo, John) edge is unknown, we can use statistical patterns, such as the existence of shared children.
Combining the strengths of latent and graphbased models is therefore a promising approach to increase the predictive performance of graph models. It typically also speeds up the training. We now discuss some ways of combining these two kinds of models.
Via Additive relational effects model
[118] proposed the additive relational effects (ARE), which is a way to combine RLFMs with observable graph models. In particular, if we combine RESCAL with PRA, we get
(21) 
ARE models can be trained by alternately optimizing the RESCAL parameters with the PRA parameters. The key benefit is now RESCAL only has to model the “residual errors” that cannot be modelled by the observable graph patterns. This allows the method to use much lower latent dimensionality, which significantly speeds up training time. The resulting combined model also has increased accuracy [118].
ViB Other combined models
In addition to ARE, further models have been explored to learn jointly from latent and observable patterns on relational data. [84, 85] combined a latent feature model with an additive term to learn from latent and neighborhoodbased information on multirelational data, as follows:^{11}^{11}11 [85] considered an additional term , where is a (noncomposite) latent feature representation of subjectobject pairs.
(22)  
(23) 
Here, is the latent representation of entity as a subject and is the latent representation of entity as an object. The term captures patterns efficiently where the existence of a triple is predictive of another triple between the same pair of entities (but of a different relation type). For instance, if Leonard Nimoy was born in Boston, it is also likely that he lived in Boston. This dependency between the relation types bornIn and livedIn can be modeled in Equation 23 by assigning a large weight to .
ARE and the models of [84] and [85] are similar in spirit to the model of [119], which augments SVD (i.e., matrix factorization) of a rating matrix with additive terms to include local neighborhood information. Similarly, factorization machines [120] allow to combine latent and observable patterns, by modeling higherorder interactions between input variables via lowrank factorizations [78].
An alternative way to combine different prediction systems is to fit them separately, and use their outputs as inputs to another “fusion” system. This is called stacking [121]. For instance, [28] used the output of PRA and ERMLP as scalar features, and learned a final “fusion” layer by training a binary classifier. Stacking has the advantage that it is very flexible in the kinds of models that can be combined. However, it has the disadvantage that the individual models cannot cooperate, and thus any individual model needs to be more complex than in a combined model which is trained jointly. For example, if we fit RESCAL separately from PRA, we will need a larger number of latent features than if we fit them jointly.
Vii Training SRL models on knowledge graphs
In this section we discuss aspects of training the previously discussed models that are specific to knowledge graphs, such as how to handle the openworld assumption of knowledge graphs, how to exploit sparsity, and how to perform model selection.
Viia Penalized maximum likelihood training
Let us assume we have a set of observed triples and let the th triple be denoted by . Each observed triple is either true (denoted ) or false (denoted ). Let this labeled dataset be denoted by . Given this, a natural way to estimate the parameters is to compute the maximum a posteriori (MAP) estimate:
(24) 
where controls the strength of the prior. (If the prior is uniform, this is equivalent to maximum likelihood training.) We can equivalently state this as a regularized loss minimization problem:
(25) 
where
is the log loss function. Another possible loss function is the squared loss,
. Using the squared loss can be especially efficient in combination with a closedworld assumption (CWA). For instance, using the squared loss and the CWA, the minimization problem for RESCAL becomes(26) 
where control the degree of regularization. The main advantage of Equation 26 is that it can be optimized via RESCALALS, which consists of a sequence of very efficient, closedform updates whose computational complexity depends only on the nonzero entries in [63, 64]. We discuss some other loss functions below.
ViiB Where do the negative examples come from?
One important question is where the labels come from. The problem is that most knowledge graphs only contain positive training examples, since, usually, they do not encode false facts. Hence for all . To emphasize this, we shall use the notation to represent the observed positive (true) triples: . Training on allpositive data is tricky, because the model might easily over generalize.
One way around this is as to make a closed world assumption and assume that all (type consistent) triples that are not in are false. We will denote this negative set as . However, for incomplete knowledge graphs this assumption will be violated. Moreover, might be very large, since the number of false facts is much larger than the number of true facts. This can lead to scalability issues in training methods that have to consider all negative examples.
An alternative approach to generate negative examples is to exploit known constraints on the structure of a knowledge graph: Type constraints for predicates (persons are only married to persons), valid value ranges for attributes (the height of humans is below 3 meters), or functional constraints such as mutual exclusion (a person is born exactly in one city) can all be used for this purpose. Since such examples are based on the violation of hard constraints, it is certain that they are indeed negative examples. Unfortunately, functional constraints are scarce and negative examples based on type constraints and valid value ranges are usually not sufficient to train useful models: While it is relatively easy to predict that a person is married to another person, it is difficult to predict to which person in particular. For the latter, examples based on type constraints alone are not very informative. A better way to generate negative examples is to “perturb” true triples. In particular, let us define
To understand the difference between this approach and the CWA (where we assumed all valid unknown triples were false), let us consider the example in Figure 1. The CWA would generate “good” negative triples such as (LeonardNimoy, starredIn, StarWars), (AlecGuinness, starredIn, StarTrek), etc., but also typeconsistent but “irrelevant” negative triples such as (BarackObama, starredIn, StarTrek), etc. (We are assuming (for the sake of this example) there is a type Person but not a type Actor.) The second approach (based on perturbation) would not generate negative triples such as (BarackObama, starredIn, StarTrek), since BarackObama does not participate in any starredIn events. This reduces the size of , and encourages it to focus on “plausible” negatives. (An even better method, used in Section IX, is to generate the candidate triples from text extraction methods run on the Web. Many of these triples will be false, due to extraction errors, but they define a good set of “plausible” negatives.)
Another option to generate negative examples for training is to make a localclosed world assumption (LCWA) [107, 28], in which we assume that a KG is only locally complete. More precisely, if we have observed any triple for a particular subjectpredicate pair , then we will assume that any nonexisting triple is indeed false and include them in . (The assumption is valid for functional relations, such as bornIn, but not for setvalued relations, such as starredIn.) However, if we have not observed any triple at all for the pair , we will assume that all triples are unknown and not include them in .
ViiC Pairwise loss training
Given that the negative training examples are not always really negative, an alternative approach to likelihood training is to try to make the probability (or in general, some scoring function) to be larger for true triples than for assumedtobefalse triples. That is, we can define the following objective function:
(27) 
where is a marginbased ranking loss function such as
(28) 
This approach has several advantages. First, it does not assume that negative examples are necessarily negative, just that they are “more negative” than the positive ones. Second, it allows the function to be any function, not just a probability (but we do assume that larger values mean the triple is more likely to be correct).
This kind of objective function is easily optimized by stochastic gradient descent (SGD) [122]: at each iteration, we just sample one positive and one negative example. SGD also scales well to large datasets. However, it can take a long time to converge. On the other hand, as discussed previously, some models, when combined with the squared loss objective, can be optimized using alternating least squares (ALS), which is typically much faster.
ViiD Model selection
Almost all models discussed in previous sections include one or more usergiven parameters that are influential for the model’s performance (e.g., dimensionality of latent feature models, length of relation paths for PRA, regularization parameter for penalized maximum likelihood training). Typically, crossvalidation over random splits of into training, validation, and testsets is used to find good values for such parameters without overfitting (for more information on model selection in machine learning see e.g., [123]). For link prediction and entity resolution, the area under the ROC curve (AUCROC) or the area under the precisionrecall curve (AUCPR) are good evaluation criteria. For data with a large number of negative examples (as it is typically the case for knowledge graphs), it has been shown that AUCPR can give a clearer picture of an algorithm’s performance than AUCROC [124]. For entity resolution, the mean reciprocal rank (MRR) of the correct entity is an alternative evaluation measure.
Viii Markov random fields
In this section we drop the assumption that the random variables in are conditionally independent. However, in the case of relational data and without the conditional independence assumption, each can depend on any of the other random variables in . Due to this enormous number of possible dependencies, it becomes quickly intractable to estimate the joint distribution without further constraints, even for very small knowledge graphs. To reduce the number of potential dependencies and arrive at tractable models, in this section we develop templatebased graphical models that only consider a small fraction of all possible dependencies.
(See [125] for an introduction to graphical models.)
Viiia Representation
Graphical models use graphs to encode dependencies between random variables. Each random variable (in our case, a possible fact ) is represented as a node in the graph, while each dependency between random variables is represented as an edge. To distinguish such graphs from knowledge graphs, we will refer to them as dependency graphs. It is important to be aware of their key difference: while knowledge graphs encode the existence of facts, dependency graphs encode statistical dependencies between random variables.
To avoid problems with cyclical dependencies, it is common to use undirected graphical models, also called Markov Random Fields (MRFs).^{12}^{12}12Technically, since we are conditioning on some observed features , this is a Conditional Random Field (CRF), but we will ignore this distinction. A MRF has the following form:
(29) 
where is a potential function on the th subset of variables, in particular the th clique in the dependency graph, and is the partition function, which ensures that the distribution sums to one. The potential functions capture local correlations between variables in each clique in the dependency graph. (Note that in undirected graphical models, the local potentials do not have any probabilistic interpretation, unlike in directed graphical models.) This equation again defines a probability distribution over “possible worlds”, i.e., over joint distribution assigned to the random variables .
The structure of the dependency graph (which defines the cliques in Equation 29) is derived from a template mechanism that can be defined in a number of ways. A common approach is to use Markov logic [126], which is a template language based on logical formulae:
Given a set of formulae , we create an edge between nodes in the dependency graph if the corresponding facts occur in at least one grounded formula. A grounding of a formula is given by the (type consistent) assignment of entities to the variables in . Furthermore, we define such that
Comments
There are no comments yet.