1 Introduction
In 2012, Google announced that it improved the quality of its search engines significantly by utilizing knowledge graphs (Eder, 2012). A knowledge graph is a data set of relational facts represented as triplets (head, relation, tail). The head and tail symbols represent realworld entities, such as people, objects, or places. The relation describes how the two entities are related to each other, e.g., ‘head was founded by tail’ or ‘head graduated from tail’.
While the number of true relational facts among a large set of entities can be enormous, the amount of data points in empirical knowledge graphs is often rather small. It is therefore desirable to complete missing facts in a knowledge graph algorithmically based on patterns detected in the data set of known facts (Nickel et al., 2016a)
. Such link prediction in relational knowledge graphs has become an important subfield of artificial intelligence
(Bordes et al., 2013; Wang et al., 2014; Lin et al., 2015; Nickel et al., 2016b; Trouillon et al., 2016; Wang and Li, 2016; Ji et al., 2016; Shen et al., 2016; Xiao et al., 2017; Shi and Weninger, 2017; Lacroix et al., 2018).A popular approach to link prediction is to fit an embedding model to the observed facts (Kadlec et al., 2017; Nguyen, 2017; Wang et al., 2017)
. A knowledge graph embedding model represents each entity and each relation by a lowdimensional semantic embedding vector. Over the past six years, these models have made significant progress on link prediction
(Bordes et al., 2013; Yang et al., 2015; Nickel et al., 2016b; Trouillon et al., 2016; Lacroix et al., 2018). However, Kadlec et al. (2017) pointed out that these models are highly sensitive to hyperparameters, specifically the regularization strength. This is not surprising since even large knowledge graphs often contain only few data points per entity (i.e., per embedding vector), and so the regularizer plays an important role. Kadlec et al. (2017) showed that a simple baseline model can outperform more modern models when using carefully tuned hyperparameters.In addition to being highly sensitive to the regularization strength, knowledge graph embedding models also need vastly different regularization strengths for different embedding vectors. Knowledge graph embedding models are typically trained by minimizing some function of the embedding vectors for each triplet fact (short or head, relation, and tail) in the training set . One typically adds a regularizer with some strength as follows,
(1) 
Here, , , and is the embedding for entity , entity , and relation , respectively. Boldface and is shorthand for all entity and relation embeddings, respectively, and one typically uses a norm regularizer with .
It was pointed out by Lacroix et al. (2018) that Eq. 1 implicitly scales the regularization strength proportionally to the frequency of entities and relations in the data set since the regularizer is inside the sum over training points. This implies vastly different regularization strengths for different embedding vectors since the frequencies of entities and relations vary over a wide range (Figure 1). As we show in this paper, the general idea to use stronger regularization for more frequent entities and relations can be justified from a Bayesian perspective (for empirical evidence, see (Srebro and Salakhutdinov, 2010)). However, the specific choice to make the regularization strength proportional to the frequency seems more like a historic accident.
Rather than imposing a proprotional relationship between frequency and regularization strength, we propose to augment the model family such that each embedding and has its individual regularization strength and
, respectively. This replaces the loss function from Eq.
1 with(2) 
Here, the last two sums run over each entity and each relation exactly once (there is only one sum over entities since the same entity embedding vector is used for an entity in either head or tail position).
The loss in Eq. 2 contains a macroscopic number of hyperparameters and : over in our largest experiments. It would be impossible to tune such a large number of hyperparameters with traditional grid search, which scales exponentially in the number of hyperparameters. To solve this issue, we propose in this work a probabilistic interpretation of knowledge graph embedding models. The probabilistic interpretation admits efficient hyperparameter tuning with variational expectationmaximization (Dempster et al., 1977; Bernardo et al., 2003). This allows us to optimize over all hyperparameters in parallel, and it leads to models with better predictive performance.
Besides improving performance, our approach also has the potential to accelerate research on new knowledge graph embedding models. Researchers who propose a new model architecture currently have to invest considerable resources into hyperparameter tuning to prove competitiveness with existing, highly tuned models. Our cheap largescale hyperparameter tuning speeds up iteration on new models.
In detail, our contributions are as follows:

We first augment these models by introducing separate priors for each entity and relationship vector. In a nonprobabilistic picture, these correspond to regularizers. This augmentation makes the models more flexibile, but it introduces thousands of new hyperparameters (regularizers) that need to be optimized.

We then show how to efficiently tune such augmented models. The large number of hyperparameters rules out both grid search with cross validation and Bayesian optimization, calling for gradientbased hyperparameter optimization. Gradientbased hyperparameter optimization would lead to singular solutions in classical maximum likelihood training. Instead, we propose variational expectationmaximization (EM), which avoids such singularities.

We evaluate our proposed hyperparameter optimization method experimentally for augmented versions of DistMult and ComplEx.^{1}^{1}1Source code: https://github.com/mandtlab/knowledgegraphtuning The high tunability of the proposed models combined with our efficient hyperparameter tuning method improve the predictive performance over the previous state of the art.
The paper is structured as follows: Section 2 summarizes a large class of knowledge graph embedding models and presents our probabilistic perspective on these models in terms of a generative probabilistic process. Section 3 describes our algorithm for hyperparameter tuning. We present experiments in Section 4, compare our method to related work in Section 5, and conclude in Section 6.
2 Generative Knowledge Graph Embedding Models
In this section, we introduce our notation for a large class of knowledge graph embedding models (KG embeddings) from the literature (Section 2.1), and we then generalize these models in two aspects. First, while conventional KG embeddings typically share the same regularization strength across all entities and relationship vectors, we lift this constraint and allow each embedding vector to be regularized differently (Section 2.2
). Second, we show that the loss functions of conventional KG embeddings as well as our augmented model class can be obtained as point estimates of a probabilistic generative process of the data (Section
2.3). Drawing on this probabilistic perspective, we can optimize all hyperparameters efficiently using variational expectationmaximization (Section 3).2.1 Conventional Kg Embeddings
We introduce our notation for a large class of knowledge graph embedding models (KG embeddings) from the literature, such as DistMult (Yang et al., 2015), ComplEx (Trouillon et al., 2016), and Holographic Embeddings (Nickel et al., 2016b).
Knowledge graphs are sets of triplet facts where the ‘head’ and ‘tail’ both belong to a fixed set of entities, and describes which one out of a set of relations holds between and . KG embeddings represent each entity and each relation by an embedding vector and , respectively, that lives in a semantic embedding space with a low dimension . A model is defined by a real valued score function . One fits the embedding vectors such that assigns a high score to observed triplet facts in the training set and a low score to triplets that do not appear in .
Examples.
We give examples of the two models that reach highest predictive performance to the best of our knowledge. For more models, see (Kadlec et al., 2017).
In the DistMult model (Yang et al., 2015), the embedding space is real valued, and the score is defined as
(3) 
where, e.g., is the ^{th} entry of the vector .
The ComplEx model (Trouillon et al., 2016) uses a complex embedding space , and defines the score
(4) 
where denotes the real part of a complex number, and is the complex conjugate of .
Tail And Head Prediction.
Typical benchmark tasks for KG embeddings are ‘tail prediction’ and ‘head prediction’, i.e., completing queries of the form and , respectively, by ranking potential completions by their score . Most proposals for KG embeddings train a single model for both tail and head prediction. Thus, the loss function is given by Eq. 1, where is a sum of two terms to train for tail and head prediction, respectively. While early works (e.g., (Bordes et al., 2013; Wang et al., 2014; Yang et al., 2015)) trained by maximizing a margin over negative samples, the more recent literature (Kadlec et al., 2017; Liang et al., 2018) suggests that the softmax loss leads to better predictive performance,
(5) 
Here, the first line (with the sum over tails ) is the softmax loss for tail prediction, while the second line (with the sum over heads ) is the softmax loss for head prediction.
2.2 Regularization in Kg Embeddings
Knowledge graph embedding models are highly sensitive to hyperparameters, especially to the strength of the regularizer (Kadlec et al., 2017). This can be understood since even large knowledge graphs typically contain only few data points per entity. For example, the FB15K data set contains data points, but of all entities appear fewer than times as head or tail of a training point. Moreover, the amount of training data varies strongly across entities and relations (see Figure 1), suggesting that the regularization strength for embedding vectors and should depend on the entity and relation .
The loss function for conventional KG embeddings in Eq. 1 regularizes all embedding vectors with the same strength . We propose to replace by individual regularization strengths and for each entity and relation , respectively, and to fit models with the loss function in Eq. 2. It generalize Eq. 1, which one obtains for
(6) 
where and denote the number of times that entity or relation appears in the training data, respectively. The proposed augmented models described by Eq. 2 are more flexible as they do not impose a linear relationship between and the regularization strength .
The downside of the augmented KG embedding models is that one has to tune a macroscopic number of hyperparameters and : more than in the popular FB15K data set. Tuning such a large number of hyperparameters would be far too computationally expensive in a conventional setup that fits point estimates by minimizing the loss function. For point estimated models, it is well known that one cannot fit hyperparameters to the training data as this would lead to overfitting (see also Supplementary Material). To avoid overfitting, knowledge graph embedding models are conventionally tuned by cross validation on heldout data. This requires training a model from scratch for each new hyperparameter setting. Cross validation does not scale beyond models with a handful of hyperparameters, and it is expensive even there (see, e.g., (Kadlec et al., 2017; Lacroix et al., 2018)).
Probabilistic models, by contrast, allow tuning of many hyperparameters in parallel using the empirical Bayes method
(Dempster et al., 1977; Maritz, 2018). We propose a probabilistic formulation of augmented KG embeddings in the next section, and we present a method for efficient hyperparameter tuning in these models in Section 3.2.3 Probabilistic Kg Embeddings
We now present our probabilistic version of KG embeddings. The probabilistic formulation enables efficient optimization over thousands of hyperparameters, see Section 3.
Reciprocal Facts.
The KG embedding models discussed in Sections 2.1 and 2.2 make a direct interpretation as a generative probabilistic process difficult. Training a single model for both head and tail prediction introduces cyclic causal dependencies. As will become clear below, the tail prediction part in Eq. 2.1 (first line on the righthand side) corresponds to a generative process where the head causes the tail . However, the head prediction part (second line) corresponds to a generative process where causes .
To solve this issue, we employ a data augmentation due to Lacroix et al. (2018) that goes as follows. For each relation , one introduces a new symbol , which has the interpretation of the inverse of , but whose embedding vector is not tied to that of . One then constructs an augmented training set by adding the reciprocal facts,
(7) 
One trains the model by minimizing the loss in Eq. 1 or Eq. 2, where the sum over data points is now over instead of , and where is given by only the first line of Eq. 2.1. When evaluating the model performance on a test set, one answers head prediction queries by answering the corresponding tail prediction query . This data augmentation was introduced in (Lacroix et al., 2018) to improve performance. As we show next, it also has the advantage of enabling a probabilistic interpretation by establishing a causal order where comes before .
Generative Process.
With the above data augmentation, minimizing the loss function in Eq. 2 is equivalent to point estimating the parameters of the probabilistic graphical model shown in Figure 2 (left). The generative process is:

For each entity and each relation , draw an embedding from the priors
(8) Here, specifies the norm, and the omitted proportionality constant follows from normalization.

Repeat for each data point to be generated:

[topsep=1pt,leftmargin=15pt]

Draw a head entity and a relation from some discrete distribution . The choice of this distribution has no influence on inference since and are both directly observed.

Add the triplet fact to the data set .

This process defines a log joint distribution over
, , and the data , conditioned on the hyperparameters , which we denote collectively by the boldface symbol ,(10) 
Using Eq. 9, it is easy to see that is the negative of the first line of Eq. 2.1. Thus, up to an additive term that depends only on , the log joint distribution in Eq. 10 is the negative of the loss function, and minimizing the loss over and is equivalent to a maximum a posteriori (MAP) approximation of the probabilistic model.
Figure 2 compares the generative process of the augmented KG embeddings proposed in Section 2.2 (left part of the figure) to the generative process for conventional KG embeddings that one obtains by setting and as in Eq. 6 (right). The augmented models are more flexible due to the large number of hyperparameters . We discuss next how the probabilistic interpretation allows us to efficiently optimize over this large number of hyperparameters.
3 Hyperparameter Optimization
We now describe the proposed method for hyperparameter tuning in the probabilistic knowledge graph embedding models introduced in Section 2.3. The method is based on variational expectationmaximization (EM). We first derive an approximate coordinate update equation for the hyperparameters (Section 3.1) and then cover details of the parameter initialization (Section 3.2).
Variational EM optimizes a lower bound to the marginal likelihood of the model over hyperparameters , with model parameters and integrated out. As we show in the supplementary material, the naive alternative of simultaneously optimizing the original model’s loss function over model parameters and hyperparameters would lead to divergent solutions. Variational EM avoids such divergent solutions by keeping track of parameter uncertainty. We elaborate on the role of parameter uncertainty in the supplementary material.
algocf[t!]
3.1 Variational Em for Knowledge Graph Embedding Models
Our proposed algorithm based on variational EM can easily be implemented in an existing model architecture by making a few modifications. Algorithm LABEL:alg:sgd
shows the conventional way to train a knowledge graph embedding model using stochastic gradient descent (SGD). The log joint distribution in Eqs.
810 defines a loss function of the form of Eq. 2. In SGD, one repeatedly calculates an estimate of this loss function based on a minibatchof training points, and one obtains gradient estimates by backpropagating through
.Algorithm LABEL:alg:variational_em shows the modifications that are necessary to implement hyperparameter optimization. We describe the algorithm in detail below. In summary, one has to: (i) inject noise into the loss estimate (lines LABEL:ln:draw_noiseLABEL:ln:inject_noise); (ii) learn the optimal amount of noise via SGD (line LABEL:ln:update_xi); and (iii) update the hyperparameters (line LABEL:ln:update_lambda).
algocf[t!]
Variational ExpectationMaximization.
Our probabilistic interpretation of knowledge graph embedding models allows us to optimization over all hyperparameters and in parallel via the expectationmaximization (EM) algorithm (Dempster et al., 1977). This algorithm treats the model parameters and as latent variables that have to be integrated out. The EM algorithm alternates between a step in which the latent variables are integrated out (‘Estep’), and an update step for the hyperparameters (‘Mstep’). We use a version of EM based on variational inference, termed variational EM (Bernardo et al., 2003), that avoids the integration step. We further derive an approximate coordinate update equation for the hyperparameters , which lead to a significant speedup over gradient updates in our experiments.
Each choice of hyperparameters defines a different variant of the model. The marginal likelihood of the data,
(11) 
quantifies how well a given model variant describes the data . Maximizing over thus yields the model variant that fits the data best. However, is unavailable in closed form as the integral in Eq. 11 is intractable.
To circumvent the problem of the intractable marginal likelihood, we use variational inference (VI) (Jordan et al., 1999). Rather than integrating over the entire space of model parameters and
, we maximize a lower bound on the marginal likelihood. We introduce a socalled variational family of Gaussian probability distributions,
(12) 
with
(13) 
and analogously for . Here, the means
and the standard deviations
are socalled variational parameters over which we optimize.Evoking Jensen’s inequality, the log marginal likelihood is then lowerbounded by the evidence lower bound (Blei et al., 2017; Zhang et al., 2018), or ELBO:
(14) 
Here, in the second step, we identified the log joint probability as the negative of the loss of the corresponding point estimated model, and is the entropy of .
The bound in Eq. 14 is tight if the variational distribution is the true posterior of the model for given . Since it is a lower bound, maximizing the ELBO over and minimizes the gap and yields the best approximation of the marginal likelihood. We thus take the ELBO as a proxy for the marginal likelihood, and we maximize it also over to find nearoptimal hyperparameters.
Gradient updates for and .
We maximize the ELBO concurrently over both variational parameters and as well as over hyperparameters . Updating the variational parameters is called the “Estep”. Here, we use gradient updates using Black Box reparameterization gradients (Kingma and Welling, 2014; Rezende et al., 2014). This has the advantage of being agnostic to the model architecture as long as the score (e.g., Eqs. 34) is differentiable, and it requires only few changes compared to the standard SGD training loop in Algorithm LABEL:alg:sgd.
To make sure that the standard deviations are always positive, we parameterize them by their logarithms ,
(15) 
and we optimize over and
using SGD. We obtain an unbiased estimate of the term
in Eq. 14 by drawing a single sample from (lines LABEL:ln:draw_noiseLABEL:ln:inject_noise in Algorithm LABEL:alg:variational_em). The reparameterization gradient trick uses the fact that for, e.g., noisefrom a standard normal distribution,
is distributed as . The entropy part of the ELBO (Eq. 14) can be calculated analytically. Up to an additive constant, it is given by the sum over all log standard deviations and . Thus, its gradient with respect to has the constant value of one in each coordinate direction, which we denote by the bold face term “” on line LABEL:ln:update_xi of Algorithm LABEL:alg:variational_em.Coordinate updates for .
Optimizing the ELBO over leads to an improved set of hyperparameters provided that the ELBO is a good approximation of the marginal likelihood . However, this is typically not the case at the beginning of the optimization when the variational distribution is still a poor fit of the posterior. We therefore begin the optimization with some number of pure “Estep” updates during which we keep fixed. After “Esteps”, we alternate between “E” and “M” steps, where the latter update the hyperparameters . In our experiments, we found that the optimization converged slowly when we used gradient updates for . To speed up convergence, we therefore derive approximate coordinate updates for .
To simplify the notation, we derive the update equation only for a single hyperparameter . Updates for are analogous. The only term in the ELBO (Eq. 14) that depends on is the expected log prior, . Since this term is independent of the data we can write it out explicitly. The omitted proportionality constant in the prior (Eq. 8) is dictated by normalization. We find,
(16) 
where for a realvalued embedding space (as in DistMult) and if (as in ComplEx). Setting to zero we find that the regularizer strength that maximizes the ELBO for given and satisfies
(17) 
In moderately high embedding dimensions , we can approximate the righthand side of Eq. 17 accurately by sampling from
. It is the expectation of the average of a large number of independent random variables, and therefore follows a highly peaked distribution. The update step on line
LABEL:ln:update_lambda of Algorithm LABEL:alg:variational_em uses a conservative weighted average between the current and the optimal value of with a learning rate . This effectively averages the estimates over past training steps with a decaying weight. Note that, for , Eq. 17 has a closed form solution for and , but we found it unnecessary in our experiments to implement specialized code for these cases.Absence of overfitting.
While the variational EM algorithm keeps track of uncertainty of model parameters, it fits only point estimates for the hyperparameters . This is justified in our setup since there are much fewer hyperparameters than model parameters: each entity and each relation has an embedding vector or with scalar components in our experiments, but only a single scalar hyperparameter or . We therefore expect much smaller posterior uncertainty for than for and , which justifies point estimating . Had we instead chosen a very flexible prior distribution with many hyperparameters per entity and relation, the EM algorithm would have essentially fitted the prior to the variational distribution, leading to an illposed problem. Judging from learning curves on the validation set, we did not detect any overfitting in variational EM.
3.2 Pre and ReTraining
Variational EM (Algorithm LABEL:alg:variational_em) converges more slowly than fitting point estimates (Algorithm LABEL:alg:sgd
) because the injected noise increases the variance of the gradient estimator. To speed up convergence, we train the model in three consecutive phases: pretraining, variational EM, and retraining.
In the pretraining phase, we keep the hyperparameters fixed and fit point estimates and to the model using standard SGD (Algorithm LABEL:alg:sgd). We found the final predictive performance (after the variational EM and retraining phases) to be insensitive to the initial hyperparameters. We use early stopping based on the mean reciprocal rank (see Eq. 18 in Section 4 below), evaluated on the validation set.
In the variational EM phase (Algorithm LABEL:alg:variational_em and Section 3.1), we initialize the variational distribution around the pretrained model parameters and . In detail, we initialize and , and we initialize the components of with a value that is small compared to the typical components of (0.2 in our experiments).
In the retraining phase, we fit again point estimates and with Algorithm LABEL:alg:sgd, this time using the optimized hyperparameters . We use the resulting models to evaluate the predictive performance, see results in Section 4.
Alternatively to retraining a point estimated model, one could also perform predictions by averaging predictive probabilities over samples from the variational distribution . If is a good approximation of the model posterior then this results in an approximate Bayesian form of link prediction. In our experiments, we found that, in low embedding dimensions , predictions based on samples from outperformed predictions based on point estimates. In higher embedding dimensions, however, the point estimated models from the retraining phase had better predictive performance. We interpret this somewhat counterintuitive observation as a failure of the fully factorized Gaussian variational approximation to adequately approximate the true posterior.
data set  WN18RR  WN18  FB15K237  FB15K  

model  variant  metric  MRR  Hits@10  MRR  Hits@10  MRR  Hits@10  MRR  Hits@10 
DistMult  Yang et al. (2015) (orig.)  –  –  0.83  0.942  –  –  0.35  0.577  
DistMult  Kadlec et al. (2017)  –  –  0.790  0.950  –  –  0.837  0.904  
DistMult  Dettmers et al. (2018)  0.43  0.49  0.822  0.936  0.241  0.419  0.654  0.824  
DistMult  Ours (after variational EM)  0.455  0.544  0.911  0.961  0.357  0.548  0.841  0.914  
ComplEx  Trouillon et al. (2016) (orig.)  –  –  0.941  0.947  –  –  0.692  0.840  
ComplEx  Lacroix et al. (2018)  0.478  0.569  0.952  0.963  0.364  0.555  0.857  0.909  
ComplEx  Ours (after variational EM)  0.486  0.579  0.953  0.964  0.365  0.560  0.854  0.915 
4 Experimental Results
We test the performance of the proposed model augmentation and the scalable hyperparameter tuning algorithm with two models and four different data sets. In this section, we report results using standard benchmark metrics and we compare to the previous state of the art. We also analyze the relationship between the optimized regularizer strengths and the frequency of entities and relations.
Model architectures and baselines.
We report results for the DistMult model (Yang et al., 2015) and the ComplEx model (Trouillon et al., 2016). We follow (Lacroix et al., 2018) for details of the model architecture: we use reciprocal facts as described at the end of Section 2.3, norm regularizers, and an embedding dimension of . We compare our results to the previous state of the art: (Dettmers et al., 2018; Kadlec et al., 2017) for DistMult and (Lacroix et al., 2018) for ComplEx.
Data sets.
We used four standard data sets. The first two are FB15K from the Freebase project (Bollacker et al., 2008) and WN18 from the WordNet database (Bordes et al., 2014). The other two data sets, FB15K237 and WN18RR, are modified versions of FB15K and WN18 due to (Toutanova and Chen, 2015; Dettmers et al., 2018). The motivation for the modified data sets is that FB15K and WN18 contain near duplicate relations that lead to leakage into the test set, which makes link prediction trivial for some facts, thus encouraging overfitting. In FB15K237 and WN18RR these near duplicates were removed.
Metrics.
We report two standard metrics used in the KG embedding literature: mean reciprocal rank (MRR) and Hits@10. We average over head and tail prediction on the test set , which is equivalent to averaging only over tail prediction on the augmented test set , see Eq. 7.
All results are obtained in the ‘filtered’ setting introduced in (Bordes et al., 2013), which takes into account that more than one tail may be a correct answer to a query . When calculating the rank of the target tail one therefore ignores any competing tails if the corresponding fact exists in either the training, validation, or test set. More formally, the fitlered rank, denoted below as , is defined as one plus the number of ‘incorrect’ facts with the the given and for which . Here, candidate facts are considered ‘incorrect’ if they appear neither in the training nor in the validation or test set.
Quantitative results.
Table 1 summarizes our quantitative results. The top half of the table shows results for the DistMult model. Our models with individually optimized regularization strengths significantly outperform the previous state of the art across all four data sets.
For the ComplEx model, the performance improvements are less pronounced (lower half of Table 1). This may be explained by the fact that the results in (Lacroix et al., 2018) were already obtained after large scale expensive hyperparameter tuning using grid search. By contrast, the hyperparameter search with our proposed method required only a single run per data set. Even for the largest data set FB15K, the variational EM phase took less than three hours on a single GPU. Despite the much cheaper hyperparameter optimization, our models slightly outperform the previous state of the art on three out of the four considered data sets, with only a small degradation on the fourth.
Qualitative results.
Finally, we study the relationship between optimized hyperparameters and frequencies of entities in the training data. Figure 3 shows the learned for all entities as a function of the number of times that each entity appears in the training corpus of the FB15K data set. The red line is the best proportional fit to the results, as is imposed by conventional models (Eq. 6).
Our findings confirm the general idea to use stronger regularization for entities with more training data. The Bayesian interpretation can explain this observation: a small amount of training data typically leads to high posterior uncertainty, which leads to small in Eq. 17. However, our results indicate that imposing a proportionality between and would be a poor choice that significantly underregularizes infrequent entities and overregularizes frequent entities (note the logarithmic scale in Figure 3
). Our empirical findings may inspire future theoretical work that derives an optimal frequency dependency of the regularization strength in tensor factorization models.
5 Related Work
Related work to this paper can be grouped into link prediction algorithms and variational inference.
Link Prediction.
Link prediction in knowledge graphs has gained a lot of attention as it may guide a way towards automated reasoning with real world data. For a review see
(Nickel et al., 2016a). Two different approaches for link prediction are predominant in the literature. In statistical relational learning, one infers explicit rules about relations (such as transitivity or commutativity) by detecting statistical patterns in the training set. One then uses these rules for logic reasoning (Friedman et al., 1999; Kersting et al., 2011; Niu et al., 2012; Pujara et al., 2015).Our work focuses on a complementary approach that builds on knowledge graph embedding models. This line of research started with the proposal of the TransE model (Bordes et al., 2013), which models relational facts as vector additions in a semantic space. More recently, a plethora of different knowledge graph embedding models based on tensor factorizations have been proposed. We summarize here only the path that lead to the current state of the art. Different models make different tradeoffs between generality and effective use of training data.
Canonical tensor decomposition (Hitchcock, 1927) uses independent embeddings for entities in the head or tail position of a fact. DistMult (Yang et al., 2015; Toutanova and Chen, 2015), by contrast, uses the same embeddings for entities in head and tail position, thus making use of more training data per entity embedding, but restricting the model to symmetric relations. The ComplEx model (Trouillon et al., 2016) lifts this restriction by multiplying the head, relation, and tail embeddings in an asymmetric way. To the best of our knowledge, the current state of the art was presented by Lacroix et al. (2018), who improved upon the ComplEx model by introducing reciprocal relations and using a better regularizer.
The sensitivity of KG embeddings to the choice of hyperparameters, such as regularizer strengths, was first pointed out in (Kadlec et al., 2017)
. A popular heuristic is to regularize each embedding every time it appears in a minibatch, thus effecitvely regularizing embeddings proportionally to their frequency
(Srebro and Salakhutdinov, 2010; Lacroix et al., 2018). In contrast, we propose to learn entitydependent regularization strengths without relying on heuristics.Vilnis et al. (2018) proposed a new model that is probabilistic in the sense that it assigns probabilities to the results of queries. However, in contrast to our proposal, the model is not a probabilistic generative model of the data set.
Variational Inference.
Variational inference (VI) is a powerful technique to approximate a Bayesian posterior over latent variables given observations (Jordan et al., 1999; Blei et al., 2017; Zhang et al., 2018). Besides approximating the posterior, VI also estimates the marginal likelihood of the data. This allows for iterative hyperparameter tuning, (variational EM) (Bernardo et al., 2003), which is the main benefit of the Bayesian approach used in this paper.
Our paper builds on recent probabilistic extensions of embedding models to Bayesian models, such as word (Barkan, 2017) or paragraph (Ji et al., 2017) embeddings. In these works, the words are embedded into a dimensional space. It has been shown that using a probabilistic approach leads to better performance on small data sets, and allows these models to be combined with powerful priors, such as for time series modeling (Bamler and Mandt, 2017, 2018; Jähnichen et al., 2018). Yet, the underlying probabilistic models in these papers are very different from the ones considered in our work.
Bayesian Optimization.
An alternative method that can be used for hyperparameter optimization is Bayesian optimization. However, Bayesian optimization does not scale to the large number of hyperparameters that we tune in this work. Most practical applications of Bayesian optimization (e.g., (Snoek et al., 2012; Wang et al., 2013)) tune only tens of hyperparameters, rather than ten thousands. This is because Bayesian optimization treats the model as a black box, which it can only train and then evaluate for a given choice of hyperparameters at a time. Each such evaluation contributes a single data point to fit an auxiliary model over the hyperparameters. By contrast, variational EM has access to gradient information to train all hyperparameters in parallel, and concurrently with the model parameters.
6 Conclusions
We augmented a large class of popular knowledge graph embedding models in such a way that every entity embedding and every relationship embedding vector has their own regularizer, and showed that it is possible to tune these potentially thousands of hyperparameters in a scalable way. Our approach is motivated by the observation that sharing a common regularization parameter across all embeddings leads to overregularization.
We treat knowledge graph embeddings as generative probabilistic models, making them amenable to Bayesian model selection. We derived approximate coordinate updates for the hyperparameters in the framework of variational EM. We applied our method to generalizations of the DistMult and ComplEx models and outperformed the state of the art for link prediction. The approach can be applied to a wide range of models with minimal modifications to the training routine. In the future, it would be interesting to investigate whether tighter variational bounds (Burda et al., 2016; Bamler et al., 2017) may further improve model selection.
References

Bamler and Mandt (2017)
Robert Bamler and Stephan Mandt.
Dynamic word embeddings.
In
International conference on Machine learning
, 2017.  Bamler and Mandt (2018) Robert Bamler and Stephan Mandt. Improving optimization in models with continuous symmetry breaking. In International Conference on Machine Learning, 2018.
 Bamler et al. (2017) Robert Bamler, Cheng Zhang, Manfred Opper, and Stephan Mandt. Perturbative black box variational inference. In Advances in Neural Information Processing Systems, pages 5079–5088, 2017.
 Barkan (2017) Oren Barkan. Bayesian neural word embedding. In Association for the Advancement of Artificial Intelligence, pages 3135–3143, 2017.
 Bernardo et al. (2003) JM Bernardo, MJ Bayarri, JO Berger, AP Dawid, D Heckerman, AFM Smith, M West, et al. The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures. Bayesian statistics, 7:453–464, 2003.
 Blei et al. (2017) David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
 Bollacker et al. (2008) Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1247–1250. AcM, 2008.
 Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto GarciaDuran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multirelational data. In Advances in Neural Information Processing Systems, pages 2787–2795, 2013.
 Bordes et al. (2014) Antoine Bordes, Xavier Glorot, Jason Weston, and Yoshua Bengio. A semantic matching energy function for learning with multirelational data. Machine Learning, 94(2):233–259, 2014.

Burda et al. (2016)
Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov.
Importance weighted autoencoders.
In International Conference on Learning Representations, 2016.  Dempster et al. (1977) Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (methodological), pages 1–38, 1977.
 Dettmers et al. (2018) Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Convolutional 2d knowledge graph embeddings. In Association for the Advancement of Artificial Intelligence, 2018.
 Eder (2012) Jeffrey Scott Eder. Knowledge graph based search system, June 21 2012. US Patent App. 13/404,109.
 Friedman et al. (1999) Nir Friedman, Lise Getoor, Daphne Koller, and Avi Pfeffer. Learning probabilistic relational models. In International Joint Conference on Artificial Intelligence, pages 1300–1309, 1999.
 Hitchcock (1927) Frank L Hitchcock. The expression of a tensor or a polyadic as a sum of products. Journal of Mathematics and Physics, 6(14):164–189, 1927.
 Jähnichen et al. (2018) Patrick Jähnichen, Florian Wenzel, Marius Kloft, and Stephan Mandt. Scalable generalized dynamic topic models. In Artificial Intelligence and Statistics, 2018.

Ji et al. (2017)
Geng Ji, Robert Bamler, Erik B Sudderth, and Stephan Mandt.
Bayesian paragraph vectors.
Symposium on Advances in Approximate Bayesian Inference
, 2017.  Ji et al. (2016) Guoliang Ji, Kang Liu, Shizhu He, and Jun Zhao. Knowledge graph completion with adaptive sparse transfer matrix. In Association for the Advancement of Artificial Intelligence, pages 985–991, 2016.
 Jordan et al. (1999) Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
 Kadlec et al. (2017) Rudolf Kadlec, Ondrej Bajgar, and Jan Kleindienst. Knowledge base completion: Baselines strike back. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 69–74. Association for Computational Linguistics, 2017.

Kersting et al. (2011)
Kristian Kersting, Sriraam Natarajan, and David Poole.
Statistical relational ai: logic, probability and computation.
In
Proceedings of the 11th International Conference on Logic Programming and Nonmonotonic Reasoning (LPNMR’11)
, pages 1–9, 2011.  Kingma and Welling (2014) Diederik P Kingma and Max Welling. Autoencoding variational bayes. In International Conference on Learning Representations, 2014.
 Lacroix et al. (2018) Timothée Lacroix, Nicolas Usunier, and Guillaume Obozinski. Canonical tensor decomposition for knowledge base completion. In International Conference on Machine Learning, 2018.
 Liang et al. (2018) Dawen Liang, Rahul G. Krishnan, Matthew D. Hoffman, and Tony Jebara. Variational autoencoders for collaborative filtering. In Proceedings of the 2018 World Wide Web Conference, WWW ’18, pages 689–698. International World Wide Web Conferences Steering Committee, 2018. ISBN 9781450356398.
 Lin et al. (2015) Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity and relation embeddings for knowledge graph completion. In Association for the Advancement of Artificial Intelligence, volume 15, pages 2181–2187, 2015.
 Maritz (2018) Johannes S Maritz. Empirical Bayes Methods with Applications: 0. Chapman and Hall/CRC, 2018.
 Nguyen (2017) Dat Quoc Nguyen. An overview of embedding models of entities and relationships for knowledge base completion. arXiv preprint arXiv:1703.08098, 2017.
 Nickel et al. (2016a) Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. A review of relational machine learning for knowledge graphs. Proceedings of the IEEE, 104(1):11–33, 2016a.
 Nickel et al. (2016b) Maximilian Nickel, Lorenzo Rosasco, Tomaso A Poggio, et al. Holographic embeddings of knowledge graphs. In Association for the Advancement of Artificial Intelligence, volume 2, pages 3–2, 2016b.
 Niu et al. (2012) Feng Niu, Ce Zhang, Christopher Ré, and Jude W Shavlik. Deepdive: Webscale knowledgebase construction using statistical learning and inference. VLDS, 12:25–28, 2012.
 Pujara et al. (2015) Jay Pujara, Hui Miao, Lise Getoor, and William W Cohen. Using semantics and statistics to turn data into knowledge. AI Magazine, 36(1):65–74, 2015.
 Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014.
 Shen et al. (2016) Yelong Shen, PoSen Huang, MingWei Chang, and Jianfeng Gao. Implicit reasonet: Modeling largescale structured relationships with shared memory. In Proceedings of the 2nd Workshop on Representation Learning for NLP, 2016.
 Shi and Weninger (2017) Baoxu Shi and Tim Weninger. Proje: Embedding projection for knowledge graph completion. In Association for the Advancement of Artificial Intelligence, volume 17, pages 1236–1242, 2017.
 Snoek et al. (2012) Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, pages 2951–2959, 2012.
 Srebro and Salakhutdinov (2010) Nathan Srebro and Ruslan R Salakhutdinov. Collaborative filtering in a nonuniform world: Learning with the weighted trace norm. In Advances in Neural Information Processing Systems, pages 2056–2064, 2010.
 Toutanova and Chen (2015) Kristina Toutanova and Danqi Chen. Observed versus latent features for knowledge base and text inference. In Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality, pages 57–66, 2015.
 Trouillon et al. (2016) Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. In International Conference on Machine Learning, pages 2071–2080, 2016.
 Vilnis et al. (2018) Luke Vilnis, Xiang Li, Shikhar Murty, and Andrew McCallum. Probabilistic embedding of knowledge graphs with box lattice measures. In Annual Meeting of the Association for Computational Linguistics, 2018.
 Wang et al. (2017) Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. Knowledge graph embedding: A survey of approaches and applications. IEEE Transactions on Knowledge and Data Engineering, 29(12):2724–2743, 2017.

Wang et al. (2014)
Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen.
Knowledge graph embedding by translating on hyperplanes.
In Association for the Advancement of Artificial Intelligence, volume 14, pages 1112–1119, 2014.  Wang and Li (2016) Zhigang Wang and JuanZi Li. Textenhanced representation learning for knowledge graph. In International Joint Conference on Artificial Intelligence, pages 1293–1299, 2016.
 Wang et al. (2013) Ziyu Wang, Masrour Zoghi, Frank Hutter, David Matheson, and Nando De Freitas. Bayesian optimization in high dimensions via random embeddings. In International Joint Conference on Artificial Intelligence, 2013.
 Xiao et al. (2017) Han Xiao, Minlie Huang, Lian Meng, and Xiaoyan Zhu. Ssp: Semantic space projection for knowledge graph embedding with text descriptions. In Association for the Advancement of Artificial Intelligence, volume 17, pages 3104–3110, 2017.
 Yang et al. (2015) Bishan Yang, Wentau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and relations for learning and inference in knowledge bases. In International Conference on Learning Representations, 2015.
 Zhang et al. (2018) Cheng Zhang, Judith Butepage, Hedvig Kjellstrom, and Stephan Mandt. Advances in variational inference. IEEE transactions on pattern analysis and machine intelligence, 2018.
Comments
There are no comments yet.