## 1 Introduction

Link prediction in relational data has been the subject of interest, given the widespread availability of such data and the breadth of its use in bioinformatics (Zitnik2018), recommender systems (koren_matrix_2009) or Knowledge Base completion (nickel_review_2016). Relational data is often temporal, for example, the action of buying an item or watching a movie is associated to a timestamp. Some medicines might not have the same adverse side effects depending on the subject’s age. The task of *temporal* link prediction is to find missing links in graphs at precise points in time.

In this work, we study temporal link prediction through the lens of temporal knowledge base completion, which provides varied benchmarks both in terms of the underlying data they represent, but also in terms of scale. A knowledge base is a set of facts (subject, predicate, object) about the world that are known to be true. Link prediction in a knowledge base amounts to answer incomplete queries of the form (subject, predicate, ?) by providing an accurate ranking of potential objects. In temporal knowledge bases, these facts have some temporal metadata attached. For example, facts might only hold for a certain time interval, in which case they will be annotated as such. Other facts might be event that happened at a certain point in time. Temporal link prediction amounts to answering queries of the form (subject, predicate, ?, timestamp). For example, we expect the ranking of queries (USA, president, ?, timestamp) to vary with the timestamps.

As tensor factorization methods have proved successful for Knowledge Base Completion (nickel_review_2016; trouillon_complex_2016; lacroix2018canonical), we express our Temporal Knowledge Base Completion problem as an order tensor completion problem. That is, timestamps are discretized and used to index a -th mode in the binary tensor holding (subject, predicate, object, timestamps) facts.

First, we introduce a ComplEx (trouillon_complex_2016) decomposition of this order tensor, and link it with previous work on temporal Knowledge Base completion. This decomposition yields embeddings for each timestamps. A natural prior is for these timestamps representation to evolve slowly over time. We are able to introduce this prior as a regularizer for which the optimum is a variation on the nuclear -norm. In order to deal with heterogeneous temporal knowledge bases where a significant amount of relations might be non-temporal, we add a non-temporal component to our decomposition.

Experiments on available benchmarks show that our method outperforms the state of the art for similar number of parameters. We run additional experiments for larger, regularized models and obtain improvements of up to absolute Mean Reciprocal Rank (MRR).

Finally, we propose a dataset of entities, based on Wikidata, with train triples, of which contain temporal validity information. This dataset is larger than usual benchmarks in the Knowledge Base completion community and could help bridge the gap between the method designed and the envisaged web-scale applications.

## 2 Related Work

Matrices and tensors are upper case letters. The -th row of is denoted by while it’s column is denoted by

. The tensor product of two vectors is written

and the hadamard (elementwise) product .#### Static link prediction methods

Standard tensor decomposition methods have lead to good results (yang_embedding_2014; trouillon_complex_2016; lacroix2018canonical; balavzevic2019tucker) in Knowledge Base completion. The Canonical Polyadic (CP) Decomposition (hitchcock_expression_1927) is the tensor equivalent to the low-rank decomposition of a matrix. A tensor of canonical rank can be written as:

(1) |

Setting leads to the Distmult (yang_embedding_2014) model, which has been successful, despite only being able to represent symmetric score functions. In order to keep the parameter sharing scheme but go beyond symmetric relations, trouillon_complex_2016 use complex parameters and set to the complex conjugate of ,

. Regularizing this algorithm with the variational form of the tensor nuclear norm as well as a slight transformation to the learning objective (also proposed in

kazemi2018simple) leads to state of the art results in lacroix2018canonical.Other methods are not directly inspired from classical tensor decompositions. For example, TransE (bordes_translating_2013) models the score as a distance of the translated subject to an object representation. This method has lead to many variations (ji2015knowledge; nguyen2016stranse; wang2014knowledge), but is limited in the relation systems it can model (kazemi2018simple) and does not lead to state of the art performances on current benchmarks. Finally schlichtkrull2018modeling

propose to generate the entity embeddings of a CP-like tensor decomposition by running a forward pass of a Graph Neural Network over the training Knowledge Base. The experiments included in this work did not lead to better link prediction performances than the same decomposition (Distmult) directly optimized

(kadlec_knowledge_2017).#### Temporal link prediction methods

sarkar2006dynamic describes a bayesian model and learning method for representing temporal relations. The temporal smoothness prior used in this work is similar to the gradient penalty we describe in Section 3.3. However, learning one embedding matrix per timestamp is not applicable to the scales considered in this work. bader2007temporal uses a tensor decomposition called ASALSAN to express temporal relations. This decomposition is related to RESCAL (nickel_three-way_2011) which underperforms on recent benchmarks due to overfitting (nickel_holographic_2015).

For temporal knowledge base completion, goel2019diachronic

learns entity embeddings that change over time, by masking a fraction of the embedding weights with an activation function of learned frequencies. Based on the Tucker decomposition, ConT

(ma2018embedding) learns one new core tensor for each timestamp. Finally, viewing the time dimension as a sequence to be predicted, garcia2018learning use recurrent neural nets to transform the embeddings of standard models such as TransE or Distmult to accomodate the temporal data.This work follows lacroix2018canonical by studying and extending a regularized CP decomposition of the training set seen as an order 4 tensor. We propose and study several regularizer suited to our decompositions.

## 3 Model

DE-SimplE | |
---|---|

TComplEx | |

TNTComplEx |

In this section, we are given facts (subject, predicate, object) annotated with timestamps, we discretize the timestamp range (eg. by reducing timestamps to years) in order to obtain a training set of -tuple (subject, predicate, object, time) indexing an order tensor. We will show in Section 5.1 how we reduce each datasets to this setting. Following lacroix2018canonical, we minimize, for each of the train tuples , the instantaneous multiclass loss :

(2) |

Note that this loss is only suited to queries of the type (subject, predicate, ?, time), which is the queries that were considered in related work. We consider another auxiliary loss in Section 6 which we will use on our Wikidata dataset. For a training set (augmented with reciprocal relations (lacroix2018canonical; kazemi2018simple)

), and parametric tensor estimate

, we minimize the following objective, with a*weighted*regularizer :

(3) |

The ComplEx (trouillon_complex_2016) decomposition can naturally be extended to this setting by adding a new factor , we then have:

(4) |

We call this decomposition TComplEx. Intuitively, we added timestamps embedding that modulate the multi-linear dot product. Notice that the timestamp can be used to equivalently modulate the objects, predicates or subjects to obtain time-dependent representation:

(5) |

Contrary to DE-SimplE (goel2019diachronic), we do not learn temporal embeddings that scale with the number of entities (as frequencies and biases), but rather embeddings that scale with the number of timestamps. The number of parameters for the two models are compared in Table 1.

### 3.1 Non-Temporal predicates

Some predicates might not be affected by timestamps. For example, Malia and Sasha will always be the daughters of Barack and Michelle Obama, whereas the “has occupation” predicate between two entities might very well change over time. In heterogeneous knowledge bases, where some predicates might be temporal and some might not be, we propose to decompose the tensor as the sum of two tensors, one temporal, and the other non-temporal:

(6) |

We call this decomposition TNTComplEx. goel2019diachronic suggests another way of introducing a non-temporal component, by only allowing a fraction

of components of the embeddings to be modulated in time. By allowing this sharing of parameters between the temporal and non-temporal part of the tensor, our model removes one hyperparameter. Moreover, preliminary experiments showed that this model outperforms one without parameter sharing.

### 3.2 Regularization

Any order tensor can be considered as an order tensor by *unfolding* modes together. For a tensor , unfolding modes and together will lead to tensor (kolda_tensor_2009).

We can see both decompositions ((4) and (6)) as order tensors by unfolding the temporal and predicate modes together. Considering the decomposition implied by these unfoldings (see Appendix 8.1) leads us to the following weighted regularizers (lacroix2018canonical):

(7) | ||||

(8) |

The first regularizer weights objects, predicates and pairs (predicate, timestamp) according to their respective marginal probabilities. This regularizer is a variational form of the weighted nuclear

-norm on an order tensor (see subsection 3.4 and Appendix 8.3 for details and proof). The second regularizer is the sum of the nuclear penalties on tensors and .### 3.3 Smoothness of temporal embeddings

We have more a priori structure on the temporal mode than on others. Notably, we expect smoothness of the application . In words, we expect neighboring timestamps to have close representations. Thus, we penalize the norm of the discrete derivative of the temporal embeddings :

(9) |

We show in Appendix 8.2 that the sum of and the variational form of the nuclear norm (11) lead to a variational form of a new tensor atomic norm.

### 3.4 Nuclear -norms of tensors and their variational forms

As was done in lacroix2018canonical, we aim to use tensor nuclear -norms as regularizers. The definition of the nuclear -norm of a tensor (friedland_nuclear_2014) of order is:

(10) |

This formulation of the nuclear -norm writes a tensor as a sum over *atoms* which are the rank- tensors of unit -norm factors. The nuclear -norm is NP-hard to compute (friedland_nuclear_2014). Following lacroix2018canonical, a practical solution is to use the equivalent formulation of nuclear -norm using their *variational form*, which can be conveniently written for :

(11) |

For the equality above to hold, the infimum should be over all possible . The practical solution is to fix to the desired rank of the decomposition. Using this variational formulation as a regularizer leads to state of the art results for order-3 tensors (lacroix2018canonical) and is convenient in a stochastic gradient setting because it separates over each model coefficient.

In addition, this formulation makes it easy to introduce a weighting as recommended in srebro_collaborative_2010; foygel_learning_2011. In order to learn under non-uniform sampling distributions, one should penalize the weighted norm : , where and are the empirical row and column marginal of the distribution. The variational form (11) makes this easy, by simply penalizing rows for observed triple

in stochastic gradient descent. More precisely for

and the vectors holding the observed count of each index over each mode :(12) |

In subsection 3.3, we add another penalty in Equation (9) which changes the norm of our atoms.In subsection 3.2, we introduced another variational form in Equation (7) which allows to easily penalize the nuclear -norm of an order tensor. This regularizer leads to different weighting. By considering the unfolding of the timestamp and predicate modes, we are able to weight according to the joint marginal of timestamps and predicates, rather than by the product of the marginals. This can be an important distinction if the two are not independent.

### 3.5 Experimental impact of the regularizers

We study the impact of regularization on the ICEWS05-15 dataset, for the TNTComplEx model. For details on the experimental set-up, see Section 5.1. The first effect we want to quantify is the effect of the regularizer . We run a grid search for the strength of both and and plot the convex hull as a function of the temporal regularization strength. As shown in Figure 1, imposing smoothness along the time mode brings an improvement of over MRR point.

The second effect we wish to quantify is the effect of the choice of regularizer . A natural regularizer for TNTComplEx would be :

(13) |

We compare , and with . The comparison is done with a temporal regularizer of to reduce the experimental space.

is the common weight-decay frequently used in deep-learning. Such regularizers have been used in knowledge base completion

(nickel_three-way_2011; nickel_holographic_2015; trouillon_complex_2016), however, lacroix2018canonical showed that the infimum of this penalty is non-convex over tensors.matches the order used in the regularizer, and in previous work on knowledge base completion (lacroix2018canonical). However, by the same arguments, its minimization does not lead to a convex penalty over tensors.

is the sum of the variational forms of the Nuclear -norm for the two tensors of order in the TNTComplEx model according to equation (11).

Detailed results of the impact of regularization on the performances of the model are given in Figure 1. The two regularizers and are the only regularizers that can be interpreted as sums of tensor norm variational forms and perform better than their lower order counterparts.

There are two differences between and . First, whereas the first is a variational form of the nuclear -norm, the second is a variational form of the nuclear -norm which is closer to the nuclear -norm. Results for exact recovery of tensors have been generalized to the nuclear -norm, and to the extent of our knowledge, there has been no formal study of generalization properties or exact recovery under the nuclear -norm for greater than two.

Second, the weighting in is done separately over timestamps and predicates, whereas it is done jointly for . This leads to using the joint empirical marginal as a weighting over timestamps and predicates. The impact of weighting on the guarantees that can be obtained are described more precisely in foygel_learning_2011.

The contribution of all these regularizers over a non-regularized model are summarized in Table 3. Note that careful regularization leads to a MRR increase.

## 4 A new dataset for Temporal and non-Temporal Knowledge Base Completion

A dataset based on Wikidata was proposed by garcia2018learning. However, upon inspection, this dataset contains numerical data as entities, such as ELO rankings of chess players, which are not representative of practically useful link prediction problems. Also, in this dataset, temporal informations is specified in the form of “OccursSince” and “OccursUntil” statements appended to triples, which becomes unwieldy when a predicate holds for several intervals in time. Moreover, this dataset contains only entities and which is insufficient to benchmark methods at scale.

The GDelt dataset described in ma2018embedding; goel2019diachronic holds many triples (), but does not describe enough entities (). In order to adress these limitations, we created our own dataset from Wikidata, which we make available along with the code for this paper at https://github.com/facebookresearch/tkbc.

Starting from Wikidata, we removed all entities that were instance of scholarly articles, proteins and others. We also removed disambiguation, template, category and project pages from wikipedia. Then, we removed all facts for which the object was not an entity. We iteratively filtered out entities that had degree at least and predicates that had at least occurrences. With this method, we obtained a dataset of entities, predicates and timestamps (we only kept the years). Each datum is a triple (subject, predicate, object) together a timestamp range (begin, end) where begin, end or both can be unspecified. Our train set contains such tuples, with about partially specified temporal tuples. We kept a validation and test set of size each.

At train and test time, for a given datum (subject, predicate, object, [begin, end]), we sample a timestamp (appearing in the dataset) uniformly at random, in the range [begin, end]. For datum without a temporal range, we sample over the maximum date range. Then, we rank the objects for the partial query (subject, predicate, ?, timestamp).

## 5 Experimental Results

### 5.1 Experimental Set-Up

We follow the experimental set-up in garcia2018learning; goel2019diachronic. We use models from garcia2018learning and goel2019diachronic as baselines since they are the best performing algorithms on the datasets considered. We report the filtered Mean Reciprocal Rank (MRR) defined in nickel_holographic_2015. In order to obtaiqn comparable results, we use Table 1 and dataset statistics to compute the rank for each (model, dataset) pair that matches the number of parameters used in goel2019diachronic. We also report results at ranks times higher. This higher rank set-up gives an estimation of the best possible performance attainable on these datasets, even though the dimension used might be impractical for applied systems. All our models are optimized with Adagrad (duchi_adaptive_2011), with a learning rate of , a batch-size of . More details on the grid-search, actual ranks used and hyper-parameters are given in Appendix 8.7.

We give results on datasets previously used in the litterature : ICEWS14, ICEWS15-05 and Yago15k. The ICEWS datasets are samplings from the Integrated Conflict Early Warning System (ICEWS)(icewsdataset)^{1}^{1}1More information can be found at http://www.icews.com.garcia2018learning introduced two subsampling of this data, ICEWS14 which contains all events occuring in 2014 and ICEWS05-15 which contains events occuring between 2005 and 2015. These datasets immediately fit in our framework, since the timestamps are already discretized.

The Yago15K dataset (garcia2018learning) is a modification of FB15k (bordes_translating_2013) which adds “occursSince” and “occursUntil” timestamps to each triples. Following the evaluation setting of garcia2018learning, during evaluation, the incomplete triples to complete are of the form (subject, predicate, ?, occursSince | occursUntil, timestamp) (with reciprocal predicates). Rather than deal with tensors of order , we choose to unfold the (occursSince, occursUntil) and the predicate mode together, multiplying its size by two.

Some relations in Wikidata are highly unbalanced (eg. (?, InstanceOf, Human)). For such relations, a ranking evaluation would not make much sense. Instead, we only compute the Mean Reciprocal Rank for missing right hand sides, since the data is such that highly unbalanced relations happen on the left-hand side. However, we follow the same training scheme as for all the other dataset, including reciprocal relations in the training set. The cross-entropy loss evaluated on entities puts a restriction on the dimensionality of embeddings at about for a batch-size of . We leave sampling of this loss, which would allow for higher dimensions to future work.

### 5.2 Results

MRR | NT-MRR | T-MRR | |
---|---|---|---|

ComplEx | |||

TComplEx | |||

TNTComplEx |

We compare ComplEx with the temporal versions described in this paper. We report results in Table 3. Note that ComplEx has performances that are stable through a tenfold increase of its number of parameters, a rank of is enough to capture the static information of these datasets. For temporal models however, the performance increases a lot with the number of parameters. It is always beneficial to allow a separate modeling of non-temporal predicates, as the performances of TNTComplex show. Finally, our model match or beat the state of the art on all datasets, even at identical number of parameters. Since these datasets are small, we also report results for higher ranks ( times the number of parameters used for DE-SimplE).

On Wikidata, of the triples have no temporal data attached. This leads to ComplEx outperforming all temporal models in term of average MRR, since the Non-Temporal MRR (NT-MRR) far outweighs the Temporal MRR (T-MRR). A breakdown of the performances is available in table 4. TNTComplEx obtains performances that are comparable to ComplEx on non-temporal triples, but are better on temporal triples. Moreover, TNTComplEx can minimize the temporal cross-entropy (14) and is thus more flexible on the queries it can answer.

Training TNTComplEx on Wikidata with a rank of with the full cross-entropy on a Quadro GP 100, we obtain a speed of triples per second, leading to experiments time of hours. This is to be compared with triples per second when training ComplEx for experiments time of hours. The additional complexity of our model does not lead to any real impact on runtime, which is dominated by the computation of the cross-entropy over entities.

## 6 Qualitative study

The instantaneous loss described in equation (2), along with the timestamp sampling scheme described in the previous section only enforces correct rankings along the “object” tubes of our order- tensor. In order to enforce a stronger temporal consistency, and be able to answer queries of the type (subject, predicate, object, ?), we propose another cross-entropy loss along the temporal tubes:

(14) |

We optimize the sum of defined in Equation 2 and defined in Equation 14. Doing so, we only lose MRR point overall. However, we make our model better at answering queries along the time axis. The macro area under the precision recall curve is for a TNTComplEx model learned with alone and for a TNTComplEx model trained with .

We plot in Figure 2 the scores along time for train triples (president of the french republic, office holder, {Jacques Chirac | Nicolas Sarkozy | François Hollande | Emmanuel Macron}, ). The periods where a score is highest matches closely the ground truth of start and end dates of these presidents mandates which is represented as a colored background. This shows that our models are able to learn rankings that are correct along time intervals despite our training method only ever sampling timestamps within these intervals.

## 7 Conclusion

Tensor methods have been successful for Knowledge Base completion. In this work, we suggest an extension of these methods to Temporal Knowledge Bases. Our methodology adapts well to the various form of these datasets : point-in-time, beginning and endings or intervals. We show that our methods reach higher performances than the state of the art for similar number of parameters. For several datasets, we also provide performances for higher dimensions. We hope that the gap between low-dimensional and high-dimensional models can motivate further research in models that have increased expressivity at lower number of parameters per entity. Finally, we propose a large scale temporal dataset which we believe represents the challenges of large scale temporal completion in knowledge bases. We give performances of our methods for low-ranks on this dataset. We believe that, given its scale, this dataset could also be an interesting addition to non-temporal knowledge base completion.

## References

## 8 Appendix

### 8.1 Unfolding and the CP decomposition

Let , that is . Then according to kolda_tensor_2009, unfolding along modes and leads to an order three tensor of decomposition . Where is the Khatri-Rao product (smilde2005multi), which is the column-wise Kronecker product : .

Note that for a fourth mode of size : . This justifies the regularizers used in Section 3.2.

### 8.2 Temporal regularizer and Nuclear norms

Consider the penalty:

(15) |

Let us define a new norm on vectors:

(16) |

is a norm and lets us rewrite:

(17) |

Following the proof in lacroix2018canonical which only uses homogeneity of the norms, we can show that is a variational form of an atomic norm with atoms :

### 8.3 Nuclear norms on unfoldings

We consider the regularizer :

(18) |

Let (resp. obj, pred/time) the diagonal matrix containing the cubic-roots of the marginal probabilities of each subject (resp. obj, pred/time) in the dataset. We denote by the Kathri-Rao product between two matrices (the columnwise Kronecker product). Summing over the entire dataset, we obtain the penalty:

(19) |

Dropping the weightings to simplify notations, we state the equivalence between this regularizer and a variational form of the nuclear -norm of an order tensor:

(20) |

The proof follows lacroix2018canonical, noting that . Note that for , there would also be equality of the weighted norms. However, in the application considered, time and predicate are most likely not independent, leading to different weightings of the norms.

### 8.4 Dataset statistics

Statistics of all the datasets used in this work are gathered in Table 5.

ICEWS14 | ICEWS05-15 | Yago15k | Wikidata | |
---|---|---|---|---|

Entities | ||||

Predicates | ||||

Timestamps | ||||

|S| |

### 8.5 Detailed results

ICEWS14 | ICEWS15-05 | Yago15k | ||||||||||

MRR | H@1 | H@3 | H@10 | MRR | H@1 | H@3 | H@10 | MRR | H@1 | H@3 | H@10 | |

TA | 0.37 | - | 0.69 | 0.35 | - | 0.73 | 0.23 | - | 0.51 | |||

DE-SimplE | 0.42 | 0.59 | 0.73 | 0.39 | 0.58 | 0.75 | - | - | - | - | ||

ComplEx | 0.35 | 0.53 | 0.70 | 0.37 | 0.55 | 0.72 | ||||||

TComplEx | 0.73 | 0.49 | 0.64 | 0.76 | 0.27 | |||||||

TNTComplEx | 0.46 | 0.35 | ||||||||||

ComplEx (x10) | 0.35 | 0.54 | 0.71 | 0.37 | 0.55 | 0.73 | 0.36 | |||||

TComplEx (x10) | 0.80 | 0.28 | 0.38 | |||||||||

TNTComplEx (x10) | 0.52 | 0.76 |

### 8.6 Standard deviations

We give the standard deviations for the MRR computed over 5 runs of TNTComplEx on all datasets:

ICEWS14 ICEWS15-05 Yago15k Wikidata (T) Wikidata (NT) TNTComplEx 0.0016 0.0011 0.00076 0.0035 0.0012### 8.7 Grid Search

For ICEWS14, ICEWS05-15 and Yago15k, we follow the grid-search below :

Using Table 1 to compute the number of parameters and the dataset statistics in Table 5, we use the following ranks to match the number of parameters of DE-SimplE in dimension :

ICEWS14 | ICEWS05-15 | Yago15k | |
---|---|---|---|

DE-SimplE | 100 | 100 | 100 |

ComplEx | |||

TComplEx | |||

TTComplEx |