Diachronic Embedding for Temporal Knowledge Graph Completion

07/06/2019 ∙ by Rishab Goel, et al. ∙ Borealis AI 0

Knowledge graphs (KGs) typically contain temporal facts indicating relationships among entities at different times. Due to their incompleteness, several approaches have been proposed to infer new facts for a KG based on the existing ones-a problem known as KG completion. KG embedding approaches have proved effective for KG completion, however, they have been developed mostly for static KGs. Developing temporal KG embedding models is an increasingly important problem. In this paper, we build novel models for temporal KG completion through equipping static models with a diachronic entity embedding function which provides the characteristics of entities at any point in time. This is in contrast to the existing temporal KG embedding approaches where only static entity features are provided. The proposed embedding function is model-agnostic and can be potentially combined with any static model. We prove that combining it with SimplE, a recent model for static KG embedding, results in a fully expressive model for temporal KG completion. Our experiments indicate the superiority of our proposal compared to existing baselines.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Knowledge graphs (KGs) are directed graphs where nodes represent entities and (labeled) edges represent the types of relationships among entities. Each edge in a KG corresponds to a fact and can be represented as a tuple such as where and are called the head and tail entities respectively and is a relation. An important problem, known as KG completion, is to infer new facts from a KG based on the existing ones. This problem has been extensively studied for static KGs (see Nickel et al. (2016a); Wang et al. (2017); Nguyen (2017)

for a summary). KG embedding approaches have offered state-of-the-art results for KG completion on several benchmarks. These approaches map each entity and each relation type to a hidden representation and compute a score for each tuple by applying a score function to these representations. Different approaches differ in how they map the entities and relation types to hidden representations and in their score functions.

To capture the temporal aspect of the facts, KG edges are typically associated with a timestamp or time interval; e.g., . However, KG embedding approaches have been mostly designed for static KGs ignoring the temporal aspect. Recent work has shown a substantial boost in performance by extending these approaches to utilize time Jiang et al. (2016); Dasgupta et al. (2018); Ma et al. (2018); García-Durán et al. (2018). The proposed extensions are mainly through computing a hidden representation for each timestamp and extending the score functions to utilize timestamp representations as well as entity and relation representations.

In this paper, we develop models for temporal KG completion (TKGC) based on an intuitive assumption: to provide a score for, e.g., , one needs to know ’s and ’s features on ; providing a score based on their current features may be misleading. That is because ’s personality and the sentiment towards may have been quite different on as compared to now. Consequently, learning a static representation for each entity – as is done by existing approaches – may be sub-optimal as such a representation only captures the entity features at the current time, or an aggregation of entity features during time.

To provide entity features at any given time, we define entity embedding as a function which takes an entity and a timestamp as input and provides a hidden representation for the entity at that time. Inspired by diachronic word embeddings, we call our proposed embedding diachronic embedding (DE). DE is model-agnostic: any static KG embedding model can be potentially extended to TKGC by leveraging DE. We prove that combining DE with SimplE Kazemi and Poole (2018c) results in a fully expressive model for TKGC. To the best of our knowledge, this is the first TKGC model with a proof of fully expressiveness. We show the merit of our model on subsets of ICEWS Boschee et al. (2015) and GDELT Leetaru and Schrodt (2013) datasets.

2 Background and Notation


Lower-case letters denote scalars, bold lower-case letters denote vectors, and bold upper-case letters denote matrices.

represents the element of a vector , represents its norm, and represents its transpose. For two vectors and , represents the concatenation of the two vectors. represents a vector such that

(i.e. the flattened vector of the tensor/outer product of the two vectors). For

vectors of the same length , represents the sum of the element-wise product of the elements of the vectors.

Temporal Knowledge Graph (Completion): Let be a finite set of entities, be a finite set of relation types, and be a finite set of timestamps. Let represent the set of all temporal tuples that are facts (i.e. true in a world), where , , and . Let be the complement of . A temporal knowledge graph (KG) is a subset of (i.e. ). Temporal KG completion (TKGC) is the problem of inferring from .

Relation Properties: A relation is symmetric if and anti-symmetric if . A relation is the inverse of another relation if . entails if .

KG Embedding: Formally, we define an entity embedding as follows.

Definition 1.

An entity embedding, , is a function which maps every entity to a hidden representation in where is the class of non-empty tuples of vectors and/or matrices.

A relation embedding () is defined similarly. We refer to the hidden representation of an entity (relation) as the embedding of the entity (relation). A KG embedding model defines two things: 1- the and functions, 2- a score function which takes and as input and provides a score for a given tuple. The parameters of hidden representations are learned from data.

3 Existing Approaches

In this section, we describe the existing approaches for static and temporal KG completion that will be used in the rest of the paper. For further detail on temporal KG completion approaches, we refer the reader to a recent survey Kazemi et al. (2019). We represent the score for a tuple by .

TransE (static) Bordes et al. (2013): In TransE, for every where , for every where , and .

DistMult (static) Yang et al. (2015): Same and as TransE but defining .

Tucker (static) Tucker (1966); Balažević et al. (2019): Same and as TransE but defining where is a weight vector shared for all tuples.

RESCAL (static) Nickel et al. (2011): Same as TransE but defining for every where , and defining .

Canonical Polyadic (CP) (static) Hitchcock (1927): Same as TransE but defining for every where . is used when is the head and is used when is the tail. In CP, . DistMult is a special case of CP where for every .

SimplE (static) Kazemi and Poole (2018c): Noticing an information flow issue between the two vectors and of an entity in CP, Kazemi and Poole (2018c) take advantage of the inverse of the relations to address this issue. They define for every , where is used as in CP and is considered the embedding of , the inverse of . In SimplE, is defined as the average of two CP scores: 1- corresponding to the score for and 2- corresponding to the score for . A similar extension of CP has been proposed in Lacroix et al. (2018).

TTransE (temporal) Jiang et al. (2016): An extension of TransE by adding one more embedding function mapping timestamps to hidden representations: for every where . In TTransE, .

HyTE (temporal) Dasgupta et al. (2018): Same , and as TTransE but defining where for . Intuitively, HyTE first projects the head, relation, and tail embeddings to the space of the timestamp and then applies the TransE function on the projected embeddings.

ConT (temporal) Ma et al. (2018): Ma et al. (2018) extend several static KG embedding models to TKGC. Their best performing model, ConT, is an extension of Tucker defining for every where and changing the score function to . Intuitively, ConT replaces the shared vector in Tucker with timestamp embeddings .

TA-DistMult (temporal) García-Durán et al. (2018): An extension of DistMult where each character in the timestamps is mapped to a vector () where . Then, for a tuple , a temporal relation is created by considering and the characters in as a sequence and an embedding is computed for this temporal relation by feeding the embedding vectors for each element in the sequence to an LSTM and taking its final output. Finally, the score function of DistMult is employed: (TransE was employed as well but DistMult performed better).

4 Diachronic Embedding

According to Definition 1, an entity embedding function takes an entity as input and provides a hidden representation as output. We propose an alternative entity embedding function which, besides entity, takes time as input as well. Inspired by diachronic word embeddings, we call such an embedding function a diachronic entity embedding. Below is a formal definition of a diachronic entity embedding.

Definition 2.

A diachronic entity embedding, , is a function which maps every pair , where and , to a hidden representation in where is the class of non-empty tuples of vectors and/or matrices.

One may take their favorite static KG embedding score function and make it temporal by replacing their entity embeddings with diachronic entity embeddings. The choice of the function can be different for various temporal KGs depending on their properties. Here, we propose a function which performs well on our benchmarks. We give the definition for models where the output of the function is a tuple of vectors but it can be generalized to other cases as well. Let be a vector in (i.e. ). We define as follows:


where and are (entity-specific) vectors with learnable parameters and

is an activation function. Intuitively, entities may have some features that change over time and some features that remain fixed. The first

elements of the vector in Equation (1) capture temporal features and the other elements capture static features. is a hyper-parameter controlling the percentage of temporal features. While in Equation (1) static features can be potentially obtained from the temporal ones if the optimizer sets some elements of to zero, explicitly modeling static features helps reduce the number of learnable parameters and avoid overfitting to temporal signals (see Section 5.2).

Intuitively, by learning s and s, the model learns how to turn entity features on and off at different points in time so accurate temporal predictions can be made about them at any time. s control the importance of the features. We mainly use sine as the activation function for Equation (1) because one sine function can model several on and off states. Our experiments explore other activation functions as well and provide more intuition.

Model-Agnosticism: The proposals in existing temporal KG embedding models can only extend one (or a few) static models to temporal KGs. As an example, it is not trivial how RESCAL can be extended to temporal KGs using the proposal in García-Durán et al. (2018) (except for the naive approach of expecting the LSTM to output large matrices) or in Jiang et al. (2016); Dasgupta et al. (2018). Same goes for models other than RESCAL where the relation embeddings contain matrices (see, e.g., Nguyen et al. (2016); Socher et al. (2013); Lin et al. (2015)). Using our proposal, one may construct temporal versions of TransE, DistMult, SimplE, Tucker, RESCAL, or other models by replacing their function with in Equation 1. We refer to the resulting models as DE-TransE, DE-DistMult, DE-SimplE and so forth, where DE is short for Diachronic Embedding.

Learning: The facts in a KG are split into , , and

sets. Model parameters are learned using stochastic gradient descent with mini-batches. Let

be a mini-batch. For each fact , we generate two queries: 1- and 2- . For the first query, we generate a candidate answer set which contains and (hereafter referred to as negative ratio) other entities selected randomly from . For the second query, we generate a similar candidate answer set . Then we minimize the cross entropy loss which has been used and shown good results for both static and temporal KG completion (see, e.g., Kadlec and Kleindienst (2017); García-Durán et al. (2018)):

4.1 Expressivity

Expressivity is an important property and has been the subject of study in several recent works on static (knowledge) graphs Buchman and Poole (2016); Trouillon et al. (2017); Kazemi and Poole (2018c); Xu et al. (2019); Balažević et al. (2019); Fatemi et al. (2019). If a model is not expressive enough, it is doomed to underfitting for some applications. A desired property of a model is fully expressiveness:

Definition 3.

A model with parameters is fully expressive if given any world with true tuples and false tuples , there exists an assignment for

that correctly classifies the tuples in

and .

For static KG completion, several models have been proved to be fully expressive. For TKGC, however, a proof of fully expressiveness does not yet exist for the proposed models. The following theorem establishes the fully expressiveness of DE-SimplE. The proof can be found in Appendix A.

Theorem 1 (Expressivity).

DE-SimplE is fully expressive for temporal knowledge graph completion.

4.2 Domain Knowledge

For several static KG embedding models, it has been shown how certain types of domain knowledge (if exists) can be incorporated into the embeddings through parameter sharing (aka tying) and how it helps improve model performance (see, e.g., Kazemi and Poole (2018c); Sun et al. (2019); Minervini et al. (2017); Fatemi et al. (2019)). Incorporating domain knowledge for these static models can be ported to their temporal version when they are extended to temporal KGs through our diachronic embeddings. As a proof of concept, we show how incorporating domain knowledge into SimplE can be ported to DE-SimplE. We chose SimplE for our proof of concept as several types of domain knowledge can be incorporated into it.

Consider with (according to SimplE). If is known to be symmetric or anti-symmetric, this knowledge can be incorporated into the embeddings by tying to or negation of respectively Kazemi and Poole (2018c). If is known to be the inverse of , this knowledge can be incorporated into the embeddings by tying to and to Kazemi and Poole (2018c).

Proposition 1.

Symmetry, anti-symmetry, and inversion can be incorporated into DE-SimplE in the same way as SimplE.

If is known to entail , Fatemi et al. (2019) prove that if entity embeddings are constrained to be non-negative, then this knowledge can be incorporated by tying to and to where and are vectors with non-negative elements. We give a similar result for DE-SimplE.

Proposition 2.

By constraining s in Equation (1) to be non-negative for all and

to be an activation function with a non-negative range (such as ReLU, sigmoid, or squared exponential), entailment can be incorporated into DE-SimplE in the same way as SimplE.

Compared to the result in Fatemi et al. (2019), the only added constraint for DE-SimplE is that the activation function in Equation (1) is also constrained to have a non-negative range. Proofs for Propositions 1 and 2 can be found in Appendix A.

Dataset ||
ICEWS14 7,128 230 365 72,826 8,941 8,963 90,730
ICEWS05-15 10,488 251 4017 386,962 46,275 46,092 479,329
GDELT 500 20 366 2,735,685 341,961 341,961 3,419,607
Table 1: Statistics on ICEWS14, ICEWS05-15, and GDELT.

5 Experiments & Results

Datasets: Our datasets are subsets of two temporal KGs that have become standard benchmarks for TKGC: ICEWS Boschee et al. (2015) and GDELT Leetaru and Schrodt (2013). For ICEWS, we use the two subsets generated by García-Durán et al. (2018): 1- ICEWS14 corresponding to the facts in 2014 and 2- ICEWS05-15 corresponding to the facts between 2005 to 2015. For GDELT, we use the subset extracted by Trivedi et al. (2017) corresponding to the facts from April 1, 2015 to March 31, 2016. We changed the train/validation/test sets following a similar procedure as in Bordes et al. (2013) to make the problem into a TKGC rather than an extrapolation problem. Table 1 provides a summary of the dataset statistics.

Baselines: Our baselines include both static and temporal KG embedding models. From the static KG embedding models, we use TransE and DistMult and SimplE where the timing information are ignored. From the temporal KG embedding models, we use the ones introduced in Section 2.

Metrics: For each fact , we create two queries: 1- and 2- . For the first query, the model ranks all entities in where . This corresponds to the filtered setting commonly used in the literature Bordes et al. (2013). We follow a similar approach for the second query. Let and represent the ranking for and for the two queries respectively. We report mean reciprocal rank (MRR) defined as . Compared to its counterpart mean rank which is largely influenced by a single bad prediction, MRR is more stable Nickel et al. (2016a). We also report Hit@1, Hit@3 and Hit@10 measures where Hit@k is defined as , where is if holds and otherwise.

Implementation111Code and datasets are available at https://github.com/BorealisAI/DE-SimplE:

We implemented our model and the baselines in PyTorch

Paszke et al. (2017). We ran our experiments on a node with four GPUs. For the two ICEWS datasets, we report the results for some of the baselines from García-Durán et al. (2018). For the other experiments on these datasets, for the fairness of results, we follow a similar experimental setup as in García-Durán et al. (2018) by using the ADAM optimizer Kingma and Ba (2014) and setting learning rate , batch size , negative ratio , embedding size , and validating every epochs selecting the model giving the best validation MRR. Following the best results obtained in Ma et al. (2018) (and considering the memory restrictions), for ConT we set embedding size , batch size on ICEWS14 and GDELT and on ICEWS05-15. We validated dropout values from . We tuned for our model from the values . For GDELT, we used a similar setting but with a negative ratio due to the large size of the dataset. Unless stated otherwise, we use as the activation function for Equation (1). Since the timestamps in our datasets are dates rather than single numbers, we apply the temporal part of Equation (1) to year, month, and day separately (with different parameters) thus obtaining three temporal vectors. Then we take an element-wise sum of the resulting vectors obtaining a single temporal vector. Intuitively, this can be viewed as converting a date into a timestamp in the embedded space.

5.1 Comparative Study

We compare the baselines with three variants of our model: 1- DE-TransE, 2- DE-DistMult, and 3- DE-SimplE. The obtained results in Table 2 indicate that the large number of parameters per timestamp makes ConT perform poorly on ICEWS14 and ICEWS05-15. On GDELT, it shows a somewhat better performance as GDELT has many training facts in each timestamp. Besides affecting the predictive performance, the large number of parameters makes training ConT extremely slow. According to the results, the temporal versions of different models outperform the static counterparts in most cases, thus providing evidence for the merit of capturing temporal information.

DE-TransE outperforms the other TransE-based baselines (TTransE and HyTE) on ICEWS14 and GDELT and gives on-par results with HyTE on ICEWS05-15. This result shows the superiority of our diachronic embeddings compared to TTransE and HyTE. DE-DistMult outperforms TA-DistMult, the only DistMult-based baseline, showing the superiority of our diachronic embedding compared to TA-DistMult. Moreover, DE-DistMult outperforms all TransE-based baselines. Finally, just as SimplE beats TransE and DistMult due to its higher expressivity, our results show that DE-SimplE beats DE-TransE, DE-DistMult, and the other baselines due to its higher expressivity.

Previously, each of the existing models was tested on different subsets of ICEWS and GDELT and a comprehensive comparison of them did not exist. As a side contribution, Table 2 provides a comparison of these approaches on the same benchmarks and under the same experimental setting. The results reported in Table 2 may be directly used for comparison in future works.

Model MRR Hit@1 Hit@3 Hit@10 MRR Hit@1 Hit@3 Hit@10 MRR Hit@1 Hit@3 Hit@10
TransE 0.280 9.4 - 63.7 0.294 9.0 - 66.3 0.113 0.0 15.8 31.2
DistMult 0.439 32.3 - 67.2 0.456 33.7 - 69.1 0.196 11.7 20.8 34.8
SimplE 0.458 34.1 51.6 68.7 0.478 35.9 53.9 70.8 0.206 12.4 22.0 36.6
ConT 0.185 11.7 20.5 31.5 0.163 10.5 18.9 27.2 0.144 8.0 15.6 26.5
TTransE 0.255 7.4 - 60.1 0.271 8.4 - 61.6 0.115 0.0 16.0 31.8
HyTE 0.297 10.8 41.6 65.5 0.316 11.6 44.5 68.1 0.118 0.0 16.5 32.6
TA-DistMult 0.477 36.3 - 68.6 0.474 34.6 - 72.8 0.206 12.4 21.9 36.5
DE-TransE 0.326 12.4 46.7 68.6 0.314 10.8 45.3 68.5 0.126 0.0 18.1 35.0
DE-DistMult 0.501 39.2 56.9 70.8 0.484 36.6 54.6 71.8 0.213 13.0 22.8 37.6
DE-SimplE 0.526 41.8 59.2 72.5 0.513 39.2 57.8 74.8 0.230 14.1 24.8 40.3
Table 2: Results on ICEWS14, ICEWS05-15, and GDELT. Best results are in bold.

5.2 Model Variants & Ablation Study

We run experiments on ICEWS14 with several variants of the proposed models to provide a better understanding of them. The results can be found in Table 3 and Figure 1. Table 3 includes DE-TransE and DE-DistMult with no variants as well so other variants can be easily compared to them.

Activation Function: So far, we used sine as the activation function in Equation 1. The performance for other activation functions including Tanh, sigmoid, Leaky ReLU (with leakage), and squared exponential are presented in Table 3. From the table, it can be viewed that other activation functions also perform well. Specifically, squared exponential performs almost on-par with sine. We believe one reason why sine and squared exponential give better performance is because a combination of sine or square exponential features can generate more sophisticated features than a combination of Tanh, sigmoid, or ReLU features. While a temporal feature with Tanh or sigmoid as the activation corresponds to a smooth off-on (or on-off) temporal switch, a temporal feature with sine or squared exponential activation corresponds to two (or more) switches (e.g., off-on-off) which can potentially model relations that start at some time and end after a while (e.g., ). These results also provide evidence for the effectiveness of diachronic embedding across several functions.

Adding Diachronic Embedding for Relations: Compared to entities, we hypothesize that relations may evolve at a very lower rate or, for some relations, evolve only negligibly. Therefore, modeling relations with a static (rather than a diachronic) representation may suffice. To test this hypothesis, we ran DE-TransE and DE-DistMult on ICEWS14 where relation embeddings are also a function of time. From the obtained results in Table 3, one can see that the model with diachronic embeddings for both entities and relations performs on-par with the model with diachronic embedding only for entities. We conducted the same experiment on ICEWS05-15 (which has a longer time horizons) and GDELT and observed similar results. These results show that at least on our benchmarks, modeling the evolution of relations may not be helpful. Future work can test this hypothesis on datasets with other types of relations and longer horizons.

Generalizing to Unseen Timestamps: To measure how well our models generalize to timestamps not observed in train set, we created a variant of the ICEWS14 dataset by including every fact except those on the , , and day of each month in the train set. We split the excluded facts randomly into validation and test sets (removing the ones including entities not observed in the train set). This ensures that none of the timestamps in the validation or test sets has been observed by the model in the train set. Then we ran DistMult and DE-DistMult on the resulting dataset. The obtained results in Table 3 indicate that DE-DistMult gains almost MRR improvement over DistMult thus showing the effectiveness of our diachronic embedding to generalize to unseen timestamps.

Importance of Model Parameters Used in Equation 1: In Equation 1, the temporal part of the embedding contains three components: , , and . To measure the importance of each component, we ran DE-DistMult on ICEWS14 under three settings: 1- when s are removed (i.e. set to ), 2- when s are removed (i.e. set to ), and 3- when s are removed (i.e. set to ). From the obtained results presented in Table 3, it can be viewed that all three components are important for the temporal features, especially s and s. Removing s does not affect the results as much as s and s. Therefore, if one needs to reduce the number of parameters, removing may be a good option as long as they can tolerate a slight reduction in accuracy.

Model Variation MRR Hit@1 Hit@3 Hit@10
DE-TransE No variation (Activation function: Sine) 0.326 12.4 46.7 68.6
DE-DistMult No variation (Activation function: Sine) 0.501 39.2 56.9 70.8
DE-DistMult Activation function: Tanh 0.486 37.5 54.7 70.1
DE-DistMult Activation function: Sigmoid 0.484 37.0 54.6 70.6
DE-DistMult Activation function: Leaky ReLU 0.478 36.3 54.2 70.1
DE-DistMult Activation function: Squared Exponential 0.501 39.0 56.8 70.9
DE-TransE Diachronic embedding for both entities and relations 0.324 12.7 46.1 68.0
DE-DistMult Diachronic embedding for both entities and relations 0.502 39.4 56.6 70.4
DistMult Generalizing to unseen timestamps 0.410 30.2 46.2 62.0
DE-DistMult Generalizing to unseen timestamps 0.452 34.5 51.3 65.4
DE-DistMult   for for all 0.458 34.4 51.8 68.3
DE-DistMult   for for all 0.470 36.4 53.1 67.1
DE-DistMult   for for all 0.498 38.9 56.2 70.4
Table 3: Results for different variations of our model on ICEWS14.

Static Features: Figure 1(a) shows the test MRR of DE-SimplE on ICEWS14 as a function of , the percentage of temporal features. According to Figure 1(a), as soon as some features become temporal (i.e. changes from to a non-zero number), a substantial boost in performance can be observed. This observation sheds more light on the importance of learning temporal features and having diachronic embeddings. As becomes larger, MRR reaches a peak and then slightly drops. This slight drop in performance can be due to overfitting to temporal cues. This result demonstrates that modeling static features explicitly can help reduce the number of learnable parameters and avoid overfitting. Such a design choice may be even more important when the embedding dimensions are larger. However, it comes at the cost of adding one hyper-parameter to the model. If one prefers a slightly less accurate model with fewer hyper-parameters, they can make all vector elements temporal.

Training Curve: Figure 1

(b) shows the training curve for DistMult and DE-DistMult on ICEWS14. While it has been argued that using sine activation functions may complicate training in some neural network architectures (see, e.g.,

Lapedes and Farber (1987); Giambattista Parascandolo (2017)), it can be viewed that when using sine activations, the training curve for our model is quite stable.

6 Related Work

StaRAI: Statistical relational AI (StaRAI) Raedt et al. (2016); Koller et al. (2007) approaches are mainly based on soft (hanf-crafted or learned) rules Richardson and Domingos (2006); De Raedt et al. (2007); Kimmig et al. (2012); Kazemi et al. (2014)

where the probability of a world is typically proportional to the number of rules that are satisfied/violated in that world and the confidence for each rule. A line of work in this area combines a stack of soft rules with embeddings for property prediction

Sourek et al. (2015); Kazemi and Poole (2018b). Another line of work extends the soft rules to temporal KGs Sadilek and Kautz (2010); Papai et al. (2012); Dylla et al. (2013); Huber et al. (2014); Chekol et al. (2017); Chekol and Stuckenschmidt (2018). The approaches based on soft rules have been generally shown to perform subpar to KG embedding models Nickel et al. (2016a).

Graph Walk: These approaches define weighted template walks on a KG and then answer queries by template matching Lao and Cohen (2010); Lao et al. (2011). They have been shown to be quite similar to, and in some cases subsumed by, the models based on soft rules Kazemi and Poole (2018a).

Static KG Embedding: A large number of models have been developed for static KG embedding. A class of these models are the translational approaches corresponding to variations of TransE (see, e.g., Lin et al. (2015); Wang et al. (2014); Nguyen et al. (2016)). Another class of approaches are based on a bilinear score function each imposing a different sparsity constraint on the matrices (see, e.g., Nickel et al. (2011); Trouillon et al. (2016); Nickel et al. (2016b); Kazemi and Poole (2018c); Liu et al. (2017)

). A third class of models are based on deep learning approaches using feed-forward or convolutional layers on top of the embeddings (see, e.g.,

Socher et al. (2013); Dong et al. (2014); Dettmers et al. (2018); Balazevic et al. (2018)). These models can be potentially extended to TKGC through our diachronic embedding.

Temporal KG Embedding: Several works have extended the static KG embedding models to temporal KGs. Jiang et al. (2016) extend TransE by adding atimestamp embedding into the score function. Dasgupta et al. (2018) extend TransE by projecting the embeddings to the timestamp hyperplain and then using the TransE score on the projected space. Ma et al. (2018) extend several models by adding a timestamp embedding to their score functions. These models may not work well when the number of timestamps is large. Furthermore, since they only learn embeddings for observed timestamps, they cannot generalize to unseen timestamps. García-Durán et al. (2018) extend TransE and DistMult by combining the relation and timestamp through a character LSTM. These models have been described in detail in Section 2 and their performances have been reported in Table 2.

KG Embedding for Extrapolation: TKGC is an interpolation problem where given a set of temporal facts in a time frame, the goal is to predict the missing facts. A related problem is the extrapolation problem where future interactions are to be predicted (see, e.g., Trivedi et al. (2017); Kumar et al. (2018); Trivedi et al. (2019)). Despite some similarities in the employed approaches, KG extrapolation is fundamentally different from TKGC in that a score for an interaction is to be computed given only the past (i.e. facts before

) whereas in TKGC the score is to be computed given past, present, and future. A comprehensive analysis of the existing models for interpolation and extrapolation can be found in

Kazemi et al. (2019).

Diachronic Word Embeddings: The idea behind our proposed embeddings is similar to diachronic word embeddings where a corpus is typically broken temporally into slices (e.g., 20-year chuncks of a 200-year corpus) and embeddings are learned for words in each chunk thus providing word embeddings that are a function of time (see, e.g., Kim et al. (2014); Kulkarni et al. (2015); Hamilton et al. (2016); Bamler and Mandt (2017)). The main goal of diachronic word embeddings is to reveal how the meanings of the words have evolved over time. Our work can be viewed as an extension of diachronic word embeddings to continuous-time KG completion.

Figure 1: (a) Test MRR of DE-SimplE on ICEWS14 as a function of . (b) The training curve for DistMult and DE-DistMult.

7 Conclusion

Temporal knowledge graph (KG) completion is an important problem and has been the focus of several recent studies. We developed a diachronic embedding function for temporal KG completion which provides a hidden representation for the entities of a temporal KG at any point in time. Our embedding is generic and can be combined with any score function. We proved that combining our diachronic embedding with SimplE results in a fully expressive model – the first temporal KG embedding model for which such a result exists. We showed the superior performance of our model compared to existing work on several benchmarks. Future work includes designing functions other than the one proposed in Equation 1, a comprehensive study of which functions are favored by different types of KGs, and using our proposed embedding for diachronic word embedding.


  • Balazevic et al. (2018) Ivana Balazevic, Carl Allen, and Timothy M Hospedales. Hypernetwork knowledge graph embeddings. arXiv preprint arXiv:1808.07018, 2018.
  • Balažević et al. (2019) Ivana Balažević, Carl Allen, and Timothy M Hospedales. Tucker: Tensor factorization for knowledge graph completion. arXiv preprint arXiv:1901.09590, 2019.
  • Bamler and Mandt (2017) Robert Bamler and Stephan Mandt. Dynamic word embeddings. In ICML, pages 380–389, 2017.
  • Bordes et al. (2013) Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko. Translating embeddings for modeling multi-relational data. In NeurIPS, pages 2787–2795, 2013.
  • Boschee et al. (2015) Elizabeth Boschee, Jennifer Lautenschlager, Sean O’Brien, Steve Shellman, James Starz, and Michael Ward. Icews coded event data. Harvard Dataverse, 12, 2015.
  • Buchman and Poole (2016) David Buchman and David Poole.

    Negation without negation in probabilistic logic programming.

    In KR, 2016.
  • Carslaw (1921) Horatio Scott Carslaw. Introduction to the Theory of Fourier’s Series and Integrals. Macmillan, 1921.
  • Chekol and Stuckenschmidt (2018) Melisachew Wudage Chekol and Heiner Stuckenschmidt. Rule based temporal inference. In ICLP. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2018.
  • Chekol et al. (2017) Melisachew Wudage Chekol, Giuseppe Pirrò, Joerg Schoenfisch, and Heiner Stuckenschmidt. Marrying uncertainty and time in knowledge graphs. In AAAI, 2017.
  • Dasgupta et al. (2018) Shib Sankar Dasgupta, Swayambhu Nath Ray, and Partha Talukdar.

    Hyte: Hyperplane-based temporally aware knowledge graph embedding.

    In EMNLP, pages 2001–2011, 2018.
  • De Raedt et al. (2007) Luc De Raedt, Angelika Kimmig, and Hannu Toivonen. Problog: A probabilistic prolog and its application in link discovery. In IJCAI, volume 7, pages 2462–2467. Hyderabad, 2007.
  • Dettmers et al. (2018) Tim Dettmers, Pasquale Minervini, Pontus Stenetorp, and Sebastian Riedel. Convolutional 2d knowledge graph embeddings. In AAAI, 2018.
  • Dong et al. (2014) Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun, and Wei Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In ACM SIGKDD, pages 601–610. ACM, 2014.
  • Dylla et al. (2013) Maximilian Dylla, Iris Miliaraki, and Martin Theobald. A temporal-probabilistic database model for information extraction. Proceedings of the VLDB Endowment, 6(14):1810–1821, 2013.
  • Fatemi et al. (2019) Bahare Fatemi, Siamak Ravanbakhsh, and David Poole. Improved knowledge graph embedding using background taxonomic information. In AAAI, 2019.
  • García-Durán et al. (2018) Alberto García-Durán, Sebastijan Dumančić, and Mathias Niepert. Learning sequence encoders for temporal knowledge graph completion. arXiv preprint arXiv:1809.03202, 2018.
  • Giambattista Parascandolo (2017) Tuomas Virtanen Giambattista Parascandolo, Heikki Huttunen. Taming the waves: sine as activation function in deep neural networks. 2017.
  • Hamilton et al. (2016) William L Hamilton, Jure Leskovec, and Dan Jurafsky. Diachronic word embeddings reveal statistical laws of semantic change. arXiv preprint arXiv:1605.09096, 2016.
  • Hitchcock (1927) Frank L Hitchcock. The expression of a tensor or a polyadic as a sum of products. Journal of Mathematics and Physics, 6(1-4):164–189, 1927.
  • Huber et al. (2014) Jakob Huber, Christian Meilicke, and Heiner Stuckenschmidt. Applying Markov logic for debugging probabilistic temporal knowledge bases. In AKBC, 2014.
  • Jiang et al. (2016) Tingsong Jiang, Tianyu Liu, Tao Ge, Lei Sha, Baobao Chang, Sujian Li, and Zhifang Sui. Towards time-aware knowledge graph completion. In COLING, pages 1715–1724, 2016.
  • Kadlec and Kleindienst (2017) Ondrej Bajgar Kadlec, Rudolf and Jan Kleindienst. Knowledge base completion: Baselines strike back. arXiv preprint arXiv:1705.10744, 2017.
  • Kazemi and Poole (2018a) Seyed Mehran Kazemi and David Poole. Bridging weighted rules and graph random walks for statistical relational models. Frontiers in Robotics and AI, 5:8, 2018.
  • Kazemi and Poole (2018b) Seyed Mehran Kazemi and David Poole. RelNN: A deep neural model for relational learning. In AAAI, 2018.
  • Kazemi and Poole (2018c) Seyed Mehran Kazemi and David Poole. SimplE embedding for link prediction in knowledge graphs. In NeurIPS, pages 4289–4300, 2018.
  • Kazemi et al. (2014) Seyed Mehran Kazemi, David Buchman, Kristian Kersting, Sriraam Natarajan, and David Poole.

    Relational logistic regression.

    In KR, 2014.
  • Kazemi et al. (2019) Seyed Mehran Kazemi, Rishab Goel, Kshitij Jain, Ivan Kobyzev, Akshay Sethi, Peter Forsyth, and Pascal Poupart. Relational representation learning for dynamic (knowledge) graphs: A survey. arXiv preprint arXiv:1905.11485, 2019.
  • Kim et al. (2014) Yoon Kim, Yi-I Chiu, Kentaro Hanaki, Darshan Hegde, and Slav Petrov. Temporal analysis of language through neural language models. arXiv preprint arXiv:1405.3515, 2014.
  • Kimmig et al. (2012) Angelika Kimmig, Stephen H Bach, Matthias Broecheler, Bert Huang, and Lise Getoor. A short introduction to probabilistic soft logic. In NIPS Workshop on probabilistic programming: Foundations and applications, volume 1, page 3, 2012.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Koller et al. (2007) Daphne Koller, Nir Friedman, Sašo Džeroski, Charles Sutton, Andrew McCallum, Avi Pfeffer, Pieter Abbeel, Ming-Fai Wong, David Heckerman, Chris Meek, et al. Introduction to statistical relational learning. MIT press, 2007.
  • Kulkarni et al. (2015) Vivek Kulkarni, Rami Al-Rfou, Bryan Perozzi, and Steven Skiena. Statistically significant detection of linguistic change. In WWW, pages 625–635, 2015.
  • Kumar et al. (2018) Srijan Kumar, Xikun Zhang, and Jure Leskovec. Learning dynamic embedding from temporal interaction networks. arXiv preprint arXiv:1812.02289, 2018.
  • Lacroix et al. (2018) Timothée Lacroix, Nicolas Usunier, and Guillaume Obozinski. Canonical tensor decomposition for knowledge base completion. In ICML, 2018.
  • Lao and Cohen (2010) Ni Lao and William W Cohen. Relational retrieval using a combination of path-constrained random walks. Machine learning, 81(1):53–67, 2010.
  • Lao et al. (2011) Ni Lao, Tom Mitchell, and William W Cohen. Random walk inference and learning in a large scale knowledge base. In EMNLP, pages 529–539, 2011.
  • Lapedes and Farber (1987) Alan Lapedes and Robert Farber. Nonlinear signal processing using neural networks: Prediction and system modelling. Technical report, 1987.
  • Leetaru and Schrodt (2013) Kalev Leetaru and Philip A Schrodt. Gdelt: Global data on events, location, and tone, 1979–2012. In ISA annual convention, volume 2, pages 1–49. Citeseer, 2013.
  • Lin et al. (2015) Yankai Lin, Zhiyuan Liu, Maosong Sun, Yang Liu, and Xuan Zhu. Learning entity and relation embeddings for knowledge graph completion. In AAAI, pages 2181–2187, 2015.
  • Liu et al. (2017) Hanxiao Liu, Yuexin Wu, and Yiming Yang. Analogical inference for multi-relational embeddings. In ICML, pages 2168–2178, 2017.
  • Ma et al. (2018) Yunpu Ma, Volker Tresp, and Erik A Daxberger. Embedding models for episodic knowledge graphs. Journal of Web Semantics, 2018.
  • Minervini et al. (2017) Pasquale Minervini, Luca Costabello, Emir Muñoz, Vít Nováček, and Pierre-Yves Vandenbussche. Regularizing knowledge graph embeddings via equivalence and inversion axioms. In ECML PKDD, pages 668–683. Springer, 2017.
  • Nguyen et al. (2016) Dat Quoc Nguyen, Kairit Sirts, Lizhen Qu, and Mark Johnson. Stranse: a novel embedding model of entities and relationships in knowledge bases. In NAACL-HLT, 2016.
  • Nguyen (2017) Dat Quoc Nguyen. An overview of embedding models of entities and relationships for knowledge base completion. arXiv preprint arXiv:1703.08098, 2017.
  • Nickel et al. (2011) Maximilian Nickel, Volker Tresp, and Hans-Peter Kriegel. A three-way model for collective learning on multi-relational data. In ICML, volume 11, pages 809–816, 2011.
  • Nickel et al. (2016a) Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. A review of relational machine learning for knowledge graphs. Proceedings of the IEEE, 104(1):11–33, 2016.
  • Nickel et al. (2016b) Maximilian Nickel, Lorenzo Rosasco, and Tomaso Poggio. Holographic embeddings of knowledge graphs. In AAAI, 2016.
  • Papai et al. (2012) Tivadar Papai, Henry Kautz, and Daniel Stefankovic. Slice normalized dynamic markov logic networks. In NeurIPS, pages 1907–1915, 2012.
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
  • Raedt et al. (2016) Luc De Raedt, Kristian Kersting, Sriraam Natarajan, and David Poole.

    Statistical relational artificial intelligence: Logic, probability, and computation.

    Synthesis Lectures on Artificial Intelligence and Machine Learning, 10(2):1–189, 2016.
  • Richardson and Domingos (2006) Matthew Richardson and Pedro Domingos. Markov logic networks. Machine learning, 62(1-2):107–136, 2006.
  • Sadilek and Kautz (2010) Adam Sadilek and Henry Kautz. Recognizing multi-agent activities from gps data. In AAAI, 2010.
  • Socher et al. (2013) Richard Socher, Danqi Chen, Christopher D Manning, and Andrew Ng. Reasoning with neural tensor networks for knowledge base completion. In AAAI, pages 926–934, 2013.
  • Sourek et al. (2015) Gustav Sourek, Vojtech Aschenbrenner, Filip Zelezny, and Ondrej Kuzelka. Lifted relational neural networks. arXiv preprint arXiv:1508.05128, 2015.
  • Sun et al. (2019) Zhiqing Sun, Zhi-Hong Deng, Jian-Yun Nie, and Jian Tang. RotatE: Knowledge graph embedding by relational rotation in complex space. In ICLR, 2019.
  • Trivedi et al. (2017) Rakshit Trivedi, Hanjun Dai, Yichen Wang, and Le Song. Know-evolve: Deep temporal reasoning for dynamic knowledge graphs. In ICML, pages 3462–3471, 2017.
  • Trivedi et al. (2019) Rakshit Trivedi, Mehrdad Farajtabar, Prasenjeet Biswal, and Hongyuan Zha. DyRep: Learning representations over dynamic graphs. In ICLR, 2019.
  • Trouillon et al. (2016) Théo Trouillon, Johannes Welbl, Sebastian Riedel, Éric Gaussier, and Guillaume Bouchard. Complex embeddings for simple link prediction. In ICML, pages 2071–2080, 2016.
  • Trouillon et al. (2017) Théo Trouillon, Christopher R Dance, Éric Gaussier, Johannes Welbl, Sebastian Riedel, and Guillaume Bouchard. Knowledge graph completion via complex tensor factorization. JMLR, 18(1):4735–4772, 2017.
  • Tucker (1966) Ledyard R Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311, 1966.
  • Wang et al. (2014) Zhen Wang, Jianwen Zhang, Jianlin Feng, and Zheng Chen. Knowledge graph embedding by translating on hyperplanes. In AAAI, 2014.
  • Wang et al. (2017) Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. Knowledge graph embedding: A survey of approaches and applications. IEEE TKDE, 29(12):2724–2743, 2017.
  • Xu et al. (2019) Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In ICLR, 2019.
  • Yang et al. (2015) Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding entities and relations for learning and inference in knowledge bases. ICLR, 2015.

Appendix A Proof of Theorems and Propositions

Theorem 1.

DE-SimplE is fully expressive for temporal knowledge graph completion.


For every entity , let where, according to Equation 1 with sine activations, and are defined as follows:




We provide the proof for a specific case of DE-SimplE where the elements of s are all temporal and the elements of s are all non-temporal. This specific case can be achieved by setting , and and for all and for all . If this specific case of DE-SimplE is fully expressive, so is DE-SimplE. In this specific case, and for every can be re-written as follows:


For every relation , let . To further simplify the proof, following Kazemi and Poole [2018c], we only show how the embedding values can be set such that becomes a positive number if and a negative number if . Extending the proof the case where the score contains both components ( and ) can be done by doubling the size of the embedding vectors and following a similar procedure as the one explained below for the second half of the vectors.

Assume where is a natural number. These vectors can be viewed as blocks of size . For the relation , let be zero everywhere except on the block where it is everywhere. With such a value assignment to s, to find the score for a fact , only the block of each embedding vector is important. Let us now focus on the block.

The size of the block (similar to all other blocks) is and it can be viewed as sub-blocks of size . For the entity , let the values of be zero in all sub-blocks except the sub-block. With such a value assignment, to find the score for a fact , only the sub-block of the block is important. Note that this sub-block is unique for each tuple . Let us now focus on the sub-block of the block.

The size of the sub-block of the block is and it can be viewed as sub-sub-blocks of size . According to the Fourier sine series Carslaw [1921], with a large enough , we can set the values for , , and in a way that the sum of the elements of for the sub-sub-block becomes when (where is the timestamp in ) and when is a timestamp other than . Note that this sub-sub-block is unique for each tuple .

Having the above value assignments, if , we set all the values in the sub-sub-block of the sub-block of the block of to . With this assignment, at . If , we set all the values for the sub-sub-block of the sub-block of the block of to . With this assignment, at . ∎

Proposition 1.

Symmetry, anti-symmetry, and inversion can be incorporated into DE-SimplE in the same way as SimplE.


Let with be symmetric. According to DE-SimplE, for a fact we have:


where gives the DE-SimplE score for a fact, and are two vectors assigned to (according to SimplE) both defined according to Equation 1, and and are two vectors assigned to both defined according to Equation 1. Moreover, for a fact we have:


By tying to , the two scores become identical. Therefore, tying to ensures that the score for is the same as the score for thus ensuring the symmetry of . With the same argument, if is tied to , then one score becomes the negation of the other score so only one of them can be true.

Assume with is known to be the inverse of . Then the score for a fact is as in Equation (6) and for is as follows:


By tying to and to , the score in Equation (8) can be re-written as:


This score is identical to the score in Equation (6). Therefore, tying to and to ensures and are the inverse of each other. ∎

Proposition 2.

By constraining s in Equation (1) to be non-negative for all and to be an activation function with a non-negative range (such as ReLU, sigmoid, or squared exponential), entailment can be incorporated into DE-SimplE in the same way as SimplE.


Let with and with be two distinct relations such that entails . For a fact , the score according to DE-SimplE is as in Equation (6), and for , the score is as follows: