TLogic: Temporal Logical Rules for Explainable Link Forecasting on Temporal Knowledge Graphs

12/15/2021
by   Yushan Liu, et al.
Siemens AG
0

Conventional static knowledge graphs model entities in relational data as nodes, connected by edges of specific relation types. However, information and knowledge evolve continuously, and temporal dynamics emerge, which are expected to influence future situations. In temporal knowledge graphs, time information is integrated into the graph by equipping each edge with a timestamp or a time range. Embedding-based methods have been introduced for link prediction on temporal knowledge graphs, but they mostly lack explainability and comprehensible reasoning chains. Particularly, they are usually not designed to deal with link forecasting – event prediction involving future timestamps. We address the task of link forecasting on temporal knowledge graphs and introduce TLogic, an explainable framework that is based on temporal logical rules extracted via temporal random walks. We compare TLogic with state-of-the-art baselines on three benchmark datasets and show better overall performance while our method also provides explanations that preserve time consistency. Furthermore, in contrast to most state-of-the-art embedding-based methods, TLogic works well in the inductive setting where already learned rules are transferred to related datasets with a common vocabulary.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/31/2020

xERTE: Explainable Reasoning on Temporal Knowledge Graphs for Forecasting Future Links

Interest has been rising lately towards modeling time-evolving knowledge...
10/31/2019

DRUM: End-To-End Differentiable Rule Mining On Knowledge Graphs

In this paper, we study the problem of learning probabilistic logical ru...
03/18/2021

Neural Multi-Hop Reasoning With Logical Rules on Biomedical Knowledge Graphs

Biomedical knowledge graphs permit an integrative computational approach...
06/13/2021

Neural Bellman-Ford Networks: A General Graph Neural Network Framework for Link Prediction

Link prediction is a very fundamental task on graphs. Inspired by tradit...
03/12/2021

Inductive Relation Prediction by BERT

Relation prediction in knowledge graphs is dominated by embedding based ...
01/02/2020

Reasoning on Knowledge Graphs with Debate Dynamics

We propose a novel method for automatic reasoning on knowledge graphs ba...
04/21/2021

Macroeconomic forecasting with statistically validated knowledge graphs

This study leverages narrative from global newspapers to construct theme...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Knowledge graphs (KGs) structure factual information in the form of triples , where and correspond to entities in the real world and to a binary relation, e. g., (Anna, born in, Paris). This knowledge representation leads to an interpretation as a directed multigraph, where entities are identified with nodes and relations with edge types. Each edge in the KG encodes an observed fact, where the source node corresponds to the subject entity, the target node to the object entity, and the edge type to the predicate of the factual statement.

Some real-world information also includes a temporal dimension, e. g., the event (Anna, born in, Paris) happened on a specific date. To model the large amount of available event data that induces complex interactions between entities over time, temporal knowledge graphs (tKGs) have been introduced. Temporal KGs extend the triples to quadruples to integrate a timestamp or time range , where indicates the time validity of the static event , e. g., (Angela Merkel, visit, China, 2014/07/04). Figure 1 visualizes a subgraph from the dataset ICEWS14 as an example of a tKG. In this work, we focus on tKGs where each edge is equipped with a single timestamp.

Figure 1: A subgraph from the dataset ICEWS14 with the entities Angela Merkel, Barack Obama, France, and China. The timestamps are displayed in the format yy/mm/dd. The dotted blue line represents the correct answer to the query (Angela Merkel, consult, ?, 2014/08/09). Previous interactions between Angela Merkel and Barack Obama can be interpreted as an explanation for the prediction.

One of the common tasks on KGs is link prediction, which finds application in areas such as recommender systems nectr.hildebrandt.2019, knowledge base completion convkb.nguyen.2018, and drug repurposing polo.liu.2021. Taking the additional temporal dimension into account, it is of special interest to forecast events for future timestamps based on past information. Notable real-world applications that rely on accurate event forecasting are, e. g., clinical decision support, supply chain management, and extreme events modeling. In this work, we address link forecasting on tKGs, where we consider queries for a timestamp that has not been seen during training.

Several embedding-based methods have been introduced for tKGs to solve link prediction and forecasting (link prediction with future timestamps), e.g., TTransE ttranse.leblay.2018, TNTComplEx tntcomplex.lacroix.2020, and RE-Net renet.jin.2019

. The underlying principle is to project the entities and relations into a low-dimensional vector space while preserving the topology and temporal dynamics of the tKG. These methods can learn the complex patterns that lead to an event but often lack transparency and interpretability.

To increase the transparency and trustworthiness of the solutions, human-understandable explanations are necessary, which can be provided by logical rules. However, the manual creation of rules is often difficult due to the complex nature of events. Domain experts cannot articulate the conditions for the occurrence of an event sufficiently formally to express this knowledge as rules, which leads to a problem termed as the knowledge acquisition bottleneck. Generally, symbolic methods that make use of logical rules tend to suffer from scalability issues, which make them impractical for the application on large real-world datasets.

We propose TLogic that automatically mines cyclic temporal logical rules by extracting temporal random walks from the graph. We achieve both high predictive performance and time-consistent explanations in the form of temporal rules, which conform to the observation that the occurrence of an event is usually triggered by previous events. The main contributions of this work are summarized as follows:

  • We introduce TLogic, a novel symbolic framework based on temporal random walks in temporal knowledge graphs. It is the first approach that directly learns temporal logical rules from tKGs and applies these rules to the link forecasting task.

  • Our approach provides explicit and human-readable explanations in the form of temporal logical rules and is scalable to large datasets.

  • We conduct experiments on three benchmark datasets (ICEWS14, ICEWS18, and ICEWS0515) and show better overall performance compared with state-of-the-art baselines.

  • We demonstrate the effectiveness of our method in the inductive setting where our learned rules are transferred to a related dataset with a common vocabulary.

Related Work

Subsymbolic machine learning methods, e. g., embedding-based algorithms, have achieved success for the link prediction task on static KGs. Well-known methods include RESCAL 

rescal.nickel.2011, TransE transe.bordes.2013, DistMult distmult.yang.2015, and ComplEx complex.trouillon.2016 as well as the graph convolutional approaches R-GCN rgcn.schlichtkrull.2018 and CompGCN compgcn.vashishth.2020. Several approaches have been recently proposed to handle tKGs, such as TTransE ttranse.leblay.2018, TA-DistMult ta-distmult-transe.garcia-duran.2018, DE-SimplE de-simple.goel.2020, TNTComplEx tntcomplex.lacroix.2020, CyGNet cygnet.zhu.2021, RE-Net renet.jin.2019, and xERTE xerte.han.2021. The main idea of these methods is to explicitly learn embeddings for timestamps or to integrate temporal information into the entity or relation embeddings. However, the black-box property of embeddings makes it difficult for humans to understand the predictions. Moreover, approaches with shallow embeddings are not suitable for an inductive setting with previously unseen entities, relations, or timestamps. From the above methods, only CyGNet, RE-Net, and xERTE are designed for the forecasting task. xERTE is also able to provide explanations by extracting relevant subgraphs around the query subject.

Symbolic approaches for link prediction on KGs like AMIE+ amie+.galarraga.2015 and AnyBURL anyburl.meilicke.2019 mine logical rules from the dataset, which are then applied to predict new links. StreamLearner streamlearner.omran.2019 is one of the first methods for learning temporal rules. It employs a static rule learner to generate rules, which are then generalized to the temporal domain. However, they only consider a rather restricted set of temporal rules, where all body atoms have the same timestamp.

Another class of approaches is based on random walks in the graph, where the walks can support an interpretable explanation for the predictions. For example, AnyBURL samples random walks for generating rules. The methods dynnode2vec dynnode2vec.mahdavi.2018 and change2vec change2vec.bian.2019 alternately extract random walks on tKG snapshots and learn parameters for node embeddings, but they do not capture temporal patterns within the random walks. ctdne.nguyen.2018 extend the concept of random walks to temporal random walks on continuous-time dynamic networks for learning node embeddings, where the sequence of edges in the walk only moves forward in time.

Preliminaries

Define .

Temporal knowledge graph

Let denote a set of entities, a set of relations, and a set of timestamps.

A temporal knowledge graph (tKG) is a collection of facts , where each fact is represented by a quadruple . The quadruple is also called link or edge, and it indicates a connection between the subject entity and the object entity via the relation . The timestamp implies the occurrence of the event at time , where can be measured in units such as hour, day, and year.

For two timestamps and , we denote the fact that occurs earlier than by . If additionally, could represent the same time as , we write .

We define for each edge an inverse edge that interchanges the positions of the subject and object entity to allow the random walker to move along the edge in both directions. The relation is called the inverse relation of .

Link forecasting

The goal of the link forecasting task is to predict new links for future timestamps. Given a query with a previously unseen timestamp , we want to identify a ranked list of object candidates that are most likely to complete the query. For subject prediction, we formulate the query as .

Temporal random walk

A non-increasing temporal random walk of length from entity to entity in the tKG is defined as a sequence of edges

(1)

where for .

A non-increasing temporal random walk complies with time constraints so that the edges are traversed only backward in time, where it is also possible to walk along edges with the same timestamp.

Temporal logical rule

We formulate temporal logical rules as first-order Horn clauses. Let and for be variables that represent entities and timestamps, respectively. Further, let be fixed.

A cyclic temporal logical rule of length is defined as

with the temporal constraints

(2)

The left-hand side of is called the rule head, with being the head relation, while the right-hand side is called the rule body, which is presented as the conjunction of body atoms . The rule is called cyclic because the rule head and the rule body constitute two different walks connecting the same two variables and . A temporal rule implies that if the rule body holds with the temporal constraints given by (2), then the rule head is true as well for a future timestamp .

The replacement of the variables and by constant terms is called grounding or instantiation. For example, a grounding of the temporal rule

is given by the edges (Angela Merkel, discuss by telephone, Barack Obama, 2014/07/22) and (Angela Merkel, consult, Barack Obama, 2014/08/09) in Figure 1. Let rule grounding refer to the replacement of the variables in the entire rule and body grounding refer to the replacement of the variables only in the body, where all groundings must comply with the temporal constraints in (2).

In many domains, logical rules are frequently violated so that confidence values are determined to estimate the probability of a rule’s correctness. We adapt the standard confidence to take timestamp values into account. Let

be the relations in a rule . The body support is defined as the number of body groundings, i. e., the number of tuples such that for all and for . The rule support is defined as the number of rule groundings, i. e., the number of tuples with for and such that for all and . The confidence of the rule , denoted by conf(), can then be obtained by dividing the rule support by the body support.

Our Framework

We introduce TLogic, a rule-based link forecasting framework for tKGs. TLogic first extracts temporal walks from the graph and then lifts these walks to a more abstract, semantic level to obtain temporal rules that generalize to new data. The application of these rules generates answer candidates, for which the body groundings in the graph serve as explicit and human-readable explanations. Our framework consists of the components rule learning, rule application, and evaluation. The pseudocode for rule learning is shown in Algorithm 1 and for rule application in Algorithm 2.

Rule Learning

Input: Temporal knowledge graph .
Parameters: Rule lengths , number of temporal random walks , transition distribution .
Output: Temporal logical rules .

1:  for relation  do
2:     for  do
3:        for  do
4:           
5:           According to transition distribution , sample a temporal random walk of length with .  See (4).
6:           Transform walk to the corresponding temporal logical rule .  See (5).
7:           Estimate the confidence of rule .
8:           
9:     
10:  
11:  return
Algorithm 1 Rule learning

As the first step of rule learning, temporal walks are extracted from the tKG . For a rule of length , a walk of length is sampled, where the additional step corresponds to the rule head.

Let be a fixed relation, for which we want to learn rules. For the first sampling step , we sample an edge , which will serve as the rule head, uniformly from all edges with relation type . A temporal random walker then samples iteratively edges adjacent to the current object until a walk of length is obtained.

For sampling step , let denote the previously sampled edge and the set of feasible edges for the next transition. To fulfill the temporal constraints in (1) and (2), we define

where excludes the inverse edge to avoid redundant rules. For obtaining cyclic walks, we sample in the last step an edge that connects the walk to the first entity if such edges exist. Otherwise, we sample the next walk.

The transition distribution for sampling the next edge can either be uniform or exponentially weighted. We define an index mapping to be consistent with the indices in (1). Then, the exponentially weighted probability for choosing edge for is given by

(3)

where denotes the timestamp of edge . The exponential weighting favors edges with timestamps that are closer to the timestamp of the previous edge and probably more relevant for prediction.

The resulting temporal walk is given by

(4)

can then be transformed to a temporal rule by replacing the entities and timestamps with variables. While the first edge in becomes the rule head , the other edges are mapped to body atoms, where each edge is converted to the body atom . The final rule is denoted by

(5)

In addition, we impose the temporal consistency constraints .

The entities in do not need to be distinct since a pair of entities can have many interactions at different points in time. For example, Angela Merkel made several visits to China in 2014, which could constitute important information for prediction. Repetitive occurrences of the same entity in

are replaced with the same random variable in

to maintain this knowledge.

For the confidence estimation of , we sample from the graph a fixed number of body groundings, which have to match the body relations and the variable constraints mentioned in the last paragraph while satisfying the condition from (2). The number of unique bodies serves as the body support. The rule support is determined by counting the number of bodies for which an edge with relation type exists that connects and from the body. Moreover, the timestamp of this edge has to be greater than all body timestamps to fulfill (2).

For every relation , we sample temporal walks for a set of prespecified lengths . The set stands for all rules of length with head relation with their corresponding confidences. All rules for relation are included in , and the complete set of learned temporal rules is given by .

It is possible to learn rules only for a single relation or a set of specific relations of interest. Explicitly learning rules for all relations is especially effective for rare relations that would otherwise only be sampled with a small probability. The learned rules are not specific to the graph from which they have been extracted, but they could be employed in an inductive setting where the rules are transferred to related datasets that share a common vocabulary for straightforward application.

Rule Application

Input: Test query , temporal logical rules , temporal knowledge graph .
Parameters: Time window , minimum number of candidates , score function .
Output: Answer candidates .

1:   Apply the rules in by decreasing confidence.
2:  if   then
3:     for rule  do
4:        Find all body groundings of in , where consists of the edges within the time window .
5:        Retrieve candidates from the target entities of the body groundings.
6:        for  do
7:           Calculate score . See (6).
8:           
9:        if  then
10:           break
11:  return
Algorithm 2 Rule application

The learned temporal rules are applied to answer queries of the form . The answer candidates are retrieved from the target entities of body groundings in the tKG . If there exist no rules for the query relation , or if there are no matching body groundings in the graph, then no answers are predicted for the given query.

To apply the rules on relevant data, a subgraph dependent on a time window is retrieved. For , the subgraph contains all edges from that have timestamps . If , then all edges with timestamps prior to the query timestamp are used for rule application, i. e., consists of all facts with , where is the minimum timestamp in the graph .

We apply the rules by decreasing confidence, where each rule generates a set of answer candidates . Each candidate is then scored by a function that reflects the probability of the candidate being the correct answer to the query.

Let be the set of body groundings of rule that start at entity and end at entity . We choose as score function a convex combination of the rule’s confidence and a function that takes the time difference as input, where denotes the earliest timestamp in the body. If several body groundings exist, we take from all possible values the one that is closest to . For candidate , the score function is defined as

(6)

with and .

The intuition for this choice of

is that candidates generated by high-confidence rules should receive a higher score. Adding a dependency on the timeframe of the rule grounding is based on the observation that the existence of edges in a rule become increasingly probable with decreasing time difference between the edges. We choose the exponential distribution since it is commonly used to model interarrival times of events. The time difference

is always non-negative for a future timestamp value , and with the assumption that there exists a fixed mean, the exponential distribution is also the maximum entropy distribution for such a time difference variable. The exponential distribution is rescaled so that both summands are in the range .

All candidates are saved with their scores as in . We stop the rule application when the number of different answer candidates is at least so that there is no need to go through all rules.

Evaluation

For evaluation of the results, all scores of each candidate are aggregated through a noisy-OR calculation, which produces the final score

(7)

The idea is to aggregate the scores to produce a probability, where candidates implied by more rules should have a higher score.

In case there are no rules for the query relation , or if there are no matching body groundings in the graph, it might still be interesting to retrieve possible answer candidates. In the experiments, we apply a simple baseline where the scores for the candidates are obtained from the overall object distribution in the training data if is a new relation. If already exists in the training set, we take the object distribution of the edges with relation type .

Experiments

Datasets

We conduct experiments on the dataset Integrated Crisis Early Warning System111https://dataverse.harvard.edu/dataverse/icews (ICEWS), which contains information about international events and is a commonly used benchmark dataset for link prediction on tKGs. We choose the subsets ICEWS14, ICEWS18, and ICEWS0515, which include data from the years 2014, 2018, and 2005 to 2015, respectively. Since we consider link forecasting, each dataset is split into training, validation, and test set so that the timestamps in the training set occur earlier than the timestamps in the validation set, which again occur earlier than the timestamps in the test set. To ensure a fair comparison, we use the split provided by xerte.han.2021222https://github.com/TemporalKGTeam/xERTE. The statistics of the datasets are summarized in the appendix.

Experimental Setup

For each test instance , we generate a list of candidates for both object prediction and subject prediction . The candidates are ranked by decreasing scores, which are calculated according to (7).

The confidence for each rule is estimated by sampling body groundings and counting the number of times the rule head holds. We learn rules of the lengths 1, 2, and 3, and for application, we only consider the rules with a minimum confidence of and minimum body support of .

We compute the mean reciprocal rank (MRR) and hits@ for , which are standard metrics for link prediction on KGs. For a rank , the reciprocal rank is defined as , and the MRR is the average of all reciprocal ranks of the correct query answers across all queries. The metric hits@ (h@) indicates the proportion of queries for which the correct entity appears under the top candidates.

Similar to xerte.han.2021, we perform time-aware filtering where all correct entities at the query timestamp except for the true query object are filtered out from the answers. In comparison to the alternative setting that filters out all other objects that appear together with the query subject and relation at any timestamp, time-aware filtering yields a more realistic performance estimate.

Baseline methods

We compare TLogic333Code available at https://github.com/liu-yushan/TLogic. with the state-of-the-art baselines for static link prediction DistMult distmult.yang.2015, ComplEx complex.trouillon.2016, and AnyBURL anyburl.meilicke.2019; anyburl.meilicke.2020 as well as for temporal link prediction TTransE ttranse.leblay.2018, TA-DistMult ta-distmult-transe.garcia-duran.2018, DE-SimplE de-simple.goel.2020, TNTComplEx tntcomplex.lacroix.2020, CyGNet cygnet.zhu.2021, RE-Net renet.jin.2019, and xERTE xerte.han.2021. All baseline results except for the results on AnyBURL come from xerte.han.2021

. AnyBURL samples paths based on reinforcement learning and generalizes them to rules, where the rule space also includes, e. g., acyclic rules and rules with constants. A non-temporal variant of TLogic would sample paths randomly and only learn cyclic rules, which would presumably yield worse performance than AnyBURL. Therefore, we choose AnyBURL as a baseline to assess the effectiveness of adding temporal constraints.

Results

Dataset ICEWS14 ICEWS18 ICEWS0515
Model MRR h@1 h@3 h@10 MRR h@1 h@3 h@10 MRR h@1 h@3 h@10
DistMult 0.2767 0.1816 0.3115 0.4696 0.1017 0.0452 0.1033 0.2125 0.2873 0.1933 0.3219 0.4754
ComplEx 0.3084 0.2151 0.3448 0.4958 0.2101 0.1187 0.2347 0.3987 0.3169 0.2144 0.3574 0.5204
AnyBURL 0.2967 0.2126 0.3333 0.4673 0.2277 0.1510 0.2544 0.3891 0.3205 0.2372 0.3545 0.5046
TTransE 0.1343 0.0311 0.1732 0.3455 0.0831 0.0192 0.0856 0.2189 0.1571 0.0500 0.1972 0.3802
TA-DistMult 0.2647 0.1709 0.3022 0.4541 0.1675 0.0861 0.1841 0.3359 0.2431 0.1458 0.2792 0.4421
DE-SimplE 0.3267 0.2443 0.3569 0.4911 0.1930 0.1153 0.2186 0.3480 0.3502 0.2591 0.3899 0.5275
TNTComplEx 0.3212 0.2335 0.3603 0.4913 0.2123 0.1328 0.2402 0.3691 0.2754 0.1952 0.3080 0.4286
CyGNet 0.3273 0.2369 0.3631 0.5067 0.2493 0.1590 0.2828 0.4261 0.3497 0.2567 0.3909 0.5294
RE-Net 0.3828 0.2868 0.4134 0.5452 0.2881 0.1905 0.3244 0.4751 0.4297 0.3126 0.4685 0.6347
xERTE 0.4079 0.3270 0.4567 0.5730 0.2931 0.2103 0.3351 0.4648 0.4662 0.3784 0.5231 0.6392
TLogic 0.4304 0.3356 0.4827 0.6123 0.2982 0.2054 0.3395 0.4853 0.4697 0.3621 0.5313 0.6743
Table 1: Results of link forecasting on the datasets ICEWS14, ICEWS18, and ICEWS0515. All metrics are time-aware filtered. The best results among all models are displayed in bold.

The results of the experiments are displayed in Table 1. TLogic outperforms all baseline methods with respect to the metrics MRR, hits@3, and hits@10. Only xERTE performs better than Tlogic for hits@1 on the datasets ICEWS18 and ICEWS0515.

Besides a list of possible answer candidates with corresponding scores, TLogic can also provide temporal rules and body groundings in form of walks from the graph that support the predictions. Table 2 presents three exemplary rules with high confidences that were learned from ICEWS14. For the query (Angela Merkel, consult, ?, 2014/08/09), two walks are shown in Table 2, which serve as time-consistent explanations for the correct answer Barack Obama.

Confidence Head Body
0.963
0.818
0.750
0.570 (Merkel, consult, Obama, 14/08/09) (Merkel, discuss by telephone, Obama, 14/07/22)
0.500 (Merkel, consult, Obama, 14/08/09) (Merkel, express intent to meet, Obama, 14/05/02)
, Merkel, 14/07/18) , Obama, 14/07/29)
Table 2: Three exemplary rules from the dataset ICEWS14 and two walks for the query (Angela Merkel, consult, ?, 2014/08/09) that lead to the correct answer Barack Obama. The timestamps are displayed in the format yy/mm/dd.
Figure 2: MRR performance on the validation set of ICEWS14. The transition distribution is either uniform or exponentially weighted.

Inductive setting

One advantage of our learned logical rules is that they are applicable to any new dataset as long as the new dataset covers common relations. This might be relevant for cases where new entities appear. For example, Donald Trump, who served as president of the United States from 2017 to 2021, is included in the dataset ICEWS18 but not in ICEWS14. The logical rules are not tied to particular entities and would still be applicable, while embedding-based methods have difficulties operating in this challenging setting. The models would need to be retrained to obtain embeddings for the new entities, where existing embeddings might also need to be adapted to the different time range.

For the two rule-based methods AnyBURL and TLogic, we apply the rules learned on the training set of ICEWS0515 (with timestamps from 2005/01/01 to 2012/08/06) to the test set of ICEWS14 as well as the rules learned on the training set of ICEWS14 to the test set of ICEWS18 (see Table 3). The performance of TLogic in the inductive setting is for all metrics close to the results in Table 1, while for AnyBURL, especially the results on ICEWS18 drop significantly. It seems that the encoded temporal information in TLogic is essential for achieving correct predictions in the inductive setting. ICEWS14 has only 7,128 entities, while ICEWS18 contains 23,033 entities. The results confirm that temporal rules from TLogic can even be transferred to a dataset with a large number of new entities and timestamps and lead to a strong performance.

Analysis

The results in this section are obtained on the dataset ICEWS14, but the findings are similar for the other two datasets. More detailed results can be found in the appendix.

Model MRR h@1 h@3 h@10
ICEWS0515 ICEWS14 AnyBURL 0.2664 0.1800 0.3024 0.4477
TLogic 0.4253 0.3291 0.4780 0.6122
ICEWS14 ICEWS18 AnyBURL 0.1546 0.0907 0.1685 0.2958
TLogic 0.2915 0.1987 0.3330 0.4795
Table 3: Inductive setting where rules learned on are transferred and applied to .

Number of walks

Figure 2 shows the MRR performance on the validation set of ICEWS14 for different numbers of walks that were extracted during rule learning. We observe a performance increase with a growing number of walks. However, the performance gains saturate between 100 and 200 walks where rather small improvements are attainable.

Transition distribution

We test two transition distributions for the extraction of temporal walks: uniform and exponentially weighted according to (3). The rationale behind using an exponentially weighted distribution is the observation that related events tend to happen within a short timeframe. The distribution of the first edge is always uniform to not restrict the variety of obtained walks. Overall, the performance of the exponential distribution consistently exceeds the uniform setting with respect to the MRR (see Figure 2).

We observe that the exponential distribution leads to more rules of length 3 than the uniform setting (11,718 compared to 8,550 rules for 200 walks), while it is the opposite for rules of length 1 (7,858 compared to 11,019 rules). The exponential setting leads to more successful longer walks because the timestamp differences between subsequent edges tend to be smaller. It is less likely that there are no feasible transitions anymore because of temporal constraints. The uniform setting, however, leads to a better exploration of the neighborhood around the start node for shorter walks.

Rule length

We learn rules of lengths 1, 2, and 3. Using all rules for application results in the best performance (MRR on the validation set: 0.4373), followed by rules of only length 1 (0.4116), 3 (0.4097), and 2 (0.1563). The reason why rules of length 3 perform better than length 2 is that the temporal walks are allowed to transition back and forth between the same entities. Since we only learn cyclic rules, a rule body of length 2 must constitute a path with no recurring entities, resulting in fewer rules and rule groundings in the graph. Interestingly, simple rules of length 1 already yield very good performance.

Time window

For rule application, we define a time window for retrieving the relevant data. The performance increases with the size of the time window, even though relevant events tend to be close to the query timestamp. The second summand of the score function in (6) takes the time difference between the query timestamp and the earliest body timestamp into account. In this case, earlier events with a large timestamp difference receive a lesser weight, while generally, as much information as possible is beneficial for prediction.

Score function

We define the score function in (6) as a convex combination of the rule’s confidence and a function that depends on the time difference . The performance of only using the confidence (MRR: 0.3869) or only using the exponential function (0.4077) is worse than the combination (0.4373), which means that both the information from the rules’ confidences and the time differences are important for prediction.

Variance

The variance in the performance due to different rules obtained from the rule learning component is quite small. Running the same model with the best hyperparameter settings for five different seeds results in a standard deviation of 0.0012 for the MRR. The rule application component is deterministic and always leads to the same candidates with corresponding scores for the same hyperparameter setting.

Training and inference time

The worst-case time complexity for learning rules of length is , where is the number of walks, the maximum node degree in the training set, and the number of body samples for estimating the confidence. The worst-case time complexity for inference is given by , where is the maximum rule length in and the minimum number of candidates. For large graphs with high node degrees, it is possible to reduce the complexity to by only keeping a maximum of candidate walks during rule application.

Both training and application can be parallelized since the rule learning for each relation and the rule application for each test query are independent. Rule learning with 200 walks and exponentially weighted transition distribution for rule lengths on a machine with 8 CPUs takes 180 sec for ICEWS14, while the application on the validation set takes 2000 sec, with and

. For comparison, the best-performing baseline xERTE needs for training one epoch on the same machine already 5000 sec, where an MRR of 0.3953 can be obtained, while testing on the validation set takes 700 sec.

Conclusion

We have proposed TLogic, the first symbolic framework that directly learns temporal logical rules from temporal knowledge graphs and applies these rules for link forecasting. The framework generates answers by applying rules to observed events prior to the query timestamp and scores the answer candidates depending on the rules’ confidences and time differences. Experiments on three datasets indicate that TLogic achieves superior overall performance compared to state-of-the-art baselines. In addition, our approach also provides time-consistent, explicit, and human-readable explanations for the predictions in the form of temporal logical rules.

As future work, it would be interesting to integrate acyclic rules, which could also contain relevant information and might boost the performance for rules of length 2. Furthermore, the simple sampling mechanism for temporal walks could be replaced by a more sophisticated approach, which is able to effectively identify the most promising walks.

Acknowledgement

This work has been supported by the German Federal Ministry for Economic Affairs and Energy (BMWi) as part of the project RAKI under Grant No. 01MD19012C and by the German Federal Ministry of Education and Research (BMBF) under Grant No. 01IS18036A. The authors of this work take full responsibility for its content.

References

Appendix A Appendix

Dataset statistics

Table 4 shows the statistics of the three datasets ICEWS14, ICEWS18, and ICEWS0515. denotes the cardinality of a set .

Dataset
14 63,685 13,823 13,222 7,128 230 365
18 373,018 45,995 49,545 23,033 256 304
0515 322,958 69,224 69,147 10,488 251 4,017
Table 4: Dataset statistics with daily time resolution for all three ICEWS datasets.

Experimental details

All experiments were conducted on a Linux machine with 16 CPU cores and 32 GB RAM. The set of tested hyperparameter ranges and best parameter values for TLogic are displayed in Table 5. Due to memory constraints, the time window for ICEWS18 is set to 200 and for ICEWS0515 to 1000. The best hyperparameter values are chosen based on the MRR on the validation set. Due to the small variance of our approach, the shown results are based on one algorithm run. A random seed of 12 is fixed for the rule learning component to obtain reproducible results.

Hyperparameter Range Best
Number of walks 200
Transition distribution
Rule lengths
Time window
Minimum candidates 20
(score function ) 0.5
(score function ) 0.1
Table 5: Hyperparameter ranges and best parameter values.

All results in the appendix refer to the validation set of ICEWS14. However, the observations are similar for the test set and the other two datasets. All experiments use the best set of hyperparameters, where only the analyzed parameters are modified.

Object distribution baseline

We apply a simple object distribution baseline when there are no rules for the query relation or no matching body groundings in the graph. This baseline is only added for completeness and does not improve the results in a significant way.

The proportion of cases where there are no rules for the test query relation is 15/26,444 = 0.00056 for ICEWS14, 21/99,090 = 0.00021 for ICEWS18, and 9/138,294 = 0.00007 for ICEWS0515. The proportion of cases where there are no matching body groundings is 880/26,444 = 0.0333 for ICEWS14, 2,535/99,090 = 0.0256 for ICEWS18, and 2,375/138,294 = 0.0172 for ICEWS0515.

Number of walks and transition distribution

Table 6 shows the results for different choices of numbers of walks and transition distributions. The performance for all metrics increases with the number of walks. Exponentially weighted transition always outperforms uniform sampling.

Walks Transition MRR h@1 h@3 h@10
10 Unif 0.3818 0.2983 0.4307 0.5404
10 Exp 0.3906 0.3054 0.4408 0.5530
25 Unif 0.4098 0.3196 0.4614 0.5803
25 Exp 0.4175 0.3270 0.4710 0.5875
50 Unif 0.4219 0.3307 0.4754 0.5947
50 Exp 0.4294 0.3375 0.4837 0.6024
100 Unif 0.4266 0.3315 0.4817 0.6057
100 Exp 0.4324 0.3397 0.4861 0.6092
200 Unif 0.4312 0.3366 0.4851 0.6114
200 Exp 0.4373 0.3434 0.4916 0.6161
Table 6: Results for different choices of numbers of walks and transition distributions.

Rule length

Table 7 indicates that using rules of all lengths for application results in the best performance. Learning only cyclic rules probably makes it more difficult to find rules of length 2, where the rule body must constitute a path with no recurring entities, leading to fewer rules and body groundings in the graph.

Rule length MRR h@1 h@3 h@10
1 0.4116 0.3168 0.4708 0.5909
2 0.1563 0.0648 0.1776 0.3597
3 0.4097 0.3213 0.4594 0.5778
1,2,3 0.4373 0.3434 0.4916 0.6161
Table 7: Results for different choices of rule lengths.

Time window

Generally, the larger the time window, the better the performance (see Table 8). If taking all previous timestamps leads to a too high memory usage, the time window should be decreased.

Time window MRR h@1 h@3 h@10
30 0.3842 0.3080 0.4294 0.5281
90 0.4137 0.3287 0.4627 0.5750
150 0.4254 0.3368 0.4766 0.5950
210 0.4311 0.3403 0.4835 0.6035
270 0.4356 0.3426 0.4892 0.6131
0.4373 0.3434 0.4916 0.6161
Table 8: Results for different choices of time windows.

Score function

Using the best hyperparameters values for and , Table 9 shows in the first row the results if only the rules’ confidences are used for scoring, in the second row if only the exponential component is used, and in the last row the results for the combined score function. The combination yields the best overall performance. The optimal balance between the two terms, however, depends on the application and metric prioritization.

MRR h@1 h@3 h@10
arbitrary 0.3869 0.2806 0.4444 0.5918
0.1 0.4077 0.3515 0.4820 0.6051
0.1 0.4373 0.3434 0.4916 0.6161
Table 9: Results for different parameter values in the score function .

Rule learning

The figures 3 and 4

show the number of rules learned under the two transition distributions. The total number of learned rules is similar for the uniform and exponential distribution, but there is a large difference for rules of lengths 1 and 3. The exponential distribution leads to more successful longer walks and thus more longer rules, while the uniform distribution leads to a better exploration of the neighborhood around the start node for shorter walks.

Figure 3: Total number of learned rules and number of rules for length 1.
Figure 4: Number of rules for lengths 2 and 3.

Training and inference time

The rule learning and rule application times are shown in the figures 5 and 6, dependent on the number of extracted temporal walks during learning.

Figure 5: Rule learning time.
Figure 6: Rule application time.

The worst-case time complexity for learning rules of length is , where is the number of walks, the maximum node degree in the training set, and the number of body samples for estimating the confidence. The worst-case time complexity for inference is given by , where is the maximum rule length in and the minimum number of candidates. More detailed steps of the algorithms for understanding these complexity estimations are given by Algorithm 3 and Algorithm 4.

Figure 7: Overall framework.

Input: Temporal knowledge graph .
Parameters: Rule lengths , number of temporal random walks , transition distribution .
Output: Temporal logical rules .

1:  for relation  do
2:     for  do
3:        for  do
4:           
5:           According to transition distribution , sample a temporal random walk of length with .  See (4).Sample uniformly a start edge with edge type .
6:           for step  do
7:              Retrieve adjacent edges of current object node.
8:              if  then
9:                 Filter out all edges with timestamps greater than or equal to the current timestamp.
10:              else
11:                 Filter out all edges with timestamps greater than the current timestamp.Filter out the inverse edge of the previously sampled edge.
12:              if  then
13:                 Retrieve all filtered edges that connect the current object to the source of the walk.
14:              Sample the next edge from the filtered edges according to distribution .break if there are no feasible edges because of temporal or cyclic constraints.
15:           Transform walk to the corresponding temporal logical rule .  See (5).Save information about the head relation and body relations.Define variable constraints for recurring entities.
16:           Estimate the confidence of rule . Sample body groundings. For each step , filter the edges for the correct body relation besides for the timestamps required to fulfill the temporal constraints.For successful body groundings, check the variable constraints.For each unique body, check if the rule head exists in the graph.Calculate rule support / body support.
17:           
18:     
19:  
20:  return
Algorithm 3 Rule learning (detailed)

Input: Test query , temporal logical rules , temporal knowledge graph .
Parameters: Time window , minimum number of candidates , score function .
Output: Answer candidates .

1:   Apply the rules in by decreasing confidence.
2:  Retrieve subgraph with timestamps . Only done if the timestamp changes. The queries in the test set are sorted by timestamp.Retrieve edges with timestamps .Store edges for each relation in a dictionary.
3:  if   then
4:     for rule  do
5:        Find all body groundings of in .Retrieve edges that could constitute walks that match the rule’s body. First, retrieve edges whose subject matches and the relation the first relation in the rule body. Then, retrieve edges whose subject match one of the current targets and the relation the next relation in the rule body.Generate complete walks by merging the edges on the same target-source entity.Delete all walks that do not comply with the time constraints.Check variable constraints, and delete the walks that do not comply with the variable constraints.
6:        Retrieve candidates from the target entities of the walks.
7:        for  do
8:           Calculate score . See (6).
9:           
10:        if  then
11:           break
12:  return
Algorithm 4 Rule application (detailed)