Knowledge graphs (KGs) structure factual information in the form of triples , where and correspond to entities in the real world and to a binary relation, e. g., (Anna, born in, Paris). This knowledge representation leads to an interpretation as a directed multigraph, where entities are identified with nodes and relations with edge types. Each edge in the KG encodes an observed fact, where the source node corresponds to the subject entity, the target node to the object entity, and the edge type to the predicate of the factual statement.
Some real-world information also includes a temporal dimension, e. g., the event (Anna, born in, Paris) happened on a specific date. To model the large amount of available event data that induces complex interactions between entities over time, temporal knowledge graphs (tKGs) have been introduced. Temporal KGs extend the triples to quadruples to integrate a timestamp or time range , where indicates the time validity of the static event , e. g., (Angela Merkel, visit, China, 2014/07/04). Figure 1 visualizes a subgraph from the dataset ICEWS14 as an example of a tKG. In this work, we focus on tKGs where each edge is equipped with a single timestamp.
One of the common tasks on KGs is link prediction, which finds application in areas such as recommender systems nectr.hildebrandt.2019, knowledge base completion convkb.nguyen.2018, and drug repurposing polo.liu.2021. Taking the additional temporal dimension into account, it is of special interest to forecast events for future timestamps based on past information. Notable real-world applications that rely on accurate event forecasting are, e. g., clinical decision support, supply chain management, and extreme events modeling. In this work, we address link forecasting on tKGs, where we consider queries for a timestamp that has not been seen during training.
Several embedding-based methods have been introduced for tKGs to solve link prediction and forecasting (link prediction with future timestamps), e.g., TTransE ttranse.leblay.2018, TNTComplEx tntcomplex.lacroix.2020, and RE-Net renet.jin.2019
. The underlying principle is to project the entities and relations into a low-dimensional vector space while preserving the topology and temporal dynamics of the tKG. These methods can learn the complex patterns that lead to an event but often lack transparency and interpretability.
To increase the transparency and trustworthiness of the solutions, human-understandable explanations are necessary, which can be provided by logical rules. However, the manual creation of rules is often difficult due to the complex nature of events. Domain experts cannot articulate the conditions for the occurrence of an event sufficiently formally to express this knowledge as rules, which leads to a problem termed as the knowledge acquisition bottleneck. Generally, symbolic methods that make use of logical rules tend to suffer from scalability issues, which make them impractical for the application on large real-world datasets.
We propose TLogic that automatically mines cyclic temporal logical rules by extracting temporal random walks from the graph. We achieve both high predictive performance and time-consistent explanations in the form of temporal rules, which conform to the observation that the occurrence of an event is usually triggered by previous events. The main contributions of this work are summarized as follows:
We introduce TLogic, a novel symbolic framework based on temporal random walks in temporal knowledge graphs. It is the first approach that directly learns temporal logical rules from tKGs and applies these rules to the link forecasting task.
Our approach provides explicit and human-readable explanations in the form of temporal logical rules and is scalable to large datasets.
We conduct experiments on three benchmark datasets (ICEWS14, ICEWS18, and ICEWS0515) and show better overall performance compared with state-of-the-art baselines.
We demonstrate the effectiveness of our method in the inductive setting where our learned rules are transferred to a related dataset with a common vocabulary.
Subsymbolic machine learning methods, e. g., embedding-based algorithms, have achieved success for the link prediction task on static KGs. Well-known methods include RESCALrescal.nickel.2011, TransE transe.bordes.2013, DistMult distmult.yang.2015, and ComplEx complex.trouillon.2016 as well as the graph convolutional approaches R-GCN rgcn.schlichtkrull.2018 and CompGCN compgcn.vashishth.2020. Several approaches have been recently proposed to handle tKGs, such as TTransE ttranse.leblay.2018, TA-DistMult ta-distmult-transe.garcia-duran.2018, DE-SimplE de-simple.goel.2020, TNTComplEx tntcomplex.lacroix.2020, CyGNet cygnet.zhu.2021, RE-Net renet.jin.2019, and xERTE xerte.han.2021. The main idea of these methods is to explicitly learn embeddings for timestamps or to integrate temporal information into the entity or relation embeddings. However, the black-box property of embeddings makes it difficult for humans to understand the predictions. Moreover, approaches with shallow embeddings are not suitable for an inductive setting with previously unseen entities, relations, or timestamps. From the above methods, only CyGNet, RE-Net, and xERTE are designed for the forecasting task. xERTE is also able to provide explanations by extracting relevant subgraphs around the query subject.
Symbolic approaches for link prediction on KGs like AMIE+ amie+.galarraga.2015 and AnyBURL anyburl.meilicke.2019 mine logical rules from the dataset, which are then applied to predict new links. StreamLearner streamlearner.omran.2019 is one of the first methods for learning temporal rules. It employs a static rule learner to generate rules, which are then generalized to the temporal domain. However, they only consider a rather restricted set of temporal rules, where all body atoms have the same timestamp.
Another class of approaches is based on random walks in the graph, where the walks can support an interpretable explanation for the predictions. For example, AnyBURL samples random walks for generating rules. The methods dynnode2vec dynnode2vec.mahdavi.2018 and change2vec change2vec.bian.2019 alternately extract random walks on tKG snapshots and learn parameters for node embeddings, but they do not capture temporal patterns within the random walks. ctdne.nguyen.2018 extend the concept of random walks to temporal random walks on continuous-time dynamic networks for learning node embeddings, where the sequence of edges in the walk only moves forward in time.
Temporal knowledge graph
Let denote a set of entities, a set of relations, and a set of timestamps.
A temporal knowledge graph (tKG) is a collection of facts , where each fact is represented by a quadruple . The quadruple is also called link or edge, and it indicates a connection between the subject entity and the object entity via the relation . The timestamp implies the occurrence of the event at time , where can be measured in units such as hour, day, and year.
For two timestamps and , we denote the fact that occurs earlier than by . If additionally, could represent the same time as , we write .
We define for each edge an inverse edge that interchanges the positions of the subject and object entity to allow the random walker to move along the edge in both directions. The relation is called the inverse relation of .
The goal of the link forecasting task is to predict new links for future timestamps. Given a query with a previously unseen timestamp , we want to identify a ranked list of object candidates that are most likely to complete the query. For subject prediction, we formulate the query as .
Temporal random walk
A non-increasing temporal random walk of length from entity to entity in the tKG is defined as a sequence of edges
where for .
A non-increasing temporal random walk complies with time constraints so that the edges are traversed only backward in time, where it is also possible to walk along edges with the same timestamp.
Temporal logical rule
We formulate temporal logical rules as first-order Horn clauses. Let and for be variables that represent entities and timestamps, respectively. Further, let be fixed.
A cyclic temporal logical rule of length is defined as
with the temporal constraints
The left-hand side of is called the rule head, with being the head relation, while the right-hand side is called the rule body, which is presented as the conjunction of body atoms . The rule is called cyclic because the rule head and the rule body constitute two different walks connecting the same two variables and . A temporal rule implies that if the rule body holds with the temporal constraints given by (2), then the rule head is true as well for a future timestamp .
The replacement of the variables and by constant terms is called grounding or instantiation. For example, a grounding of the temporal rule
is given by the edges (Angela Merkel, discuss by telephone, Barack Obama, 2014/07/22) and (Angela Merkel, consult, Barack Obama, 2014/08/09) in Figure 1. Let rule grounding refer to the replacement of the variables in the entire rule and body grounding refer to the replacement of the variables only in the body, where all groundings must comply with the temporal constraints in (2).
In many domains, logical rules are frequently violated so that confidence values are determined to estimate the probability of a rule’s correctness. We adapt the standard confidence to take timestamp values into account. Letbe the relations in a rule . The body support is defined as the number of body groundings, i. e., the number of tuples such that for all and for . The rule support is defined as the number of rule groundings, i. e., the number of tuples with for and such that for all and . The confidence of the rule , denoted by conf(), can then be obtained by dividing the rule support by the body support.
We introduce TLogic, a rule-based link forecasting framework for tKGs. TLogic first extracts temporal walks from the graph and then lifts these walks to a more abstract, semantic level to obtain temporal rules that generalize to new data. The application of these rules generates answer candidates, for which the body groundings in the graph serve as explicit and human-readable explanations. Our framework consists of the components rule learning, rule application, and evaluation. The pseudocode for rule learning is shown in Algorithm 1 and for rule application in Algorithm 2.
As the first step of rule learning, temporal walks are extracted from the tKG . For a rule of length , a walk of length is sampled, where the additional step corresponds to the rule head.
Let be a fixed relation, for which we want to learn rules. For the first sampling step , we sample an edge , which will serve as the rule head, uniformly from all edges with relation type . A temporal random walker then samples iteratively edges adjacent to the current object until a walk of length is obtained.
where excludes the inverse edge to avoid redundant rules. For obtaining cyclic walks, we sample in the last step an edge that connects the walk to the first entity if such edges exist. Otherwise, we sample the next walk.
The transition distribution for sampling the next edge can either be uniform or exponentially weighted. We define an index mapping to be consistent with the indices in (1). Then, the exponentially weighted probability for choosing edge for is given by
where denotes the timestamp of edge . The exponential weighting favors edges with timestamps that are closer to the timestamp of the previous edge and probably more relevant for prediction.
The resulting temporal walk is given by
can then be transformed to a temporal rule by replacing the entities and timestamps with variables. While the first edge in becomes the rule head , the other edges are mapped to body atoms, where each edge is converted to the body atom . The final rule is denoted by
In addition, we impose the temporal consistency constraints .
The entities in do not need to be distinct since a pair of entities can have many interactions at different points in time. For example, Angela Merkel made several visits to China in 2014, which could constitute important information for prediction. Repetitive occurrences of the same entity in
are replaced with the same random variable into maintain this knowledge.
For the confidence estimation of , we sample from the graph a fixed number of body groundings, which have to match the body relations and the variable constraints mentioned in the last paragraph while satisfying the condition from (2). The number of unique bodies serves as the body support. The rule support is determined by counting the number of bodies for which an edge with relation type exists that connects and from the body. Moreover, the timestamp of this edge has to be greater than all body timestamps to fulfill (2).
For every relation , we sample temporal walks for a set of prespecified lengths . The set stands for all rules of length with head relation with their corresponding confidences. All rules for relation are included in , and the complete set of learned temporal rules is given by .
It is possible to learn rules only for a single relation or a set of specific relations of interest. Explicitly learning rules for all relations is especially effective for rare relations that would otherwise only be sampled with a small probability. The learned rules are not specific to the graph from which they have been extracted, but they could be employed in an inductive setting where the rules are transferred to related datasets that share a common vocabulary for straightforward application.
The learned temporal rules are applied to answer queries of the form . The answer candidates are retrieved from the target entities of body groundings in the tKG . If there exist no rules for the query relation , or if there are no matching body groundings in the graph, then no answers are predicted for the given query.
To apply the rules on relevant data, a subgraph dependent on a time window is retrieved. For , the subgraph contains all edges from that have timestamps . If , then all edges with timestamps prior to the query timestamp are used for rule application, i. e., consists of all facts with , where is the minimum timestamp in the graph .
We apply the rules by decreasing confidence, where each rule generates a set of answer candidates . Each candidate is then scored by a function that reflects the probability of the candidate being the correct answer to the query.
Let be the set of body groundings of rule that start at entity and end at entity . We choose as score function a convex combination of the rule’s confidence and a function that takes the time difference as input, where denotes the earliest timestamp in the body. If several body groundings exist, we take from all possible values the one that is closest to . For candidate , the score function is defined as
with and .
The intuition for this choice of
is that candidates generated by high-confidence rules should receive a higher score. Adding a dependency on the timeframe of the rule grounding is based on the observation that the existence of edges in a rule become increasingly probable with decreasing time difference between the edges. We choose the exponential distribution since it is commonly used to model interarrival times of events. The time differenceis always non-negative for a future timestamp value , and with the assumption that there exists a fixed mean, the exponential distribution is also the maximum entropy distribution for such a time difference variable. The exponential distribution is rescaled so that both summands are in the range .
All candidates are saved with their scores as in . We stop the rule application when the number of different answer candidates is at least so that there is no need to go through all rules.
For evaluation of the results, all scores of each candidate are aggregated through a noisy-OR calculation, which produces the final score
The idea is to aggregate the scores to produce a probability, where candidates implied by more rules should have a higher score.
In case there are no rules for the query relation , or if there are no matching body groundings in the graph, it might still be interesting to retrieve possible answer candidates. In the experiments, we apply a simple baseline where the scores for the candidates are obtained from the overall object distribution in the training data if is a new relation. If already exists in the training set, we take the object distribution of the edges with relation type .
We conduct experiments on the dataset Integrated Crisis Early Warning System111https://dataverse.harvard.edu/dataverse/icews (ICEWS), which contains information about international events and is a commonly used benchmark dataset for link prediction on tKGs. We choose the subsets ICEWS14, ICEWS18, and ICEWS0515, which include data from the years 2014, 2018, and 2005 to 2015, respectively. Since we consider link forecasting, each dataset is split into training, validation, and test set so that the timestamps in the training set occur earlier than the timestamps in the validation set, which again occur earlier than the timestamps in the test set. To ensure a fair comparison, we use the split provided by xerte.han.2021222https://github.com/TemporalKGTeam/xERTE. The statistics of the datasets are summarized in the appendix.
For each test instance , we generate a list of candidates for both object prediction and subject prediction . The candidates are ranked by decreasing scores, which are calculated according to (7).
The confidence for each rule is estimated by sampling body groundings and counting the number of times the rule head holds. We learn rules of the lengths 1, 2, and 3, and for application, we only consider the rules with a minimum confidence of and minimum body support of .
We compute the mean reciprocal rank (MRR) and hits@ for , which are standard metrics for link prediction on KGs. For a rank , the reciprocal rank is defined as , and the MRR is the average of all reciprocal ranks of the correct query answers across all queries. The metric hits@ (h@) indicates the proportion of queries for which the correct entity appears under the top candidates.
Similar to xerte.han.2021, we perform time-aware filtering where all correct entities at the query timestamp except for the true query object are filtered out from the answers. In comparison to the alternative setting that filters out all other objects that appear together with the query subject and relation at any timestamp, time-aware filtering yields a more realistic performance estimate.
We compare TLogic333Code available at https://github.com/liu-yushan/TLogic. with the state-of-the-art baselines for static link prediction DistMult distmult.yang.2015, ComplEx complex.trouillon.2016, and AnyBURL anyburl.meilicke.2019; anyburl.meilicke.2020 as well as for temporal link prediction TTransE ttranse.leblay.2018, TA-DistMult ta-distmult-transe.garcia-duran.2018, DE-SimplE de-simple.goel.2020, TNTComplEx tntcomplex.lacroix.2020, CyGNet cygnet.zhu.2021, RE-Net renet.jin.2019, and xERTE xerte.han.2021. All baseline results except for the results on AnyBURL come from xerte.han.2021
. AnyBURL samples paths based on reinforcement learning and generalizes them to rules, where the rule space also includes, e. g., acyclic rules and rules with constants. A non-temporal variant of TLogic would sample paths randomly and only learn cyclic rules, which would presumably yield worse performance than AnyBURL. Therefore, we choose AnyBURL as a baseline to assess the effectiveness of adding temporal constraints.
The results of the experiments are displayed in Table 1. TLogic outperforms all baseline methods with respect to the metrics MRR, hits@3, and hits@10. Only xERTE performs better than Tlogic for hits@1 on the datasets ICEWS18 and ICEWS0515.
Besides a list of possible answer candidates with corresponding scores, TLogic can also provide temporal rules and body groundings in form of walks from the graph that support the predictions. Table 2 presents three exemplary rules with high confidences that were learned from ICEWS14. For the query (Angela Merkel, consult, ?, 2014/08/09), two walks are shown in Table 2, which serve as time-consistent explanations for the correct answer Barack Obama.
|0.570||(Merkel, consult, Obama, 14/08/09)||(Merkel, discuss by telephone, Obama, 14/07/22)|
|0.500||(Merkel, consult, Obama, 14/08/09)||(Merkel, express intent to meet, Obama, 14/05/02)|
|, Merkel, 14/07/18) , Obama, 14/07/29)|
One advantage of our learned logical rules is that they are applicable to any new dataset as long as the new dataset covers common relations. This might be relevant for cases where new entities appear. For example, Donald Trump, who served as president of the United States from 2017 to 2021, is included in the dataset ICEWS18 but not in ICEWS14. The logical rules are not tied to particular entities and would still be applicable, while embedding-based methods have difficulties operating in this challenging setting. The models would need to be retrained to obtain embeddings for the new entities, where existing embeddings might also need to be adapted to the different time range.
For the two rule-based methods AnyBURL and TLogic, we apply the rules learned on the training set of ICEWS0515 (with timestamps from 2005/01/01 to 2012/08/06) to the test set of ICEWS14 as well as the rules learned on the training set of ICEWS14 to the test set of ICEWS18 (see Table 3). The performance of TLogic in the inductive setting is for all metrics close to the results in Table 1, while for AnyBURL, especially the results on ICEWS18 drop significantly. It seems that the encoded temporal information in TLogic is essential for achieving correct predictions in the inductive setting. ICEWS14 has only 7,128 entities, while ICEWS18 contains 23,033 entities. The results confirm that temporal rules from TLogic can even be transferred to a dataset with a large number of new entities and timestamps and lead to a strong performance.
The results in this section are obtained on the dataset ICEWS14, but the findings are similar for the other two datasets. More detailed results can be found in the appendix.
Number of walks
Figure 2 shows the MRR performance on the validation set of ICEWS14 for different numbers of walks that were extracted during rule learning. We observe a performance increase with a growing number of walks. However, the performance gains saturate between 100 and 200 walks where rather small improvements are attainable.
We test two transition distributions for the extraction of temporal walks: uniform and exponentially weighted according to (3). The rationale behind using an exponentially weighted distribution is the observation that related events tend to happen within a short timeframe. The distribution of the first edge is always uniform to not restrict the variety of obtained walks. Overall, the performance of the exponential distribution consistently exceeds the uniform setting with respect to the MRR (see Figure 2).
We observe that the exponential distribution leads to more rules of length 3 than the uniform setting (11,718 compared to 8,550 rules for 200 walks), while it is the opposite for rules of length 1 (7,858 compared to 11,019 rules). The exponential setting leads to more successful longer walks because the timestamp differences between subsequent edges tend to be smaller. It is less likely that there are no feasible transitions anymore because of temporal constraints. The uniform setting, however, leads to a better exploration of the neighborhood around the start node for shorter walks.
We learn rules of lengths 1, 2, and 3. Using all rules for application results in the best performance (MRR on the validation set: 0.4373), followed by rules of only length 1 (0.4116), 3 (0.4097), and 2 (0.1563). The reason why rules of length 3 perform better than length 2 is that the temporal walks are allowed to transition back and forth between the same entities. Since we only learn cyclic rules, a rule body of length 2 must constitute a path with no recurring entities, resulting in fewer rules and rule groundings in the graph. Interestingly, simple rules of length 1 already yield very good performance.
For rule application, we define a time window for retrieving the relevant data. The performance increases with the size of the time window, even though relevant events tend to be close to the query timestamp. The second summand of the score function in (6) takes the time difference between the query timestamp and the earliest body timestamp into account. In this case, earlier events with a large timestamp difference receive a lesser weight, while generally, as much information as possible is beneficial for prediction.
We define the score function in (6) as a convex combination of the rule’s confidence and a function that depends on the time difference . The performance of only using the confidence (MRR: 0.3869) or only using the exponential function (0.4077) is worse than the combination (0.4373), which means that both the information from the rules’ confidences and the time differences are important for prediction.
The variance in the performance due to different rules obtained from the rule learning component is quite small. Running the same model with the best hyperparameter settings for five different seeds results in a standard deviation of 0.0012 for the MRR. The rule application component is deterministic and always leads to the same candidates with corresponding scores for the same hyperparameter setting.
Training and inference time
The worst-case time complexity for learning rules of length is , where is the number of walks, the maximum node degree in the training set, and the number of body samples for estimating the confidence. The worst-case time complexity for inference is given by , where is the maximum rule length in and the minimum number of candidates. For large graphs with high node degrees, it is possible to reduce the complexity to by only keeping a maximum of candidate walks during rule application.
Both training and application can be parallelized since the rule learning for each relation and the rule application for each test query are independent. Rule learning with 200 walks and exponentially weighted transition distribution for rule lengths on a machine with 8 CPUs takes 180 sec for ICEWS14, while the application on the validation set takes 2000 sec, with and
. For comparison, the best-performing baseline xERTE needs for training one epoch on the same machine already 5000 sec, where an MRR of 0.3953 can be obtained, while testing on the validation set takes 700 sec.
We have proposed TLogic, the first symbolic framework that directly learns temporal logical rules from temporal knowledge graphs and applies these rules for link forecasting. The framework generates answers by applying rules to observed events prior to the query timestamp and scores the answer candidates depending on the rules’ confidences and time differences. Experiments on three datasets indicate that TLogic achieves superior overall performance compared to state-of-the-art baselines. In addition, our approach also provides time-consistent, explicit, and human-readable explanations for the predictions in the form of temporal logical rules.
As future work, it would be interesting to integrate acyclic rules, which could also contain relevant information and might boost the performance for rules of length 2. Furthermore, the simple sampling mechanism for temporal walks could be replaced by a more sophisticated approach, which is able to effectively identify the most promising walks.
This work has been supported by the German Federal Ministry for Economic Affairs and Energy (BMWi) as part of the project RAKI under Grant No. 01MD19012C and by the German Federal Ministry of Education and Research (BMBF) under Grant No. 01IS18036A. The authors of this work take full responsibility for its content.
Appendix A Appendix
Table 4 shows the statistics of the three datasets ICEWS14, ICEWS18, and ICEWS0515. denotes the cardinality of a set .
All experiments were conducted on a Linux machine with 16 CPU cores and 32 GB RAM. The set of tested hyperparameter ranges and best parameter values for TLogic are displayed in Table 5. Due to memory constraints, the time window for ICEWS18 is set to 200 and for ICEWS0515 to 1000. The best hyperparameter values are chosen based on the MRR on the validation set. Due to the small variance of our approach, the shown results are based on one algorithm run. A random seed of 12 is fixed for the rule learning component to obtain reproducible results.
|Number of walks||200|
|(score function )||0.5|
|(score function )||0.1|
All results in the appendix refer to the validation set of ICEWS14. However, the observations are similar for the test set and the other two datasets. All experiments use the best set of hyperparameters, where only the analyzed parameters are modified.
Object distribution baseline
We apply a simple object distribution baseline when there are no rules for the query relation or no matching body groundings in the graph. This baseline is only added for completeness and does not improve the results in a significant way.
The proportion of cases where there are no rules for the test query relation is 15/26,444 = 0.00056 for ICEWS14, 21/99,090 = 0.00021 for ICEWS18, and 9/138,294 = 0.00007 for ICEWS0515. The proportion of cases where there are no matching body groundings is 880/26,444 = 0.0333 for ICEWS14, 2,535/99,090 = 0.0256 for ICEWS18, and 2,375/138,294 = 0.0172 for ICEWS0515.
Number of walks and transition distribution
Table 6 shows the results for different choices of numbers of walks and transition distributions. The performance for all metrics increases with the number of walks. Exponentially weighted transition always outperforms uniform sampling.
Table 7 indicates that using rules of all lengths for application results in the best performance. Learning only cyclic rules probably makes it more difficult to find rules of length 2, where the rule body must constitute a path with no recurring entities, leading to fewer rules and body groundings in the graph.
Generally, the larger the time window, the better the performance (see Table 8). If taking all previous timestamps leads to a too high memory usage, the time window should be decreased.
Using the best hyperparameters values for and , Table 9 shows in the first row the results if only the rules’ confidences are used for scoring, in the second row if only the exponential component is used, and in the last row the results for the combined score function. The combination yields the best overall performance. The optimal balance between the two terms, however, depends on the application and metric prioritization.
show the number of rules learned under the two transition distributions. The total number of learned rules is similar for the uniform and exponential distribution, but there is a large difference for rules of lengths 1 and 3. The exponential distribution leads to more successful longer walks and thus more longer rules, while the uniform distribution leads to a better exploration of the neighborhood around the start node for shorter walks.
Training and inference time
The worst-case time complexity for learning rules of length is , where is the number of walks, the maximum node degree in the training set, and the number of body samples for estimating the confidence. The worst-case time complexity for inference is given by , where is the maximum rule length in and the minimum number of candidates. More detailed steps of the algorithms for understanding these complexity estimations are given by Algorithm 3 and Algorithm 4.