Many real-world scenarios exist where the time series can effectively be complemented with external knowledge. One such scenario represents information about the trade between countries in the form of a temporally evolving knowledge graph. The information about trade between the countries affects the corresponding currency exchange rate. Using information about trade, we want to better predict the currency exchange rate with high accuracy.
Time Series forecasting deals with the prediction of the data points of the sequence at a future timestamp based on the data available up till the current timestamp. Methods such as Auto-Regressive Integrated Moving Average (ARIMA) model and Kalman filtering[16, 15] are popular for predicting time series. Representational learning on graph-structured data [9, 25, 12, 27] is a widely researched field with considerable focus on temporally-evolving graphs [24, 18, 13, 7]. The increasing amount of complex data that can be effectively represented using dynamic multi-relational graphs  has led to this increased focus on dynamic graph modeling. Several methods such as ConvE , RGCN , and DistMult  have shown admirable results on modeling static, multi-relational graph data for link prediction. There are other approaches that attempt to model dynamic knowledge graphs by incorporating temporal information, and these include Know-Evolve , HyTE , and TA-DistMult , among others.
The two aforementioned fields of time series prediction and representation learning on graphs have mainly been separated in the machine learning community. Recently some work has been done in integrating the two fields , which describes a method to incorporate a static, uni-relational graph for traffic flow prediction. However, this method is only limited to static graphs with a single relation. To date, no method has been proposed for integrating temporally evolving graphs and time series prediction. In this paper, we propose a new method for exploiting the information from the dynamic graphs for time series prediction. We propose the use of static learnable embedding to capture the spatial information from knowledge graphs and a dynamic embedding to capture the dynamics of the time series and the evolving graph.
We present the first-ever solution to the problem of time-series prediction with temporal knowledge graphs (TKG). Since, to the best of our knowledge, currently no datasets exist which align with the problem statement, we prepared five suitable datasets through web scraping333We will release the dataset upon acceptance for future work. and evaluate our model. We show that our approach beats the current state-of-the-art methods for time series forecasting on all the five datasets. Our approach also predicts the time series by any number of time steps and does not require a test time graph structure for evaluation. We release the code of model DArtNet for future research444https://github.com/INK-USC/DArtNet.
2 Related Work
We review work using static graphs for time series prediction and work on temporal knowledge graphs.
In addition to the general time-series prediction task, there have been some recent studies on the spatial-temporal forecasting problem. Diffusion Convolutional Recurrent Neural Network (DCRNN) is a method which incorporates a static, uni-relational graph for time series (traffic flow) forecasting. Traffic flow is modeled as a diffusion process on a directed graph. The method makes use of bidirectional random walks on the graph to capture the spatial dependency and uses an encoder-decoder framework with scheduled sampling for incorporating the temporal dependence. However, this method cannot be extended to temporally evolving graphs as well as multi-relational graphs. Another paper on Relational Time Series Forecasting , also formulates the problem of using dynamic graphs for time series prediction though it is not formulated for multi relational data. Neural relational inference  also looks at the inverse problem of predicting dynamics of graph with attribute information.
Temporal Knowledge Graph Reasoning and Link Prediction. There have been several attempts on reasoning on dynamically evolving graphs. HyTE 
is a method for embedding knowledge graphs which views each timestamp in the graph data as a hyperplane. Each head, relation, tail triple at a particular timestamp is projected into the corresponding hyperplane. The translational distance, as defined by the TransE model
, of the projected embedding vectors, is minimized. TA-Distmult is a temporal-aware version of Distmult. For a quadruple, a predicate is constructed using which is passed into an GRU. The last hidden state of the GRU is taken as the representation of the predicate sequence (). Know-Evolve  models a relationship between two nodes as a multivariate point process. Learned entity embeddings are used to calculate the score for that relation, which is used to modulate the intensity function of the point process. ReNet  uses the neighborhood aggregation and RNN to capture the spatial and temporal information in the graph.
3 Problem Formulation
A Knowledge Graph is a multi-relational graph that can be represented in the form of triples where denotes the head, denotes the tail, and is the relation between the nodes and . A TKG has a time dimension as well, and the graph can be represented as quadruples in the form where denotes the timestamp at which the relation exists between the nodes and . We now introduce Dynamic Attributed Graphs, formalize our problem statement, and in later sections, present our model for making predictions on Dynamic Attributed Graphs.
Problem Definition. A Dynamic Attributed Graph (DAG) is a directed graph, where the edges are multi-relational with time stamp associated with each edge known as an event, and attributes associated with the nodes for that particular time. An event in dynamic graph is represented as a quadruple where is the head entity, is the relation, is the tail entity and is the time stamp of the event. In a dynamic attributed graph, an event is represented as a hextuple where and are the attribute associated with head and tail at time . The collection of all the events at a time constitutes a dynamic graph where . The goal of the DAG Prediction problem is to learn a representation of the dynamic graph events and predict the attributes at each node at future timestamps by learning a set of functions to predict the events for the next time step. Link is predicted jointly, to aid the attribute prediction task. Formally, we want to learn a set of functions such that:
We divide the dynamic graph at any time in two sets, one consisting of only and other consisting of only the attribute values. Formally, , where and . We propose to predict and using these set of functions as follows:
We jointly predict the graph structure and the attribute values and show that the attribute values are being predicted with greater accuracy than can be done using any existing method.
4 Proposed Framework : DArtNet
We now present our framework for learning the set of functions for predicting the events for the next timestamp, given the history. We name our framework DArtNet which stands for Dynamic Attributed Network.
We model the changing graph structure and attribute values by learning an entity-specific representation. We define the representation of an entity in the graph as a combination of a static learnable embedding which does not change with time and represent static characteristics of each node, and a dynamic embedding which depends on the attribute value at that time and represent dynamically evolving property of each node. We then aggregate the information using the mean of neighborhood entities. For every entity in the graph, we model history encoding using a Recurrent Neural Network. Finally, we use a fully connected network for attribute prediction and link prediction tasks.
4.1 Representation Learning on Events
The main component of our framework is to learn a representation over the events which will be used for predicting future events. We learn a head-specific representation and then model its history using Recurrent Neural Network (RNN). Let represent the events associated with head at time , i.e., . For each entity in the graph, we decompose the information in two parts, static information and dynamic information. Static information does not change over time and represents the inherent information for the entity. Dynamic information changes over time. It represents the information that is affected by all the external variables for the entity. For every entity in the graph at time , we construct an embedding for the entity which consists of two components:
Static (Constant) learnable embedding which does not change over time.
Dynamic embedding which changes over time.
where , is the attribute of entity at time and is learnable parameter. The attribute value can be a multi-dimensional vector with representing multiple time-series associated with same head. Hence the embedding of entity becomes where is the concatenation operator. For every relation (link) , we construct a learnable static embedding .
To capture the neighbourhood information of entity at time from the dynamic graph, we propose two spatial embeddings: attribute embedding and interaction embedding . captures the spatio-attribute information from the neighbourhood of the entity and captures the spatio-interaction information from the neighbourhood of the entity . Mathematically, we can define the spatial embeddings as:
where is the cardinality of set and and are learnable parameters.
4.2 Learning Temporal Dependency
The embeddings , capture the spatial information for entity at time
. For predicting the information at future time, we need to capture the temporal dependence of the information. To keep track of the interactions and the attribute evolution over time, we model the history using Gated Recurrent Unit, an RNN. For the head s, we define the encoded attribute history at time as the sequence and the encoded interaction history at time as the sequence . These sequence provide the full information about the evolution of the head till time . To represent in sequences, we model the encoded attribute and encoded interaction history for head as follows:
where the vector captures the spatio-temporal information for the attribute evolution i.e. captures how the attribute value of the entity evolves over time with respect to the evolving graph structure, while the vector captures the spatio-temporal information of how the relation is associated with the entity over time. We show the DArtNet, its input and output in Figure 2.
|Dataset||# Train||# Valid||# Test||# Nodes||# Rel||# Granularity|
4.3 Prediction Functions
The main aim of the model is to be able to predict the future attribute values as well as the interaction events. To get the complete information of the event at next time step, we perform the prediction in two steps: (1) prediction of the attribute values for the whole graph and (2) prediction of the interaction events for the graph. We know . To predict , we divide it into two sets and . The attribute values of are predicted directly and modelled as follows:
The attribute value for the entity is a function of the spatio-attribute history for the entity and the static information about the entity. Attribute prediction requires graph structures, so we also predict graph structures. The probability of is modeled as:
At , we can write this probability as
In this work, we consider the case that probability of is independent of the past graphs
, and model it using uniform distribution, leading to
For predicting the interaction at future timestamp, we model the probability of the tail as follows:
The functions and can be any function. In our experiments we use the functions as a single-layered feed-forward network.
4.4 Parameter Learning
We use multi-task learning [20, 2] loss for optimizing the parameters. We minimize the attribute prediction loss and graph prediction loss jointly. The total loss , where is interaction loss, is attribute loss and
is a hyperparameter deciding the weight of both the tasks. For the attribute loss, we use the mean-squared error, where is the predicted attribute and is the ground truth attribute. For the interaction loss, we use the standard multi-class cross entropy loss, , where the is the number classes i.e. number of entities in our case.
4.5 Forecasting over Time
At each inference time, DArtNet predicts future interactions and attributes based on the previous observations, i.e., . To predict interactions and attributes at time , DArtNet adopts multi-step inference and predicts in a sequential manner. At each time step, we compute the probability to predict . We rank the tails predicted and choose the top- tails as the predicted values. We use the predicted tails as for further inference. Also, we predict attributes, which yields . Now we have graph structure and attributes at time . We repeat this process until we get . Then we can predict interactions and attributes at time based on .
|Seq2Seq model ||1.323||4.554||8.080||2.975||28.000|
|ConvE+LSTM (1 layer) ††footnotemark: ||0.763||3.899||8.220||7.240||202.580|
|ConvE+LSTM (2 layers)||0.728||4.321||8.440||9.460||206.640|
|HyTE+LSTM (1 layer) ††footnotemark: ||4.041||40.234||8.089||37.170||7.430|
|HyTE+LSTM (2 layers)||1.531||40.885||8.230||17.410||2.070|
|TA-Distmult+LSTM (1 layer) ||0.847||3.584||9.456||16.880||3.250|
|TA-Distmult+LSTM (2 layer)||0.796||3.432||9.034||9.770||7.030|
|RENet (mean)+LSTM (1 layer) ||0.793||4.073||9.022||5.020||203.320|
|RENet (mean)+LSTM (2 layers)||0.857||3.865||8.856||4.348||200.220|
|RENet (RGCN)+LSTM (1 layer)||0.620||3.718||8.998||5.170||203.120|
|RENet (RGCN)+LSTM (2 layers)||0.550||3.984||8.201||12.700||201.560|
In this section, we evaluate our proposed method DArtNet on a temporal attribute prediction task on five datasets.The attribute prediction task is to predict future attributes for each node.
We evaluate our proposed method on two tasks: (1) predicting future attributes associated with each node on five datasets; (2) studying variations and parameter sensitivity of our proposed method. We will summarize the datasets, evaluation metrics, and baseline methods in the following sections.
Due to the unavailability of datasets satisfying our problem statement, we curated appropriate datasets by scraping the web. We created and tested our approach on the datasets described below. Statistics of datasets are described in Table 1.
Attributed Trade graph (ATG). This dynamic graph represents the net export from one country (node) to another, where each edge belongs to an order of trade segment (in a million dollars). The month-averaged currency exchange rate of the corresponding country in SDRs per currency unit is the time series attribute value.
Co-authorship-Citation dataset (CAC). Each edge in the graph represents the collaboration between the authors (node) of the research paper. The number of citations per year for an author is the corresponding time series attribute for the node.
Multi-attributed Trade graph (MTG). This is a subset of ATG, with a multi attributed time series representing monthly Net Export Price Index and the value of International Reserves assets in millions of US dollars.
Attributed GDELT graph (AGG). Global Database of Events, Language, and Tone (GDELT) represents a different type of event in a month between entities like political leaders, organizations, and countries, etc. Here only country nodes are associated with a time-series attribute, which is taken as the Currency Exchange Rate.
5.2 Evaluation Metrics
The aim is to predict attribute values at each node at future timestamps. For this purpose, Mean Squared Error (MSE) loss is used. The lower MSE indicates better performance.
5.3 Baseline methods
We show that the results produced by our model outperform those of the existing time series forecasting models. We compare our approach against two kinds of methods for attribute prediction.
Time series prediction without TKG
These methods do not take into account the graph data and make predictions using just the time-series history available. We compare our model to Historic Average (HA), Vector AutoRegressive (VAR) model , Autoregressive Integrated Moving Average (ARIMA) model, and GRU based Seq2Seq . HA makes predictions based on the weighted average of the previous time series values, ARIMA uses lagged observations for prediction and VAR predicts multiple time series simultaneously by capturing linear interdependencies among multiple time series.
Time series prediction with TKG
Node embeddings are learned using graph representational learning methods (both static and temporal) like RE-Net, HyTE, TA-DistMult, and ConvE. For each node, the attribute value at a particular timestamp is concatenated with the corresponding node embedding, and this data is passed into an GRU network for making predictions.
All models are implemented in PyTorch and have used Adam Optimizer for calculating gradients and training. The best hyperparameters are chosen using the validation dataset. Typically increasing value ofgives better results, and the best results on each dataset are reported.
5.4 Main Results
The results for the attribute prediction on different datasets are reported in Table 2. We see that our method DArtNet outperforms every other baseline for the attribute prediction by a large margin. From the results it is clear, that the neural network based models outperform the other baselines on these complicated datasets proving their long term modeling capacity. We observe that the relational methods using graph information, generally outperform the non-relational methods on attribute prediction. Large increase in performance is observed for more complicated datasets like MTG and AGG. This suggests that it is the right direction for research to use relational methods for attribute prediction. DArtNet outperforms other relational methods, which does not jointly train embeddings for attribute prediction and link prediction. This suggests that joint training of embeddings for attribute prediction and link prediction improves the performance on attribute prediction rather than training embeddings separately and then using it for attribute prediction.
5.5 Performance Analysis
To study the effects of changes in model parameter sharing and hyperparameter sensitivity on prediction, we perform several ablation studies for DArtNet on four datasets as AGG does not have attribute values over all nodes.
Decoupling of Attribute prediction and Interaction prediction tasks. We decouple the shared parameters and for both tasks and observe the performance. More formally, we use a different embedding for both tasks, i.e and as the parameters for link prediction and attribute prediction task respectively.
Sharing history. We study the effect of using the same history embedding for both link prediction and attribute prediction. This will help us study if similar history information is required for both the tasks. Here the does not explicitly get the related information so that we can share the weights. Hence the new equations become:
where the parameters of both the RNNs are shared.
Study of Time-Dependent Information. We evaluate the performance of our model in the absence of any temporal information. Hence we do not encode any history for any task and directly predict the tails and the attribute values at a future timestamp. Hence the equations become:
Analysis on Variants of DArtNet. Figure 3 shows the variation of Attribute Loss with different variants of DArtNet proposed in Section 5.5. From Figure 3, we observe that our model outperforms the decoupled variant by a large margin for attribute prediction. This confirms the hypothesis that joint training of attribute prediction and link prediction performs better than training separately. We also see that sharing history for attribute prediction and link prediction deteriorates the results, which indicates that the history encoding information required for link prediction and attribute prediction is quite different from each other. Lastly, the time-independent variant of our framework performs poorly. This clearly indicates that the temporal evolution information is essential for proper inference.
Sensitivity analysis of hyperparameter . We perform the sensitivity analysis of parameter , which specifies the weight given to both the tasks. We show the variation of MSE loss for attribute prediction task with . Figures 4 shows the variation of Attribute Loss with increasing . In Figure 4, we observe that Attribute value decreases with increasing . As expected, as increasing lambda favors the optimization of attribute loss while decreasing lambda favors the link prediction.
6 Conclusion and Future Work
In this paper, we propose to jointly model the attribute prediction and link prediction on a temporally evolving graph. We propose a novel framework DArtNet, which uses two recurrent neural networks to encode the history for the graph. The framework shares the parameter for the two tasks and jointly trains the two tasks using multi-task learning. Through various experiments, we show that our framework is able to achieve better performance on attribute prediction than the previous methods indicating that external knowledge is useful for time series prediction. Interesting future work includes the link prediction on graph level rather than on subject and relation level in a memory-optimized way.
-  (2013) Translating embeddings for modeling multi-relational data. In Advances in Neural Information Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger (Eds.), pp. 2787–2795. External Links: Cited by: §2.
-  (1997) Multitask learning. Machine Learning 28 (1), pp. 41–75. External Links: Cited by: §4.4.
On the properties of neural machine translation: encoder-decoder approaches. CoRR abs/1409.1259. External Links: Cited by: §4.2.
HyTE: hyperplane-based temporally aware knowledge graph embedding.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2001–2011. External Links: Cited by: §1, §2, Table 2.
-  (2017) Convolutional 2d knowledge graph embeddings. CoRR abs/1707.01476. External Links: Cited by: §1, Table 2.
-  (2018) Learning sequence encoders for temporal knowledge graph completion. CoRR abs/1809.03202. External Links: Cited by: §1, Table 2.
-  (2018) DynGEM: deep embedding method for dynamic graphs. CoRR abs/1805.11273. External Links: Cited by: §1.
-  (1994) Time series analysis. Princeton Univ. Press, Princeton, NJ. External Links: Cited by: Table 2, §5.3.
-  (2017) Representation learning on graphs: methods and applications. CoRR abs/1709.05584. External Links: Cited by: §1.
-  (2019) Recurrent event network for reasoning over temporal knowledge graphs. arXiv preprint arXiv:1904.05530. Cited by: §2, Table 2.
-  (2018) Neural relational inference for interacting systems. In ICML, Cited by: §2.
-  (2016) Semi-supervised classification with graph convolutional networks. CoRR abs/1609.02907. External Links: Cited by: §1.
-  (2018) Learning dynamic embeddings from temporal interactions. CoRR abs/1812.02289. External Links: Cited by: §1.
-  (2017) Graph convolutional recurrent neural network: data-driven traffic forecasting. CoRR abs/1707.01926. External Links: Cited by: §1, §2.
Short-term traffic flow forecasting: an experimental comparison of time-series analysis and supervised learning. IEEE Transactions on Intelligent Transportation Systems 14, pp. 871–882. Cited by: §1.
-  (2011) Discovering spatio-temporal causal interactions in traffic data streams. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’11, New York, NY, USA, pp. 1010–1018. External Links: Cited by: §1.
Interpretable graph convolutional neural networks for inference on noisy knowledge graphs. CoRR abs/1812.00279. External Links: Cited by: §1.
-  (2019) EvolveGCN: evolving graph convolutional networks for dynamic graphs. CoRR abs/1902.10191. External Links: Cited by: §1.
Relational time series forecasting.
The Knowledge Engineering Review33, pp. e1. External Links: Cited by: §2.
-  (2017) An overview of multi-task learning in deep neural networks. CoRR abs/1706.05098. External Links: Cited by: §4.4.
-  (2018) Modeling relational data with graph convolutional networks. Lecture Notes in Computer Science, pp. 593–607. External Links: Cited by: §1.
-  (2014) Sequence to sequence learning with neural networks. CoRR abs/1409.3215. External Links: Cited by: Table 2, §5.3.
-  (2017) Know-evolve: deep reasoning in temporal knowledge graphs. CoRR abs/1705.05742. External Links: Cited by: §1, §2.
-  (2018) Representation learning over dynamic graphs. CoRR abs/1803.04051. External Links: Cited by: §1.
-  (2019) A comprehensive survey on graph neural networks. CoRR abs/1901.00596. External Links: Cited by: §1.
-  (2014) Embedding entities and relations for learning and inference in knowledge bases. CoRR abs/1412.6575. Cited by: §1.
-  (2018) GraphRNN: A deep generative model for graphs. CoRR abs/1802.08773. External Links: Cited by: §1.
Appendix A Datasets
Due to the unavailability of datasets satisfying our problem statement we curated appropriate datasets by scraping the web. We created and tested our approach on the datasets described below.
Attributed Trade graph (ATG). This dataset consists of a directed, multi-relational, unweighted, dynamic knowledge graph with nodes representing different countries. A timestamped edge between two nodes represents the net exports between the respective countries in million dollars. To discretize the edges, the range of values of net exports is split into 200 equal-sized segments resulting in 178 different types of edges. The attribute value associated with each node is the month-averaged currency exchange rate of the corresponding country in SDRs per currency unit. The data is present in the form of a tuple where denote the head, tail and timestamp respectively. Relation exists between and at timestamp and are the attribute values of the head and tail respectively at . The graph evolves at a monthly rate.
The graph is obtained by using a script to scrape data from www.trademap.org. The exchange-rate data is scraped from www.imf.org.
Co-authorship-Citation dataset (CAC). Here the knowledge graph is dynamic, uni-relational, unweighted and undirected. The nodes denote authors and an edge between two nodes at a particular timestamp denotes that the corresponding authors contributed to a research paper at that time. The time granularity is a year. The attribute value associated with each node is the number of citations received by the associated author on any paper written by him/her per year. Again the data is present in the form of the tuple where the meanings of the symbols are as explained above. We used two versions of this dataset: small having 44 nodes and large having 20k nodes.
The citation dataset is curated from www.aminer.cn.
Multi-attributed Trade graph (MTG). The graph in this dataset is a subset of the trade graph described above. Each node has multiple attribute values associated with it. One of the two attributes is the Net Export Price Index with individual commodities weighted by the ratio of net exports to the total commodity trade. The other is the value of International Reserves and other foreign currency assets in millions of US dollars. All two form monthly time series.
Both the time series attributes are scraped from www.imf.org.
Attributed GDELT graph (AGG). The knowledge graph, in this case, is derived from the Global Database of Events, Language, and Tone (GDELT). It is dynamic, directed, multi-relational, unweighted and also has multiple types of nodes. The nodes represent entities like political leaders, organisations and several others. Each of these entities can be associated with a country. We modified this graph by adding nodes representing countries and connecting them to their respective entities through a self-defined edge type. 245 other edge types also exist recording events. We have used this graph at the granularity level of a month. Only the country nodes are associated with a time-series attribute which is taken as the Currency Exchange Rate (as described above) in this case.
Appendix B Experimental Settings
All are models are written in PyTorch888https://pytorch.org/. We use the Adam Optimizer for training our models with learning rate of . The Gated recurrent units are used as the RNN for all the experiments. We use only one GRU unit with hidden dimension for the experiments involving knowledge graphs while we use both one unit and two units for the baselines. The default sequence length for input to the graph is used. We experiment with various values of and we report the results of study in Section 5.5. We first train DArtNet on the training dataset. We then use the saved checkpoints from various stages of the training to obtain the results of attribute prediction on the validation data. From the validation results, we choose the best checkpoint and evaluate the test set on that checkpoint. We report the results of attribute prediction on the test data. All are models are trained on Nvidia GeForce GTX 1080 Ti.