Multivariate Time Series Forecasting Based on Causal Inference with Transfer Entropy and Graph Neural Network

05/03/2020 ∙ by Haoyan Xu, et al. ∙ Zhejiang University 0

Multivariate time series (MTS) forecasting is an important problem in many fields. Accurate forecasting results can effectively help decision-making and reduce subjectivity. To date, many MTS forecasting methods have been proposed and widely applied. However, these methods assume that the value to be predicted of a single variable is related to all other variables, which makes it difficult to select the true key variable in high-dimensional situations. To address the above issue, a novel end-to-end deep learning model, termed transfer entropy graph neural network (TEGNN) is proposed in this paper. For accurate variable selection, the transfer entropy (TE) graph is introduced to characterize the causal information among variables, in which each variable is regarded as a graph node. In addition, convolutional neural network (CNN) filters with different perception scales are used for time series feature extraction. What is more, graph neural network (GNN) is adopted to tackle the embedding and forecasting problem of graph structure composed of MTS. MTS data collected from the real world are used to evaluate the prediction performance of TEGNN. Our comprehensive experiments demonstrate that the proposed TEGNN consistently outperforms state-of-the-art MTS forecasting baselines.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the real world, multivariate time series (MTS) data are common in various fields, such as the sensor data in the Internet of Things, the traffic flows on highways, and the prices collected from stock markets. Through the existing MTS data, prediction models can be established to estimate the future trend. MTS forecasting is an important problem in many fields. For example, predict the stock prices to determine the investment strategy, and predict the traffic flows to reasonably plan the travel route.

In recent years, many time series forecasting methods have been widely studied and applied. For univariate situations, ARIMA box2015time is one of the most classic forecasting methods. This method includes a variety of time series models, including autoregression (AR), moving average (MA), and autoregressive moving average (ARMA), thus has the flexibility and adaptability to various types of time series. However, due to the high computational complexity, ARIMA is not suitable for multivariate situations. VAR hamilton1994time ; lutkepohl2005new ; box2015time method is a multivariate extended version of the AR model. Although VAR is widely used in MTS forecasting tasks due to its simplicity, it can not handle the nonlinear relationships among variables, which reduce its forecasting accuracy.

In addition to traditional statistical methods, deep learning methods are also applied for the MTS forecasting problem tokgoz2018rnn

. Due to the flexibility of the neural network structures, deep learning methods can well capture the dynamics and changing trends of the time series by taking temporal sequence into account. The recurrent neural network (RNN)


and its two improved versions, namely the long short term memory (LSTM)


and the gated recurrent unit (GRU)

chung2014empirical , realize the extraction of time series dynamic information through the memory mechanism. Convolutional neural network (CNN) lecun1995convolutional uses multiple convolution kernels to perform moving convolution operations in the time series, thereby achieving feature extraction according to time order. Besides, the Multi-Head attention mechanism (MHA)NIPS2017_7181 which concatenates and projects the input into query, key and value space in the famous Transformer model could also be used in encoding MTS sequence. By specific combination of the above neural network structures can achieve reasonable MTS forecasting results.

Nevertheless, the existing deep learning methods assume that the value to be predicted of a single variable is related to all other variables. In fact, for a time series to be predicted, its future value may be only related to a few other variables in the data set. For example, the future traffic flow of a certain street is easier to be predict by the traffic information of the neighboring area, while the information of the area farther away is relatively useless. If such priori causal information can be considered, then it is easier to select key variables in the model training phase. Conversely, automatic learning only through optimization algorithms will increase the difficulty of model training, prone to overfitting, and reduce accuracy. There have been studies on the quantitative characterization of time series causality. Among them, the most famous is Granger causality analysis (G-causality) granger1969investigating ; kirchgassner2013granger . This method represents the causality by establishing an AR model and comparing the prediction residuals when selecting different independent variables. Howerver, as a linear model, G-causality can not well handle nonlinear relationships. Besides, transfer entropy (TE) bossomaier2016transfer ; faes2013compensated is also proposed for causal analysis, which is able to deal with the nonlinear relationships. Since TE was proposed, it has been widely used for data analysis in the economic dimpfl2013using , biological tung2007inferring and industrial bauer2006finding fields.

In this work, a novel framwork, termed transfer entropy graph neural network (TEGNN) is proposed and applied for MTS forecasting tasks, which considers the causal relationships among variables. For the introduction of causality, The pairwise TE between variables is calculated, thus obtain the TE matrix, which is regarded as the adjacency matrix of the graph structure and each variable is one of a node of this graph. In addition, convolutional neural network (CNN) filters with different perception scales are used for time series feature extraction. What is more, graph neural network is adopted to tackle the embedding and forecasting problerm of graph structure composed of MTS. Our major contributions are:

  • We first propose the framework that considers multivariate time series as a graph structure with causality, so that the causality among time series is used as priori information to guide the forecasting task, and graph neural network is utilized to process this graph structure.

  • We adopt the CNN structure with multiple receptive fields to comprehensively extract the features of time series, which effectively improves the prediction accuracy.

  • We conduct extensive experiments on MTS benchmark datasets and the results from the experiment have proved that TEGNN out-performs the state-of-the-art models.

The rest of this paper is organized as follows. Section 2 outlines the related preliminary information in detail, including TE and GNN methods. Section 3 describe the proposed TEGNN model. Section 4 reports the evaluation results of the proposed model in comparison with baselines on real-world datasets. Finally, in Section 5, the paper is concluded along with a discussion on the future research.

2 Preliminaries

2.1 Transfer Entropy

Transfer entropy (TE) is a measure of causality based on information theory, which was proposed by Schreiber in 2000. Before introducing TE, two concepts in information theory should be presented in advance. Given a variable , its information entropy is defined as:


where denotes all possible values of variable . Information entropy is used to measure the amount of information. A larger indicates that the variable contains more information. Conditional entropy is another information theory concept. Given two variables and , it is defined as:


Conditional entropy represents the information amount of under the condition that the variable is known.

The TE of variables to is defined as:


where and represent their values at time . and . It can be found that TE is actually an increase in the information amount of the variable when changes from unknown to known. TE indicates the direction of information flow, thus characterizing causality. It is worth noting that TE is asymmetric, so the causal relationship between and is usually further indicated in the following way:


When is greater than , it means that is the cause of , otherwise is the consequence of .

2.2 Graph Neural Network

The concept of graph neural network (GNN) was first proposed in scarselli2008graph , which extended existing neural networks for processing the data represented in graph domains. A wide variety of graph neural network (GNN) models have been proposed in recent years. Most of these approaches fit within the framework of “neural message passing” proposed by Gilmeret al.gilmer2017neural . In the message passing framework, a GNN is viewed as a message passing algorithm where node representations are iteratively computed from the features of their neighbor nodes using a differentiable aggregation functionying2018hierarchical .

A separate line of work focuses on generalizing convolutions to graphs. The Graph Convolutional Networks(GCN)DBLP:journals/corr/KipfW16 could be regarded as an approximation of spectral-domain convolution of the graph signals. GCN convolutional operation could also be viewed as sampling and aggregating of the neighborhood information, such as GraphSAGE DBLP:journals/corr/HamiltonYL17 and FastGCN chen2018fastgcn , enabling training in batches while sacrificing some time-efficiency. Coming right after GCN, Graph Isomorphism Network(GIN) xu2018powerful and k-GNNsmorris2019weisfeiler is developed, enabling more complex forms of aggregation. Graph Attention Networks (GAT) velivckovic2017graph is another nontrivial direction to go under the topic of graph neural networks. It incorporates attention into propagation, attending over the neighbors via self-attention.

3 Methodology

Figure 1: The schematic of TEGNN. A multivariate time series consists of multiple univariate time series. TEGNN maps a multivariate time series to a graph and each univariate time series(variable) is mapped to a node. Transfer Entropy matrix is calculated to model the adjacency information of nodes, while convolutional layer is used to catch node features. The node feature matrix and adjacency matrix are then fed into graph neural network to get forecasts.

This section introduces the proposed TEGNN in detail, which is a graph neural network based approach that attempts to take the causal relationship among variables into account for MTS forecasting. A schematic of TEGNN is illustrated in Figure 1. The details of TEGNN is presented as below.

3.1 Problem Formulation

In this paper, the task of MTS forecasting is focused. Given a matrix consisting of multiple observed time series where and is the number of variables, the purpose of MTS forecasting is to predict as accurately as possible, where is the horizon ahead of the current time stamp, which is usually determined according to the actual application scenario.

3.2 Causality Graph Structure Based on Transfer Entropy

When predicting the future value of a variable , if we can directly determine which other variables have an effect on predicting , it will be helpful to reduce the difficulty of model training and prevent incorrect timing relationships from being learned. As mentioned above, transfer entropy can characterize the causal relationship among variables. If the paired transfer entropy between variables is calculated before the prediction model is trained, and input into the model as a priori information, the selection of key variables can be achieved.

According to equations 3-4 in Section 2.1, the transfer entropy matrix of the multivariate time series can be obtained, where the element of the -th row and -th column of , denoted , is calculated as:


where is the -th variable of , is the threshold to determine whether the causality is significant. can be regarded as the adjacency matrix of a graph structure, which is used for subsequent variable selection.

3.3 Time Series Feature Extraction of Multiple Receptive Fields

Time series is a special kind of data. When analyzing time series, it is necessary to consider not only its numerical value but also its trend over time. In this paper, multiple CNN filters with different receptive fields are used to extract individual features for each input time series. Time series from the real world often have multiple meaningful periods. For example, the traffic flow of a certain street not only shows a similar trend every day, but meaningful rules can also be observed in the unit of a week. Therefore, it is reasonable to extract the features of time series in units of multiple certain periods. However, before determining the network structure of the model, the effective period is often unknown. In this paper, we use multiple CNN filters with different receptive fields, namely kernel sizes, to extract features at multiple time scales. Given an input time series , CNN filters with different convolution kernel sizes are separately generated and the features are extracted as follows:


where denotes the convolution operation, represents the concatenate operation, and

is a nonlinear activation function

. In this way, features under different periods are extracted, which provides effective information for time series prediction. It is worth noting that the feature extraction of each time series is separated from each other here, because the subsequent steps need to merge the information of different time series according to the transfer entropy matrix .

3.4 Graph Node Embedding Based on Transfer Entropy Matrix

After feature extraction, the input MTS is converted into a feature matrix , where is the number of features after the calculation introduced in Section 3.3. can be regarded as a feature matrix of a graph with nodes. The adjacency of nodes in the graph structure is determined by the transfer entropy matrix . For such graph structure, graph neural networks can be directly applied for the embedding of nodes. Inspired by k-GNNsmorris2019weisfeiler model, we propose TEGNN model and use the following propagation model for calculating the forward-pass update of a node denoted by :


is the hidden state of node in the layer, denotes the neighbors of node i. k-GNNs only perform information fusion between a certain node and its neighbors, ignoring the information of other non-neighbor nodes. In this way, for the prediction of a time series, only other series with significant causality are considered. This design plays a role in the selection of key variables, which can effectively avoid the information redundancy brought by high dimensions. By adding the priori causal information obtained by TE, the model does not need to find out the key variables for forecasting by itself. In this paper, the number of hidden features of each node in the last graph neural network layer is set to , so that the output of this layer is used as the prediction result of the input MTS. We also conduct experiments using GINxu2018powerful model and our corresponding model is called TEGIN. GIN can efficiently gather information of neighboring nodes, and learn accurate structural information through summation aggregation:


where is the -th layer node embedding for the node , is a trainable parameter, represents the nonlinear mapping composed of multi-layer fully connected neural networks and represents the neighbor nodes of node .

3.5 Objective Function

In the task of MTS forecasting, the following absolute loss (L1-loss) function is often used:


where is the prediction result of output by the model, is the number of variables, is the set of time stamps used for training and denotes all trainable parameters in the model. This optimization function is also used in this paper and the optimization problem can be solved by stochastic gradient decent (SGD) or its improved versions such as Adamkingma2014adam .

4 Experiments

In this section, we conduct extensive experiments on four benchmark datasets for multivariate time series forecasting tasks, and compare the results of proposed TEGNN model with other baselines. All the data and experiment codes are available online111Our codes will be released as well upon the acceptance of this paper..

4.1 Data

We use four benchmark datasets which are publicly available.

Exchange_rate: the exchange rates of eight foreign countries collected from to , collected per day.

Energy contains measurements of different quantities related to appliances energy consumption in a single house for months, collected per minutes.

Nasdaq: the stock prices are selected as the multivariable time series for corporations, collected per minutes.

4.2 Methods for Comparison

The methods in our comparative evaluation are as follows.

  • VAR

    stands for the well-known vector regression model, which has proven to be a useful machine learning method for multivariate time series forecasting.

  • CNN-AR stands for classical convolution neural network. We use multi-layer CNN with AR components to perform MTS forecasting tasks.

  • RNN-GRU is the Recurrent Neural Network using GRU cell with AR components.

  • MultiAttention stands for multihead attention components in the famous Transformer model, where multi-head mechanism runs through the scaled dot-product attention multiple times in parallel.

  • LSTNet is a famous MTS forecasting framework which shows great performance by modeling long- and short-term temporal patterns of MTS data.

  • TEGNN stands for our proposed Transfer Entropy Graph Neural Network. We apply multi-layer CNN and k-GNNs to perform MTS forecasting tasks.

  • TEGIN stands for our proposed Transfer Entropy Graph Isomorphism Network where k-GNNs layers are replaced by GIN layers.

  • nTEGNN stands for TEGNN using all-one adjacency matrix instead of Transfer Entropy matrix.

4.3 Metrics

We apply three conventional evaluation metrics to evaluate the performance of different models for multivariate time series prediction: Root Squared Error(

RSE), Relative Absolute Error(RAE), Empirical Correlation Coefficient(CORR):


a = actual target

p = predict target

For MAE and RAE metrics, lower value is better, for CORR metric, higher value is better.

4.4 Experiment Details

We conduct grid search on tunable hyper-parameters on each method over all datasets. Specifically, we set the same grid search range of input window size for each method from {,,…,} if applied. We vary hyper-parameters for each baseline method to achieve their best performance on this task. For RNN-GRU and LSTNet, the hidden dimension of Recurrent and Convolutional layer is chosen from . For LSTNet, the skip-length is chosen from . We adopt dropout layer after each layer, and the dropout rate is set from . We calculate transfer entropy matrix based on train and validation data. For TEGNN, TEGIN, nTEGNN, we set the size of the three convolutional kernels to be respectively and the number of channels of each kernel is in all our models. The hidden dimension of k-GNN layer is chosen from {,,…,}. For TEGIN , the hidden size is chosen from {,,…,}. The Adam algorithm is used to optimize the parameters of our model.

4.5 Main Results

Table 1 summarizes the evaluation results of all the methods on benchmark datasets with metrics. Following the test settings of DBLP:journals/corr/LaiCYL17

, we use each model for time series predicting on future moment

, thus we set horizon = , which means the horizon is set from to days for forecasting over the Exchange-Rate data, from to hours over the Electricity data, from to minutes over the Energy data, and from to minutes over the Nasdaq data. The best results for each metrics on each dataset is set bold in the Table 1.

Dataset Exchange rate Energy Nasdaq
horizon horizon horizon horizon horizon horizon horizon horizon horizon
Methods Metrics 5 10 15 5 10 15 5 10 15
VAR RSE 0.0065 0.0093 0.0116 3.1628 4.2154 5.1539 0.1706 0.2667 0.39090
RAE 0.0188 0.0270 0.0339 0.0545 0.0727 0.0889 0.0011 0.00180 0.0026
CORR 0.9619 0.9470 0.9318 0.9106 0.8482 0.7919 0.9911 0.9273 0.55280
CNN-AR RSE 0.0063 0.0085 0.0104 2.4286 2.9499 3.5719 0.2110 0.2650 0.2663
RAE 0.0182 0.0249 0.0303 0.0419 0.0509 0.0616 0.0014 0.0017 0.0017
CORR 0.9638 0.9490 0.9372 0.9159 0.8618 0.8150 0.9920 0.9919 0.9860
RNN-GRU MAE 0.0066 0.0092 0.0122 2.7306 3.0590 3.7150 0.2245 0.2313 0.2700
RAE 0.0192 0.0268 0.0355 0.0471 0.0528 0.0641 0.0015 0.0015 0.0018
CORR 0.9630 0.9491 0.9323 0.9167 0.8624 0.8106 0.9930 0.9901 0.9877
MultiHead Att MAE 0.0078 0.0101 0.0119 2.6155 3.2763 3.8457 0.2618 0.2946 0.6177
RAE 0.0227 0.0294 0.0347 0.0451 0.0565 0.0663 0.0017 0.0019 0.0041
CORR 0.9630 0.9500 0.9376 0.9178 0.8574 0.8106 0.9899 0.9869 0.9835
LSTNet MAE 0.0063 0.0085 0.0107 2.2813 3.0951 3.4979 0.1708 0.2511 0.2603
RAE 0.0184 0.0247 0.0311 0.0393 0.0534 0.0603 0.0011 0.0016 0.0017
CORR 0.9639 0.9490 0.9373 0.9190 0.8640 0.8216 0.9940 0.9902 0.9872
nTEGNN MAE 0.0076 0.0100 0.0113 2.8954 3.0549 3.4599 0.1601 0.2174 0.2490
RAE 0.0221 0.0290 0.0315 0.0499 0.0527 0.0597 0.0010 0.0014 0.0016
CORR 0.9660 0.9531 0.9425 0.8979 0.8624 0.8155 0.9942 0.9907 0.9879
TEGNN MAE 0.0060 0.0083 0.0104 2.0773 2.7242 3.3232 0.1549 0.1897 0.2358
RAE 0.0173 0.0243 0.0302 0.0358 0.0470 0.0573 0.0010 0.0012 0.0015
CORR 0.9691 0.9548 0.9438 0.9244 0.8673 0.8221 0.9951 0.9922 0.9887
TEGIN MAE 0.0065 0.0089 0.0108 2.1768 2.8097 3.3572 0.1469 0.1961 0.2361
RAE 0.0188 0.0259 0.0315 0.0375 0.0485 0.0579 0.0010 0.0013 0.0015
CORR 0.9690 0.9551 0.9441 0.9204 0.8615 0.8131 0.9955 0.9919 0.9885
Table 1: MTS forecasting results measured by MAE/RAE/CORR score over three datasets.

We record the performance of the best model on valid dataset based on RSE or MAE metric after training 1000 epochs for each method. It is shown that the proposed TEGNN model performed better than other baseline models in most of the datasets in these settings of horizons. Specifically, TEGNN outperformed the state-of-the-art baseline LSTNet by

9.310%, 24.452%, 9.412% on MAE, 5.978%, 1.619%, 2.894% on RAE and 0.586%, 0.382%, 0.061% on CORR on the Nasdaq, Energy and Exchante_rate datasets respectively, indicating the effectiveness of our proposed model on multivariate time series predicting tasks adopting the idea of combining Transfer Entropy Matrix and Graph Neural Network. LSTNet model showed impressing results when modeling periodic dependency patterns occurred in data, but weaker otherwise. Our proposed TEGNN uses transfer entropy matrix to collect the internal relationship between variables and analyze the topology composed of variables and relationships through graph network, thus it can break through these restrictions and perform well under different horizons in all the datasets.

Other deep learning baseline models show similar performance, which results from the fine-tuned work on general deep learning methods after the show up of LSTNet model and the effort of suitable hyper-parameters after grid search over different datasets, enhancing these models significantly. We use the following sets of hyperparameters for RNN-GRU, MultiHeadAttention and LSTNet:

(hidCNN), (hidRNN), (hidSkip), (windowsize); RNN-GRU: (hidRNN), (highway window) on Exchange_rate dataset, and some fine tuned adjustment over other datasets. TEGNN model sets (hidCNN), (hidGNN1), (hidGNN2), (window size) applying to all datasets and horizons. Compared with these baseline models, our proposed TEGNN model can share the same hyper-parameters among varies datasets and situations with robust performance as the results showed.

4.6 Variant Comparison

We replace the k-GNNs module with GIN. The results in table1 show that TEGIN has similar performance with TEGNN. This shows that our proposed framework has strong universality and compatibility.
For ablation study, we also replace transfer entropy matrix with all-one matrix in nTEGNN, assuming the value to be predicted of a single variable is related to all other variables, thus an completed graph is fed into k-GNNs layers. We spent a week to get the TE matrix, and the results show that TEGNN outperforms nTEGNN, which indicates the significant role TE matrix plays in TEGNN model.

Figure 2: Parameter sensitivity test results. TEGNN shows steady performance under different settings of hidden sizes in GNN layer.

When testing the parameter sensitivity of our model, we evaluate how the hidden size of the GNN component can affect the results. We report the empirical correlation coefficient on Exchange_rate dataset. As can be seen in figure 2, while ranging the hidden size of GNN layers from , the model performance is steady, being relatively insensitive to the hidden dimension parameter.

5 Conclusion

In this paper, we propose a novel deep learning framework (TEGNN) for the task of multivariate time series forecasting. By using CNN with multiple receptive fields, introducing causal prior information characterized by transfer entropy, and adopting graph neural network for feature extraction, the proposed method effectively improved the state-of-the-art results in MTS forecasting on multiple datasets. With in-depth theoretical analysis and experimental verification, we confirm that TEGNN successfully captures the causal relationship among variables and uses graph neural network to select key variables for accurate forecasting.

In the future, there are several promising research directions that deserve more attention and efforts. Firstly, we use transfer entropy to represent causality. In fact, other causal calculation methods can also be tried to make more accurate selection of key variables. Secondly, other time series forecasting methods can be incorporated into the graph neural network to further improve prediction performance.