1 Introduction
Forecasting multivariate time series is a core machine learning task both in science and in industry (Petropoulos et al., 2020). Between the individual time series (nodes), rich dependencies and interactions (edges) govern how the time series evolves. In the simplest case these could be (linear) correlations; other examples include the road network underlying traffic flows (Wu et al., 2020; Shang et al., 2021), or physical relations such as attraction or repulsion affecting trajectories of objects in space (Kipf et al., 2018).
Knowledge of the ‘true’ relations can be used to make more accurate predictions of how the time series evolves in the future, e.g., by using graph neural networks (GNNs) (e.g., (Kipf & Welling, 2017; Veličković et al., 2018; Hamilton et al., 2017; Gilmer et al., 2017; Scarselli et al., 2009; Klicpera et al., 2019; Bojchevski et al., 2020, 2018)) . Even more, the graph can reveal fundamental insights into the system described by the time series, and may thus be of value in itself, independent of an improvement in the forecasting quality. Therefore, recent works aim at jointly inferring relations between the time series and learn to forecast in an endtoend manner, sometimes without any prior information about the graph (Wu et al., 2020; Deng & Hooi, 2021). Besides potential benefits in forecasting quality, inferring a graph among time series comes at an inherent computational complexity of , which needs to be taken into account when deciding whether to leverage joint graph inference and forecasting. Hence, we consider the following research questions in this paper.
(R1) In which scenarios do joint graph inference and forecasting improve forecasting accuracy? Given the diverse domains and settings of multivariate time series forecasting (e.g., underlying spatial relations of sensors in traffic forecasting, sets of sensors measuring different properties of the same system, etc.) it is possible that graph inference helps the forecasting task more in some use cases.
(R2) How do the existing architectures compare in forecasting performance? Are there certain architectural choices that appear beneficial for forecasting?
(R3) What are properties of the inferred graphs by the model? Specifically, how consistent are the inferred graphs across different training runs? How (dis)similar are the inferred graphs to the “groundtruth” graphs (when known)?
2 Background
Forecasting with Multivariate Time Series
In time series forecasting we are interested in estimating a future series
given its past and some context information about the past where variables index over time. For the multivariate case, we can consider time series at a time . We model the following conditional distribution:(1) 
where indexes over time series. Notice that we are conditioning on all series in order to estimate the series .
Time Series Forecasting for graph structured data
When conditioning over multivariate time series as in Eq. (1), we may benefit from modelling the relations between different multivariate time series. An expressive structure to capture such relations are graphs. We can define a graph as a set of nodes and edges that relate the nodes. In our case each is associated to a graph node . Edges may be given or unkown depending on the dataset, in cases where the underlying graph is latent/unkown we may jointly infer the graph while estimating a forecasting model. In this work we study the performance of a variety of algorithms under different assumptions of the graph structure (known, unkown, partially known). Note that even in the cases where we have “groundtruth” knowledge (e.g., of spatial relations), there may still be additional latent relations which could be discovered by the models.
3 Literature Review
Recent models perform joint graph learning and forecasting in multivariate timeseries. These models are GTS (“graph for timeseries”) (Shang et al., 2021), Graph Deviation Network (GDN) (Deng & Hooi, 2021), MTS forecasting with GNNs (MTGNN) (Wu et al., 2020), and Neural Relational Inference (NRI) (Kipf et al., 2018). Here, we briefly introduce these four methods and their differences and commonalities; for a more detailed overview, see Appendix B.
All models can be decomposed into two main components: the graph learning and the forecasting modules. The former outputs an adjacency matrix describing a graph between the nodes (i.e., timeseries). The latter takes this graph as well as the input timeseries window to forecast the next timestep(s). Once the adjacency matrix has been obtained from the graph learning module, there are many ways of how to leverage it for forecasting the timeseries. The core idea of the models of this study is that the adjacency matrix construction step is differentiable and jointly learned with the forecasting module. Thus, the intuition is that the model will learn graphs which help the forecasting task.
3.1 Graph learning
The goal of the graph learning module is to output an adjacency matrix , where each entry denotes the edge weight between nodes . Typically, we aim for to be sparse, which reflects the intuition that there are only relatively few useful relations in the latent graph. Each model first represents each node
by a fixedsize vector
, followed by a pairwise similarity computation of any pair and , e.g., by using a fully connected neural network or simply by taking the dot product.Next, the models obtain the adjacency matrix from the pairwise scores. MTGNN and GDN do so by taking the highest scores per node. An advantage of this is that by choosing appropriately is guaranteed to be sparse. On the other hand, the top operation is not continuously differentiable, which may pose challenges to endtoend learning.
NRI and GTS first map the pairwise scores into range (e.g., via softmax or sigmoid). The models use the Gumbel softmax trick (Maddison et al., 2017; Jang et al., 2017)
to sample a discrete adjacency matrix from the edge probabilities in a differentiable way (though gradients are biased); a downside is that we have to take extra steps to obtain a sparse graph, e.g., by regularization.
Moreover, the models can can be broadly split into two groups according to how they compute the fixedsize representations per node: MTGNN and GDN simply learn these representations as node embeddings; on the other hand, NRI and GTS compute the vectors based on the time series itself. That is, they apply some (shared) function to each timeseries to map it into a fixedsize vector. While NRI dynamically produces the representations per individual window, GTS uses the whole training timeseries for each node. The former has the advantage of being more flexible, though more expensive, since we need to compute a tensor to store the individual adjacency matrices, where is the batch size. On the other hand, the graph learned by GTS is global, i.e., shared for all time series. It is thus more efficient yet less flexible, as the model cannot adjust the graph for changing inputs during inference time. Moreover, in its current implementation, this leads to GTS’s number of parameters growing linearly with the length of the training time series (though this could in principle be resolved via dilated convolutions or pooling).
3.2 Graphbased forecasting
There are many existing models to incorporate graph structure in the forecasting task (e.g., (Li et al., 2018; Seo et al., 2017; Chen et al., 2018; Zhao et al., 2020; Zhu et al., 2020; Guo et al., 2019; Panagopoulos et al., 2020; Wang et al., 2018)). Each of the models in this study has its own way of forecasting the time series given the input timeseries window and the adjacency matrix constructed by the graph learning module. For instance, MTGNN interchanges temporal convolution layers with graph convolution layers, and GTS
uses a DiffusionConvolutional Recurrent Neural Network (DCRNN)
(Li et al., 2018), where the hidden states of each node are diffused via graph convolutions at each timestep. Again, the core idea is that the adjacency matrix used in the graphbased forecasting is itself constructed in a differentiable way and can thus be adjusted by the model to improve forecasting results.4 Experiments
GTS  MTGNN  NRI  GDN  

MAE@12  MAE@12  MAE@12  MAE@12  
METRLA  
PEMSBAY    
Diffusion  
DAG 
Random graph  No graph  

GTS  MTGNN  NRI  GDN  GTS  MTGNN  
MAE@12  MAE@12  MAE@12  MAE@12  MAE@12  MAE@12  
METRLA  
PEMSBAY  2.12    
WADI  
SWaT  
Electricity      
Solar Energy      
Traffic                  
Exchange Rate  
DAG      
Diffusion     
To address our research questions (R1)(R3), we perform experiments on realworld and synthetic datasets. We repeat all runs five times and report the average; error bars are provided in Table 4 (Appendix).
4.1 Datasets
We briefly describe here the datasets that we use; more details can be found in appendix section C
. We scale each timeseries to have zero mean and unit variance or to have range
(only SWaT and WADI, as in (Deng & Hooi, 2021)). For training and evaluation we compute MAE on the original scale.PEMSBAY and METRLA (Li et al., 2018) are widely used traffic datasets where we do have knowledge about the underlying graph. To construct the sensor graph, we computed the pairwise road network distances between sensors and build the adjacency matrix using a thresholded Gaussian kernel.
We use a range of other multivariate datasets for which no graph structure is known: Electricity,^{1}^{1}1archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014^{2}^{2}2github.com/laiguokun/multivariatetimeseriesdata Solarenergy,^{3}^{3}3www.nrel.gov/grid/solarpowerdata.html^{2}^{2}2github.com/laiguokun/multivariatetimeseriesdata Exchangerate^{2}^{2}2github.com/laiguokun/multivariatetimeseriesdata and Traffic.^{4}^{4}4https://pems.dot.ca.gov^{2}^{2}2github.com/laiguokun/multivariatetimeseriesdata Further, SWaT(Mathur & Tippenhauer, 2016) and WADI (Ahmed et al., 2017)
are datasets of sensors measuring watertreatment plants. In the test split there are annotated anomalies where the creators tampered with the water treatment systems. Therefore, SWaT and WADI were originally proposed as anomaly detection datasets (and e.g., used in the GDN paper); however, since the respective training sets are free of anomalies, we use them for our forecasting experiments.
Synthetic datasets. To enhance the real world datasets, we create two synthetic datasets starting with a graph and making sure that the graph has an impact on the connection between the time series. This allows us to speculate that the graph will be of importance for the forecasting of the time series. We create the Diffusion dataset by using Personalized PageRank (PPR) (Page et al., 1999) to diffuse the multivariate timeseries. We create the DAG dataset using a directed acyclic graph (DAG) and making all the children dimensions be a weighted combination of its parents dimensions.
4.2 Results
(R1). Here we analyze the forecasting results on the different datasets at horizons 3, 6, and 12, respectively. For reference, we also add a vanilla LSTM (Hochreiter & Schmidhuber, 1997) baseline that jointly forecasts all timeseries, as well LSTMU, which consists of univariate LSTMs. Essentially, the LSTM uses information from all timeseries, though lacks typical GNN properties such as permutation equivariance and does not leverage sparsity. LSTMU is on the other end of the spectrum and simply views all timeseries as completely independent. In Table 4 (appendix) we present the results.
On the popular traffic datasets METRLA and PEMSBAY, the GNN models generally dominate the LSTMs. These datasets have known spatial spatial relations among the timeseries, thus this comes as no surprise. NRI’s results on METRLA is quite poor, which we attribute to the relatively large number of nodes and to the fact that the underlying relations are static, while NRI predicts a graph per window.
On WADI, interestingly, LSTMU performs on par with MTGNN. The remaining gap to GTS is relatively small and can potentially be explained by GTS’s more sophisticated forecasting procedure. This indicates that on WADI, where we do not have a “straightforward” spatial graph between the nodes, the GNNbased models struggle to find useful relations in the data – or that there are no useful relations in the data to begin with. Similarly, on SWaT, LSTM outperforms all GNNbased models except GTS.
In the synthetic diffusionbased dataset, GTS achieves roughly 50% lower mean absolute error than LSTM. We attribute this to the fact that the datagenerating process (graph diffusion) matches well with the DCRNN architecture used by GTS in the forecasting module. Further, note that on Traffic, GTS ran OOM on a 16GB VRAM GPU for batch size larger than 1, and therefore did not finish within 24 hours. NRI, which is even more computationally expensive, has additional missing values.
In summary, the GNNbased models’ edge over the nongraph baselines tends to be largest for datasets with an underlying spatial graph (traffic datasets, Electricity, Solar Energy), and smaller for the datasets where the relations are expected to be more subtle (WADI, SWaT). Future work could compare the GNNbased models to stateoftheart nongraph forecasting methods in a benchmark study.
(R2). Next we perform ablation experiments on the GNNbased models to study their behavior when removing their graphlearning module. For the forecasting modules, we either provide the groundtruth graph (where known); provide a sparse random graph; or provide no graph. We compare results to the “vanilla” settings of the models, computing the relative change in MAE at horizon 12 in percent.
In Table 1 we show the results for providing the groundtruth graph to the forecasting modules. Strikingly, MTGNN’s performance substantially increases, leading to almost 10% less MAE on METRLA. On PEMSBAY and METRLA, MTGNN’s results are on par with GTS’s. This suggests that MTGNN’s forecasting module performs well, and that GTS’s graphlearning module may be advantageous. GDN also benefits from ground truth, though the effect is not as pronounced. Interestingly, providing the “true” graph to GTS leads to a slight performance drop on all but one datasets, indicating that the model’s graphlearning module is effective at improving forecasting.
In Table 2, we see the results for providing a (sparse) random Erdős Renyi graph to the models (left), or completely disabling the graph processing in the forecasting modules (right). For the random graphs we set the edge probability such that the expected degree is 30 (), 10 (), or 3 (). An interesting insight is that for GTS, using a random graph has little or moderate effect on most datasets; and that using no graph at all leads to strong performance drop, indicating that GTS’s forecasting module greatly benefits from the sparsity of graphs.
Remarkably, for MTGNN we see relatively little effect when using a random graph or even no graph at all. We hypothesize that this is due to MTGNN’s way of constructing the adjacency matrix. It uses kNNstyle approach, which has sparse gradients. Further, the edge weights are the result of applying
to the pairwise scores, which may lead to vanishing gradients. Thus, the node embeddings may receive only very little training signal. In contrast, GDN, which also uses node embeddings in the graph learning module, utilizes the node embeddings also in the forecasting task. This may be a way to address the issue of MTGNN. Another approach may be to replace the kNN graph construction with differentiable sampling via the Gumbel softmax trick (as in GTS and NRI). This is an interesting experiment to further investigate whether the strategy of parameterizing the graph based on the time series, employed by NRI and GTS, is generally advantageous over nodeembeddingbased approaches.Avg. corr.  Avg. corr. GT  

METRLA  GDN  0.356  0.212 
MTGNN  0.001  0.015  
GTS  0.264  0.046  
GTS w/ reg.  0.493  0.523  
PEMSBAY  GDN  0.287  0.185 
MTGNN  0.000  0.008  
GTS  0.164  0.010  
GTS w/ reg.  0.704  0.684 
(R3). Finally, we measure how consistent the learned edge scores are across training runs as well as how similar the learned adjacency matrices are to the ground truth adjacency matrices. For this we measure the correlation of edge scores among reruns and with the groundtruth graph. Intuitively, high correlation means that the model assigns large/small scores to the same node pairs. A subset of the results is shown in Table 3; see Table 5 (app.) for more details. We can see that (i) for GDN and GTS, the learned adjacency matrices tend to be moderately similar across training runs. Interestingly, only GDN’s learned graphs have a nontrivial correlation with the ground truth. This indicates that the models learn a graph which is useful for forecasting, which need not have much in common with the “true” (e.g., spatial) graph. Note that for these experiments we have disabled GTS’s regularization on the groundtruth graph. When enabling the loss (GTS w/ reg.) we find that, as expected, the learned graphs strongly correlate with the input graph.
5 Conclusion
We present a study of recent models performing joint graph inference and forecasting. We highlight key commonalities and differences among the architectures. In our experiments, we compare the forecasting results of the models and study properties of the different graphlearning modules. For instance, we find MTGNN to be insensitive as to whether the graphlearning module is active or not; though it greatly benefits from access to a groundtruth graph. In general, learning a latent graph is a challenging problem; improvements in terms of expressiveness and computational efficiency could lead to broader applicability. We highlight potential ways of combining the existing architectures.
References
 Ahmed et al. (2017) Ahmed, C. M., Palleti, V. R., and Mathur, A. P. Wadi: A water distribution testbed for research in the design of secure cyber physical systems. In Proceedings of the 3rd International Workshop on CyberPhysical Systems for Smart Water Networks, CySWATER ’17, pp. 25–28, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450349758. doi: 10.1145/3055366.3055375.
 Bojchevski et al. (2018) Bojchevski, A., Shchur, O., Zügner, D., and Günnemann, S. Netgan: Generating graphs via random walks. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, pp. 609–618, 2018.
 Bojchevski et al. (2020) Bojchevski, A., Klicpera, J., Perozzi, B., Kapoor, A., Blais, M., Rózemberczki, B., Lukasik, M., and Günnemann, S. Scaling graph neural networks with approximate pagerank. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 2020. ACM.
 Chen et al. (2018) Chen, J., Xu, X., Wu, Y., and Zheng, H. Gclstm: Graph convolution embedded lstm for dynamic link prediction. arXiv preprint arXiv:1812.04206, 2018.

Deng & Hooi (2021)
Deng, A. and Hooi, B.
Graph neural networkbased anomaly detection in multivariate time
series.
In
Proceedings of the 35th AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada
, pp. 2–9, 2021.  Gilmer et al. (2017) Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. Neural message passing for quantum chemistry. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1263–1272. PMLR, 06–11 Aug 2017.
 Guo et al. (2019) Guo, S., Lin, Y., Feng, N., Song, C., and Wan, H. Attention based spatialtemporal graph convolutional networks for traffic flow forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 922–929, 2019.
 Hamilton et al. (2017) Hamilton, W., Ying, Z., and Leskovec, J. Inductive representation learning on large graphs. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
 Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Jang et al. (2017) Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbelsoftmax. 2017.
 Kipf et al. (2018) Kipf, T., Fetaya, E., Wang, K.C., Welling, M., and Zemel, R. Neural relational inference for interacting systems. In International Conference on Machine Learning, pp. 2688–2697. PMLR, 2018.
 Kipf & Welling (2017) Kipf, T. N. and Welling, M. Semisupervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017.
 Klicpera et al. (2019) Klicpera, J., Bojchevski, A., and Günnemann, S. Predict then propagate: Graph neural networks meet personalized pagerank. In International Conference on Learning Representations (ICLR), 2019.
 Li et al. (2018) Li, Y., Yu, R., Shahabi, C., and Liu, Y. Diffusion convolutional recurrent neural network: Datadriven traffic forecasting. In International Conference on Learning Representations, 2018.

Maddison et al. (2017)
Maddison, C. J., Mnih, A., and Teh, Y. W.
The concrete distribution: A continuous relaxation of discrete random variables.
International Conference on Learning Representations, 2017.  Mathur & Tippenhauer (2016) Mathur, A. P. and Tippenhauer, N. O. Swat: a water treatment testbed for research and training on ics security. In 2016 International Workshop on Cyberphysical Systems for Smart Water Networks (CySWater), pp. 31–36, 2016. doi: 10.1109/CySWater.2016.7469060.
 Page et al. (1999) Page, L., Brin, S., Motwani, R., and Winograd, T. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford InfoLab, 1999.
 Panagopoulos et al. (2020) Panagopoulos, G., Nikolentzos, G., and Vazirgiannis, M. Transfer graph neural networks for pandemic forecasting. 2020.
 Petropoulos et al. (2020) Petropoulos, F., Apiletti, D., Assimakopoulos, V., Babai, M. Z., Barrow, D. K., Bergmeir, C., Bessa, R. J., Boylan, J. E., Browell, J., Carnevale, C., Castle, J. L., Cirillo, P., Clements, M. P., Cordeiro, C., Oliveira, F. L. C., Baets, S. D., Dokumentov, A., Fiszeder, P., Franses, P. H., Gilliland, M., Gönül, M. S., Goodwin, P., Grossi, L., GrushkaCockayne, Y., Guidolin, M., Guidolin, M., Gunter, U., Guo, X., Guseo, R., Harvey, N., Hendry, D. F., Hollyman, R., Januschowski, T., Jeon, J., Jose, V. R. R., Kang, Y., Koehler, A. B., Kolassa, S., Kourentzes, N., Leva, S., Li, F., Litsiou, K., Makridakis, S., Martinez, A. B., Meeran, S., Modis, T., Nikolopoulos, K., Önkal, D., Paccagnini, A., Panapakidis, I., Pavía, J. M., Pedio, M., Pedregal, D. J., Pinson, P., Ramos, P., Rapach, D. E., Reade, J. J., RostamiTabar, B., Rubaszek, M., Sermpinis, G., Shang, H. L., Spiliotis, E., Syntetos, A. A., Talagala, P. D., Talagala, T. S., Tashman, L., Thomakos, D., Thorarinsdottir, T., Todini, E., Arenas, J. R. T., Wang, X., Winkler, R. L., Yusupova, A., and Ziel, F. Forecasting: theory and practice. arXiv preprint arXiv:2012.03854, 2020.
 Scarselli et al. (2009) Scarselli, F., Gori, M., Tsoi, A. C., Hagenbuchner, M., and Monfardini, G. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2009. doi: 10.1109/TNN.2008.2005605.
 Seo et al. (2017) Seo, Y., Defferrard, M., Vandergheynst, P., and Bresson, X. Structured sequence modeling with graph convolutional recurrent networks. In International Conference on Neural Information Processing (ICONIP), 2017.
 Shang et al. (2021) Shang, C., Chen, J., and Bi, J. Discrete graph structure learning for forecasting multiple time series. In International Conference on Learning Representations, 2021.
 Veličković et al. (2018) Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and Bengio, Y. Graph attention networks. In International Conference on Learning Representations, 2018.
 Wang et al. (2018) Wang, B., Luo, X., Zhang, F., Yuan, B., Bertozzi, A. L., and Brantingham, P. J. GraphBased Deep Modeling and Real Time Forecasting of Sparse SpatioTemporal Data. arXiv eprints, April 2018.
 Wu et al. (2020) Wu, Z., Pan, S., Long, G., Jiang, J., Chang, X., and Zhang, C. Connecting the dots: Multivariate time series forecasting with graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 753–763, 2020.
 Zhao et al. (2020) Zhao, L., Song, Y., Zhang, C., Liu, Y., Wang, P., Lin, T., Deng, M., and Li, H. Tgcn: A temporal graph convolutional network for traffic prediction. IEEE Transactions on Intelligent Transportation Systems, 21(9):3848–3858, 2020. doi: 10.1109/TITS.2019.2935152.
 Zhu et al. (2020) Zhu, J., Song, Y., Zhao, L., and Li, H. A3tgcn: attention temporal graph convolutional network for traffic forecasting. arXiv preprint arXiv:2006.11583, 2020.
Appendix A Additional results
In Table 4, we provide the results on the forecasting task. In Table 5, we provide additional correlation results of the learned graphs.
Model  Dataset  Avg. corr.  Avg. corr. GT 

NRI  METRLA  0.44  0.24 
WADI  0.11  n/a  
SWaT  0.10  n/a  
Exchange Rate  0.68  n/a  
DAG  0.64  0.00  
Diffusion  0.85  0.02  
GDN  METRLA  0.36  0.21 
PEMSBAY  0.29  0.19  
WADI  0.13  n/a  
SWaT  0.25  n/a  
Electricity  0.17  n/a  
Solar Energy  0.15  n/a  
Traffic  0.10  n/a  
Exchange Rate  0.44  n/a  
DAG  0.22  0.04  
Diffusion  0.68  0.00  
MTGNN  METRLA  0.00  0.01 
PEMSBAY  0.00  0.01  
WADI  0.00  n/a  
SWaT  0.01  n/a  
Electricity  0.00  n/a  
Solar Energy  0.00  n/a  
Traffic  0.00  n/a  
Exchange Rate  0.15  n/a  
DAG  0.00  0.00  
Diffusion  0.00  0.00  
GTS  METRLA  0.26  0.05 
PEMSBAY  0.16  0.01  
WADI  0.46  n/a  
SWaT  0.51  n/a  
Solar Energy  0.02  n/a  
Exchange Rate  0.38  n/a  
DAG  0.46  0.02  
Diffusion  0.01  0.00 
Appendix B Model details
b.1 Graph for Time Series (GTS)
GTS (“graph for time series”) (Shang et al., 2021) is a recent model aiming to jointly learn a latent graph in the time series and use it for MTS forecasting. The model consists of two main components: graph learning, and graphbased forecasting.
Graph learning. The graph learning module first maps the training partition of each time series to a fixedsize vector representation
via a 1Dconvolutional neural network. Then, all pairs
are processed by an MLP to output the probability of an edge between and .where
denotes the logistic sigmoid function and
denotes vector concatenation. Finally, a discrete adjacency matrix is obtained via elementwise, differentiable sampling, i.e., , using the Gumbel softmax trick (Jang et al., 2017; Maddison et al., 2017). Note that GTS uses the complete training partition to parameterize the adjacency matrix at each batch. This means that (i) the model can use information from the whole (training) time series at training time; (ii) the model cannot adjust the learned graph at test time; (iii) the number of parameters grows linearly with the length of the input timeseries, which could be improved by adding dilation or pooling to the convolutional encoder.Forecasting. The forecasting module uses the graph produced by the graph learning module, a DiffusionConvolutional RNN (DCRNN) (Li et al., 2018). A DCRNN essentially updates hidden states collectively for all series via a graph convolution, which replaces the usual multiplication with a weight matrix. Thus, at each time step
, the RNN uses the learnt graph to locally average the hidden representation of timeseries in their graph neighborhood:
where and are diagonal matrices whose entries are the out and indegrees of the nodes in , respectively; and , are learnable weight matrices. GTS uses .
Regularization. The authors propose to incorporate potential apriori knowledge about the groundtruth graph via a regularization loss in the form of elementwise binary cross entropy loss between the learned and prior graph.
b.2 MTS Forecasting with GNNs (MTGNN)
Like GTS, MTGNN (Wu et al., 2020) also consists of a graph learning and forecasting module, though the implementations of these modules are different.
Graph learning. In contrast to GTS, which parameterizes the representation of a timeseries using a neural network based on the input timeseries, MTGNN learns two embedding vectors per node (i.e., timeseries), i.e., two embedding matrices , . Pairwise scores are computed as
where the formulation of ensures that it is asymmetric, i.e., if is positive, is zero. Finally, only the top scores per row are kept to ensure sparsity of the adjacency matrix. The node embeddings are trained in an endtoend fashion on the forecasting task.
Forecasting. The main difference to the RNNbased forecasting module in GTS is that MTGNN uses temporal convolutions combined with graph convolution layers. MTGNN stacks three blocks of interchanging inceptionstyle temporal convolution layers and graph convolution layers to predict the next timestep(s) of the timeseries.
b.3 Graph Deviation Network (GDN)
Graph Deviation Network (GDN) (Deng & Hooi, 2021) is a recent model aimed at anomaly detection in multivariate timeseries. The model is trained on MTS forecasting, and anomalies are flagged when the predicted value deviates strongly from the observed value. Since the model is essentially a forecasting method by construction, we chose to include it as a baseline in this work.
Graph learning. Similar to MTGNN, GDN infers the graph by learning a node embedding per node. Specifically, the model builds a kNN graph where the similarity metric is the cosine of a pair of embeddings.
Forecasting. The forecasting module is based on the Graph Attention Network (GAT) (Veličković et al., 2018) architecture.
where is the input timeseries window of node , is a learned weight matrix, and are attention scores, computed as follows.
where , and is a learned vector. The next predicted value(s) for all nodes are predicted jointly by a stack of fully connected layers :
where is the number of predicted timesteps.
b.4 Neural Relational Inference (NRI)
The Neural Relational Inference (NRI) (Kipf et al., 2018) model assumes a slightly different setting than GTS, MTGNN, and GDN. Instead of learning a global, static graph over the whole time series, NRI infers a graph per input window. While this setting is more flexible, it comes at the drawback of having inherent memory complexity, where is the batch size. Thus, NRI can only realistically scale to small graphs, i.e. at most . NRI is a VAEbased architecture consisting of an encoder module predicting edge probabilities and a decoder module performing the forecasting.
Graph learning. The encoder module interchanges neural networks on the node and edge representations, respectively:
Finally, the edge type posterior is , where one edge type can be hardcoded to denote ‘no edge’. Similar to GTS, NRI samples a discrete graph using the Gumbel softmax trick.
Forecasting. The decoder module is similar to the encoder. It has a fully connected neural network per edge type, which processes the respective input pairs of nodes connected by the specific edge type. Finally, for each node, the representations of incoming edges are aggregated, and a final neural network predicts the value of the next timestep. For the decoder, the authors propose an MLPbased and a RNNbased variant.
Appendix C Datasets
We describe here more in details the characteristics of the datasets used.
In Table 6 we summarize the realworld datasets used in this work. SWaT(Mathur & Tippenhauer, 2016) and WADI (Ahmed et al., 2017) are datasets of sensors measuring watertreatment plants. In the test split there are annotated anoalies where the creators tampered with the water treatment systems. Therefore, SWaT and WADI were originally proposed as anomaly detection datasets (and e.g., used in the GDN paper); however, since the respective training sets are free of anomalies, we use them for our forecasting experiments.
PEMSBAY and METRLA (Li et al., 2018) are widely used traffic datasets where we do have knowledge about the underlying graph. To construct the sensor graph, the computed the pairwise road network distances between sensors and build the adjacency matrix using a thresholded Gaussian kernel.
Dataset  # Samples  Ground truth?  
PEMSBAY (Li et al., 2018)  52,116  325  Yes 
METRLA (Li et al., 2018)  34,272  207  Yes 
WADI^{5}^{5}5As proposed by (Deng & Hooi, 2021), we subsample SWaT and WADI by a factor of 10 in the time dimension using the median operation.] (Ahmed et al., 2017)  1,187,951  122  No 
SWAT^{5}^{5}5As proposed by (Deng & Hooi, 2021), we subsample SWaT and WADI by a factor of 10 in the time dimension using the median operation. (Mathur & Tippenhauer, 2016)  475,200  51  No 
Electricity^{6}^{6}6archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014^{7}^{7}7github.com/laiguokun/multivariatetimeseriesdata  26,304  321  No 
Solarenergy^{8}^{8}8www.nrel.gov/grid/solarpowerdata.html^{7}^{7}7github.com/laiguokun/multivariatetimeseriesdata  52,560  137  No 
Exchangerate^{7}^{7}7github.com/laiguokun/multivariatetimeseriesdata  7,588  8  No 
Traffic^{9}^{9}9https://pems.dot.ca.gov^{7}^{7}7github.com/laiguokun/multivariatetimeseriesdata  17,544  862  No 
c.1 Synthetic datasets
One drawback about realworld datasets – even the traffic datasets for which we have some knowledge about the relations – is that we do not know the true datagenerating process and how the graph interacts with it. To address this, we generated two synthetic datasets where relations between nodes are handcrafted. An advantage of this is that we know in advance the true dependencies between nodes.
Diffusionbased dataset. For each of the timeseries, we first randomly sample parameters of a sinusoidal function, i.e., its frequency, amplitude, horizontal, and vertical shift. Next, we partition the nodes into clusters and generate an undirected graph from a Stochastic Blockmodel (SBM), such that nodes within a cluster are more densely connected than between clusters. We use Personalized PageRank (PPR) (Page et al., 1999) to diffuse the multivariate timeseries, i.e., compute for each timeseries the weighted combination of itself and the other nodes in its vicinity. For each node, we add independent Gaussian noise to the other nodes before averaging. Thus far we have induced correlation between nodes in the same cluster. Finally, we perform a weighted combination of the timeseries before and after diffusion, where the diffused timeseries is lagged by timesteps. This means, each timestep . This way, knowing the value of ’s neighbors steps is useful to predict its current value, rewarding models which have correctly identified the relations.
DAGbased dataset. As an alternative, we generate a dataset based on a directed acyclic graph (DAG). We induce an arbitrary topological order defined by the node IDs, i.e., . We iterate over nodes in increasing topological order. For node , we randomly sample incoming edges from all nodes with uniform probability . If no incoming edges were sampled for , we generate its timeseries as a random sinusoidal function as described above and add Gaussian noise. Otherwise, ’s timeseries is a randomly weighted combination of modified timeseries for which is an edge and . We modify the input timeseries by applying random horizontal and vertical shift and stretch, and add some noise again. Thus, ’s values are directly determined by its incoming edges (except for some noise), and we expect models which correctly identify the relations to perform well in the forecasting task.
For the diffusion dataset we set , , and the restart probability in the PPR computation to . We choose clusters; the edge probability within clusters is , and between clusters we have .
For DAG, the edge probability . For both synthetic datasets, we set .
Appendix D Training details
Generally, we use the hyperparameters provided by the authors of the respective papers. We train all models on a horizon of 12 timesteps, where the training loss is the mean absolute error of all 12 time steps. For all datasets except SWaT and WADI, we ignore targets with value zero (as in
(Shang et al., 2021; Wu et al., 2020)), as these correspond to missing values. We train models for a maximum of 200 epochs and use early stopping with patience of 20 epochs. We use the validation MAE for early stopping. For SWaT and WADI as well as DAG and Diffusion, we fix the window length to 20. For METRLA and PEMSBAY, we set the window length to 12, as proposed by
(Wu et al., 2020; Shang et al., 2021). For Electricity, Solar Energy, Traffic, and Exchange Rate, window size is 168 as in (Wu et al., 2020).
Comments
There are no comments yet.