A Study of Joint Graph Inference and Forecasting

09/10/2021 ∙ by Daniel Zügner, et al. ∙ 0

We study a recent class of models which uses graph neural networks (GNNs) to improve forecasting in multivariate time series. The core assumption behind these models is that there is a latent graph between the time series (nodes) that governs the evolution of the multivariate time series. By parameterizing a graph in a differentiable way, the models aim to improve forecasting quality. We compare four recent models of this class on the forecasting task. Further, we perform ablations to study their behavior under changing conditions, e.g., when disabling the graph-learning modules and providing the ground-truth relations instead. Based on our findings, we propose novel ways of combining the existing architectures.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Forecasting multivariate time series is a core machine learning task both in science and in industry (Petropoulos et al., 2020). Between the individual time series (nodes), rich dependencies and interactions (edges) govern how the time series evolves. In the simplest case these could be (linear) correlations; other examples include the road network underlying traffic flows (Wu et al., 2020; Shang et al., 2021), or physical relations such as attraction or repulsion affecting trajectories of objects in space (Kipf et al., 2018).

Knowledge of the ‘true’ relations can be used to make more accurate predictions of how the time series evolves in the future, e.g., by using graph neural networks (GNNs) (e.g., (Kipf & Welling, 2017; Veličković et al., 2018; Hamilton et al., 2017; Gilmer et al., 2017; Scarselli et al., 2009; Klicpera et al., 2019; Bojchevski et al., 2020, 2018)) . Even more, the graph can reveal fundamental insights into the system described by the time series, and may thus be of value in itself, independent of an improvement in the forecasting quality. Therefore, recent works aim at jointly inferring relations between the time series and learn to forecast in an end-to-end manner, sometimes without any prior information about the graph (Wu et al., 2020; Deng & Hooi, 2021). Besides potential benefits in forecasting quality, inferring a graph among time series comes at an inherent computational complexity of , which needs to be taken into account when deciding whether to leverage joint graph inference and forecasting. Hence, we consider the following research questions in this paper.

(R1) In which scenarios do joint graph inference and forecasting improve forecasting accuracy? Given the diverse domains and settings of multivariate time series forecasting (e.g., underlying spatial relations of sensors in traffic forecasting, sets of sensors measuring different properties of the same system, etc.) it is possible that graph inference helps the forecasting task more in some use cases.

(R2) How do the existing architectures compare in forecasting performance? Are there certain architectural choices that appear beneficial for forecasting?

(R3) What are properties of the inferred graphs by the model? Specifically, how consistent are the inferred graphs across different training runs? How (dis-)similar are the inferred graphs to the “ground-truth” graphs (when known)?

2 Background

Forecasting with Multivariate Time Series

In time series forecasting we are interested in estimating a future series

given its past and some context information about the past where variables index over time. For the multivariate case, we can consider time series at a time . We model the following conditional distribution:


where indexes over time series. Notice that we are conditioning on all series in order to estimate the series .

Time Series Forecasting for graph structured data

When conditioning over multivariate time series as in Eq. (1), we may benefit from modelling the relations between different multivariate time series. An expressive structure to capture such relations are graphs. We can define a graph as a set of nodes and edges that relate the nodes. In our case each is associated to a graph node . Edges may be given or unkown depending on the dataset, in cases where the underlying graph is latent/unkown we may jointly infer the graph while estimating a forecasting model. In this work we study the performance of a variety of algorithms under different assumptions of the graph structure (known, unkown, partially known). Note that even in the cases where we have “ground-truth” knowledge (e.g., of spatial relations), there may still be additional latent relations which could be discovered by the models.

3 Literature Review

Recent models perform joint graph learning and forecasting in multivariate timeseries. These models are GTS (“graph for timeseries”) (Shang et al., 2021), Graph Deviation Network (GDN) (Deng & Hooi, 2021), MTS forecasting with GNNs (MTGNN) (Wu et al., 2020), and Neural Relational Inference (NRI) (Kipf et al., 2018). Here, we briefly introduce these four methods and their differences and commonalities; for a more detailed overview, see Appendix B.

All models can be decomposed into two main components: the graph learning and the forecasting modules. The former outputs an adjacency matrix describing a graph between the nodes (i.e., timeseries). The latter takes this graph as well as the input timeseries window to forecast the next timestep(s). Once the adjacency matrix has been obtained from the graph learning module, there are many ways of how to leverage it for forecasting the timeseries. The core idea of the models of this study is that the adjacency matrix construction step is differentiable and jointly learned with the forecasting module. Thus, the intuition is that the model will learn graphs which help the forecasting task.

3.1 Graph learning

The goal of the graph learning module is to output an adjacency matrix , where each entry denotes the edge weight between nodes . Typically, we aim for to be sparse, which reflects the intuition that there are only relatively few useful relations in the latent graph. Each model first represents each node

by a fixed-size vector

, followed by a pairwise similarity computation of any pair and , e.g., by using a fully connected neural network or simply by taking the dot product.

Next, the models obtain the adjacency matrix from the pairwise scores. MTGNN and GDN do so by taking the highest scores per node. An advantage of this is that by choosing appropriately is guaranteed to be sparse. On the other hand, the top- operation is not continuously differentiable, which may pose challenges to end-to-end learning.

NRI and GTS first map the pairwise scores into range (e.g., via softmax or sigmoid). The models use the Gumbel softmax trick (Maddison et al., 2017; Jang et al., 2017)

to sample a discrete adjacency matrix from the edge probabilities in a differentiable way (though gradients are biased); a downside is that we have to take extra steps to obtain a sparse graph, e.g., by regularization.

Moreover, the models can can be broadly split into two groups according to how they compute the fixed-size representations per node: MTGNN and GDN simply learn these representations as node embeddings; on the other hand, NRI and GTS compute the vectors based on the time series itself. That is, they apply some (shared) function to each timeseries to map it into a fixed-size vector. While NRI dynamically produces the representations per individual window, GTS uses the whole training timeseries for each node. The former has the advantage of being more flexible, though more expensive, since we need to compute a tensor to store the individual adjacency matrices, where is the batch size. On the other hand, the graph learned by GTS is global, i.e., shared for all time series. It is thus more efficient yet less flexible, as the model cannot adjust the graph for changing inputs during inference time. Moreover, in its current implementation, this leads to GTS’s number of parameters growing linearly with the length of the training time series (though this could in principle be resolved via dilated convolutions or pooling).

3.2 Graph-based forecasting

There are many existing models to incorporate graph structure in the forecasting task (e.g., (Li et al., 2018; Seo et al., 2017; Chen et al., 2018; Zhao et al., 2020; Zhu et al., 2020; Guo et al., 2019; Panagopoulos et al., 2020; Wang et al., 2018)). Each of the models in this study has its own way of forecasting the time series given the input timeseries window and the adjacency matrix constructed by the graph learning module. For instance, MTGNN interchanges temporal convolution layers with graph convolution layers, and GTS

uses a Diffusion-Convolutional Recurrent Neural Network (DCRNN)

(Li et al., 2018), where the hidden states of each node are diffused via graph convolutions at each timestep. Again, the core idea is that the adjacency matrix used in the graph-based forecasting is itself constructed in a differentiable way and can thus be adjusted by the model to improve forecasting results.

4 Experiments

MAE@12 MAE@12 MAE@12 MAE@12
Table 1: Average forecasting MAE (over five runs) when disabling the graph-learning and forcing the model to use the ground-truth graph. We also show the percentage change of the MAE (); e.g., means error is reduced by over the base scenario.
Random graph No graph
MAE@12 MAE@12 MAE@12 MAE@12 MAE@12 MAE@12
PEMS-BAY 2.12 -
Electricity - -
Solar Energy - -
Traffic - - - - - - - -
Exchange Rate
DAG - -
Diffusion - -
Table 2: Average forecasting MAE (averaged over five runs) when forcing the model to use a (sparse) random graph (left) or when not using a graph at all in the forecasting (right). Relative performance () as explained in Fig. 1. ‘-’ indicates OOM/timeout after 24 hours.

To address our research questions (R1)-(R3), we perform experiments on real-world and synthetic datasets. We repeat all runs five times and report the average; error bars are provided in Table 4 (Appendix).

4.1 Datasets

We briefly describe here the datasets that we use; more details can be found in appendix section C

. We scale each timeseries to have zero mean and unit variance or to have range

(only SWaT and WADI, as in (Deng & Hooi, 2021)). For training and evaluation we compute MAE on the original scale.

PEMS-BAY and METR-LA (Li et al., 2018) are widely used traffic datasets where we do have knowledge about the underlying graph. To construct the sensor graph, we computed the pairwise road network distances between sensors and build the adjacency matrix using a thresholded Gaussian kernel.

We use a range of other multi-variate datasets for which no graph structure is known: Electricity,111archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014222github.com/laiguokun/multivariate-time-series-data Solar-energy,333www.nrel.gov/grid/solar-power-data.html222github.com/laiguokun/multivariate-time-series-data Exchange-rate222github.com/laiguokun/multivariate-time-series-data and Traffic.444https://pems.dot.ca.gov222github.com/laiguokun/multivariate-time-series-data Further, SWaT(Mathur & Tippenhauer, 2016) and WADI (Ahmed et al., 2017)

are datasets of sensors measuring water-treatment plants. In the test split there are annotated anomalies where the creators tampered with the water treatment systems. Therefore, SWaT and WADI were originally proposed as anomaly detection datasets (and e.g., used in the GDN paper); however, since the respective training sets are free of anomalies, we use them for our forecasting experiments.

Synthetic datasets. To enhance the real world datasets, we create two synthetic datasets starting with a graph and making sure that the graph has an impact on the connection between the time series. This allows us to speculate that the graph will be of importance for the forecasting of the time series. We create the Diffusion dataset by using Personalized PageRank (PPR) (Page et al., 1999) to diffuse the multivariate timeseries. We create the DAG dataset using a directed acyclic graph (DAG) and making all the children dimensions be a weighted combination of its parents dimensions.

4.2 Results

(R1). Here we analyze the forecasting results on the different datasets at horizons 3, 6, and 12, respectively. For reference, we also add a vanilla LSTM (Hochreiter & Schmidhuber, 1997) baseline that jointly forecasts all timeseries, as well LSTM-U, which consists of univariate LSTMs. Essentially, the LSTM uses information from all timeseries, though lacks typical GNN properties such as permutation equivariance and does not leverage sparsity. LSTM-U is on the other end of the spectrum and simply views all timeseries as completely independent. In Table 4 (appendix) we present the results.

On the popular traffic datasets METR-LA and PEMS-BAY, the GNN models generally dominate the LSTMs. These datasets have known spatial spatial relations among the timeseries, thus this comes as no surprise. NRI’s results on METR-LA is quite poor, which we attribute to the relatively large number of nodes and to the fact that the underlying relations are static, while NRI predicts a graph per window.

On WADI, interestingly, LSTM-U performs on par with MTGNN. The remaining gap to GTS is relatively small and can potentially be explained by GTS’s more sophisticated forecasting procedure. This indicates that on WADI, where we do not have a “straightforward” spatial graph between the nodes, the GNN-based models struggle to find useful relations in the data – or that there are no useful relations in the data to begin with. Similarly, on SWaT, LSTM outperforms all GNN-based models except GTS.

In the synthetic diffusion-based dataset, GTS achieves roughly 50% lower mean absolute error than LSTM. We attribute this to the fact that the data-generating process (graph diffusion) matches well with the DC-RNN architecture used by GTS in the forecasting module. Further, note that on Traffic, GTS ran OOM on a 16GB VRAM GPU for batch size larger than 1, and therefore did not finish within 24 hours. NRI, which is even more computationally expensive, has additional missing values.

In summary, the GNN-based models’ edge over the non-graph baselines tends to be largest for datasets with an underlying spatial graph (traffic datasets, Electricity, Solar Energy), and smaller for the datasets where the relations are expected to be more subtle (WADI, SWaT). Future work could compare the GNN-based models to state-of-the-art non-graph forecasting methods in a benchmark study.

(R2). Next we perform ablation experiments on the GNN-based models to study their behavior when removing their graph-learning module. For the forecasting modules, we either provide the ground-truth graph (where known); provide a sparse random graph; or provide no graph. We compare results to the “vanilla” settings of the models, computing the relative change in MAE at horizon 12 in percent.

In Table 1 we show the results for providing the ground-truth graph to the forecasting modules. Strikingly, MTGNN’s performance substantially increases, leading to almost 10% less MAE on METR-LA. On PEMS-BAY and METR-LA, MTGNN’s results are on par with GTS’s. This suggests that MTGNN’s forecasting module performs well, and that GTS’s graph-learning module may be advantageous. GDN also benefits from ground truth, though the effect is not as pronounced. Interestingly, providing the “true” graph to GTS leads to a slight performance drop on all but one datasets, indicating that the model’s graph-learning module is effective at improving forecasting.

In Table 2, we see the results for providing a (sparse) random Erdős Renyi graph to the models (left), or completely disabling the graph processing in the forecasting modules (right). For the random graphs we set the edge probability such that the expected degree is 30 (), 10 (), or 3 (). An interesting insight is that for GTS, using a random graph has little or moderate effect on most datasets; and that using no graph at all leads to strong performance drop, indicating that GTS’s forecasting module greatly benefits from the sparsity of graphs.

Remarkably, for MTGNN we see relatively little effect when using a random graph or even no graph at all. We hypothesize that this is due to MTGNN’s way of constructing the adjacency matrix. It uses kNN-style approach, which has sparse gradients. Further, the edge weights are the result of applying

to the pairwise scores, which may lead to vanishing gradients. Thus, the node embeddings may receive only very little training signal. In contrast, GDN, which also uses node embeddings in the graph learning module, utilizes the node embeddings also in the forecasting task. This may be a way to address the issue of MTGNN. Another approach may be to replace the kNN graph construction with differentiable sampling via the Gumbel softmax trick (as in GTS and NRI). This is an interesting experiment to further investigate whether the strategy of parameterizing the graph based on the time series, employed by NRI and GTS, is generally advantageous over node-embedding-based approaches.

Avg. corr. Avg. corr. GT
METR-LA GDN 0.356 0.212
MTGNN -0.001 -0.015
GTS 0.264 -0.046
GTS w/ reg. 0.493 0.523
PEMS-BAY GDN 0.287 0.185
MTGNN 0.000 -0.008
GTS 0.164 -0.010
GTS w/ reg. 0.704 0.684
Table 3: Average correlation of edge scores across different training runs (left), and with the ground-truth graph (right).

(R3). Finally, we measure how consistent the learned edge scores are across training runs as well as how similar the learned adjacency matrices are to the ground truth adjacency matrices. For this we measure the correlation of edge scores among re-runs and with the ground-truth graph. Intuitively, high correlation means that the model assigns large/small scores to the same node pairs. A subset of the results is shown in Table 3; see Table 5 (app.) for more details. We can see that (i) for GDN and GTS, the learned adjacency matrices tend to be moderately similar across training runs. Interestingly, only GDN’s learned graphs have a nontrivial correlation with the ground truth. This indicates that the models learn a graph which is useful for forecasting, which need not have much in common with the “true” (e.g., spatial) graph. Note that for these experiments we have disabled GTS’s regularization on the ground-truth graph. When enabling the loss (GTS w/ reg.) we find that, as expected, the learned graphs strongly correlate with the input graph.

5 Conclusion

We present a study of recent models performing joint graph inference and forecasting. We highlight key commonalities and differences among the architectures. In our experiments, we compare the forecasting results of the models and study properties of the different graph-learning modules. For instance, we find MTGNN to be insensitive as to whether the graph-learning module is active or not; though it greatly benefits from access to a ground-truth graph. In general, learning a latent graph is a challenging problem; improvements in terms of expressiveness and computational efficiency could lead to broader applicability. We highlight potential ways of combining the existing architectures.


Appendix A Additional results

In Table 4, we provide the results on the forecasting task. In Table 5, we provide additional correlation results of the learned graphs.

Model Dataset Avg. corr. Avg. corr. GT
NRI METR-LA 0.44 -0.24
WADI -0.11 n/a
SWaT 0.10 n/a
Exchange Rate 0.68 n/a
DAG 0.64 -0.00
Diffusion 0.85 0.02
GDN METR-LA 0.36 0.21
PEMS-BAY 0.29 0.19
WADI 0.13 n/a
SWaT 0.25 n/a
Electricity 0.17 n/a
Solar Energy 0.15 n/a
Traffic 0.10 n/a
Exchange Rate 0.44 n/a
DAG 0.22 0.04
Diffusion 0.68 0.00
MTGNN METR-LA -0.00 -0.01
PEMS-BAY -0.00 -0.01
WADI 0.00 n/a
SWaT 0.01 n/a
Electricity 0.00 n/a
Solar Energy 0.00 n/a
Traffic -0.00 n/a
Exchange Rate 0.15 n/a
DAG 0.00 0.00
Diffusion -0.00 0.00
GTS METR-LA 0.26 -0.05
PEMS-BAY 0.16 -0.01
WADI 0.46 n/a
SWaT 0.51 n/a
Solar Energy 0.02 n/a
Exchange Rate 0.38 n/a
DAG 0.46 0.02
Diffusion 0.01 -0.00
Table 5: Average correlation of edge scores among different training runs (left), and average correlation of the resulting edge scores with the ground-truth graph (where available).

Appendix B Model details

b.1 Graph for Time Series (GTS)

GTS (“graph for time series”) (Shang et al., 2021) is a recent model aiming to jointly learn a latent graph in the time series and use it for MTS forecasting. The model consists of two main components: graph learning, and graph-based forecasting.

Graph learning. The graph learning module first maps the training partition of each time series to a fixed-size vector representation

via a 1D-convolutional neural network. Then, all pairs

are processed by an MLP to output the probability of an edge between and .


denotes the logistic sigmoid function and

denotes vector concatenation. Finally, a discrete adjacency matrix is obtained via element-wise, differentiable sampling, i.e., , using the Gumbel softmax trick (Jang et al., 2017; Maddison et al., 2017). Note that GTS uses the complete training partition to parameterize the adjacency matrix at each batch. This means that (i) the model can use information from the whole (training) time series at training time; (ii) the model cannot adjust the learned graph at test time; (iii) the number of parameters grows linearly with the length of the input timeseries, which could be improved by adding dilation or pooling to the convolutional encoder.

Forecasting. The forecasting module uses the graph produced by the graph learning module, a Diffusion-Convolutional RNN (DCRNN) (Li et al., 2018). A DCRNN essentially updates hidden states collectively for all series via a graph convolution, which replaces the usual multiplication with a weight matrix. Thus, at each time step

, the RNN uses the learnt graph to locally average the hidden representation of timeseries in their graph neighborhood:

where and are diagonal matrices whose entries are the out- and in-degrees of the nodes in , respectively; and , are learnable weight matrices. GTS uses .

Regularization. The authors propose to incorporate potential a-priori knowledge about the ground-truth graph via a regularization loss in the form of element-wise binary cross entropy loss between the learned and prior graph.

b.2 MTS Forecasting with GNNs (MTGNN)

Like GTS, MTGNN (Wu et al., 2020) also consists of a graph learning and forecasting module, though the implementations of these modules are different.

Graph learning. In contrast to GTS, which parameterizes the representation of a timeseries using a neural network based on the input timeseries, MTGNN learns two embedding vectors per node (i.e., timeseries), i.e., two embedding matrices , . Pairwise scores are computed as

where the formulation of ensures that it is asymmetric, i.e., if is positive, is zero. Finally, only the top scores per row are kept to ensure sparsity of the adjacency matrix. The node embeddings are trained in an end-to-end fashion on the forecasting task.

Forecasting. The main difference to the RNN-based forecasting module in GTS is that MTGNN uses temporal convolutions combined with graph convolution layers. MTGNN stacks three blocks of interchanging inception-style temporal convolution layers and graph convolution layers to predict the next timestep(s) of the timeseries.

b.3 Graph Deviation Network (GDN)

Graph Deviation Network (GDN) (Deng & Hooi, 2021) is a recent model aimed at anomaly detection in multivariate timeseries. The model is trained on MTS forecasting, and anomalies are flagged when the predicted value deviates strongly from the observed value. Since the model is essentially a forecasting method by construction, we chose to include it as a baseline in this work.

Graph learning. Similar to MTGNN, GDN infers the graph by learning a node embedding per node. Specifically, the model builds a k-NN graph where the similarity metric is the cosine of a pair of embeddings.

Forecasting. The forecasting module is based on the Graph Attention Network (GAT) (Veličković et al., 2018) architecture.

where is the input timeseries window of node , is a learned weight matrix, and are attention scores, computed as follows.

where , and is a learned vector. The next predicted value(s) for all nodes are predicted jointly by a stack of fully connected layers :

where is the number of predicted timesteps.

b.4 Neural Relational Inference (NRI)

The Neural Relational Inference (NRI) (Kipf et al., 2018) model assumes a slightly different setting than GTS, MTGNN, and GDN. Instead of learning a global, static graph over the whole time series, NRI infers a graph per input window. While this setting is more flexible, it comes at the drawback of having inherent memory complexity, where is the batch size. Thus, NRI can only realistically scale to small graphs, i.e. at most . NRI is a VAE-based architecture consisting of an encoder module predicting edge probabilities and a decoder module performing the forecasting.

Graph learning. The encoder module interchanges neural networks on the node and edge representations, respectively:

Finally, the edge type posterior is , where one edge type can be hard-coded to denote ‘no edge’. Similar to GTS, NRI samples a discrete graph using the Gumbel softmax trick.

Forecasting. The decoder module is similar to the encoder. It has a fully connected neural network per edge type, which processes the respective input pairs of nodes connected by the specific edge type. Finally, for each node, the representations of incoming edges are aggregated, and a final neural network predicts the value of the next timestep. For the decoder, the authors propose an MLP-based and a RNN-based variant.

Appendix C Datasets

We describe here more in details the characteristics of the datasets used.

In Table 6 we summarize the real-world datasets used in this work. SWaT(Mathur & Tippenhauer, 2016) and WADI (Ahmed et al., 2017) are datasets of sensors measuring water-treatment plants. In the test split there are annotated anoalies where the creators tampered with the water treatment systems. Therefore, SWaT and WADI were originally proposed as anomaly detection datasets (and e.g., used in the GDN paper); however, since the respective training sets are free of anomalies, we use them for our forecasting experiments.

PEMS-BAY and METR-LA (Li et al., 2018) are widely used traffic datasets where we do have knowledge about the underlying graph. To construct the sensor graph, the computed the pairwise road network distances between sensors and build the adjacency matrix using a thresholded Gaussian kernel.

Dataset # Samples Ground truth?
PEMS-BAY (Li et al., 2018) 52,116 325 Yes
METR-LA (Li et al., 2018) 34,272 207 Yes
WADI555As proposed by (Deng & Hooi, 2021), we subsample SWaT and WADI by a factor of 10 in the time dimension using the median operation.] (Ahmed et al., 2017) 1,187,951 122 No
SWAT555As proposed by (Deng & Hooi, 2021), we subsample SWaT and WADI by a factor of 10 in the time dimension using the median operation. (Mathur & Tippenhauer, 2016) 475,200 51 No
Electricity666archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014777github.com/laiguokun/multivariate-time-series-data 26,304 321 No
Solar-energy888www.nrel.gov/grid/solar-power-data.html777github.com/laiguokun/multivariate-time-series-data 52,560 137 No
Exchange-rate777github.com/laiguokun/multivariate-time-series-data 7,588 8 No
Traffic999https://pems.dot.ca.gov777github.com/laiguokun/multivariate-time-series-data 17,544 862 No
Table 6: Real-world dataset summary.

c.1 Synthetic datasets

One drawback about real-world datasets – even the traffic datasets for which we have some knowledge about the relations – is that we do not know the true data-generating process and how the graph interacts with it. To address this, we generated two synthetic datasets where relations between nodes are handcrafted. An advantage of this is that we know in advance the true dependencies between nodes.

Diffusion-based dataset. For each of the timeseries, we first randomly sample parameters of a sinusoidal function, i.e., its frequency, amplitude, horizontal, and vertical shift. Next, we partition the nodes into clusters and generate an undirected graph from a Stochastic Blockmodel (SBM), such that nodes within a cluster are more densely connected than between clusters. We use Personalized PageRank (PPR) (Page et al., 1999) to diffuse the multivariate timeseries, i.e., compute for each timeseries the weighted combination of itself and the other nodes in its vicinity. For each node, we add independent Gaussian noise to the other nodes before averaging. Thus far we have induced correlation between nodes in the same cluster. Finally, we perform a weighted combination of the timeseries before and after diffusion, where the diffused timeseries is lagged by timesteps. This means, each timestep . This way, knowing the value of ’s neighbors steps is useful to predict its current value, rewarding models which have correctly identified the relations.

DAG-based dataset. As an alternative, we generate a dataset based on a directed acyclic graph (DAG). We induce an arbitrary topological order defined by the node IDs, i.e., . We iterate over nodes in increasing topological order. For node , we randomly sample incoming edges from all nodes with uniform probability . If no incoming edges were sampled for , we generate its timeseries as a random sinusoidal function as described above and add Gaussian noise. Otherwise, ’s timeseries is a randomly weighted combination of modified timeseries for which is an edge and . We modify the input timeseries by applying random horizontal and vertical shift and stretch, and add some noise again. Thus, ’s values are directly determined by its incoming edges (except for some noise), and we expect models which correctly identify the relations to perform well in the forecasting task.

For the diffusion dataset we set , , and the restart probability in the PPR computation to . We choose clusters; the edge probability within clusters is , and between clusters we have .

For DAG, the edge probability . For both synthetic datasets, we set .

Appendix D Training details

Generally, we use the hyperparameters provided by the authors of the respective papers. We train all models on a horizon of 12 timesteps, where the training loss is the mean absolute error of all 12 time steps. For all datasets except SWaT and WADI, we ignore targets with value zero (as in

(Shang et al., 2021; Wu et al., 2020)

), as these correspond to missing values. We train models for a maximum of 200 epochs and use early stopping with patience of 20 epochs. We use the validation MAE for early stopping. For SWaT and WADI as well as DAG and Diffusion, we fix the window length to 20. For METR-LA and PEMS-BAY, we set the window length to 12, as proposed by

(Wu et al., 2020; Shang et al., 2021). For Electricity, Solar Energy, Traffic, and Exchange Rate, window size is 168 as in (Wu et al., 2020).