Log In Sign Up

Scalable Spatiotemporal Graph Neural Networks

Neural forecasting of spatiotemporal time series drives both research and industrial innovation in several relevant application domains. Graph neural networks (GNNs) are often the core component of the forecasting architecture. However, in most spatiotemporal GNNs, the computational complexity scales up to a quadratic factor with the length of the sequence times the number of links in the graph, hence hindering the application of these models to large graphs and long temporal sequences. While methods to improve scalability have been proposed in the context of static graphs, few research efforts have been devoted to the spatiotemporal case. To fill this gap, we propose a scalable architecture that exploits an efficient encoding of both temporal and spatial dynamics. In particular, we use a randomized recurrent neural network to embed the history of the input time series into high-dimensional state representations encompassing multi-scale temporal dynamics. Such representations are then propagated along the spatial dimension using different powers of the graph adjacency matrix to generate node embeddings characterized by a rich pool of spatiotemporal features. The resulting node embeddings can be efficiently pre-computed in an unsupervised manner, before being fed to a feed-forward decoder that learns to map the multi-scale spatiotemporal representations to predictions. The training procedure can then be parallelized node-wise by sampling the node embeddings without breaking any dependency, thus enabling scalability to large networks. Empirical results on relevant datasets show that our approach achieves results competitive with the state of the art, while dramatically reducing the computational burden.


page 1

page 2

page 3

page 4


Inductive Graph Neural Networks for Spatiotemporal Kriging

Time series forecasting and spatiotemporal kriging are the two most impo...

Temporal Weights

In artificial neural networks, weights are a static representation of sy...

Learning the Evolutionary and Multi-scale Graph Structure for Multivariate Time Series Forecasting

Recent studies have shown great promise in applying graph neural network...

On the Equivalence Between Temporal and Static Graph Representations for Observational Predictions

In this work we formalize the (pure observational) task of predicting no...

GTrans: Spatiotemporal Autoregressive Transformer with Graph Embeddings for Nowcasting Extreme Events

Spatiotemporal time series nowcasting should preserve temporal and spati...

Spatially Focused Attack against Spatiotemporal Graph Neural Networks

Spatiotemporal forecasting plays an essential role in various applicatio...

Demand Forecasting from Spatiotemporal Data with Graph Networks and Temporal-Guided Embedding

Short-term demand forecasting models commonly combine convolutional and ...

1 Introduction

As graph neural networks (GNNs; Scarselli et al. 2008; Bacciu et al. 2020) are gaining more traction in many application fields, the need for architectures scalable to large graphs – such as those associated with large sensor networks – is becoming a pressing issue. While research to improve the scalability of models for static graph signals has been very prolific Hamilton et al. (2017); Chiang et al. (2019); Zeng et al. (2019); Frasca et al. (2020), little attention has been paid to the additional challenges encountered when dealing with discrete-time dynamical graphs, i.e., spatiotemporal time series. Several of the existing scalable training techniques rely on subsampling graphs to reduce the computational requirements of the training procedure, e.g., Hamilton et al. (2017); Zeng et al. (2019). However, sampling the node-level observations as if they were i.i.d. can break relational (spatial) dependencies in static graphs and it is even more problematic in the dynamic case, as dependencies occur also across the temporal dimension. Indeed, complex temporal and spatial dynamics that emerge from the interactions across the whole graph over a long time horizon, can be easily disrupted by perturbing such spatiotemporal structure with subsampling. As an alternative, precomputing aggregated features over the graph allows for factoring out spatial propagation from the training phase in certain architetures Frasca et al. (2020). However, similarly to the subsampling approach, extending this method to the spatiotemporal case is not trivial as the preprocessing step must account also for the temporal dependencies besides the graph topology.

Figure 1: Overview of the forecasting framework. Light-grey boxes denote training-free components. At first, an Echo-State Network (ESN) – with shared parameters across nodes – encodes multi-scale temporal dynamics. Then, graph shift operators are used to propagate spatial information. The resulting representations are concatenated and fed to an MLP to predict the next node observations.

In this paper, we propose a novel scalable encoder-decoder architecture for processing spatiotemporal data; Fig. 1 shows high-level schematics of the proposed approach. The spatiotemporal encoding scheme is training-free: first, it exploits a deep randomized recurrent neural network Jaeger (2001); Gallicchio et al. (2017)

to encode the history of each sequence in a high-dimensional vector embedding; then, it uses powers of the graph adjacency matrix to build informative node representations of the spatiotemporal dynamics at different scales. According to the downstream task at hand, the

decoder maps the node representations into the desired output, e.g., the future values of the time series associated with each node. To improve efficiency, we exploit the structure of the extracted embedding to design the decoder to act as a collection of filters localized at different spatiotemporal scales.

Since the spatiotemporal encoder requires neither training nor supervision, the representation of each node and time step can be computed in a preprocessing stage, without the constraints that come from online training on GPUs with limited memory. The decoder is the only component of the architecture with trainable parameters. However, since spatiotemporal relationships are already embedded in the representations, the embeddings can be processed independently from their spatiotemporal context with two consequent advantages. First, training can be done node-wise, allowing for sampling node representations in mini-batches of a size proportional to the hardware capacity. Second, the decoder can be implemented similarly to a standard multilayer perceptron (MLP) readout, which is fast and easy to train.

Let and be the number of steps and the number of edges in the input graph, respectively. The cost of training a standard spatiotemporal GNN on a mini-batch of data has a computational and memory cost that scales as , or in attention-based architectures Wu et al. (2022). Conversely, in our approach mini-batches can be sampled disregarding the length of the sequence and size of the graph, thus making scalability in training constant, i.e., , w.r.t. the spatiotemporal dimension of the problem.

Our contributions can be summarized as follows.

  • We propose a general scalable deep learning framework for spatiotemporal time series, which exploits a novel encoding method based on randomized recurrent components and scalable GNNs architectures.

  • We apply the proposed model to forecast multivariate time series, whose channels are subject to spatial relationships described by a graph.

  • We carry out a rigorous and extensive empirical evaluation of the proposed architecture and variations thereof. Notably, we introduce two benchmarks for scalable spatiotemporal forecasting architectures.

Empirical results show that our approach performs on par with the state of the art while being easy to implement, computationally efficient, and extremely scalable. Given these considerations, we refer to our architecture as model.

2 Preliminaries and Problem Definition

We consider discrete-time spatiotemporal graphs. In particular, given interlinked sensors, we indicate with the -dimensional multivariate observation associated with the -th sensor at time-step , with the node attribute matrix encompassing measurements graph-wise, and with the sequence of measurements collected in the time interval at each sensor. Similarly, we indicate with the matrix containing exogenous variables (e.g., weather information related to a monitored area) associated with each sensor at the -th time-step. Then, we indicate additional, optional, static node attributes as . The relational information is encoded in a, potentially dynamic, weighted adjacency matrix . We indicate with the tuple the graph signal at the -th time-step. Note that the number of sensors in a network is here considered fixed only to ease the presentation; we only request nodes to be distinguishable across time steps. The objective of spatiotemporal forecasting is to predict the next observations given a window of past measurements. In particular, we consider the family of forecasting models s.t.


where indicates the learnable parameters of the model and the -step ahead point forecast.

Echo-State Networks

Echo state networks Jaeger (2001); Lukoševičius and Jaeger (2009) are a class of randomized architectures that consist of recurrent neural networks with random connections that encode the history of input signals into a high-dimensional state representation to be used as input to a (trainable) readout layer. The main idea is to feed an input signal into a high-dimensional, randomized, and non-linear reservoir, whose internal state can be used as an embedding of the input dynamics. An echo state network is governed by the following state update equation:


where indicates a generic input to the system, and are the random matrices defining the connectivity pattern in the reservoir, is a randomly initialized bias, indicates the reservoir state, and

is a nonlinear activation function (usually 

tanh). If the random matrices are defined properly, the reservoir will extract a rich pool of dynamics characterizing the system underlying the input time series and, thus, the reservoir states become informative embeddings of  Lukoševičius and Jaeger (2009). Thanks to the non-linearity of the reservoir, the embeddings are commonly processed with a linear readout that is optimized with a least squares procedure to perform classification, clustering, or time series forecasting Bianchi et al. (2020).

Figure 2: Overview of the model encoder. Input time series are fed into a randomized network with recurrent connections and embedded into a hierarchical vector representation. A graph shift operator is used to propagate and aggregate spatial information of different order which is then concatenated to obtain a final embedding.

3 Scalable Spatiotemporal GNNs

This section presents our approach to building scalable GNN architectures for time series forecasting. Our method is based on a hybrid encoder-decoder architecture. The encoder first constructs representations of the time series observed at each node by using a reservoir that accounts for dynamics at different time scales. Representations are further processed to account for spatial dynamics described by the graph structure. In particular, as shown on the right-hand side of Fig. 2, we use incremental powers of the graph adjacency matrix to propagate and aggregate information along the spatial dimension. Each power of the propagation matrix accounts for different scales of spatial dynamics. The final embedding is then built by concatenating representations obtained w.r.t. each propagation step, thus resulting in a rich encoding of both spatial and temporal features.

The encoder does not need any training and, once computed, the embeddings can be uniformly sampled over time and space when training a nonlinear readout to perform

-step-ahead predictions. The straightforward choice for the decoder (i.e., readout) is to map the encodings to the outputs (i.e., predictions) by using a linear transformation or a standard MLP. However, to further enhance scalability, our decoder exploits the structure of the embedding to reduce the number of parameters and learn filters that are localized in time and space. As we will discuss in Sec. 

3.2, this is done by learning separate weight matrices for each spatiotemporal scale.

The following subsections describe in detail each component of the architecture.

3.1 Spatiotemporal Encoder

We consider as temporal encoders deep echo state networks (DeepESN; Gallicchio et al. 2017

) with leaky integrator neurons 

Jaeger et al. (2007). In particular, we consider networks where the signal associated with each node is encoded by a stack of randomized recurrent layers s.t.


where is a discount factor associated with -th layer, , , are random weight matrices, indicates the hidden state of the system w.r.t. the -th node at the -th layer, and indicates node-wise concatenation. As Eq. 3 shows, DeepESNs are a hierarchical stack of reservoir layers that, e.g., by changing the discount factor at each layer, extract a rich pool of multi-scale temporal dynamics Gallicchio et al. (2017)111We refer to Gallicchio et al. (2018) for more details on the properties and stability of DeepESNs.. Given a DeepESN encoder, the input is represented by the concatenation of the states from each layer, i.e., we obtain node-level temporal encodings for each node and time-step as


We indicate as the encoding for the whole graph at time . The extraction of the node-level temporal embeddings is depicted on the left-end side of Fig. 2, where, to simplify the drawing, we depict an ESN with a single layer.

The next step is to propagate information along the spatial dimension. As discussed at the beginning of the section, we use powers of a graph shift operator to propagate and aggregate node representations at different scales. By using a notation similar to Eq. 4, we obtain spatiotemporal encodings as


where indicates a generic graph shift operator matching the sparsity pattern of the graph adjacency matrix. In practice, by indicating with the graph degree matrix, we use in the case of a directed graph and the symmetrically normalized adjacency in the undirected case. Furthermore, for directed graphs we optionally increase the number of representations to to account for bidirectional dynamics, i.e., we repeat the encoding process w.r.t. the transpose adjacency matrix similarly to Li et al. (2018). Intuitively, each propagation step propagates and aggregates – properly weighted – features between nodes connected by paths of length in the graph. As shown in Eq. 5, features corresponding to each order can be computed recursively with sparse matrix-matrix multiplications (Fig. 2). Alternatively, each matrix can be precomputed and the computation of the different blocks of matrix can be distributed in a parallel fashion as suggested in Fig. 1. Even in the case of extremely large graphs, features can be computed offline by exploiting distributed computing as they do not need to be loaded on the GPU memory.

3.2 Multi-Scale Decoder

The role of the decoder is that of selecting and weighing from the pool of the (possibly redundant) features extracted by the spatiotemporal encoder and mapping them to the desired output. Representations

can be fed into an MLP that performs node-wise predictions. Since the representations are large vectors, a naïve implementation of the MLP results in many parameters that hinder scalability. Therefore, we replace the first MLP layer with a more efficient implementation that exploits the structure of the embeddings.

As we described in Sec. 3.1, is the concatenation of the representations corresponding to different spatial propagation steps which, in turn, are obtained from the concatenation of multi-scale temporal features. To exploit this structure, we design the first layer of the decoder with a sparse connectivity pattern to learn representations s.t.


where are the learnable parameters and is an activation function. In practice, representations can be efficiently computed by exploiting grouped -d convolutions (e.g., see Krizhevsky et al. 2012) to parallelize computation on GPUs. In particular, if we indicate the -d grouped convolution operator with groups and kernel size as , and the collection of the decoder parameters as we can compute as


with in the case of undirected graphs and for the directed case. Besides reducing the number of parameters by a factor of , this architecture localizes filters w.r.t. the dynamics of spatial order and temporal scale . In fact, as highlighted in Eq. 68, representation can be seen as a concatenation of the results of graph convolutions of different order. Finally, the obtained representations are fed into an MLP that predicts the -step-ahead observations as


where the static node-level attributes can also be augmented by concatenating a set of learnable parameters (i.e., a learnable positional encoding).

Training and sampling

The main improvement introduced by the proposed approach in terms of scalability concerns the training procedure. Representations embed both the temporal and spatial relationships among observations over the sensor network. Consequently, each sample can be processed independently since no further spatiotemporal information needs to be collected. This allows for training the decoder with SGD by uniformly and independently sampling mini-batches of data points . This is the key property that makes the training procedure extremely scalable and drastically reduces the lower bound on the computational complexity required for the training w.r.t. standard spatiotemporal GNN architectures. We provide an efficient implementation of the full encoding and training procedure as supplementary material to the paper (see Sec. 5).

4 Related works

15 min 30 min 60 min Average 15 min 30 min 60 min Average
LSTM 2.99 0.00 3.58 0.00 4.43 0.01 3.58 0.00 53.01 0.13 10.19 0.05 1.39 0.00 1.83 0.01 2.35 0.01 1.79 0.00 17.72 0.08 4.16 0.05
FC-LSTM 3.33 0.01 3.43 0.01 3.67 0.01 3.46 0.01 44.85 0.12 10.15 0.09 2.22 0.01 2.25 0.01 2.34 0.02 2.26 0.01 22.31 0.27 5.33 0.04
DynGESN 3.27 0.00 3.99 0.00 5.00 0.00 3.98 0.00 50.30 0.07 11.11 0.01 1.57 0.00 2.13 0.01 2.81 0.02 2.09 0.01 18.43 0.07 4.74 0.01
DCRNN 2.82 0.00 3.23 0.01 3.74 0.01 3.20 0.00 41.57 0.22 8.88 0.05 1.36 0.00 1.71 0.00 2.08 0.01 1.66 0.00 14.40 0.15 3.76 0.01
Graph WaveNet 2.72 0.01 3.10 0.02 3.54 0.03 3.06 0.02 38.22 0.32 8.40 0.03 1.31 0.00 1.64 0.01 1.94 0.01 1.58 0.00 13.12 0.14 3.58 0.02
FC-Gated-GN 2.72 0.01 3.05 0.01 3.44 0.01 3.01 0.00 37.48 0.32 8.27 0.00 1.32 0.00 1.63 0.01 1.89 0.01 1.56 0.01 12.96 0.11 3.51 0.03
UG-Gated-GN 2.72 0.00 3.10 0.00 3.54 0.01 3.06 0.00 38.82 0.08 8.40 0.04 1.33 0.00 1.67 0.01 1.99 0.01 1.61 0.01 13.76 0.14 3.59 0.03
model 2.69 0.00 3.05 0.00 3.45 0.00 3.00 0.00 36.70 0.10 8.27 0.02 1.30 0.00 1.60 0.00 1.88 0.00 1.54 0.00 12.43 0.03 3.44 0.01
–No-Space-Enc. 2.84 0.00 3.26 0.00 3.74 0.00 3.22 0.00 44.55 0.05 9.20 0.01 1.34 0.00 1.68 0.00 2.02 0.00 1.62 0.00 14.14 0.06 3.67 0.01
–FC-Dec. 2.76 0.01 3.13 0.01 3.52 0.02 3.08 0.01 37.92 0.35 8.63 0.11 1.35 0.01 1.67 0.01 1.96 0.01 1.61 0.01 13.04 0.23 3.61 0.04
–GC-Dec. 2.77 0.00 3.17 0.00 3.63 0.00 3.12 0.00 40.67 0.06 8.74 0.01 1.32 0.00 1.65 0.00 1.97 0.00 1.59 0.00 13.47 0.08 3.60 0.01
Table 1: Results on benchmark traffic datasets (averaged over 3 independent runs). We report MAE, MSE, and MAPE averaged over a one-hour (12 steps) forecasting horizon. We also show MAE for

minutes time horizons. Bold numbers are within a standard deviation from the best reported average result.

Spatiotemporal GNNs are essentially based on the idea of integrating message-passing modules in architectures to process sequential data. Notably, Seo et al. (2018) and Li et al. (2018) use message-passing to implement gates of recurrent neural networks. Yu et al. (2018) and Wu et al. (2019, 2020) proposed architectures alternating temporal and spatial convolutions. Wu et al. (2022) and Marisca et al. (2022), instead, exploit the attention mechanism to propagate information along both time and space. Modern architectures often combine some type of relational inductive bias, with full Transformer-like attention Vaswani et al. (2017) along the spatial dimension Zheng et al. (2020); Oreshkin et al. (2021); Satorras et al. (2022), which, however, makes the computation scale quadratically with the number of nodes. model falls within the category of time-then-graph models, i.e., models where the temporal information is encoded before being propagated along the spatial dimension. Gao and Ribeiro (2022) showed that such models can be more expressive than architectures that alternate temporal and spatial processing steps.

Research on scalable models for discrete-time dynamic graphs has been relatively limited. Practitioners have mostly relied on methods developed in the context of static graphs which include node-centric, GraphSAGE-like, approaches Hamilton et al. (2017) or subgraph sampling methods, such as ClusterGCN Chiang et al. (2019) or GraphSAINT Zeng et al. (2019). Wu et al. (2020); Gandhi et al. (2021); Wu et al. (2021) are examples of such approaches. Among scalable GNNs for static graphs, SIGN Frasca et al. (2020) is the approach most related to our method. Like in our approach, SIGN performs spatial propagation as a preprocessing step by using different shift operators to aggregate across different graph neighborhoods, which are then fed to an MLP. However, SIGN is limited to static graphs and propagates raw node-level attributes. Finally, similar to our work, DynGESN Micheli and Tortorella (2022) processes dynamical graphs with a recurrent randomized architecture. However, the architecture in DynGESN is completely randomized, while ours is hybrid as it combines randomized components in the encoder with trainable parameters in the decoder.

5 Empirical evaluation

Prediction error (MAE) Resource utilization Prediction error (MAE) Resource utilization
30 mins 7 hours 30 mins 11 hours Batch/s Memory Batch size 30 mins 7 hours 30 mins 11 hours Batch/s Memory Batch size
100-NN DCRNN 1.39 0.09 3.34 0.22 3.54 0.48 2.04 0.01 9.63 GB 2 0.22 0.00 0.28 0.00 0.29 0.00 1.43 0.02 11.10 GB 2
Graph WaveNet 1.45 0.13 5.09 0.63 5.26 1.34 2.01 0.02 11.64 GB 2 0.23 0.00 0.36 0.01 0.36 0.01 2.41 0.03 8.39 GB 1
UG-Gated-GN 1.33 0.08 2.94 0.05 3.12 0.14 8.41 0.09 11.46 GB 5 0.22 0.00 0.28 0.00 0.28 0.00 8.21 0.08 11.70 GB 4
model 1.09 0.01 3.14 0.21 3.16 0.19 116.58 8.74 2.21 GB 4096 0.21 0.00 0.30 0.00 0.31 0.01 117.32 8.36 2.21 GB 4096

DCRNN 1.59 0.17 4.10 0.27 4.93 0.60 1.37 0.00 11.59 GB 1 0.23 0.00 0.29 0.00 0.29 0.00 1.13 0.01 11.10 GB 1
Graph WaveNet 1.65 0.23 6.93 0.58 7.93 0.17 0.77 0.00 11.35 GB 2 0.25 0.01 0.38 0.03 0.37 0.01 1.26 0.01 8.58 GB 1
UG-Gated-GN 1.61 0.06 3.25 0.04 3.04 0.05 8.83 0.10 11.14 GB 1 0.22 0.00 0.28 0.00 0.29 0.00 8.77 0.10 11.14 GB 1
model 1.09 0.00 3.06 0.11 3.13 0.13 118.64 8.35 2.21 GB 4096 0.21 0.00 0.30 0.00 0.31 0.01 115.85 10.60 2.21 GB 4096
Table 2: Results on large-scale datasets (averaged over at least independent runs). We report MAE over -step-ahead predictions, {30m, 7h30m, 11h}, together with timings and memory consumption. indicates that subsampling was needed to comply with the memory constraints. Bold numbers are within a standard deviation from the best reported average result.

We empirically evaluate our approach in different scenarios. In the first, we compare the performance of our forecasting architecture against state-of-the-art methods on popular, medium-scale, traffic forecasting benchmarks. In the second, we evaluate the scalability of the proposed method on large-scale spatiotemporal time series datasets by considering two novel benchmarks for load forecasting and PV production prediction.

Dataset # steps # nodes # edges sparsity
METR-LA 34272 207 1515 3.54%
PEMS-BAY 52116 325 2369 2.24%
PV-US (100nn) 8868 5016 417,199 1.66%
CER-En (100nn) 8868 6435 639,369 1.54%
PV-US 8868 5016 3,710,008 14.75%
CER-En 8868 6435 3,186,369 7.69%
Table 3: Additional information on the considered datasets.

In the first experiment we consider the METR-LA and PEMS-BAY datasets Li et al. (2018), which are popular medium-sized benchmarks used in the spatiotemporal forecasting literature. In particular, METR-LA consists of traffic speed measurements taken every minutes by detectors in the Los Angeles County Highway, while PEMS-BAY includes analogous observations recorded by sensors in the San Francisco Bay Area. We use the same preprocessing steps of previous works to extract a graph and obtain train, validation and test data splits Wu et al. (2019). For the second experiment, we introduce two larger-scale datasets derived from energy analytics data. The first dataset contains data coming from the Irish Commission for Energy Regulation Smart Metering Project (CER-ECommission for Energy Regulation 2016

), which has been previously used for benchmarking spatiotemporal imputation methods 

Cini et al. (2022); however, differently from previous works, we consider the full sensor network consisting of smart meters measuring energy consumption every minutes at both residential and commercial/industrial premises. The second large-scale dataset is obtained from the synthetic PV-US222 dataset Hummon et al. (2012), consisting of simulated energy production by 5016 PV farms scattered over the United States given historic weather data for the year , aggregated in half an hour intervals. Since the model does not have access to weather information, PV production at neighboring farms is instrumental for obtaining good predictions. Notably, CER-E and PV-US datasets are at least an order of magnitude larger than the datasets typically used for benchmarking spatiotemporal time series forecasting models. Note that for both PV-US and CER-En the (weighted) adjacency is obtained by applying a thresholded Gaussian kernel to the similarity matrix obtained by considering the geographic distance among the sensors and the correntropy Liu et al. (2007) among the time series, respectively. We provide further details on the datasets in the supplemental material.


We consider the following baselines:

  1. LSTM: a single standard gated recurrent neural network Hochreiter and Schmidhuber (1997) trained by sampling window of observations from each node-level time series by disregarding the spatial information;

  2. FC-LSTM: an LSTM processing input sequences as if they were a single high-dimensional multivariate time series;

  3. DCRNN: a recurrent graph network presented in Li et al. (2018) – differently from the original model we use a recurrent encoder followed by a linear readout (more details in the appendix);

  4. Graph WaveNet: a residual network that alternates temporal and graph convolutions over the graph that is given as input and an adjacency matrix that is learned by the model Wu et al. (2019);

  5. Gated-GN: a state-of-the-art time-than-graph Gao and Ribeiro (2022) model introduced in Satorras et al. (2022) for which we consider two different configurations. The first one – indicated as FC – uses attention over the full node set to perform spatial propagation, while the second one – indicated as UG – constrains the attention to edges of the underlying graph.

  6. DynGESN: the echo state network for dynamical graphs proposed in Micheli and Tortorella (2022).

For all the baselines, we use, whenever possible, the configuration found in the original papers or in their open-source implementation; in all the other cases we tune hyperparameters on the holdout validation set.

Experimental setup

For the traffic datasets, we replicate the setup used in previous works. In particular, each model is trained to predict the -step-ahead observations. In model, the input time series are first encoded by the spatiotemporal encoder, and then the decoder is trained by sampling mini-batches along the temporal dimension, i.e., by sampling sequences of observations.

For the large-scale datasets, we focus on assessing the scalability of the different architectures rather than maximizing forecasting accuracy. In particular, for both datasets, we consider the first months of data ( for months for training, month for validation, and month for testing). The models are trained to predict the next hours. We repeat the experiment in two different settings to test the scalability of the different architectures w.r.t. the number of edges. In the first setting, we extract the graph by sparsifying the graph adjacency matrix imposing a maximum of neighbors for each node, while in the second case we do not constrain the density of the adjacency matrix. Tab. 3 reports some details for the considered benchmarks.

To assess the performance in terms of scalability, we fix a maximum GPU memory budget of GB and select the batch size accordingly; if a batch size of does not fit in GB, we uniformly subsample edges of the graph to reduce the memory consumption. Differently from the other baselines, in model we first preprocess the data to obtain spatiotemporal embeddings and then train the decoder by uniformly sampling the node representations. We train each model for

hour, then restore the weights corresponding to the minimum training error and evaluate the forecasts on the test set. The choice of not running validation at each epoch was dictated by the fact that for some of the baselines running a validation epoch would take a large portion of the

hour budget.

The time required to encode the datasets with model’s encoder ranges from tens of seconds to minutes on an AMD EPYC 7513 processor with parallel processes. To ensure reproducibility, the time constraint is not imposed as a hard time out; conversely, we measure the time required for the update step of each model on an NVIDIA RTX A5000 GPU and fix the maximum number of updates accordingly. For model, the time required to compute node embeddings was considered as part of the training time and the number of updates was appropriately reduced to make the comparison fair. For all the baselines, we keep the same architecture used in the traffic experiment. For model we use the same hyperparameters for the decoder, but we reduce the dimension of the embedding (the value of ) so that a preprocessed dataset can fit in a maximum of GB of storage. To account for the different temporal scales, we increase the window size for all baselines and increase the number of layers in the ESN (while keeping the final size of similar). Additional details and the exact values of the hyperparameters are provided in the supplementary material.

Figure 3: Training curves on PV-US. The plot shows the average the standard deviation of independent runs. The plotted curves are smoothed with a running average of steps.

5.1 Results

Results for the traffic benchmarks are reported in Tab. 1; while the outcomes of the scalability experiments are shown in Tab. 2. We consider mean absolute error (MAE), mean squared error (MSE), and mean absolute percentage error

(MAPE) as evaluation metrics.

Traffic experiment

The purpose of the first experiment is to demonstrate that the proposed method achieves performance comparable to that of the state of the art. In this regard, results in Tab. 1 show that in all the considered scenarios model is always among the best performing forecasting architectures. The full-attention baseline is the strongest competitor which, however, has time and memory complexities that scale quadratically with the number of nodes. Regarding the other baselines, DCRNN underperforms compared to the other spatiotemporal GNN architectures. DynGESN, the fully randomized architecture, despite being very fast to train, obtains reasonable performance in short-range predictions but falls short over longer forecasting horizons in the considered scenarios. In light of these results, it is worth commenting on the efficiency of model compared to the baselines. Approaches like DCRNN and Graph Wavenet, perform graph convolutions whose time and space of complexity is , being the number of edges, the number of layers ( in Graph Wavenet), and the time steps. Such complexity is completely amortized by the preprocessing step in our architecture. Similarly, Gated-GN, while being architecturally much simpler, propagates spatial information by relying on the attention mechanism that is known to scale poorly with the dimensionality of the problem. The next experimental setting shines a light on these shortcomings.

The bottom of Tab. 1 reports results for the ablation of key elements of the proposed architecture: No-Space-Enc. indicates that the embeddings are built without the spatial propagation step; FC-Dec. consider the case where the structure of the embedding is ignored in the readout and the sparse weight matrix in Eq. 7 is replaced by a fully-connected one; GC-Dec. indicates that the spatial propagation is limited to the neighbors of order and, thus, the decoder behaves similarly to a single-layer graph convolutional network. Results clearly show the optimality of the proposed architectural design.

Large-scale experiment

Tab. 2 reports the results of the scalability experiment where we considered only the spatiotemporal GNNs trained by gradient descent. We excluded the full-attention baseline (FC-Gated-GN) as its complexity prevented scaling to the larger datasets; however, we considered the UG version where attention is restrained to each node’s neighborhood. There are several comments that need to be made here. First of all, batch size has a different meaning for our model and the other baselines. In our case, each sample corresponds to a single spatiotemporal (preprocessed) observation; for the other methods, a sample corresponds to a window of observations where edges of the graph are eventually subsampled if the memory constraints could not be met otherwise. In both cases, the loss is computed w.r.t. all the observations in the batch. The results clearly show that model can be trained efficiently also in resource-constrained settings, with contained GPU memory usage. In particular, the update frequency (batch/s) is up to order of magnitude higher. Notably, resource utilization at training time remains constant (by construction) in the two considered scenarios, while almost all the baselines require edge subsampling in order to meet the resource constraints. Fig. 3 shows learning curves for the PV-US dataset, further highlighting the vastly superior efficiency, scalability, and learning stability of model. Finally, results concerning the forecasting accuracy show that performance is competitive with the state of the art in all the considered scenarios.

6 Remarks and conclusion

We proposed model, a scalable architecture for graph-based spatiotemporal time series forecasting. Our approach can compete with the state of the art in popular medium-sized benchmarks datasets, while greatly improving the scalability in large sensor networks. While sampling in model largely reduces GPU memory usage compared to the other methods, the entire processed sequence can take up a large portion of system memory, depending on the size of the reservoir. Nevertheless, the preprocessing can be distributed and the preprocessed data stored on disk. Then, each data batch can be loaded incrementally during training, as usually done on large datasets, such as those from computer vision applications. We believe that model constitutes an important stepping stone for future research on scalable spatiotemporal forecasting and has the potential of being widely adopted by practitioners in both academia and industry. Future work can explore a tighter integration of the spatial and temporal encoding components, assess performance on even larger benchmarks, and transfer across sensor networks, i.e., in an inductive learning setting.


  • D. Bacciu, F. Errica, A. Micheli, and M. Podda (2020) A gentle introduction to deep learning for graphs. Neural Networks 129, pp. 203–221. Cited by: §1.
  • P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al. (2018) Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Cited by: §A.4.
  • F. M. Bianchi, S. Scardapane, S. Løkse, and R. Jenssen (2020) Reservoir computing approaches for representation and classification of multivariate time series. IEEE Transactions on Neural Networks and Learning Systems 32 (5), pp. 2169–2179. Cited by: §2.
  • W. Chiang, X. Liu, S. Si, Y. Li, S. Bengio, and C. Hsieh (2019) Cluster-gcn: an efficient algorithm for training deep and large graph convolutional networks. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 257–266. Cited by: §1, §4.
  • A. Cini, I. Marisca, and C. Alippi (2022) Filling the g_ap_s: multivariate time series imputation by graph neural networks. In International Conference on Learning Representations, External Links: Link Cited by: §A.3, §5.
  • A. Cini and I. Marisca (2022) Torch Spatiotemporal External Links: Link Cited by: 3rd item.
  • Commission for Energy Regulation (2016) CER Smart Metering Project - Electricity Customer Behaviour Trial, 2009-2010 [dataset]. Irish Social Science Data Archive. SN: 0012-00. External Links: Link Cited by: §A.3, §5.
  • W. Falcon and The PyTorch Lightning team (2019)

    PyTorch Lightning External Links: Document, Link Cited by: 4th item.
  • M. Fey and J. E. Lenssen (2019) Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428. Cited by: 2nd item.
  • F. Frasca, E. Rossi, D. Eynard, B. Chamberlain, M. Bronstein, and F. Monti (2020) SIGN: scalable inception graph neural networks. In ICML 2020 Workshop on Graph Representation Learning and Beyond, Cited by: §1, §4.
  • C. Gallicchio, A. Micheli, and L. Pedrelli (2017) Deep reservoir computing: a critical experimental analysis. Neurocomputing 268, pp. 87–99. Cited by: §1, §3.1.
  • C. Gallicchio, A. Micheli, and L. Pedrelli (2018) Design of deep echo state networks. Neural Networks 108, pp. 33–47. Cited by: §A.4, footnote 1.
  • A. Gandhi, S. Kaveri, V. Chaoji, et al. (2021) Spatio-temporal multi-graph networks for demand forecasting in online marketplaces. In

    Joint European Conference on Machine Learning and Knowledge Discovery in Databases

    pp. 187–203. Cited by: §4.
  • J. Gao and B. Ribeiro (2022) On the equivalence between temporal and static equivariant graph representations. In International Conference on Machine Learning, pp. 7052–7076. Cited by: §4, item 5.
  • W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. Advances in Neural Information Processing Systems 30. Cited by: §1, §4.
  • C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, et al. (2020) Array programming with numpy. Nature 585 (7825), pp. 357–362. Cited by: 5th item.
  • D. Hendrycks and K. Gimpel (2016) Gaussian error linear units. arXiv preprint arXiv:1606.08415. Cited by: §A.4.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: item 1.
  • M. Hummon, E. Ibanez, G. Brinkman, and D. Lew (2012) Sub-hour solar data for power system modeling from static spatial variability analysis. Technical report National Renewable Energy Lab.(NREL), Golden, CO (United States). Cited by: §A.3, §5.
  • H. Jaeger, M. Lukoševičius, D. Popovici, and U. Siewert (2007) Optimization and applications of echo state networks with leaky-integrator neurons. Neural Networks 20 (3), pp. 335–352. Cited by: §3.1.
  • H. Jaeger (2001) The “echo state” approach to analysing and training recurrent neural networks-with an erratum note. Bonn, Germany: German National Research Center for Information Technology GMD Technical Report 148 (34), pp. 13. Cited by: §A.4, §1, §2.
  • H. V. Jagadish, J. Gehrke, A. Labrinidis, Y. Papakonstantinou, J. M. Patel, R. Ramakrishnan, and C. Shahabi (2014) Big data and its technical challenges. Communications of the ACM 57 (7), pp. 86–94. Cited by: §A.3.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25. Cited by: §3.2.
  • G. Lai, W. Chang, Y. Yang, and H. Liu (2018) Modeling long-and short-term temporal patterns with deep neural networks. In The 41st international ACM SIGIR conference on research & development in information retrieval, pp. 95–104. Cited by: §A.3.
  • Y. Li, R. Yu, C. Shahabi, and Y. Liu (2018) Diffusion convolutional recurrent neural network: data-driven traffic forecasting. In International Conference on Learning Representations, Cited by: §A.5, §3.1, §4, item 3, §5.
  • W. Liu, P. P. Pokharel, and J. C. Principe (2007) Correntropy: properties and applications in non-gaussian signal processing. IEEE Transactions on signal processing 55 (11), pp. 5286–5298. Cited by: §A.3, §5.
  • M. Lukoševičius and H. Jaeger (2009) Reservoir computing approaches to recurrent neural network training. Computer Science Review 3 (3), pp. 127–149. Cited by: §2.
  • M. Lukoševičius (2012) A practical guide to applying echo state networks. In Neural networks: Tricks of the trade, pp. 659–686. Cited by: §A.4.
  • I. Marisca, A. Cini, and C. Alippi (2022) Learning to reconstruct missing data from spatiotemporal graphs with sparse observations. arXiv preprint arXiv:2205.13479. Cited by: §4.
  • A. Micheli and D. Tortorella (2022) Discrete-time dynamic graph echo state networks. Neurocomputing 496, pp. 85–95. Cited by: §4, item 6.
  • (2021) Neptune: metadata store for mlops, built for research and production teams that run a lot of experiments External Links: Link Cited by: §A.1.
  • B. N. Oreshkin, A. Amini, L. Coyle, and M. Coates (2021) FC-gaga: fully connected gated graph architecture for spatio-temporal traffic forecasting. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 35, pp. 9233–9241. Cited by: §4.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. Advances in neural information processing systems 32, pp. 8026–8037. Cited by: 1st item.
  • V. G. Satorras, S. S. Rangapuram, and T. Januschowski (2022) Multivariate time series forecasting with latent graph inference. arXiv preprint arXiv:2203.03423. Cited by: §4, item 5.
  • F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini (2008) The graph neural network model. IEEE Transactions on Neural Networks and Learning Systems 20 (1), pp. 61–80. Cited by: §1.
  • Y. Seo, M. Defferrard, P. Vandergheynst, and X. Bresson (2018) Structured sequence modeling with graph convolutional recurrent networks. In International Conference on Neural Information Processing, pp. 362–373. Cited by: §4.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §A.4.
  • R. K. Srivastava, K. Greff, and J. Schmidhuber (2015) Highway networks. arXiv preprint arXiv:1505.00387. Cited by: §A.4.
  • G. Van Rossum and F. L. Drake (2009) Python 3 reference manual. CreateSpace, Scotts Valley, CA. External Links: ISBN 1441412697 Cited by: §A.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §4.
  • Y. Wu, D. Zhuang, A. Labbe, and L. Sun (2021) Inductive graph neural networks for spatiotemporal kriging. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 4478–4485. Cited by: §4.
  • Z. Wu, S. Pan, G. Long, J. Jiang, X. Chang, and C. Zhang (2020) Connecting the dots: multivariate time series forecasting with graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, pp. 753–763. Cited by: §4, §4.
  • Z. Wu, S. Pan, G. Long, J. Jiang, and C. Zhang (2019) Graph wavenet for deep spatial-temporal graph modeling. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 1907–1913. Cited by: §A.3, §4, item 4, §5.
  • Z. Wu, D. Zheng, S. Pan, Q. Gan, G. Long, and G. Karypis (2022) TraverseNet: unifying space and time in message passing for traffic forecasting. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §1, §4.
  • B. Yu, H. Yin, and Z. Zhu (2018) Spatio-temporal graph convolutional networks: a deep learning framework for traffic forecasting. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 3634–3640. Cited by: §4.
  • H. Zeng, H. Zhou, A. Srivastava, R. Kannan, and V. Prasanna (2019) GraphSAINT: graph sampling based inductive learning method. In International Conference on Learning Representations, Cited by: §1, §4.
  • C. Zheng, X. Fan, C. Wang, and J. Qi (2020) Gman: a graph multi-attention network for traffic prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 1234–1241. Cited by: §4.


Appendix A Detailed experimental settings

In this appendix, we provide additional details on the experimental settings for the results presented in the paper.

a.1 Software platform

The Python Van Rossum and Drake (2009) code used to run all the computational experiments will be released upon publication. We relied on the following open-source libraries:

  • PyTorch (Paszke et al., 2019);

  • PyTorch Geometric (Fey and Lenssen, 2019);

  • Torch Spatiotemporal Cini and Marisca (2022);

  • PyTorch Lightning Falcon and The PyTorch Lightning team (2019);

  • numpy (Harris et al., 2020).

We relied on the Neptune333 (, 2021) DevOps infrastructure for the logging of the experiments. For all the baselines, we run all the experiments by relying on their open-source implementations.

a.2 Hardware platform

Experiments were run on a server equipped with two AMD EPYC 7513 processors and four NVIDIA RTX A5000. Reproducibility of the scalability experiments was ensured by taking timings for the update step of each model and setting the number of updates performed by each model accordingly (more details in Sec. A.5).

a.3 Datasets

All datasets used in our study are open-source or freely available for research purposes. The input graphs are extracted by at first computing a weighted, dense adjacency matrix from (side) spatial information, e.g., the geographic position of the sensors, or by computing a (dis)similarity metric among the time series. The adjacency is then sparsified to obtain by zeroing out values under a certain threshold and, optionally, capping the maximum number of neighbors for each node. For all datasets, the only exogenous variable we consider is the encoding of the time of the day with two sinusoidal functions.

Traffic datasets

Both METR-LA and PEMS-BAY are widely popular benchmarks. We use the same setup of previous works Wu et al. (2019) for all the preprocessing steps. As mentioned in Sec. 5, PEMS-BAY contains months of data from traffic sensors in the San Francisco Bay Area, while METR-LA contains months of analogous readings acquired from detectors in the Los Angeles County Highway (Jagadish et al., 2014). In both datasets, observations are aggregated at a minutes time scale.


The data from the Irish Commission for Energy Regulation Smart Metering Project Commission for Energy Regulation (2016) contains measurements of the energy consumption aggregated at a minutes scale in households and small/medium enterprises. The full dataset consists of observations from smart meters measuring energy consumption every minutes. As mentioned in the paper, we use the same preprocessing of Cini et al. (2022), and, in particular, an analogous strategy to extract a graph from the correntropy Liu et al. (2007) among time series. Note that, differently from Cini et al. (2022), we consider the full sensor network. For all the spatiotemporal GNN baselines, we set the window size to steps. Access to the dataset can be obtained free of charge by following the information provided at; we provide the preprocessing scripts in the supplementary material.


The PV-US444 dataset Hummon et al. (2012) instead consists in a collection of simulated energy production by 5016 PV farms for the year . In the raw datasets, samples are generated every minute, we aggregate observations at minutes intervals by taking their mean. A (small) subset of this dataset (often referred to as “Solar Energy”555 with only the PV plants in Alabama state has been used as a multivariate time series forecasting benchmark Lai et al. (2018). To obtain an adjacency matrix, we consider the virtual position of the farms in terms of geographic coordinates, and we apply a Gaussian kernel over the pairwise Haversine distances, as described at the beginning of this section. Similarly to the CER-En dataset, we set the window size of the baselines to steps. In the supplementary material, we provide the code to download and preprocess the data.

a.4 Additional details on SGP architecture

We implemented the DeepESN encoder following the design principles assessed in previous works Gallicchio et al. (2018); Lukoševičius (2012). In particular, we decrease the discount factor progressively at each layer by subtracting from its initial value. We also randomly set of the weights of the networks to to obtain a sparse reservoir. We use as nonlinear activation function. The recurrent weights are normalized so that the spectral radius of the corresponding matrix is lower than one Jaeger (2001).

For the spatial encoding, we compute the embeddings at the different spatial scales iteratively. Additionally, we also concatenate to the spatiotemporal embedding the graph-wise average of the temporal embedding to act as a sort of global attribute Battaglia et al. (2018).

The MLP decoder is implemented as standard feed-forward network with parametrized residual connections between layers 

Srivastava et al. (2015), SiLU activation function Hendrycks and Gimpel (2016) and optional Dropout Srivastava et al. (2014) regularization.

a.5 Training and evaluation procedure


As previously mentioned, for the traffic datasets we used the same training settings of previous works. For all the baselines we kept the same parameters of previous works whenever possible. For SGP we selected the hyperparameters by performing an initial random search and then manually adjusting the hyperparameters of the reservoir and selecting the best performing configuration on the validation set. In particular, for METR-LA we used a DeepESN with layers of units each, an initial decay factor of , and a spectral radius of . For PEMS-BAY, instead, we used an encoder with a single layer of units, a decay rate of , and a spectral radius of . For both datasets, we set and used the bidirectional encoding scheme. In the decoder, for the first layer we used units for each group in METR-LA and PEMS-BAY, followed by fully connected layers of units each with a dropout rate of . The model is trained with early stopping for a maximum of epochs of batch each with the Adam optimizer and a multi-step learning rate scheduler.


In Tab. 2 of the paper, we report the time required for a single model update (in terms of batches per second) and GPU memory usage for every considered method. To ensure a fair assessment, we record the time interval between the beginning of the inference step and the end weights’ update for batches and exclude the first and last measurements (that may have overheads). We exclude from the computation the overhead introduced – for every batch – by the edge subsampling strategy adopted for the scalability of the baselines.

To measure the GPU memory required, we exploit NVIDIA System Management Interface666, which provides near real-time GPU usage monitoring.

All the experiments designed to measure time and memory requirements have been run on the same machine on a dedicated reserved GPU. We kept the models mostly unchanged w.r.t. the traffic experiment. However, we increased the window size to for the baselines and updated the configuration of the reservoir for SGP to account for the different time scales. In particular, we increased the number of reservoir layers to and in PV-US and to in CER-En, respectively, and reduced the number of units accordingly. The difference in the number of layers between the two datasets is motivated by the choice of keeping the size of the preprocessed sequences similar. For this reason, we also set and use the unidirectional encoding to limit the amount of required storage to a maximum GB for each dataset.


For LSTM and FC-LSTM we consider a single-layer LSTM with units for the temporal embedding and an MLP with one hidden layer with units and dropout rate of . For DCRNN, as reported in Li et al. (2018), we set the number of units in the hidden state to and the order of the diffusion convolution to ; compared to the original mode, we use a feed-forward readout instead of a recurrent one to enable scalability on the larger benchmarks. For Graph WaveNet and Gated-GN we use the same hyperparameters and learning rate schedulers reported in the relative papers. We implemented all the baselines in PyTorch and PyTorch Geometric (for graph-based methods) following the open-source implementations provided by the authors. To improve memory and computation efficiency in message-passing layers, we use sparse matrix-matrix multiplications instead of scatter-gather operations whenever possible. We fix the maximum number of training epochs to to allow all the models to reach convergence, and stop the training if the MAE computed on the validation set does not decrease for epochs. We evaluate the models using the weights corresponding to the minimum validation MAE.


we set the hyperparameters of the reservoir to the same ones used for SGP and increase the number of units to approximately match the dimensions of the final embeddings extracted by our method. We trained the readout with Ridge regression by selecting the weight of the L2-regularization term on the validation set.