1 Introduction
As graph neural networks (GNNs; Scarselli et al. 2008; Bacciu et al. 2020) are gaining more traction in many application fields, the need for architectures scalable to large graphs – such as those associated with large sensor networks – is becoming a pressing issue. While research to improve the scalability of models for static graph signals has been very prolific Hamilton et al. (2017); Chiang et al. (2019); Zeng et al. (2019); Frasca et al. (2020), little attention has been paid to the additional challenges encountered when dealing with discretetime dynamical graphs, i.e., spatiotemporal time series. Several of the existing scalable training techniques rely on subsampling graphs to reduce the computational requirements of the training procedure, e.g., Hamilton et al. (2017); Zeng et al. (2019). However, sampling the nodelevel observations as if they were i.i.d. can break relational (spatial) dependencies in static graphs and it is even more problematic in the dynamic case, as dependencies occur also across the temporal dimension. Indeed, complex temporal and spatial dynamics that emerge from the interactions across the whole graph over a long time horizon, can be easily disrupted by perturbing such spatiotemporal structure with subsampling. As an alternative, precomputing aggregated features over the graph allows for factoring out spatial propagation from the training phase in certain architetures Frasca et al. (2020). However, similarly to the subsampling approach, extending this method to the spatiotemporal case is not trivial as the preprocessing step must account also for the temporal dependencies besides the graph topology.
In this paper, we propose a novel scalable encoderdecoder architecture for processing spatiotemporal data; Fig. 1 shows highlevel schematics of the proposed approach. The spatiotemporal encoding scheme is trainingfree: first, it exploits a deep randomized recurrent neural network Jaeger (2001); Gallicchio et al. (2017)
to encode the history of each sequence in a highdimensional vector embedding; then, it uses powers of the graph adjacency matrix to build informative node representations of the spatiotemporal dynamics at different scales. According to the downstream task at hand, the
decoder maps the node representations into the desired output, e.g., the future values of the time series associated with each node. To improve efficiency, we exploit the structure of the extracted embedding to design the decoder to act as a collection of filters localized at different spatiotemporal scales.Since the spatiotemporal encoder requires neither training nor supervision, the representation of each node and time step can be computed in a preprocessing stage, without the constraints that come from online training on GPUs with limited memory. The decoder is the only component of the architecture with trainable parameters. However, since spatiotemporal relationships are already embedded in the representations, the embeddings can be processed independently from their spatiotemporal context with two consequent advantages. First, training can be done nodewise, allowing for sampling node representations in minibatches of a size proportional to the hardware capacity. Second, the decoder can be implemented similarly to a standard multilayer perceptron (MLP) readout, which is fast and easy to train.
Let and be the number of steps and the number of edges in the input graph, respectively. The cost of training a standard spatiotemporal GNN on a minibatch of data has a computational and memory cost that scales as , or in attentionbased architectures Wu et al. (2022). Conversely, in our approach minibatches can be sampled disregarding the length of the sequence and size of the graph, thus making scalability in training constant, i.e., , w.r.t. the spatiotemporal dimension of the problem.
Our contributions can be summarized as follows.

We propose a general scalable deep learning framework for spatiotemporal time series, which exploits a novel encoding method based on randomized recurrent components and scalable GNNs architectures.

We apply the proposed model to forecast multivariate time series, whose channels are subject to spatial relationships described by a graph.

We carry out a rigorous and extensive empirical evaluation of the proposed architecture and variations thereof. Notably, we introduce two benchmarks for scalable spatiotemporal forecasting architectures.
Empirical results show that our approach performs on par with the state of the art while being easy to implement, computationally efficient, and extremely scalable. Given these considerations, we refer to our architecture as model.
2 Preliminaries and Problem Definition
We consider discretetime spatiotemporal graphs. In particular, given interlinked sensors, we indicate with the dimensional multivariate observation associated with the th sensor at timestep , with the node attribute matrix encompassing measurements graphwise, and with the sequence of measurements collected in the time interval at each sensor. Similarly, we indicate with the matrix containing exogenous variables (e.g., weather information related to a monitored area) associated with each sensor at the th timestep. Then, we indicate additional, optional, static node attributes as . The relational information is encoded in a, potentially dynamic, weighted adjacency matrix . We indicate with the tuple the graph signal at the th timestep. Note that the number of sensors in a network is here considered fixed only to ease the presentation; we only request nodes to be distinguishable across time steps. The objective of spatiotemporal forecasting is to predict the next observations given a window of past measurements. In particular, we consider the family of forecasting models s.t.
(1) 
where indicates the learnable parameters of the model and the step ahead point forecast.
EchoState Networks
Echo state networks Jaeger (2001); Lukoševičius and Jaeger (2009) are a class of randomized architectures that consist of recurrent neural networks with random connections that encode the history of input signals into a highdimensional state representation to be used as input to a (trainable) readout layer. The main idea is to feed an input signal into a highdimensional, randomized, and nonlinear reservoir, whose internal state can be used as an embedding of the input dynamics. An echo state network is governed by the following state update equation:
(2) 
where indicates a generic input to the system, and are the random matrices defining the connectivity pattern in the reservoir, is a randomly initialized bias, indicates the reservoir state, and
is a nonlinear activation function (usually
tanh). If the random matrices are defined properly, the reservoir will extract a rich pool of dynamics characterizing the system underlying the input time series and, thus, the reservoir states become informative embeddings of Lukoševičius and Jaeger (2009). Thanks to the nonlinearity of the reservoir, the embeddings are commonly processed with a linear readout that is optimized with a least squares procedure to perform classification, clustering, or time series forecasting Bianchi et al. (2020).3 Scalable Spatiotemporal GNNs
This section presents our approach to building scalable GNN architectures for time series forecasting. Our method is based on a hybrid encoderdecoder architecture. The encoder first constructs representations of the time series observed at each node by using a reservoir that accounts for dynamics at different time scales. Representations are further processed to account for spatial dynamics described by the graph structure. In particular, as shown on the righthand side of Fig. 2, we use incremental powers of the graph adjacency matrix to propagate and aggregate information along the spatial dimension. Each power of the propagation matrix accounts for different scales of spatial dynamics. The final embedding is then built by concatenating representations obtained w.r.t. each propagation step, thus resulting in a rich encoding of both spatial and temporal features.
The encoder does not need any training and, once computed, the embeddings can be uniformly sampled over time and space when training a nonlinear readout to perform
stepahead predictions. The straightforward choice for the decoder (i.e., readout) is to map the encodings to the outputs (i.e., predictions) by using a linear transformation or a standard MLP. However, to further enhance scalability, our decoder exploits the structure of the embedding to reduce the number of parameters and learn filters that are localized in time and space. As we will discuss in Sec.
3.2, this is done by learning separate weight matrices for each spatiotemporal scale.The following subsections describe in detail each component of the architecture.
3.1 Spatiotemporal Encoder
We consider as temporal encoders deep echo state networks (DeepESN; Gallicchio et al. 2017
) with leaky integrator neurons
Jaeger et al. (2007). In particular, we consider networks where the signal associated with each node is encoded by a stack of randomized recurrent layers s.t.(3) 
where is a discount factor associated with th layer, , , are random weight matrices, indicates the hidden state of the system w.r.t. the th node at the th layer, and indicates nodewise concatenation. As Eq. 3 shows, DeepESNs are a hierarchical stack of reservoir layers that, e.g., by changing the discount factor at each layer, extract a rich pool of multiscale temporal dynamics Gallicchio et al. (2017)^{1}^{1}1We refer to Gallicchio et al. (2018) for more details on the properties and stability of DeepESNs.. Given a DeepESN encoder, the input is represented by the concatenation of the states from each layer, i.e., we obtain nodelevel temporal encodings for each node and timestep as
(4) 
We indicate as the encoding for the whole graph at time . The extraction of the nodelevel temporal embeddings is depicted on the leftend side of Fig. 2, where, to simplify the drawing, we depict an ESN with a single layer.
The next step is to propagate information along the spatial dimension. As discussed at the beginning of the section, we use powers of a graph shift operator to propagate and aggregate node representations at different scales. By using a notation similar to Eq. 4, we obtain spatiotemporal encodings as
(5) 
where indicates a generic graph shift operator matching the sparsity pattern of the graph adjacency matrix. In practice, by indicating with the graph degree matrix, we use in the case of a directed graph and the symmetrically normalized adjacency in the undirected case. Furthermore, for directed graphs we optionally increase the number of representations to to account for bidirectional dynamics, i.e., we repeat the encoding process w.r.t. the transpose adjacency matrix similarly to Li et al. (2018). Intuitively, each propagation step propagates and aggregates – properly weighted – features between nodes connected by paths of length in the graph. As shown in Eq. 5, features corresponding to each order can be computed recursively with sparse matrixmatrix multiplications (Fig. 2). Alternatively, each matrix can be precomputed and the computation of the different blocks of matrix can be distributed in a parallel fashion as suggested in Fig. 1. Even in the case of extremely large graphs, features can be computed offline by exploiting distributed computing as they do not need to be loaded on the GPU memory.
3.2 MultiScale Decoder
The role of the decoder is that of selecting and weighing from the pool of the (possibly redundant) features extracted by the spatiotemporal encoder and mapping them to the desired output. Representations
can be fed into an MLP that performs nodewise predictions. Since the representations are large vectors, a naïve implementation of the MLP results in many parameters that hinder scalability. Therefore, we replace the first MLP layer with a more efficient implementation that exploits the structure of the embeddings.As we described in Sec. 3.1, is the concatenation of the representations corresponding to different spatial propagation steps which, in turn, are obtained from the concatenation of multiscale temporal features. To exploit this structure, we design the first layer of the decoder with a sparse connectivity pattern to learn representations s.t.
(6)  
(7)  
(8) 
where are the learnable parameters and is an activation function. In practice, representations can be efficiently computed by exploiting grouped d convolutions (e.g., see Krizhevsky et al. 2012) to parallelize computation on GPUs. In particular, if we indicate the d grouped convolution operator with groups and kernel size as , and the collection of the decoder parameters as we can compute as
(9) 
with in the case of undirected graphs and for the directed case. Besides reducing the number of parameters by a factor of , this architecture localizes filters w.r.t. the dynamics of spatial order and temporal scale . In fact, as highlighted in Eq. 6–8, representation can be seen as a concatenation of the results of graph convolutions of different order. Finally, the obtained representations are fed into an MLP that predicts the stepahead observations as
(10) 
where the static nodelevel attributes can also be augmented by concatenating a set of learnable parameters (i.e., a learnable positional encoding).
Training and sampling
The main improvement introduced by the proposed approach in terms of scalability concerns the training procedure. Representations embed both the temporal and spatial relationships among observations over the sensor network. Consequently, each sample can be processed independently since no further spatiotemporal information needs to be collected. This allows for training the decoder with SGD by uniformly and independently sampling minibatches of data points . This is the key property that makes the training procedure extremely scalable and drastically reduces the lower bound on the computational complexity required for the training w.r.t. standard spatiotemporal GNN architectures. We provide an efficient implementation of the full encoding and training procedure as supplementary material to the paper (see Sec. 5).
4 Related works
METRLA  PEMSBAY  
15 min  30 min  60 min  Average  15 min  30 min  60 min  Average  
MAE  MAE  MAE  MAE  MSE  MAPE (%)  MAE  MAE  MAE  MAE  MSE  MAPE (%)  
LSTM  2.99 0.00  3.58 0.00  4.43 0.01  3.58 0.00  53.01 0.13  10.19 0.05  1.39 0.00  1.83 0.01  2.35 0.01  1.79 0.00  17.72 0.08  4.16 0.05 
FCLSTM  3.33 0.01  3.43 0.01  3.67 0.01  3.46 0.01  44.85 0.12  10.15 0.09  2.22 0.01  2.25 0.01  2.34 0.02  2.26 0.01  22.31 0.27  5.33 0.04 
DynGESN  3.27 0.00  3.99 0.00  5.00 0.00  3.98 0.00  50.30 0.07  11.11 0.01  1.57 0.00  2.13 0.01  2.81 0.02  2.09 0.01  18.43 0.07  4.74 0.01 
DCRNN  2.82 0.00  3.23 0.01  3.74 0.01  3.20 0.00  41.57 0.22  8.88 0.05  1.36 0.00  1.71 0.00  2.08 0.01  1.66 0.00  14.40 0.15  3.76 0.01 
Graph WaveNet  2.72 0.01  3.10 0.02  3.54 0.03  3.06 0.02  38.22 0.32  8.40 0.03  1.31 0.00  1.64 0.01  1.94 0.01  1.58 0.00  13.12 0.14  3.58 0.02 
FCGatedGN  2.72 0.01  3.05 0.01  3.44 0.01  3.01 0.00  37.48 0.32  8.27 0.00  1.32 0.00  1.63 0.01  1.89 0.01  1.56 0.01  12.96 0.11  3.51 0.03 
UGGatedGN  2.72 0.00  3.10 0.00  3.54 0.01  3.06 0.00  38.82 0.08  8.40 0.04  1.33 0.00  1.67 0.01  1.99 0.01  1.61 0.01  13.76 0.14  3.59 0.03 
model  2.69 0.00  3.05 0.00  3.45 0.00  3.00 0.00  36.70 0.10  8.27 0.02  1.30 0.00  1.60 0.00  1.88 0.00  1.54 0.00  12.43 0.03  3.44 0.01 
Ablations  
–NoSpaceEnc.  2.84 0.00  3.26 0.00  3.74 0.00  3.22 0.00  44.55 0.05  9.20 0.01  1.34 0.00  1.68 0.00  2.02 0.00  1.62 0.00  14.14 0.06  3.67 0.01 
–FCDec.  2.76 0.01  3.13 0.01  3.52 0.02  3.08 0.01  37.92 0.35  8.63 0.11  1.35 0.01  1.67 0.01  1.96 0.01  1.61 0.01  13.04 0.23  3.61 0.04 
–GCDec.  2.77 0.00  3.17 0.00  3.63 0.00  3.12 0.00  40.67 0.06  8.74 0.01  1.32 0.00  1.65 0.00  1.97 0.00  1.59 0.00  13.47 0.08  3.60 0.01 
minutes time horizons. Bold numbers are within a standard deviation from the best reported average result.
Spatiotemporal GNNs are essentially based on the idea of integrating messagepassing modules in architectures to process sequential data. Notably, Seo et al. (2018) and Li et al. (2018) use messagepassing to implement gates of recurrent neural networks. Yu et al. (2018) and Wu et al. (2019, 2020) proposed architectures alternating temporal and spatial convolutions. Wu et al. (2022) and Marisca et al. (2022), instead, exploit the attention mechanism to propagate information along both time and space. Modern architectures often combine some type of relational inductive bias, with full Transformerlike attention Vaswani et al. (2017) along the spatial dimension Zheng et al. (2020); Oreshkin et al. (2021); Satorras et al. (2022), which, however, makes the computation scale quadratically with the number of nodes. model falls within the category of timethengraph models, i.e., models where the temporal information is encoded before being propagated along the spatial dimension. Gao and Ribeiro (2022) showed that such models can be more expressive than architectures that alternate temporal and spatial processing steps.
Research on scalable models for discretetime dynamic graphs has been relatively limited. Practitioners have mostly relied on methods developed in the context of static graphs which include nodecentric, GraphSAGElike, approaches Hamilton et al. (2017) or subgraph sampling methods, such as ClusterGCN Chiang et al. (2019) or GraphSAINT Zeng et al. (2019). Wu et al. (2020); Gandhi et al. (2021); Wu et al. (2021) are examples of such approaches. Among scalable GNNs for static graphs, SIGN Frasca et al. (2020) is the approach most related to our method. Like in our approach, SIGN performs spatial propagation as a preprocessing step by using different shift operators to aggregate across different graph neighborhoods, which are then fed to an MLP. However, SIGN is limited to static graphs and propagates raw nodelevel attributes. Finally, similar to our work, DynGESN Micheli and Tortorella (2022) processes dynamical graphs with a recurrent randomized architecture. However, the architecture in DynGESN is completely randomized, while ours is hybrid as it combines randomized components in the encoder with trainable parameters in the decoder.
5 Empirical evaluation
PVUS  CEREn  

Prediction error (MAE)  Resource utilization  Prediction error (MAE)  Resource utilization  
30 mins  7 hours 30 mins  11 hours  Batch/s  Memory  Batch size  30 mins  7 hours 30 mins  11 hours  Batch/s  Memory  Batch size  
100NN  DCRNN  1.39 0.09  3.34 0.22  3.54 0.48  2.04 0.01  9.63 GB  2  0.22 0.00  0.28 0.00  0.29 0.00  1.43 0.02  11.10 GB  2 
Graph WaveNet  1.45 0.13  5.09 0.63  5.26 1.34  2.01 0.02  11.64 GB  2  0.23 0.00  0.36 0.01  0.36 0.01  2.41 0.03  8.39 GB  1  
UGGatedGN  1.33 0.08  2.94 0.05  3.12 0.14  8.41 0.09  11.46 GB  5  0.22 0.00  0.28 0.00  0.28 0.00  8.21 0.08  11.70 GB  4  
model  1.09 0.01  3.14 0.21  3.16 0.19  116.58 8.74  2.21 GB  4096  0.21 0.00  0.30 0.00  0.31 0.01  117.32 8.36  2.21 GB  4096  
Full 
DCRNN  1.59 0.17  4.10 0.27  4.93 0.60  1.37 0.00  11.59 GB  1  0.23 0.00  0.29 0.00  0.29 0.00  1.13 0.01  11.10 GB  1 
Graph WaveNet  1.65 0.23  6.93 0.58  7.93 0.17  0.77 0.00  11.35 GB  2  0.25 0.01  0.38 0.03  0.37 0.01  1.26 0.01  8.58 GB  1  
UGGatedGN  1.61 0.06  3.25 0.04  3.04 0.05  8.83 0.10  11.14 GB  1  0.22 0.00  0.28 0.00  0.29 0.00  8.77 0.10  11.14 GB  1  
model  1.09 0.00  3.06 0.11  3.13 0.13  118.64 8.35  2.21 GB  4096  0.21 0.00  0.30 0.00  0.31 0.01  115.85 10.60  2.21 GB  4096 
We empirically evaluate our approach in different scenarios. In the first, we compare the performance of our forecasting architecture against stateoftheart methods on popular, mediumscale, traffic forecasting benchmarks. In the second, we evaluate the scalability of the proposed method on largescale spatiotemporal time series datasets by considering two novel benchmarks for load forecasting and PV production prediction.
Dataset  # steps  # nodes  # edges  sparsity 

METRLA  34272  207  1515  3.54% 
PEMSBAY  52116  325  2369  2.24% 
PVUS (100nn)  8868  5016  417,199  1.66% 
CEREn (100nn)  8868  6435  639,369  1.54% 
PVUS  8868  5016  3,710,008  14.75% 
CEREn  8868  6435  3,186,369  7.69% 
Datasets
In the first experiment we consider the METRLA and PEMSBAY datasets Li et al. (2018), which are popular mediumsized benchmarks used in the spatiotemporal forecasting literature. In particular, METRLA consists of traffic speed measurements taken every minutes by detectors in the Los Angeles County Highway, while PEMSBAY includes analogous observations recorded by sensors in the San Francisco Bay Area. We use the same preprocessing steps of previous works to extract a graph and obtain train, validation and test data splits Wu et al. (2019). For the second experiment, we introduce two largerscale datasets derived from energy analytics data. The first dataset contains data coming from the Irish Commission for Energy Regulation Smart Metering Project (CERE; Commission for Energy Regulation 2016
), which has been previously used for benchmarking spatiotemporal imputation methods
Cini et al. (2022); however, differently from previous works, we consider the full sensor network consisting of smart meters measuring energy consumption every minutes at both residential and commercial/industrial premises. The second largescale dataset is obtained from the synthetic PVUS^{2}^{2}2https://www.nrel.gov/grid/solarpowerdata.html dataset Hummon et al. (2012), consisting of simulated energy production by 5016 PV farms scattered over the United States given historic weather data for the year , aggregated in half an hour intervals. Since the model does not have access to weather information, PV production at neighboring farms is instrumental for obtaining good predictions. Notably, CERE and PVUS datasets are at least an order of magnitude larger than the datasets typically used for benchmarking spatiotemporal time series forecasting models. Note that for both PVUS and CEREn the (weighted) adjacency is obtained by applying a thresholded Gaussian kernel to the similarity matrix obtained by considering the geographic distance among the sensors and the correntropy Liu et al. (2007) among the time series, respectively. We provide further details on the datasets in the supplemental material.Baselines
We consider the following baselines:

LSTM: a single standard gated recurrent neural network Hochreiter and Schmidhuber (1997) trained by sampling window of observations from each nodelevel time series by disregarding the spatial information;

FCLSTM: an LSTM processing input sequences as if they were a single highdimensional multivariate time series;

DCRNN: a recurrent graph network presented in Li et al. (2018) – differently from the original model we use a recurrent encoder followed by a linear readout (more details in the appendix);

Graph WaveNet: a residual network that alternates temporal and graph convolutions over the graph that is given as input and an adjacency matrix that is learned by the model Wu et al. (2019);

GatedGN: a stateoftheart timethangraph Gao and Ribeiro (2022) model introduced in Satorras et al. (2022) for which we consider two different configurations. The first one – indicated as FC – uses attention over the full node set to perform spatial propagation, while the second one – indicated as UG – constrains the attention to edges of the underlying graph.

DynGESN: the echo state network for dynamical graphs proposed in Micheli and Tortorella (2022).
For all the baselines, we use, whenever possible, the configuration found in the original papers or in their opensource implementation; in all the other cases we tune hyperparameters on the holdout validation set.
Experimental setup
For the traffic datasets, we replicate the setup used in previous works. In particular, each model is trained to predict the stepahead observations. In model, the input time series are first encoded by the spatiotemporal encoder, and then the decoder is trained by sampling minibatches along the temporal dimension, i.e., by sampling sequences of observations.
For the largescale datasets, we focus on assessing the scalability of the different architectures rather than maximizing forecasting accuracy. In particular, for both datasets, we consider the first months of data ( for months for training, month for validation, and month for testing). The models are trained to predict the next hours. We repeat the experiment in two different settings to test the scalability of the different architectures w.r.t. the number of edges. In the first setting, we extract the graph by sparsifying the graph adjacency matrix imposing a maximum of neighbors for each node, while in the second case we do not constrain the density of the adjacency matrix. Tab. 3 reports some details for the considered benchmarks.
To assess the performance in terms of scalability, we fix a maximum GPU memory budget of GB and select the batch size accordingly; if a batch size of does not fit in GB, we uniformly subsample edges of the graph to reduce the memory consumption. Differently from the other baselines, in model we first preprocess the data to obtain spatiotemporal embeddings and then train the decoder by uniformly sampling the node representations. We train each model for
hour, then restore the weights corresponding to the minimum training error and evaluate the forecasts on the test set. The choice of not running validation at each epoch was dictated by the fact that for some of the baselines running a validation epoch would take a large portion of the
hour budget.The time required to encode the datasets with model’s encoder ranges from tens of seconds to minutes on an AMD EPYC 7513 processor with parallel processes. To ensure reproducibility, the time constraint is not imposed as a hard time out; conversely, we measure the time required for the update step of each model on an NVIDIA RTX A5000 GPU and fix the maximum number of updates accordingly. For model, the time required to compute node embeddings was considered as part of the training time and the number of updates was appropriately reduced to make the comparison fair. For all the baselines, we keep the same architecture used in the traffic experiment. For model we use the same hyperparameters for the decoder, but we reduce the dimension of the embedding (the value of ) so that a preprocessed dataset can fit in a maximum of GB of storage. To account for the different temporal scales, we increase the window size for all baselines and increase the number of layers in the ESN (while keeping the final size of similar). Additional details and the exact values of the hyperparameters are provided in the supplementary material.
5.1 Results
Results for the traffic benchmarks are reported in Tab. 1; while the outcomes of the scalability experiments are shown in Tab. 2. We consider mean absolute error (MAE), mean squared error (MSE), and mean absolute percentage error
(MAPE) as evaluation metrics.
Traffic experiment
The purpose of the first experiment is to demonstrate that the proposed method achieves performance comparable to that of the state of the art. In this regard, results in Tab. 1 show that in all the considered scenarios model is always among the best performing forecasting architectures. The fullattention baseline is the strongest competitor which, however, has time and memory complexities that scale quadratically with the number of nodes. Regarding the other baselines, DCRNN underperforms compared to the other spatiotemporal GNN architectures. DynGESN, the fully randomized architecture, despite being very fast to train, obtains reasonable performance in shortrange predictions but falls short over longer forecasting horizons in the considered scenarios. In light of these results, it is worth commenting on the efficiency of model compared to the baselines. Approaches like DCRNN and Graph Wavenet, perform graph convolutions whose time and space of complexity is , being the number of edges, the number of layers ( in Graph Wavenet), and the time steps. Such complexity is completely amortized by the preprocessing step in our architecture. Similarly, GatedGN, while being architecturally much simpler, propagates spatial information by relying on the attention mechanism that is known to scale poorly with the dimensionality of the problem. The next experimental setting shines a light on these shortcomings.
The bottom of Tab. 1 reports results for the ablation of key elements of the proposed architecture: NoSpaceEnc. indicates that the embeddings are built without the spatial propagation step; FCDec. consider the case where the structure of the embedding is ignored in the readout and the sparse weight matrix in Eq. 7 is replaced by a fullyconnected one; GCDec. indicates that the spatial propagation is limited to the neighbors of order and, thus, the decoder behaves similarly to a singlelayer graph convolutional network. Results clearly show the optimality of the proposed architectural design.
Largescale experiment
Tab. 2 reports the results of the scalability experiment where we considered only the spatiotemporal GNNs trained by gradient descent. We excluded the fullattention baseline (FCGatedGN) as its complexity prevented scaling to the larger datasets; however, we considered the UG version where attention is restrained to each node’s neighborhood. There are several comments that need to be made here. First of all, batch size has a different meaning for our model and the other baselines. In our case, each sample corresponds to a single spatiotemporal (preprocessed) observation; for the other methods, a sample corresponds to a window of observations where edges of the graph are eventually subsampled if the memory constraints could not be met otherwise. In both cases, the loss is computed w.r.t. all the observations in the batch. The results clearly show that model can be trained efficiently also in resourceconstrained settings, with contained GPU memory usage. In particular, the update frequency (batch/s) is up to order of magnitude higher. Notably, resource utilization at training time remains constant (by construction) in the two considered scenarios, while almost all the baselines require edge subsampling in order to meet the resource constraints. Fig. 3 shows learning curves for the PVUS dataset, further highlighting the vastly superior efficiency, scalability, and learning stability of model. Finally, results concerning the forecasting accuracy show that performance is competitive with the state of the art in all the considered scenarios.
6 Remarks and conclusion
We proposed model, a scalable architecture for graphbased spatiotemporal time series forecasting. Our approach can compete with the state of the art in popular mediumsized benchmarks datasets, while greatly improving the scalability in large sensor networks. While sampling in model largely reduces GPU memory usage compared to the other methods, the entire processed sequence can take up a large portion of system memory, depending on the size of the reservoir. Nevertheless, the preprocessing can be distributed and the preprocessed data stored on disk. Then, each data batch can be loaded incrementally during training, as usually done on large datasets, such as those from computer vision applications. We believe that model constitutes an important stepping stone for future research on scalable spatiotemporal forecasting and has the potential of being widely adopted by practitioners in both academia and industry. Future work can explore a tighter integration of the spatial and temporal encoding components, assess performance on even larger benchmarks, and transfer across sensor networks, i.e., in an inductive learning setting.
References
 A gentle introduction to deep learning for graphs. Neural Networks 129, pp. 203–221. Cited by: §1.
 Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Cited by: §A.4.
 Reservoir computing approaches for representation and classification of multivariate time series. IEEE Transactions on Neural Networks and Learning Systems 32 (5), pp. 2169–2179. Cited by: §2.
 Clustergcn: an efficient algorithm for training deep and large graph convolutional networks. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 257–266. Cited by: §1, §4.
 Filling the g_ap_s: multivariate time series imputation by graph neural networks. In International Conference on Learning Representations, External Links: Link Cited by: §A.3, §5.
 Torch Spatiotemporal External Links: Link Cited by: 3rd item.
 CER Smart Metering Project  Electricity Customer Behaviour Trial, 20092010 [dataset]. Irish Social Science Data Archive. SN: 001200. External Links: Link Cited by: §A.3, §5.
 PyTorch Lightning External Links: Document, Link Cited by: 4th item.
 Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428. Cited by: 2nd item.
 SIGN: scalable inception graph neural networks. In ICML 2020 Workshop on Graph Representation Learning and Beyond, Cited by: §1, §4.
 Deep reservoir computing: a critical experimental analysis. Neurocomputing 268, pp. 87–99. Cited by: §1, §3.1.
 Design of deep echo state networks. Neural Networks 108, pp. 33–47. Cited by: §A.4, footnote 1.

Spatiotemporal multigraph networks for demand forecasting in online marketplaces.
In
Joint European Conference on Machine Learning and Knowledge Discovery in Databases
, pp. 187–203. Cited by: §4.  On the equivalence between temporal and static equivariant graph representations. In International Conference on Machine Learning, pp. 7052–7076. Cited by: §4, item 5.
 Inductive representation learning on large graphs. Advances in Neural Information Processing Systems 30. Cited by: §1, §4.
 Array programming with numpy. Nature 585 (7825), pp. 357–362. Cited by: 5th item.
 Gaussian error linear units. arXiv preprint arXiv:1606.08415. Cited by: §A.4.
 Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: item 1.
 Subhour solar data for power system modeling from static spatial variability analysis. Technical report National Renewable Energy Lab.(NREL), Golden, CO (United States). Cited by: §A.3, §5.
 Optimization and applications of echo state networks with leakyintegrator neurons. Neural Networks 20 (3), pp. 335–352. Cited by: §3.1.
 The “echo state” approach to analysing and training recurrent neural networkswith an erratum note. Bonn, Germany: German National Research Center for Information Technology GMD Technical Report 148 (34), pp. 13. Cited by: §A.4, §1, §2.
 Big data and its technical challenges. Communications of the ACM 57 (7), pp. 86–94. Cited by: §A.3.
 Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25. Cited by: §3.2.
 Modeling longand shortterm temporal patterns with deep neural networks. In The 41st international ACM SIGIR conference on research & development in information retrieval, pp. 95–104. Cited by: §A.3.
 Diffusion convolutional recurrent neural network: datadriven traffic forecasting. In International Conference on Learning Representations, Cited by: §A.5, §3.1, §4, item 3, §5.
 Correntropy: properties and applications in nongaussian signal processing. IEEE Transactions on signal processing 55 (11), pp. 5286–5298. Cited by: §A.3, §5.
 Reservoir computing approaches to recurrent neural network training. Computer Science Review 3 (3), pp. 127–149. Cited by: §2.
 A practical guide to applying echo state networks. In Neural networks: Tricks of the trade, pp. 659–686. Cited by: §A.4.
 Learning to reconstruct missing data from spatiotemporal graphs with sparse observations. arXiv preprint arXiv:2205.13479. Cited by: §4.
 Discretetime dynamic graph echo state networks. Neurocomputing 496, pp. 85–95. Cited by: §4, item 6.
 Neptune: metadata store for mlops, built for research and production teams that run a lot of experiments External Links: Link Cited by: §A.1.

FCgaga: fully connected gated graph architecture for spatiotemporal traffic forecasting.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 35, pp. 9233–9241. Cited by: §4.  Pytorch: an imperative style, highperformance deep learning library. Advances in neural information processing systems 32, pp. 8026–8037. Cited by: 1st item.
 Multivariate time series forecasting with latent graph inference. arXiv preprint arXiv:2203.03423. Cited by: §4, item 5.
 The graph neural network model. IEEE Transactions on Neural Networks and Learning Systems 20 (1), pp. 61–80. Cited by: §1.
 Structured sequence modeling with graph convolutional recurrent networks. In International Conference on Neural Information Processing, pp. 362–373. Cited by: §4.
 Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §A.4.
 Highway networks. arXiv preprint arXiv:1505.00387. Cited by: §A.4.
 Python 3 reference manual. CreateSpace, Scotts Valley, CA. External Links: ISBN 1441412697 Cited by: §A.1.
 Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §4.
 Inductive graph neural networks for spatiotemporal kriging. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 4478–4485. Cited by: §4.
 Connecting the dots: multivariate time series forecasting with graph neural networks. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, pp. 753–763. Cited by: §4, §4.
 Graph wavenet for deep spatialtemporal graph modeling. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 1907–1913. Cited by: §A.3, §4, item 4, §5.
 TraverseNet: unifying space and time in message passing for traffic forecasting. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §1, §4.
 Spatiotemporal graph convolutional networks: a deep learning framework for traffic forecasting. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pp. 3634–3640. Cited by: §4.
 GraphSAINT: graph sampling based inductive learning method. In International Conference on Learning Representations, Cited by: §1, §4.
 Gman: a graph multiattention network for traffic prediction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, pp. 1234–1241. Cited by: §4.
Appendix
Appendix A Detailed experimental settings
In this appendix, we provide additional details on the experimental settings for the results presented in the paper.
a.1 Software platform
The Python Van Rossum and Drake (2009) code used to run all the computational experiments will be released upon publication. We relied on the following opensource libraries:
We relied on the Neptune^{3}^{3}3https://neptune.ai/ (neptune.ai, 2021) DevOps infrastructure for the logging of the experiments. For all the baselines, we run all the experiments by relying on their opensource implementations.
a.2 Hardware platform
Experiments were run on a server equipped with two AMD EPYC 7513 processors and four NVIDIA RTX A5000. Reproducibility of the scalability experiments was ensured by taking timings for the update step of each model and setting the number of updates performed by each model accordingly (more details in Sec. A.5).
a.3 Datasets
All datasets used in our study are opensource or freely available for research purposes. The input graphs are extracted by at first computing a weighted, dense adjacency matrix from (side) spatial information, e.g., the geographic position of the sensors, or by computing a (dis)similarity metric among the time series. The adjacency is then sparsified to obtain by zeroing out values under a certain threshold and, optionally, capping the maximum number of neighbors for each node. For all datasets, the only exogenous variable we consider is the encoding of the time of the day with two sinusoidal functions.
Traffic datasets
Both METRLA and PEMSBAY are widely popular benchmarks. We use the same setup of previous works Wu et al. (2019) for all the preprocessing steps. As mentioned in Sec. 5, PEMSBAY contains months of data from traffic sensors in the San Francisco Bay Area, while METRLA contains months of analogous readings acquired from detectors in the Los Angeles County Highway (Jagadish et al., 2014). In both datasets, observations are aggregated at a minutes time scale.
CEREn
The data from the Irish Commission for Energy Regulation Smart Metering Project Commission for Energy Regulation (2016) contains measurements of the energy consumption aggregated at a minutes scale in households and small/medium enterprises. The full dataset consists of observations from smart meters measuring energy consumption every minutes. As mentioned in the paper, we use the same preprocessing of Cini et al. (2022), and, in particular, an analogous strategy to extract a graph from the correntropy Liu et al. (2007) among time series. Note that, differently from Cini et al. (2022), we consider the full sensor network. For all the spatiotemporal GNN baselines, we set the window size to steps. Access to the dataset can be obtained free of charge by following the information provided at https://www.ucd.ie/issda/data/commissionforenergyregulationcer; we provide the preprocessing scripts in the supplementary material.
PvUs
The PVUS^{4}^{4}4https://www.nrel.gov/grid/solarpowerdata.html dataset Hummon et al. (2012) instead consists in a collection of simulated energy production by 5016 PV farms for the year . In the raw datasets, samples are generated every minute, we aggregate observations at minutes intervals by taking their mean. A (small) subset of this dataset (often referred to as “Solar Energy”^{5}^{5}5https://github.com/laiguokun/multivariatetimeseriesdata) with only the PV plants in Alabama state has been used as a multivariate time series forecasting benchmark Lai et al. (2018). To obtain an adjacency matrix, we consider the virtual position of the farms in terms of geographic coordinates, and we apply a Gaussian kernel over the pairwise Haversine distances, as described at the beginning of this section. Similarly to the CEREn dataset, we set the window size of the baselines to steps. In the supplementary material, we provide the code to download and preprocess the data.
a.4 Additional details on SGP architecture
We implemented the DeepESN encoder following the design principles assessed in previous works Gallicchio et al. (2018); Lukoševičius (2012). In particular, we decrease the discount factor progressively at each layer by subtracting from its initial value. We also randomly set of the weights of the networks to to obtain a sparse reservoir. We use as nonlinear activation function. The recurrent weights are normalized so that the spectral radius of the corresponding matrix is lower than one Jaeger (2001).
For the spatial encoding, we compute the embeddings at the different spatial scales iteratively. Additionally, we also concatenate to the spatiotemporal embedding the graphwise average of the temporal embedding to act as a sort of global attribute Battaglia et al. (2018).
The MLP decoder is implemented as standard feedforward network with parametrized residual connections between layers
Srivastava et al. (2015), SiLU activation function Hendrycks and Gimpel (2016) and optional Dropout Srivastava et al. (2014) regularization.a.5 Training and evaluation procedure
Traffic
As previously mentioned, for the traffic datasets we used the same training settings of previous works. For all the baselines we kept the same parameters of previous works whenever possible. For SGP we selected the hyperparameters by performing an initial random search and then manually adjusting the hyperparameters of the reservoir and selecting the best performing configuration on the validation set. In particular, for METRLA we used a DeepESN with layers of units each, an initial decay factor of , and a spectral radius of . For PEMSBAY, instead, we used an encoder with a single layer of units, a decay rate of , and a spectral radius of . For both datasets, we set and used the bidirectional encoding scheme. In the decoder, for the first layer we used units for each group in METRLA and PEMSBAY, followed by fully connected layers of units each with a dropout rate of . The model is trained with early stopping for a maximum of epochs of batch each with the Adam optimizer and a multistep learning rate scheduler.
Largescale
In Tab. 2 of the paper, we report the time required for a single model update (in terms of batches per second) and GPU memory usage for every considered method. To ensure a fair assessment, we record the time interval between the beginning of the inference step and the end weights’ update for batches and exclude the first and last measurements (that may have overheads). We exclude from the computation the overhead introduced – for every batch – by the edge subsampling strategy adopted for the scalability of the baselines.
To measure the GPU memory required, we exploit NVIDIA System Management Interface^{6}^{6}6https://developer.nvidia.com/nvidiasystemmanagementinterface, which provides near realtime GPU usage monitoring.
All the experiments designed to measure time and memory requirements have been run on the same machine on a dedicated reserved GPU. We kept the models mostly unchanged w.r.t. the traffic experiment. However, we increased the window size to for the baselines and updated the configuration of the reservoir for SGP to account for the different time scales. In particular, we increased the number of reservoir layers to and in PVUS and to in CEREn, respectively, and reduced the number of units accordingly. The difference in the number of layers between the two datasets is motivated by the choice of keeping the size of the preprocessed sequences similar. For this reason, we also set and use the unidirectional encoding to limit the amount of required storage to a maximum GB for each dataset.
Baselines
For LSTM and FCLSTM we consider a singlelayer LSTM with units for the temporal embedding and an MLP with one hidden layer with units and dropout rate of . For DCRNN, as reported in Li et al. (2018), we set the number of units in the hidden state to and the order of the diffusion convolution to ; compared to the original mode, we use a feedforward readout instead of a recurrent one to enable scalability on the larger benchmarks. For Graph WaveNet and GatedGN we use the same hyperparameters and learning rate schedulers reported in the relative papers. We implemented all the baselines in PyTorch and PyTorch Geometric (for graphbased methods) following the opensource implementations provided by the authors. To improve memory and computation efficiency in messagepassing layers, we use sparse matrixmatrix multiplications instead of scattergather operations whenever possible. We fix the maximum number of training epochs to to allow all the models to reach convergence, and stop the training if the MAE computed on the validation set does not decrease for epochs. We evaluate the models using the weights corresponding to the minimum validation MAE.
For DynGESN
we set the hyperparameters of the reservoir to the same ones used for SGP and increase the number of units to approximately match the dimensions of the final embeddings extracted by our method. We trained the readout with Ridge regression by selecting the weight of the L2regularization term on the validation set.