I Introduction
Complex cyberphysical systems (CPSs) often consist of multiple entities that interact with each other across time. With the continued digitization, various sensor technologies are deployed to record timevarying attributes of such entities, thus producing correlated time series [7, 34]. One representative example of such CPSs is road transportation system [33, 9], where the speeds on different roads are captured by different sensors, such as loop detectors, speed cameras, and bluetooth speedometers, as multiple speed time series [17, 10, 18].
Accurate forecasting of correlated time series have the potential to reveal holistic system dynamics of the underlying CPSs, including predicting future behavior [7] and detecting anomalies [21, 20], which are important to enable effective operations of the CPSs. For example, in an intelligent transportation system, analyzing speed time series enables travel time forecasting, early warning of congestion, and predicting the effect of incidents, which help drivers make routing decisions [23, 12].
Traffic on different roads often has an impact of the neighbour roads. An accident on one road might cause congestion on the others. Most of the related work capture such interactions as a graph which is precomputed based on different metrics such as road or euclidean distance between pairwise entities. Although such methods are easy to implement, they only capture static interactions between entities. In contrast, the interactions among entities are often dynamic and evolve across time. Figure 1 shows a graph of 6 vertices representing 6 entities, e.g., speed sensors. At 7 AM vertex has strong connections with vertices and , while at 9 AM, the connections with vertices and become weaker but vertex has a stronger connection with vertex instead. However, most of the recent studies fall short on capturing such dynamics [22, 36, 5, 34].
To enable accurate and robust correlated time series forecasting, it is essential to model such spatiotemporal correlations among multiple time series. To this end, we propose graph attention recurrent neural networks (GARNNs).
We first build a graph among different entities by taking into account spatial proximity. In the graph, vertices represent entities and two vertices are connected by an edge if the two corresponding entities are nearby. After building the graph, we apply multihead attention to learn an attention matrix. For each vertex, the attention matrix indicates, among all the vertex’s neighbor vertices, which neighboring vertices’ speeds are more relevant when predicting the speed of the vertex.
Next, since recurrent neural networks (RNNs) are able to well capture temporal dependency, we modify classic RNNs to capture spatiotemporal dependency. Specifically, we replace weight multiplications in classic RNNs with convolutions that take into account graph topology (e.g., graph convolution or diffusion convolution [22]). However, instead of using a static adjacency matrix, we employ the attention matrix learned from the first step to obtain adaptive adjacency matrices. Here, with the learned attention weight matrices and the inputs at different timestamps, we obtain different adjacency matrices at different timestamps, which are able to capture the dynamic correlations among different time series at different timestamps.
The main contribution is to utilize attention to derive adaptive adjacency matrices which are then utilized in RNNs to capture spatiotemporal correlations among time series to enable accurate forecasting for correlated time series. The proposed graph attention is a generic approach which can be applied to any graphbased convolution operations that utilizes adjacency matrices such as graph convolution and diffusion convolution [22]. Experiments on a large realword traffic time series offer evidence that the proposed method is accurate and robust.
Ii Problem definition
We first consider the temporal aspects by introducing multiple time series and then the spatial aspects by defining graph signals. Finally, we define correlated time series forecasting.
Iia Time Series
Consider a cyberphysical system (CPS) where we have entities. The status of the th entity across time, e.g., from timestamps 1 to , is represented by a time series , where a
dimensional vector
records features (e.g., speed and traffic flow) of the th entity at timestamp .Since we have a time series for each entity, we have in total time series: , , …, .
Given historical statuses of all entities, we aim at predicting the future statuses of all entities. More specifically, from the time series, we have historical statuses covering a window that contains time stamps, and we aim at predicting the future statuses in a future window that contains time stamps. We call this problem step ahead forecasting.
IiB Graph Signals
We still consider the CPS with entities. Now, we focus on the modeling of interactions among these entities.
We build a directed graph where each vertex represents an entity in the CPS, which is often associated with spatial information such as longitude and latitude. Since we have entities in total, we have . Edges are represented by adjacency matrix , where represents an edge from the ith to the jth vertices.
If the entities are already embedded into a spatial network, e.g., camera sensors deployed in road intersections in a road network, we connect two entities by an edge if they are connected in the spatial network. Otherwise, we connect two entities by an edge if the distance between them is smaller then a threshold, which is shown to be effective [22].
At each timestamp , each entity is associated with features (e.g., speed and traffic flow). We introduce a graph signal at , as shown in Figure 1(a), where represent all features from all entities at timestamp . Based on the concept of graph signals, the problem becomes learning a function that takes as input past graph signals and outputs future graph signals:
where , , and is the current timestamp.
Iii Graph Attention RNNs
We proceed to describe the proposed graph attention recurrent neural network (GARNN) to solve the step ahead forecasting.
Iiia GARNN Framework
The proposed GARNN consists of two parts, an attention part and an RNN part, which follow an encoderdecoder architecture as shown in Fig. 3. At each timestamp, we firstly model the spatial correlations among different entities using multihead attention as two adjacency matrices and , which consider the outgoing and incoming traffic, respectively. Next, the two attention weight matrices are fed into an RNN unit together with the input graph signal at the time stamp, which facilitate the RNN unit not only capture the temporal dependency but also the spatial correlations, making it is possible to capture the spatiotemporal correlations among different time series.
IiiB Spatial Modeling
To capture the spatial correlations among different entities at a specific timestamp, we employ attention mechanism [3]. The idea is to determine, in order to predict the features of an entity, i.e., a vertex, how much should we consider the features of the vertex’s neighbour vertices. We may consider different neighbour vertices differently at different timestamps. Specifically, for each vertex , we compute an attention score w.r.t. to each vertex in , i.e., vertex ’s neighbour vertices and itself.
We proceed to show how to compute an attention score for vertex . Recall that for any vertex in the graph, its features at timestamp is represented by a dimensional vector . Vertex is a vertex from . Attention score indicates how much attention should be paid to vertex
’s features in order to estimate vertex
’s features at timestamp , which is computed based on Equation 1, where represents the concatenation operator.(1) 
First, we embed the features from each vertex with an embedding matrix , where is the size of the embedding. Then, we concatenate the embeddings of the two vertices’ features and
, which is fed into an attention network. The attention network is constructed as a singlelayer feed forward neural network, parameterized by a weight matrix
. In addition, we apply a LeakyReLU function with a negative input slope ofas activation function (see Euqation
2). Finally, we apply softmax to normalize the final output to obtain so that the results are easier to interpret.(2) 
Motivated by [30], we observe that stacking multiple attention networks, a.k.a., attention heads, is beneficial since each attention network can specialise on capturing different interactions. Assume that we use a total of attention networks, each network has its own embedding matrix and weight matrix in the feed forward neural network. There are multiple ways of combining the attention heads, in our experiments we use average according to Equation 3,
(3) 
So far, for each vertex, we have computed the attentions to all its neighbors, which captures the influence of the outgoing traffic. On the other hand, it is also of interest to capture the influence of the incoming traffic. To this end, we use the transpose of the adjacency matrix to define neighboring vertices and apply the same attention mechanism to obtain another attention matrix to represent the influence of incoming traffic. Finally, we obtain two attention matrices and , which capture the influence from outgoing traffic and incoming traffic, respectively.
IiiC SpatioTemporal Modeling
We integrate the learned attention matrices into classical recurrent neural networks to capture spatiotemporal correlations. We follow an encoderdecoder architecture as shown in Figure 3, which is able to well capture the temporal dependencies across time. Next, we replace the matrix multiplications in RNN units by convolutions that take into account graph topology such as diffusion convolution or graph convolution. Here, instead of using a static adjacency matrix that only captures connectivity, the convolution here employs the learned adjacency matrices and at timestamp . Note that although the attention embedding and weight matrices and are static across time, the input matrices are changing across time. Thus, we are able to derive different adjacency matrices and at different timestamps.
More specifically, we proceed to define a diffusion convolution operator on a graph signal using the learned attention matrices and at timestamp in Equation 4.
(4) 
Here, is an activation function, matrix is a filter to be learned, and represents the th column of graph signal , which is the th features of all entities. We often apply different filters to perform diffusion convolutions and then concatenate the results into a matrix . Finally, graph signal is convoluted into matrix .
Next, we integrate the proposed graph attention based convolution into an RNN. Here, we use a Gated Recurrent Unit (GRU)
[6] as an RNN unit to illustrate the integration, where the matrix multiplications in classic GRU are replaced by the graph attention based diffusion convolutional process as defined in Equation 4.Here, is the graph signal at timestamp , is the output at timestamp . , and represent the reset gate, update gate, and candidate context at time . indicates the proposed graph attention based diffusion convolution, and , , and are the filters used in the three convolutions. is Hadamard product.
The proposed graph attention is generic in the sense that it provides a datadriven manner to produce dynamic adjacency matrices and thus can be integrated with different kinds of convolutions that utilize graph adjacency matrices. Such convolutions often use a static adjacency matrix, while the proposed graph attention allows us to employ dynamic adjacency matrices at different timestamps, which are expected to better capture the spatiotemporal correlations among different time series.
IiiD Loss Function
The loss function measures the discrepancy between the estimated travel speed and the ground truth at each timestamp for each entity. Suppose that we have
training instances in total, we use Equation 5 to measure the discrepancy.(5) 
where and are the ground truth and the estimated speed for the th entity at timestamp in the th training instance.
Iv Empirical Study
Data  Models  15 min  30 min  60 min  
MAE  RMSE  MAPE  MAE  RMSE  MAPE  MAE  RMSE  MAPE  
METRLA 
HA  4.16  7.80  13.0%  4.16  7.80  13.0%  4.16  7.80  13.0% 
ARIMA  3.99  8.21  9.6%  5.15  10.45  12.7%  6.90  13.23  17.4%  
VAR  4.42  7.89  10.2%  5.41  9.13  12.7%  6.52  10.11  15.8%  
SVR  3.99  8.45  9.3%  5.05  10.87  12.1%  6.72  13.76  16.7%  
FCLSTM  3.44  6.30  9.6%  3.77  7.23  10.9%  4.37  8.69  13.2%  
WaveNet  2.99  5.89  8.0%  3.59  7.28  10.2%  4.45  8.93  13.6%  
STGCN  2.88  5.74  7.6%  3.47  7.24  9.5%  4.59  9.40  12.7%  
GCRNN  2.80  5.51  7.5%  3.24  6.74  9.0%  3.70  8.16  10.9%  
GAGCRNN  2.76  5.33  7.1%  3.21  6.45  8.8%  3.70  7.68  10.9%  
DCRNN  2.77  5.38  7.3%  3.15  6.45  8.8%  3.60  7.60  10.5%  
GADCRNN  2.75  5.28  7.0%  3.14  6.39  8.0%  3.72  7.64  11.0% 
Iva Experimental Setup
We conducted experiments on a large real world traffic dataset METRLA from [22]. The dataset consists of speed measurements from 207 loop detectors spread across Los Angeles highways. The data was collected between March 1st 2012 and June 30th 2012 with a frequency of every 5 minutes.
We follow the same experimental setup as [22]. We build a graph by connecting from sensors to if the road network distance from to is small [22]. Since road network distance is used, the distance from to may be different from the distance from to , making the adjacency matrix asymmetric. We use 70% of the data for training, 10% for validation and 20% for testing. We consider three metrics to evaluate the prediction accuracy: Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE). We consider the same forecasting setting as [22] where we use past observations to predict the next steps ahead and report the errors at three different intervals (15, 30, and 60 mins) and also the average errors over steps.
IvB Implementation Details
The method is implemented in Python 3.6 using Tensorflow 1.7. A server with Intel Xeon Platinum 8168 CPU and 2 Tesla V100 GPUS are used to conduct all experiments. We trained the model using Adam optimizer with 0.01 as learning rate which decreases every 10 epochs after the 40th iteration. We used a total of
attention heads with an embedding size of . In addition we used a 2 layer GRU, with 64 units, a batch size of 64, and the scheduled sampling technique as described in [22].IvC Experimental Results
Baselines: We compared our proposed model with the following methods:

HA historical average [22]. HA models the data as a seasonal process and computes the predictions as a weighted average of the previous seasons.

ARIMA [arima]
integrated moving average model with Kalman filter which is widely used for time series forecasting.

VAR [13] vector autoregression.

SVR support vector regression.

FCLSTM [29] recurrent neural network with fully connected LSTM units.

WaveNet [25] a dilated causal convolution network.

STGCN [36] spatialtemporal GCN that combines 1D convolution with graph convolution.

GCRNN [22] graph convolutional recurrent neural network.

DCRNN [22] diffusion convolutional recurrent neural network.
A more indept description of the baselines, along with hyperparameters can be found in [22]. Since GCRNN and DCRNN employ graph based convolutions, where GCRNN uses graph convolution and DCRNN uses diffusion convolution [22], we incorporate the proposed graph attention based adaptive adjacency matrix into the two methods to obtain GAGCRNN and GADCRNN.
Accuracy: We compare the proposed GAGCRNN and GADCRNN with the baselines in Table IV. Firstly, we observe that nondeep learning methods such as ARIMA, VAR and SVR perform poorly compared to the deep learning methods especially when is large. This is because the temporal dependencies becomes increasingly nonlinear as the horizon increases and the aforementioned baselines are unable to capture such dynamics.
We also observe that by considering the underlying correlations among entities using a graph is very effective on improving the accuracy. Here, FCLSTM and WaveNet do not consider such correlations and are outperformed by the other deep learning methods that consider the correlations.
Next, we evaluate the proposed graph attention method for generating the dynamic adjacency matrices. We first compare GAGCRNN with GCRNN, where GAGCRNN outperforms GCRNN in all settings (see the underline values in Table IV). We then consider GADCRNN. GADCRNN outperforms DCRNN in most settings (see the bold values in Table IV), especially in short term predictions at 15 and 30 minutes.
We also consider the average accuracy over all the 12 steps in the prediction, as shown in Table II. On average, GADCRNN is better when using RMSE and MAPE. This suggests that GADCRNN avoids having large prediction errors but may have more small prediction errors than does DCRNN. This shows that GADCRNN better captures the general trend of the underlying traffic data.
Models  Average  
MAE  RMSE  MAPE  
GCRNN  3.28  6.80  9.13% 
GAGCRNN  3.22  6.48  8.93% 
DCRNN  3.17  6.47  8.86% 
GADCRNN  3.22  6.43  8.66% 
The proposed graph attention based method for generating dynamic adjacency matrices is effective, which helps improve prediction accuracy, especially for short term prediction. In addition, the proposed method is generic as it can be used to boost the accuracy of both DCRNN and GCRNN.
Efficiency: While DCRNN takes on average 271 seconds for one epoch, GADCRNN requires 401 seconds. The discrepancy between the two models is due to the attention mechanism, which requires more computation. Note that when implementing the attention we followed [31], but a more efficient way is available [24]. For the same reason, GAGCRNN also takes longer time than GCRNN does.
Summary: The results suggest that the proposed graph head attention based adaptive adjacency matrices can be easily integrated with convolutions that consider graph topology and has a great potential to enable more accurate predictions, especially for relatively short term predictions. It is of interest to further investigate how to improve long term predictions.
V Related work
For time series forcasting, autoregressive models, e.g., ARIMA, is widely used as a baseline method. Hidden Markov models are also often used to enable time series forecasting. A socalled spatiotemporal HMM (STHMM) is able to consider spatiotemporal correlations among traffic time series from adjacent edges
[34].Neural networks are able to capture nonlinear dependencies within the data, which enable nonlinear forecasting models and this category of models started to receive greater attention in the last years. In particular, Recurrent Neural Networks (RNNs) are used with successes in multiple domains such as traffic time series prediction [22] or wind forecasting [32]
, due to their recurrent nature that is able to capture temporal relationships within the data. However they suffer from the well known vanishing gradient problem
[14]. Different models such as LongShortTermMemory (LSTM) or GatedRecurrentUnit (GRU) come as an extension to traditional RNNs. The core idea of RNN models is by integrating different gates to control what the network is going to learn and forget leading to better results and an elegant solution for the vanishing problem.
Another well known type of deep learning models are Convolutions Neural Networks (CNNs). CNNs are considered to be the state of the art when it comes to speech or image recognition. CNN based models are well known for their ability to capture spatial relationships, nearby elements, e.g., pixels in images, share local features by traversing the input multiple times and applying different filters. CNNs often work on gridbased inputs, limiting its applicability to graphbased inputs such as road networks. Recently, Bruna et al.
[4] introduce Graph Convolutional Networks (GCNs) that combines spectral graphs theory with CNNs. Two recent studies employ GCNs to fill in missing values for uncertain traffic time series [16, 15]. [2] proposes Diffusion Convolution Neural Networks (DCNNs) which fall under the nonspectral methods. DCNN works on the assumption that the further two nodes are in terms of graph topology, the lower impact they should have.DCRNN [22] extends DCNN with RNN so that it also captures time dependencies within the data. Our work is closely related and builds on top of DCRNN. The main difference is that DCRNN assumes that the adjacency matrix used in random walks is static. However, this assumption might not always hold and we propose to learn an adaptive adjacency matrix at each time stamp using attention mechanism that considers graph topology. Multitask learning [19] has also been applied for correlated time series forecasting, where both a CNN and a RNN are combined to both forecast future values and reconstruct historical values [7]. Our work resembles [31] and [38]. The main difference is that our model tries to learn a dynamic adjacency matrix which can afterwards be used with any type of graphRNN like structure to update each node embedding by taking into account its neighbours offering more flexibility.
Vi Conclusion and Outlook
We propose a generic method to obtain dynamic adjacency matrices using graph attention which can be integrated seamlessly with existing graph based convolutions such as graph convolution and diffusion convolution. More specifically, we show how the integration of adaptive adjacency matrices and recurrent neural networks is able to improve the correlated time series predictions where the relationships among different time series can be captured as a graph. Experimental results show great potential when using adaptive adjacency matrices, especially for short term predictions.
As future work, it is of interest to explore how speed up the attention computation using, e.g., parallel computation [37]. It is also of interest to explore how to incorporate additional knowledge of the entities, e.g., points of interest around the speed sensors.
References
 [1] (2016) Finding nondominated paths in uncertain road networks. In SIGSPATIAL, pp. 15:1–15:10. Cited by: §V.
 [2] (2015) Searchconvolutional neural networks. CoRR abs/1511.02136. External Links: Link, 1511.02136 Cited by: §V.
 [3] (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §IIIB.
 [4] (2013) Spectral networks and locally connected networks on graphs. CoRR abs/1312.6203. External Links: Link, 1312.6203 Cited by: §V.

[5]
(2019)
Gated residual recurrent graph neural networks for traffic prediction.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 485–492. Cited by: §I.  [6] (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555. External Links: Link, 1412.3555 Cited by: §IIIC.
 [7] (2018) Correlated time series forecasting using multitask deep neural networks. In CIKM, pp. 1527–1530. Cited by: §I, §I, §V.
 [8] (2019) Graph attention recurrent neural networks for correlated time series forecasting.. In MileTS19@KDD, Cited by: Graph Attention Recurrent Neural Networks for Correlated Time Series Forecasting—Full Version.
 [9] (2016) Enabling smart transportation systems: A parallel spatiotemporal database approach. IEEE Trans. Computers 65 (5), pp. 1377–1391. Cited by: §I.
 [10] (2014) Towards total traffic awareness. SIGMOD Record 43 (3), pp. 18–23. Cited by: §I.
 [11] (online first, 2020) Contextaware, preferencebased vehicle routing.. VLDB Journal. Cited by: §V.
 [12] (2018) Learning to route with sparse trajectory sets. In ICDE, pp. 1073–1084. Cited by: §I.
 [13] (1994) Time series analysis. Vol. 2, Princeton New Jersey. Cited by: 3rd item.
 [14] (2001) Gradient flow in recurrent nets: the difficulty of learning longterm dependencies.(2001). Cited on, pp. 114. Cited by: §V.
 [15] (2020) Stochastic origindestination matrix forecasting using dualstage graph convolutional, recurrent neural networks. In ICDE, pp. 1417–1428. Cited by: §V.
 [16] (2019) Stochastic weight completion for road networks using graph convolutional networks. In ICDE, pp. 1274–1285. Cited by: §V.
 [17] (2018) Riskaware path selection with timevarying, uncertain travel costs: a time series approach. VLDB J. 27 (2), pp. 179–200. Cited by: §I.
 [18] (2017) Enabling timedependent uncertain ecoweights for road networks. GeoInformatica 21 (1), pp. 57–88. Cited by: §I.
 [19] (2018) Distinguishing trajectories from different drivers using incompletely labeled trajectories. In CIKM, pp. 863–872. Cited by: §V.
 [20] (2019) Outlier detection for time series with recurrent autoencoder ensembles. In IJCAI, Cited by: §I.
 [21] (2018) Outlier detection for multidimensional time series using deep neural networks. In MDM, pp. 125–134. Cited by: §I.
 [22] (2018) Diffusion convolutional recurrent neural network: datadriven traffic forecasting. In ICLR, Cited by: §I, §I, §I, §IIB, 1st item, 8th item, 9th item, §IVA, §IVA, §IVB, §IVC, §V, §V.
 [23] (2018) Finding topk shortest paths with diversity. IEEE Trans. Knowl. Data Eng. 30 (3), pp. 488–502. Cited by: §I.
 [24] (2018) Geniepath: graph neural networks with adaptive receptive paths. arXiv preprint arXiv:1802.00910. Cited by: §IVC.
 [25] (2016) Wavenet: a generative model for raw audio. arXiv preprint arXiv:1609.03499. Cited by: 6th item.
 [26] (2020) A hybrid learning approach to stochastic routing. In ICDE, pp. 2010–2013. Cited by: §V.
 [27] (2020) Anytime stochastic routing with hybrid learning. Proc. VLDB Endow. 13 (9), pp. 1555–1567. Cited by: §V.
 [28] (2020) Fast stochastic routing under timevarying uncertainty. VLDB J. 29 (4), pp. 819–839. Cited by: §V.
 [29] (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: 5th item.
 [30] (2017) Attention is all you need. CoRR abs/1706.03762. Cited by: §IIIB.
 [31] (2017) Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §IVC, §V.
 [32] (2017) Wind forecasting using artificial neural networks: a survey and taxonomy. International Journal of Research In Science & Engineering 3. Cited by: §V.
 [33] (2018) PACE: a pathcentric paradigm for stochastic path finding. VLDB J. 27 (2), pp. 153–178. Cited by: §I.

[34]
(2013)
Travel cost inference from sparse, spatiotemporally correlated time series using markov models
. PVLDB 6 (9), pp. 769–780. Cited by: §I, §I, §V.  [35] (2020) Learning to rank paths in spatial networks. In ICDE, pp. 2006–2009. Cited by: §V.
 [36] (2017) Spatiotemporal graph convolutional networks: a deep learning framework for traffic forecasting. arXiv preprint arXiv:1709.04875. Cited by: §I, 7th item.
 [37] (2010) XML structural similarity search using mapreduce. In WAIM, pp. 169–181. Cited by: §VI.
 [38] (2018) Gaan: gated attention networks for learning on large and spatiotemporal graphs. arXiv preprint arXiv:1803.07294. Cited by: §V.
Comments
There are no comments yet.