Introduction
Spatialtemporal graph modeling has received increasing attention with the advance of graph neural networks. It aims to model the dynamic nodelevel inputs by assuming interdependency between connected nodes, as demonstrated by Figure Document. Spatialtemporal graph modeling has wide applications in solving complex system problems such as traffic speed forecasting [li2018diffusion], taxi demand prediction [yao2018deep], human action recognition [yan2018spatial], and driver maneuver anticipation [jain2016structural]. For a concrete example, in traffic speed forecasting, speed sensors on roads of a city form a graph where the edge weights are judged by two nodes’ Euclidean distance. As the traffic congestion on one road could cause lower traffic speed on its incoming roads, it is natural to consider the underlying graph structure of the traffic system as the prior knowledge of interdependency relationships among nodes when modeling time series data of the traffic speed on each road.
A basic assumption behind spatialtemporal graph modeling is that a node’s future information is conditioned on its historical information as well as its neighbors’ historical information. Therefore how to capture spatial and temporal dependencies simultaneously becomes a primary challenge. Recent studies on spatialtemporal graph modeling mainly follow two directions. They either integrate graph convolution networks (GCN) into recurrent neural networks (RNN)
[seo2018structured, li2018diffusion]or into convolution neural networks (CNN)
[yu2018spatio, yan2018spatial]. While having shown the effectiveness of introducing the graph structure of data into a model, these approaches face two major shortcomings. First, these studies assume the graph structure of data reflects the genuine dependency relationships among nodes. However, there are circumstances when a connection does not entail the interdependency relationship between two nodes and when the interdependency relationship between two nodes exists but a connection is missing. To give each circumstance an example, let us consider a recommendation system. In the first case, two users are connected, but they may have distinct preferences over products. In the second case, two users may share a similar preference, but they are not linked together. Zhang et al. zhang2018gaan used attention mechanisms to address the first circumstance by adjusting the dependency weight between two connected nodes, but they failed to consider the second circumstance. Second, current studies for spatialtemporal graph modeling are ineffective to learn temporal dependencies. RNNbased approaches suffer from timeconsuming iterative propagation and gradient explosion/vanishing for capturing longrange sequences [seo2018structured, li2018diffusion, zhang2018gaan]. On the contrary, CNNbased approaches enjoy the advantages of parallel computing, stable gradients and low memory requirement [yu2018spatio, yan2018spatial]. However, these works need to use many layers in order to capture very long sequences because they adopt standard 1D convolution whose receptive field size grows linearly with an increase in the number of hidden layers. In this work, we present a CNNbased method named Graph WaveNet, which addresses the two shortcomings we have aforementioned. We propose a graph convolution layer in which a selfadaptive adjacency matrix can be learned from the data through an endtoend supervised training. In this way, the selfadaptive adjacency matrix preserves hidden spatial dependencies. Motivated by WaveNet [oord2016wavenet], we adopt stacked dilated casual convolutions to capture temporal dependencies. The receptive field size of stacked dilated casual convolution networks grows exponentially with an increase in the number of hidden layers. With the support of stacked dilated casual convolutions, Graph WaveNet is able to handle spatialtemporal graph data with longrange temporal sequences efficiently and effectively. The main contributions of this work are as follows:
We construct a selfadaptive adjacency matrix which preserves hidden spatial dependencies. Our proposed selfadaptive adjacency matrix is able to uncover unseen graph structures automatically from the data without any guidance of prior knowledge. Experiments validate that our method improves the results when spatial dependencies are known to exist but are not provided.

We present an effective and efficient framework to capture spatialtemporal dependencies simultaneously. The core idea is to assemble our proposed graph convolution with dilated casual convolution in a way that each graph convolution layer tackles spatial dependencies of nodes’ information extracted by dilated casual convolution layers at different granular levels.

We evaluate our proposed model on traffic datasets and achieve stateoftheart results with low computation costs. The source codes of Graph WaveNet are publicly available from https://github.com/nnzhan/GraphWaveNet.
Related Works
Graph Convolution Networks
Graph convolution networks are building blocks for learning graphstructured data [wu2019comprehensive]. They are widely applied in domains such as node embedding [pan2019learning], node classification [kipf2016semi], graph classification [ying2018hierarchical], link prediction [zhang2018link] and node clustering [wang2017mgae]. There are two mainstreams of graph convolution networks, the spectralbased approaches and the spatialbased approaches. Spectralbased approaches smooth a node’s input signals using graph spectral filters [bruna2013spectral, defferrard2016convolutional, kipf2016semi]. Spatialbased approaches extract a node’s highlevel representation by aggregating feature information from neighborhoods [atwood2016diffusion, gilmer2017neural, hamilton2017inductive]. In these approaches, the adjacency matrix is considered as prior knowledge and is fixed throughout training. Monti et al. monti2017geometric learned the weight of a node’s neighbor through Gaussian kernels. Velickovic et al. velickovic2017graph updated the weight of a node’s neighbor via attention mechanisms. Liu et al. liu2018geniepath proposed an adaptive path layer to explore the breadth and depth of a node’s neighborhood. Although these methods assume the contribution of each neighbor to the central node is different and need to be learned, they still rely on a predefined graph structure. Li et al. li2018adaptive adopted distance metrics to adaptively learn a graph’s adjacency matrix for graph classification problems. This generated adjacency matrix is conditioned on nodes’ inputs. As inputs of a spatialtemporal graph are dynamic, their method is unstable for spatialtemporal graph modeling.
Spatialtemporal Graph Networks
The majority of Spatialtemporal Graph Networks follows two directions, namely, RNNbased and CNNbased approaches. One of the early RNNbased methods captured spatialtemporal dependencies by filtering inputs and hidden states passed to a recurrent unit using graph convolution [seo2018structured]. Later works adopted different strategies such as diffusion convolution [li2018diffusion] and attention mechanisms [zhang2018gaan] to improve model performance. Another parallel work used nodelevel RNNs and edgelevel RNNs to handle different aspects of temporal information [jain2016structural]. The main drawbacks of RNNbased approaches are that it becomes inefficient for long sequences and its gradients are more likely to explode when they are combined with graph convolution networks. CNNbased approaches combine a graph convolution with a standard 1D convolution [yu2018spatio, yan2018spatial]. While being computationally efficient, these two approaches have to stack many layers or use global pooling to expand the receptive field of a neural network model.
Methodology
In this section, we first give the mathematical definition of the problem we are addressing in this paper. Next, we describe two building blocks of our framework, the graph convolution layer (GCN) and the temporal convolution layer (TCN). They work together to capture the spatialtemporal dependencies. Finally, we outline the architecture of our framework.
Problem Definition
A graph is represented by where is the set of nodes and is the set of edges. The adjacency matrix derived from a graph is denoted by . If and , then is one otherwise it is zero. At each time step , the graph has a dynamic feature matrix . In this paper, the feature matrix is used interchangeably with graph signals. Given a graph and its historical step graph signals, our problem is to learn a function which is able to forecast its next step graph signals. The mapping relation is represented as follows
() 
where and .
Graph Convolution Layer
Graph convolution is an essential operation to extract a node’s features given its structural information. Kipf et al. kipf2016semi proposed a first approximation of Chebyshev spectral filter [defferrard2016convolutional]. From a spatialbased perspective, it smoothed a node’s signal by aggregating and transforming its neighborhood information. The advantages of their method are that it is a compositional layer, its filter is localized in space, and it supports multidimensional inputs. Let denote the normalized adjacency matrix with selfloops, denote the input signals , denote the output, and denote the model parameter matrix, in [kipf2016semi] the graph convolution layer is defined as
() 
Li et al. li2018diffusion proposed a diffusion convolution layer which proves to be effective in spatialtemporal modeling. They modeled the diffusion process of graph signals with finite steps. We generalize its diffusion convolution layer into the form of Equation Document, which results in,
() 
where represents the power series of the transition matrix. In the case of an undirected graph, . In the case of a directed graph, the diffusion process have two directions, the forward and backward directions, where the forward transition matrix and the backward transition matrix . With the forward and the backward transition matrix, the diffusion graph convolution layer is written as
() 
Selfadaptive Adjacency Matrix: In our work, we propose a selfadaptive adjacency matrix
. This selfadaptive adjacency matrix does not require any prior knowledge and is learned endtoend through stochastic gradient descent. In doing so, we let the model discover hidden spatial dependencies by itself. We achieve this by randomly initializing two node embedding dictionaries with learnable parameters
. We propose the selfadaptive adjacency matrix as() 
We name as the source node embedding and as the target node embedding. By multiplying and
, we derive the spatial dependency weights between the source nodes and the target nodes. We use the ReLU activation function to eliminate weak connections. The SoftMax function is applied to normalize the selfadaptive adjacency matrix. The normalized selfadaptive adjacency matrix, therefore, can be considered as the transition matrix of a hidden diffusion process. By combining predefined spatial dependencies and selflearned hidden graph dependencies, we propose the following graph convolution layer
() 
When the graph structure is unavailable, we propose to use the selfadaptive adjacency matrix alone to capture hidden spatial dependencies, i.e.,
() 
It is worth to note that our graph convolution falls into spatialbased approaches. Although we use graph signals interchangeably with node feature matrix for consistency, our graph convolution in Equation Document indeed is interpreted as aggregating transformed feature information from different orders of neighborhoods.
Temporal Convolution Layer
We adopt the dilated causal convolution [yu2016multi]
as our temporal convolution layer (TCN) to capture a node’s temporal trends. Dilated causal convolution networks allow an exponentially large receptive field by increasing the layer depth. As opposed to RNNbased approaches, dilated casual convolution networks are able to handle longrange sequences properly in a nonrecursive manner, which facilitates parallel computation and alleviates the gradient explosion problem. The dilated causal convolution preserves the temporal causal order by padding zeros to the inputs so that predictions made on the current time step only involve historical information. As a special case of standard 1Dconvolution, the dilated causal convolution operation slides over inputs by skipping values with a certain step, as illustrated by Figure
Document. Mathematically, given a 1D sequence input and a filter , the dilated causal convolution operation of with at step is represented as() 
where is the dilation factor which controls the skipping distance. By stacking dilated causal convolution layers with dilation factors in an increasing order, the receptive field of a model grows exponentially. It enables dilated causal convolution networks to capture longer sequences with less layers, which saves computation resources.
Gated TCN: Gating mechanisms are critical in recurrent neural networks. They have been shown to be powerful to control information flow through layers for temporal convolution networks as well [dauphin2017language]. A simple Gated TCN only contains an output gate. Given the input , it takes the form
() 
where , , and are model parameters, is the elementwise product, is an activation function of the outputs, and
is the sigmoid function which determines the ratio of information passed to the next layer. We adopt Gated TCN in our model to learn complex temporal dependencies. Although we empirically set the tangent hyperbolic function as the activation function
, other forms of Gated TCN can be easily fitted into our framework, such as an LSTMlike Gated TCN [kalchbrenner2016neural].Framework of Graph WaveNet
We present the framework of Graph WaveNet in Figure Document. It consists of stacked spatialtemporal layers and an output layer. A spatialtemporal layer is constructed by a graph convolution layer (GCN) and a gated temporal convolution layer (Gated TCN) which consists of two parallel temporal convolution layers (TCNa and TCNb). By stacking multiple spatialtemporal layers, Graph WaveNet is able to handle spatial dependencies at different temporal levels. For example, at the bottom layer, GCN receives shortterm temporal information while at the top layer GCN tackles longterm temporal information. The inputs
to a graph convolution layer in practice are threedimension tensors with size [N,C,L] where
is the number of nodes, and is the hidden dimension, is the sequence length. We apply the graph convolution layer to each of . We choose to use mean absolute error (MAE) as the training objective of Graph WaveNet, which is defined by() 
Unlike previous works such as [li2018diffusion, yu2018spatio], our Graph WaveNet outputs as a whole rather than generating recursively through steps. It addresses the problem of inconsistency between training and testing due to the fact that a model learns to make predictions for one step during training and is expected to produce predictions for multiple steps during inference. To achieve this, we artificially design the receptive field size of Graph WaveNet equals to the sequence length of the inputs so that in the last spatialtemporal layer the temporal dimension of the outputs exactly equals to one. After that we set the number of output channels of the last layer as a factor of step length to get our desired output dimension.
Experiments
We verify Graph WaveNet on two public traffic network datasets, METRLA and PEMSBAY released by Li et al. li2018diffusion. METRLA records four months of statistics on traffic speed on 207 sensors on the highways of Los Angeles County. PEMSBAY contains six months of traffic speed information on 325 sensors in the Bay area. We adopt the same data preprocessing procedures as in [li2018diffusion]. The readings of the sensors are aggregated into 5minutes windows. The adjacency matrix of the nodes is constructed by road network distance with a thresholded Gaussian kernel [shuman2012emerging]
. Zscore normalization is applied to inputs. The datasets are split in chronological order with 70% for training, 10% for validation and 20% for testing. Detailed dataset statistics are provided in Table
Document.Data  #Nodes  #Edges  #Time Steps 

METRLA  207  1515  34272 
PEMSBAY  325  2369  52116 
2*Data  2*Models  15 min  30 min  60 min  

MAE  RMSE  MAPE  MAE  RMSE  MAPE  MAE  RMSE  MAPE  
6*[origin=c]90METRLA  ARIMA [li2018diffusion]  3.99  8.21  9.60%  5.15  10.45  12.70%  6.90  13.23  17.40% 
FCLSTM [li2018diffusion]  3.44  6.30  9.60%  3.77  7.23  10.90%  4.37  8.69  13.20%  
WaveNet [oord2016wavenet]  2.99  5.89  8.04%  3.59  7.28  10.25%  4.45  8.93  13.62%  
DCRNN [li2018diffusion]  2.77  5.38  7.30%  3.15  6.45  8.80%  3.60  7.60  10.50%  
GGRU [zhang2018gaan]  2.71  5.24  6.99%  3.12  6.36  8.56%  3.64  7.65  10.62%  
STGCN [yu2018spatio]  2.88  5.74  7.62%  3.47  7.24  9.57%  4.59  9.40  12.70%  
Graph WaveNet  2.69  5.15  6.90%  3.07  6.22  8.37%  3.53  7.37  10.01%  
6*[origin=c]90PEMSBAY  ARIMA [li2018diffusion]  1.62  3.30  3.50%  2.33  4.76  5.40%  3.38  6.50  8.30% 
FCLSTM [li2018diffusion]  2.05  4.19  4.80%  2.20  4.55  5.20%  2.37  4.96  5.70%  
WaveNet [oord2016wavenet]  1.39  3.01  2.91%  1.83  4.21  4.16%  2.35  5.43  5.87%  
DCRNN [li2018diffusion]  1.38  2.95  2.90%  1.74  3.97  3.90%  2.07  4.74  4.90%  
GGRU [zhang2018gaan]                    
STGCN [yu2018spatio]  1.36  2.96  2.90%  1.81  4.27  4.17%  2.49  5.69  5.79%  
Graph WaveNet  1.30  2.74  2.73%  1.63  3.70  3.67%  1.95  4.52  4.63%  
Baselines
We compare Graph WaveNet with the following models.

ARIMA. AutoRegressive Integrated Moving Average model with Kalman filter
[li2018diffusion]. 
FCLSTM Recurrent neural network with fully connected LSTM hidden units [li2018diffusion].

WaveNet. A convolution network architecture for sequence data [oord2016wavenet].

DCRNN. Diffusion convolution recurrent neural network [li2018diffusion], which combines graph convolution networks with recurrent neural networks in an encoderdecoder manner.

GGRU. Graph gated recurrent unit network
[zhang2018gaan]. Recurrentbased approaches. GGRU uses attention mechanisms in graph convolution. 
STGCN. Spatialtemporal graph convolution network [yu2018spatio], which combines graph convolution with 1D convolution.
Experimental Setups
Our experiments are conducted under a computer environment with one Intel(R) Core(TM) i97900X CPU @ 3.30GHz and one NVIDIA Titan Xp GPU card. To cover the input sequence length, we use eight layers of Graph WaveNet with a sequence of dilation factors . We use Equation Document as our graph convolution layer with a diffusion step
. We randomly initialize node embeddings by a uniform distribution with a size of 10. We train our model using Adam optimizer with an initial learning rate of 0.001. Dropout with p=0.3 is applied to the outputs of the graph convolution layer. The evaluation metrics we choose include mean absolute error (MAE), root mean squared error (RMSE), and mean absolute percentage error (MAPE). Missing values are excluded both from training and testing.
Dataset  Model Name  Adjacency Matrix Configuration  Mean MAE  Mean RMSE  Mean MAPE  

5*

Identity  []  3.58  7.18  10.21%  
Forwardonly  []  3.13  6.26  8.65%  
Adaptiveonly  []  3.10  6.21  8.68%  
Forwardbackward  [, ]  3.08  6.13  8.25%  
Forwardbackwardadaptive  [, , ]  3.04  6.09  8.23%  
5*

Identity  []  1.80  4.05  4.18%  
Forwardonly  []  1.62  3.61  3.72%  
Adaptiveonly  []  1.61  3.63  3.59%  
Forwardbackward  [, ]  1.59  3.55  3.57%  
Forwardbackwardadaptive  [, , ]  1.58  3.52  3.55%  
Experimental Results
Table Document compares the performance of Graph WaveNet and baseline models for 15 minutes, 30 minutes and 60 minutes ahead prediction on METRLA and PEMSBAY datasets. Graph WaveNet obtains the superior results on both datasets. It outperforms temporal models including ARIMA, FCLSTM, and WaveNet by a large margin. Compared to other spatialtemporal models, Graph WaveNet surpasses the previous convolutionbased approach STGCN significantly and excels recurrentbased approaches DCRNN and GGRU at the same time. In respect of the second best model GGRU as suggested in Table Document, Graph WaveNet achieves small improvement over GGRU on the 15minute horizons; however, realizes bigger enhancement on the 60minute horizons. We think this is because our architecture is more capable of detecting spatial dependencies at each temporal stage. GGRU uses recurrent architectures in which parameters of the GCN layer are shared across all recurrent units. In contrast, Graph WaveNet employs stacked spatialtemporal layers which contain separate GCN layers with different parameters. Therefore each GCN layer in Graph WaveNet is able to focus on its own range of temporal inputs. We plot 60minutesahead predicted values v.s real values of Graph WaveNet and WaveNet on a snapshot of the test data in Figure Document. It shows that Graph WaveNet generates more stable predictions than WaveNet. In particular, there is a red sharp spike produced by WaveNet, which deviates far from real values. On the contrary, the curve of Graph WaveNet goes in the middle of real values all the time.
Effect of the SelfAdaptive Adjacency Matrix
To verify the effectiveness of our proposed adaptive adjacency matrix, we conduct experiments with Graph WaveNet using five different adjacency matrix configurations. Table Document shows the average score of MAE, RMSE, and MAPE over 12 prediction horizons. We find that the adaptiveonly model works even better than the forwardonly model with mean MAE. When the graph structure is unavailable, Graph WaveNet would still be able to realize a good performance. The forwardbackwardadaptive model achieves the lowest scores on all three evaluation metrics. It indicates that if graph structural information is given, adding the selfadaptive adjacency matrix could introduce new and useful information to the model. In Figure Document, we further investigate the learned selfadaptive adjacency matrix under the configuration of the forwardbackwardadaptive model trained on the METRLA dataset. According to Figure Document, some columns have more highvalue points than others such as column 9 in the left box compared to column 47 in the right box. It suggests that some nodes are influential to most nodes in a graph while other nodes have weaker impacts. Figure Document confirms our observation. It can be seen that node 9 locates nearby the intersection of several main roads while node 47 lies in a single road.
2*Model  Computation Time  

Training(s/epoch) 
Inference(s)  
DCRNN  249.31  18.73 
STGCN  19.10  11.37 
Graph WaveNet  53.68  2.27 
Computation Time
We compare the computation cost of Graph WaveNet with DCRNN and STGCN on the METRLA dataset in Table Document. Graph WaveNet runs five times faster than DCRNN but two times slower than STGCN in training. For inference, we measure the total time cost of each model on the validation data. Graph WaveNet is the most efficient of all at the inference stage. This is because that Graph WaveNet generates 12 predictions in one run while DCRNN and STGCN have to produce the results conditioned on previous predictions.
Conclusion
In this paper, we present a novel model for spatialtemporal graph modeling. Our model captures spatialtemporal dependencies efficiently and effectively by combining graph convolution with dilated casual convolution. We propose an effective method to learn hidden spatial dependencies automatically from the data. This opens up a new direction in spatialtemporal graph modeling where the dependency structure of a system is unknown but needs to be discovered. On two public traffic network datasets, Graph WaveNet achieves stateoftheart results. In future work, we will study scalable methods to apply Graph WaveNet on largescale datasets and explore approaches to learn dynamic spatial dependencies.
Acknowledgments
This research was funded by the Australian Government through the Australian Research Council (ARC) under grants 1) LP160100630 partnership with Australia Government Department of Health and 2) LP150100671 partnership with Australia Research Alliance for Children and Youth (ARACY) and Global Business College Australia (GBCA).
Comments
There are no comments yet.