1 Introduction and Related Work
Using deep learning techniques to analyze spatiotemporal data has shown promising results on classification and regression tasks, in contexts such as road and traffic networks (Shi & Yeung, 2018), gene expression data (Dutil et al., 2018), and Internet networking (Boutaba et al., 2018)
. Previous work has leveraged supervised learning techniques, largely by combining autoregressive models with graph convolutional layers of different types
(Yu et al., 2018; Li et al., 2018; Zhang et al., 2018). In this work, we address this task in the unsupervised learning setting and propose an approach for learning node representations for a spatiotemporal graph in a fully unsupervised fashion, encoding useful information that increases performance of a downstream forecasting model.
While neural network based learning methods for graphstructured data have received substantial attention, previous work has largely focused on supervised classification tasks.
Veličković et al. (2019) have recently proposed Deep Graph Infomax (DGI)—an unsupervised representation learning approach for nodes in nontemporal graphs—and achieved stateoftheart performance on classification benchmarks. Unlike previous methods (Mutlu & Oghaz, 2018), DGI does not rely on a random walk or adjacencybased methods and instead uses graph convolutions (Kipf & Welling, 2016) to build on the deep mutual information maximization principle described by Hjelm et al. (2019). So far, DGI has only been applied to nontemporal graphs in the node classification setting. In this work, we adapt the mutual information maximization principle to spatiotemporal graphs and show that the learned embeddings can encode valuable information for node regression tasks. We compare our model to a baseline autoregressive model that only exploits temporal information and thus show that we can encode relevant spatial information in a fully unsupervised manner.2 Background
2.1 Problem Setting
The node regression task takes the form of a prediction on a graph with node set , edge set , and weighted adjacency matrix . Node features change over time and hence features at time step are given by . Given the features from the most recent time steps, the task is to predict the features of the next time steps using a function , potentially parameterized by a neural network:
(1) 
2.2 Learning Representations via Mutual Information Maximization
Deep InfoMax (DIM, Hjelm et al., 2019) is a recent approach for unsupervised representation learning that derives embeddings by maximizing the mutual information between the output of an encoder and local patches of the input. DIM builds on Mutual Information Neural Information (MINE, Belghazi et al., 2018)
, which formulates an estimate
for the mutual information between random variables
using neural networks. These estimates are obtained by training a classifier (a.k.a, the
discriminator or statistics network) to distinguish between samples from the joint distribution and the product of marginals. DIM applies this approach to representation learning by training both the encoder and the discriminator to maximize the mutual information between the random variables corresponding to local input patches and the embeddings. Deep Graph Infomax (DGI) extends this representation learning technique to nontemporal graphs, finding node embeddings that maximize the mutual information between local patches of the graph and summaries of the entire graph. Here, we build on these methods and propose a representation learning technique for spatiotemporal graphs. Furthermore, unlike in previous work, we evaluate our embeddings in the regression rather than classification setting.
3 SpatioTemporal Deep Graph Infomax
We extend the DGI approach by adapting it to spatiotemporal graphs and refer to our method as spatiotemporal deep graph infomax (STDGI). At each time step, representations are trained for each node in the graph in a fully unsupervised fashion. Similarly to DIM, we train the encoder to maximize the mutual information between patches in the graph at a particular time step and the raw features of the same node at a future time step . The goal is to aggregate, for each node, the information from its neighbourhood that is most relevant for predicting its features in the future.
3.1 Architecture Components
The unsupervised training setup is equivalent to the one in DIM or DGI. An encoder computes embeddings for each node at each time step , using node features and the graph structure . The discriminator then receives pairs containing the embedding and raw features of the same node at the current and next time step, respectively. We refer to such a pair of embedding and raw features as a positive sample, if both were drawn from the same graph. A negative sample will then consist of an embedding and raw features, where the latter are obtained from a corrupted version of the graph, derived by randomly permuting the node features of the graph at each time step. Positive samples can be understood as being drawn from the joint distribution of embeddings and raw features, whereas negative samples are drawn from the marginal distributions. The discriminator outputs a score corresponding to whether a given pair represents a positive or negative sample; both the encoder and discriminator are trained jointly to distinguish between positive and negative samples by minimizing the binary cross entropy loss. This maximizes the mutual information between the embeddings and the raw features of the next time step (Poole et al., 2018; Belghazi et al., 2018; Hjelm et al., 2019).
During supervised training, the embeddings output by the encoder are concatenated with the raw features . In order to use the learned representations for our considered task, the resulting features serve as input to a downstream supervised regressor.
4 Experiments
We devise an evaluation setup for the traffic forecasting task to determine whether embeddings successfully encode spatial information of the graph that is relevant for making more accurate predictions. From a highlevel point of view, the graph structure corresponds to a network of traffic sensors, where nodes are individual traffic sensors. An edge between two nodes is added when the distance between the two corresponding sensors is below a certain threshold. The time series of node features are given by the traffic measurements of each sensor over time.
4.1 Experimental Setup
We use the METRLA dataset (Jagadish et al., 2014), which contains data recorded by 207 traffic sensors. The traffic measurements were aggregated into five minute intervals and consist of traffic speed and the time of day. The graph is given by a directed, weighted adjacency matrix. The edge weights are the exponentially decaying distances along the roads. For more details on the data set, we refer to (Jagadish et al., 2014, E.1). Given the past 12 time steps (corresponding to measurements over 1h), the predictor has to forecast the traffic speeds at the next 12 time steps.
The encoder consists of a linear layer applied to each , followed by two graph convolutional layers applied to each . The discriminator is a twolayer fullyconnected neural network, which concatenates the embedding and raw features of each pair and outputs whether the pair is a positive or negative sample. We chose to train three separate discriminators of the same architecture. Each discriminator compares the embedding to the raw feature of the same node steps in the future where . For the downstream regressor, we employ an LSTM seq2seq model (Sutskever et al., 2014), which operates on the time series of each node in isolation. As a baseline, we compare our regressor to one with an identical configuration that receives as input only the raw features (rather than the concatenation of the raw features and embeddings). More details on the experimental setup can be found in Appendix A.
4.2 Results
Method  15 min  30 min  60 min 

MAE  
LSTM Baseline  
STDGI  
RMSE  
LSTM Baseline  
STDGI  
MAPE  
LSTM Baseline  
STDGI 
Results for the LSTM regressor that only uses raw features (baseline) and for the one that uses raw features concatenated with STDGI embeddings are shown in Table 1. We find that regressors using STDGI embeddings achieve lower predictions errors for all time horizons considered, and that the improvements become more pronounced for larger time horizons. This suggests that STDGI extracts embeddings with useful longterm features that improve upon the future predictive ability of raw data alone. All performances increases documented here are significant at .
These results are further supported by the qualitative study in Figure 2, which illustrates tSNEprocessed embeddings colored by future time point speeds. As indicated by their color, closely clustered embeddings generally share similar speeds. This suggests that embedding similarity is a proxy for speed similarity, and thus that embeddings are learning useful longterm information.
5 Conclusion
We have presented STDGI, an approach for learning embeddings of nodes in a graph that evolves over time, which leverages mutual information maximization and performs node regression in a spatiotemporal prediction context. We demonstrate that an autoregressive seq2seq model operating on the time axis achieves higher predictive performance when making use of STDGI embeddings, thus confirming that the method successfully encodes valuable information over that provided by the standard baseline, even in the more challenging setting of regression. Moreover, the results show that STDGI is able to capture intrinsic and helpful properties of traffic flow by building increasingly stronger representations of the graph in relation to the baseline, as the prediction time horizon becomes larger. Our model represents both a generalization of DGI to spatiotemporal settings as well as a successful extension of the method to perform regression, in addition to classification. Future work will aim to find embeddings that further increase the accuracy of the downstream regressor and provide highquality representations for multiple predictive tasks.
References
 Belghazi et al. (2018) Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. Mutual Information Neural Estimation. arXiv preprint arXiv: 1801.04062, 2018.

Boutaba et al. (2018)
Raouf Boutaba, Mohammad A. Salahuddin, Noura Limam, Sara Ayoubi, Nashid
Shahriar, Felipe EstradaSolano, and Oscar M. Caicedo.
A comprehensive survey on machine learning for networking: evolution, applications and research opportunities.
Journal of Internet Services and Applications, 9(1):16, 2018.  Dutil et al. (2018) Francis Dutil, Joseph Paul Cohen, Martin Weiss, Georgy Derevyanko, and Yoshua Bengio. Towards Gene Expression Convolutions using Gene Interaction Graphs. Technical report, 2018.
 Hjelm et al. (2019) R Devon Hjelm, Alex Fedorov, Samuel LavoieMarchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. Learning deep representations by mutual information estimation and maximization. In International Conference on Learning Representations, 2019.
 Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural Comput., 9(8):1735–1780, 1997.
 Jagadish et al. (2014) H. V. Jagadish, Johannes Gehrke, Alexandros Labrinidis, Yannis Papakonstantinou, Jignesh M. Patel, Raghu Ramakrishnan, and Cyrus Shahabi. Big Data and Its Technical Challenges. Commun. ACM, 57(7):86–94, July 2014.
 Kingma & Ba (2014) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv: 1412.6980, 2014.
 Kipf & Welling (2016) Thomas N Kipf and Max Welling. SemiSupervised Classification with Graph Convolutional Networks. arXiv preprint arXiv:1609.02907, 2016.

Li et al. (2018)
Yaguang Li, Rose Yu, Cyrus Shahabi, and Yan Liu.
Diffusion Convolutional Recurrent Neural Network: DataDriven Traffic Forecasting.
In International Conference on Learning Representations (ICLR ’18), 2018. 
Mutlu & Oghaz (2018)
Ece C Mutlu and Toktam A Oghaz.
Review on Graph Feature Learning and Feature Extraction Techniques for Link Prediction.
2018.  Poole et al. (2018) Ben Poole, Sherjil Ozair, Aäron van den Oord, Alexander A Alemi, and George Tucker. On variational lower bounds of mutual information. NeurIPS Workshop on Bayesian Deep Learning, 2018.
 Shi & Yeung (2018) Xingjian Shi and DitYan Yeung. Machine Learning for Spatiotemporal Sequence Forecasting: A Survey. arXiv preprint arXiv: 1808.06865, 2018.
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014.
 Veličković et al. (2019) Petar Veličković, William Fedus, William L. Hamilton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. Deep Graph Infomax. In International Conference on Learning Representations, 2019.
 Yu et al. (2018) Bing Yu, Haoteng Yin, and Zhanxing Zhu. SpatioTemporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting. In Proceedings of the TwentySeventh International Joint Conference on Artificial Intelligence, IJCAI18, pp. 3634–3640. International Joint Conferences on Artificial Intelligence Organization, 2018.
 Zhang et al. (2018) Jiani Zhang, Xingjian Shi, Junyuan Xie, Hao Ma, Irwin King, and DitYan Yeung. GaAN: Gated Attention Networks for Learning on Large and Spatiotemporal Graphs. In Proceedings of the ThirtyFourth Conference on Uncertainty in Artificial Intelligence, UAI 2018, Monterey, California, USA, August 610, 2018, pp. 339–349, 2018.
Appendix A Details on Experimental Setup
The METRLA (Jagadish et al., 2014) dataset contains data recorded by 207 traffic sensors throughout Los Angeles County, from March 1st, 2012 to June 20th, 2012. In our experiments, we use the canonical split of the dataset into training, validation, and test set containing , , and samples, respectively.
All layers in the encoder contain 64 hidden units and the embeddings size is 128. The two fullyconnected layers of the discriminator contain 6 and 1 hidden units, respectively. The seq2seq downstream regressor consists of a single LSTM layer (Hochreiter & Schmidhuber, 1997) with 64 hidden units.
All models are trained with a batch size of 64 for 120 epochs, using the Adam optimizer
(Kingma & Ba, 2014) The unsupervised training of embeddings is carried out for 100 epochs, with an initial learning rate of that is reduced by a factor ofevery 30 epochs after the first 20. The supervised models use the mean absolute error (MAE) over the entire horizon of 12 steps as the loss function. The learning rate is initially
and decreases by a factor of every 30 epochs after the first 20.
Comments
There are no comments yet.