Spatial and time-dependent data describe the evolution of signals (i.e., the values of attributes) at multiple spatial locations across time [31, 11]. It occurs in many domains, including economics , global trade , environment studies , public health , or traffic networks  to name a few. For example, the gross domestic product (GDP) of different countries in the past century, the daily temperature measurements of different cities for the last decade, and the hourly taxi ride-hailing demand at various urban locations in the recent year are all spatial and time-dependent data. Forecasting such data allows to proactively allocate resources and take actions to improve the efficiency of society and the quality of life.
However, forecasting spatial and time-dependent data is challenging — they exhibit complex spatial dependency, long-range temporal dependency, heterogeneity, and non-stationarity. Take the spatial and time-dependent data in a traffic network as an example. The data at a location (e.g., taxi ride-hailing demand) may correlate more with the data at a geometrically remote location than a nearby location . Also, the data at a time instant may depend on the data at a recent time instant, say an hour ago, but may be highly correlated with the data a day ago or even a week ago, showing strong long-range temporal dependency. Additionally, the spatial and time-dependent data may be influenced by many other relevant factors (e.g., weather influences taxi demand). These factors are relevant information, shall be taken into account. In other words, in this paper we propose to perform forecasting with heterogeneous sources of data at different spatial and time scales and including auxiliary information of a different nature or modality. Further, the data may be non-stationary due to unexpected incidents or traffic accidents 
. This non-stationarity makes the conventional time series forecasting methods such as Auto-Regressive Integrated Moving Average (ARIMA) and Kalman filtering, which usually rely on stationarity, inappropriate for accurate forecasting with spatial and time-dependent data. [13, 32].
Recently, deep learning models have been proposed for forecasting for spatial and time-dependent data[13, 28, 10, 29, 6, 30, 27, 32]
. To deal with spatial dependency, most of these models either use pre-defined distance/similarity metrics or other prior knowledge like adjacency matrices of traffic networks to determine dependencies among locations. Then, they often use a (standard or graph) convolutional neural network (CNN) to better characterize the spatial dependency between these locations. These ad-hoc methods may lead to errors. For example, the locations that are considered as dependent may actually be independent in practice. Regarding temporal dependency, most of these models use recurrent neural networks (RNN), CNN, or their variants to capture the data long-range temporal dependency and non-stationarity. But it is well documented that these networks may fail to capture temporal dependencies between distant time epochs[8, 23].
To tackle these challenges, we propose Forecaster, a new deep learning architecture for forecasting spatial and time-dependent data. Our architecture consists of two parts. First, we use the theory of Gaussian Markov random fields  to learn the structure of the graph that parsimoniously represents the spatial dependency between the locations (we call such graph a dependency graph5].111The approach to estimate the precision matrix of a Gaussian Markov random field (i.e., graphical lasso) can also be used with non-Gaussian distributions . The precision matrix provides the graph structure with each node representing a location and each edge representing the dependency between two locations. This contrasts prior work on forecasting — we learn from the data its spatial dependencies. Second, we integrate the dependency graph in the architecture of the Transformer  for forecasting spatial and time-dependent data. The Transformer and its extensions [23, 2, 26, 4, 18] have been shown to significantly outperform RNN and CNN in NLP tasks, as they capture relations among data at distant positions, significantly improving the learning of long-range temporal dependency 
. In our Forecaster, in order to better capture the spatial dependencies, we associate each neuron in different layers with a spatial location. Then, we sparsify the Transformer based on the dependency graph: if two locations are not connected, we prune the connection between their associated neurons. In this way, the state encoding for each location is only impacted by its own state encoding and encodings for other dependent locations. Moreover, pruning the unnecessary connections in the Transformer avoids overfitting.
To evaluate the effectiveness of our proposed architecture, we apply it to the task of forecasting taxi ride-hailing demand in New York City . We pick 996 hot locations in New York City and forecast the hourly taxi ride-hailing demand around each location from January 1st, 2009 to June 30th, 2016. Our architecture accounts for crucial auxiliary information such as weather, day of the week, hour of the day, and holidays. This improves significantly the forecasting task. Evaluation results show that our architecture reduces the root mean square error (RMSE) and mean absolute percentage error (MAPE) of the Transformer by 8.8210% and 9.6192%, respectively, and also show that our architecture significantly outperforms other state-of-the-art baselines.
In this paper, we present critical innovations:
Forecaster combines the theory of Gaussian Markov random fields with deep learning. It uses the former to find the dependency graph among locations, and this graph becomes the basis for the deep learner forecast spatial and time-dependent data.
Forecaster sparsifies the architecture of the Transformer based on the dependency graph, allowing the Transformer to capture better the spatiotemporal dependencies within the data.
We apply Forecaster to forecasting taxi ride-hailing demand and demonstrate the advantage of its proposed architecture over state-of-the-art baselines.
In this section, we introduce the proposed architecture of Forecaster. We start by formalizing the problem of forecasting spatial and time-dependent data (Section 2.1). Then, we use Gaussian Markov random fields to determine the dependency graph among data at different locations (Section 2.2). Based on this dependency graph, we design a sparse linear layer, which is a fundamental building block of Forecaster (Section 2.3). Finally, we present the entire architecture of Forecaster (Section 2.4).
2.1 Problem Statement
We define spatial and time-dependent data as a series of spatial signals, each collecting the data at all spatial locations at a certain time. For example, hourly taxi demand at a thousand locations in 2019 is a spatial and time-dependent data, while the hourly taxi demand at these locations between 8 a.m. and 9 a.m. of January 1st, 2019 is a spatial signal. The goal of our forecasting task is to predict the future spatial signals given the historical spatial signals and historical/future auxiliary information (e.g., weather history and forecast). We formalize forecasting as learning a function that maps historical spatial signals and historical/future auxiliary information to future spatial signals, as Equation (1):
where is the spatial signal at time , , with the data at location at time ; the number of locations; the auxiliary information at time , , the dimension of the auxiliary information;222For simplicity, we assume in this work that different locations share the same auxiliary information, i.e., can impact , for any . However, it is easy to generalize our approach to the case where locations do not share the same auxiliary information. and is the set of the reals.
2.2 Gaussian Markov Random Field
We use Gaussian Markov random fields to find the dependency graph of the data over the different spatial locations. Gaussian Markov random fields model the spatial and time-dependent data as a multivariant Gaussian distribution overis
where and are the expected value (mean) and precision matrix (inverse of the covariance matrix) of the distribution.
The precision matrix characterizes the conditional dependency between different locations — whether the data and at the and locations depend on each other or not given the data at all the other locations (). We can measure the conditional dependency between locations and through their conditional correlation coefficient :
where is the , entry of . In practice, we set a threshold on , and treat locations and as conditionally dependent if the absolute value of is above the threshold.
The non-zero entries define the structure of the dependency graph between locations. Figure 1 shows an example of a dependency graph. Locations 1 and 2 and locations 2 and 3 are conditionally dependent, while locations 1 and 3 are conditionally independent. This principle example illustrates the advantage of Gaussian Markov random field over ad-hoc pairwise similarity metrics — the former leads to parsimonious (sparse) graph representations.
We estimate the precision matrix by graphical lasso , an L1-penalized maximum likelihood estimator:
where is the empirical covariance matrix computed from the data:
where is the number of time samples used to compute .
2.3 Building Block: Sparse Linear Layer
We use the dependency graph to sparsify the architecture of the Transformer. This leads to the Transformer better capturing the spatial dependencies within the data. There are multiple linear layers in the Transformer. Our sparsification on the Transformer replaces all these linear layers by the sparse linear layers described in this section.
We use the dependency graph to build a sparse linear layer. Figure 2 shows an example (based on the dependency graph in Figure 1). Suppose that initially the layer (of nine neurons) is fully connected to the layer (of five neurons). We assign neurons to the data at different locations and to the auxiliary information as illustrated next. In this example, assign one neuron to each location and two neurons to the auxiliary information at the layer and assign two neurons to each location and three neurons to the auxiliary information at the layer.333How to assign neurons is a design choice for users. After assigning neurons, we prune connections based on the structure of the dependency graph. As location 1 and 3 are conditionally independent, we prune the connections between them. We also prune the connections between the neurons associated with locations and the auxiliary information.444This is another design choice for users. This way, the encoding for the data at a location is only impacted by the encodings of itself and of its dependent locations, better capturing the spatial dependency between locations. Moreover, pruning the unnecessary connections between conditionally independent locations helps avoiding overfitting.
2.4 Entire Architecture: Graph Transformer
Forecaster adopts an architecture similar to that of the Transformer except for substituting all the linear layers in the Transformer with our sparse linear layer designed based on the dependency graph. Figure 3 shows its architecture. Forecaster employs an encoder-decoder architecture . The encoder is used to encode the historical spatial signals and historical auxiliary information; the decoder is used to predict the future spatial signals based on the output of the encoder and the future auxiliary information. Due to the page limit, we omit what Forecaster shares with the Transformer (e.g., (masked) multi-head attention, positional encoding) and emphasize only on their differences.
At each time step in the history, we concatenate the spatial signal with its auxiliary information. This way, we obtain a sequence where each element is a vector consisting of the spatial signal and the auxiliary information at a specific time step. The encoder takes this sequence as input. Then, a sparse embedding
layer (consisting of a sparse linear layer with ReLU activation) maps each element of this sequence to thestate space of the model and outputs a new sequence. In Forecaster, except for the sparse linear layer at the end of the decoder, all the layers have the same output dimension. We term this dimension and the space with this dimension as the state space of the model. After that, we add positional encoding to the new sequence and let it pass through stacked encoder layers to generate the encoding of the input sequence. Each encoder layer consists of a sparse multi-head attention layer and a sparse feedforward layer. These layers are the same multi-head attention layer and feedforward layer as in the Transformer, except that sparse linear layers to replace linear layers within them.
For each time step in the future, we concatenate its auxiliary information with the (predicted) spatial signal one step before. Then, we input this sequence to the decoder. The decoder first uses a sparse embedding layer to map each element of the sequence to the state space of the model, and then passes it through stacked decoder layers to obtain the new encoding of each element. Finally, the decoder uses a sparse linear layer to project this encoding back and predict the next spatial signal. Like the Transformer, the decoder layer contains two sparse multi-head attention layers. The first one compares the elements in the input sequence of the decoder, obtaining a new encoding for each element of the sequence; the second one enables the comparison between the input sequences of the decoder and the encoder so that we can learn from the past history.
In this section, we apply Forecaster to the problem of forecasting taxi ride-hailing demand in Manhattan, New York City. We demonstrate that Forecaster outperforms the state-of-the-art baselines (Transformer  and DCRNN ).
3.1 Evaluation Settings
Our evaluation uses the NYC Taxi dataset  from 01/01/2009 to 06/30/2016 (7.5 years in total). This dataset records detailed information for each taxi trip in New York City, including its pickup and dropoff locations. Based on this dataset, we select 996 locations with hot taxi ride-hailing demand in Manhattan of New York City, shown in Figure 4
. Specifically, we compute the taxi ride-hailing demand at each location by accumulating the taxi ride closest to that location. Note that these selected locations are not uniformly distributed, as different regions of Manhattan has distinct taxi demand. We split the dataset into three parts — training set, validation set, and test set. Training set uses the data in the time interval 01/01/2009 – 12/31/2011 and 07/01/2012 – 06/30/2015; validation set uses the data in 01/01/2012 – 06/30/2012; and the test set uses the data in 07/01/2015 –06/30/2016.
Our evaluation uses hourly weather data from Weather Underground  to construct (part of) the auxiliary information. Each record in this weather data contains seven entries — temperature, wind speed, precipitation, visibility, and the Booleans for rain, snow, and fog.
3.1.2 Details of the Forecasting Task
In our evaluation, we forecast taxi demand for the next three hours based on the previous 674 hours and the corresponding auxiliary information (i.e., use four weeks and two hours as history; , in Equation 1). Instead of directly inputing this history sequence into the model, we first filter it. This filtering is based on the following observation: a future taxi demand correlates more with the taxi demand at previous recent hours, the similar hours of the past week, and the similar hours on the same weekday in the past several weeks. In other words, we shrink the history sequence and only input the elements relevant to forecasting. Specifically, our filtered history sequence contains the data for the following taxi demand (and the corresponding auxiliary information):
The recent past hours: ;
Similar hours of the past week: ;
Similar hours on the same weekday of the past several weeks: .
3.1.3 Evaluation Metrics
Similar to prior work [13, 6], we use root mean square error (RMSE) and mean absolute percentage error (MAPE) to evaluate the quality of the forecasting results. Suppose that for the forecasting job (), the ground truth is , and the prediction is , where is the number of locations, and is the length of the forecasted sequence. Then RMSE and MAPE are:
Following practice in prior work , we set a threshold on when computing MAPE: if , disregard the term associated it. This practice prevents small dominating MAPE.
3.2 Models Details
We evaluate Forecaster and compare it against baseline models including the Transformer and DCRNN.
3.2.1 Our model: Forecaster
Forecaster uses weather (7 bits), weekday (one-hot encoding, 7 bits), hour (one-hot encoding, 24 bits), and a Boolean for holidays (1 bit) as auxiliary information (39 bits). Concatenated with a spatial signal (996 bits), each element of the input sequence for Forecaster has 1035 bits. Forecaster uses one encoder layer and one decoder layer (i.e.,. Except for the sparse linear layer at the end, all the layers of Forecaster use four neurons for encoding the data at each location and 64 neurons for encoding the auxiliary information and thus have 4048 neurons in total (i.e.,
). The sparse linear layer at the end has 996 neurons. Forecaster uses the following loss function:
where is a constant used to balance the impact of RMSE with MAPE.
3.2.2 Baseline model: Transformer 
The Transformer uses the same input and loss function as Forecaster. It also adopts a similar architecture except that all the layers are fully-connected. For a comprehensive comparison, we evaluate two versions of the Transformer:
Transformer-4048: All the layers in this implementation (except for the linear layer at the end) have the same width as Forecaster, i.e., ;
Transformer-Best: We vary the width of all the layers (except for the linear layer at the end) from 64 to 4096, and pick the best width in performance to implement.
3.2.3 Baseline model: DCRNN 
DCRNN models the dependency relations between locations as a diffusion process guided by a pre-defined distance metric. Then, it leverages graph CNN to capture spatial dependency and RNN to capture the temporal dependency within the data.
Our evaluation of Forecaster starts by using Gaussian Markov random fields to determine the spatial dependency between the data at different locations. Based on the methods described in Section 2.2, we can obtain a conditional correlation matrix where each entry of the matrix represents the conditional correlation coefficient between two corresponding locations. If the absolute value of an entry is less than a threshold, we will treat the corresponding two locations as conditionally independent, and round the value of the entry to zero. Figure 5 shows the structure of the conditional correlation matrix under a threshold of 0.1. We can see that the matrix is sparse, which means a location generally depends on just a few other locations other than all the locations. We found that a location depends on only 2.5 other locations on average. There are some locations on which many other locations depend on. For example, there is a location in Lower Manhattan on which 16 other locations depend on. This may be because there are many locations with significant taxi demand in Lower Manhattan, with these locations sharing a strong dependency. Figure 6 shows the top 400 spatial dependencies. We see some long-range spatial dependency between remote locations. For example, there is a strong dependency between Grand Central Terminal and New York Penn Station, which are two important stations in Manhattan with a large traffic of passengers.
After determining the spatial dependency between locations, we use the graph transformer architecture of Forecaster to forecast the taxi demand. Table 1
contrasts the performance of Forecaster to other baseline models. Here we run all the evaluated models six times (using different seeds) and report the mean and the standard deviation of the results. Forecaster outperforms all the baseline methods at every future step of forecasting. On average (over these future steps), Forecaster achieves an RMSE of 5.1879 and a MAPE of 20.1362, which is 8.8210% and 9.6192% better than the Transformer (Transformer-Best), and 3.4809% and 19.4078% better than DCRNN. This demonstrates the advantage of Forecaster on capturing spatiotemporal dependencies.
|Model||Metrics||Average||Next step||Second next step||Third next step|
|DCRNN||RMSE||5.3750 ± 0.0691||5.1627 ± 0.0644||5.4018 ± 0.0673||5.5532 ± 0.0758|
|MAPE (%)||24.9853 ± 0.1275||24.4747 ± 0.1342||25.0366 ± 0.1625||25.4424 ± 0.1238|
|Transformer-4048||RMSE||5.6802 ± 0.0206||5.4055 ± 0.0109||5.6632 ± 0.0173||5.9584 ± 0.0478|
|MAPE (%)||22.5787 ± 0.2153||21.8932 ± 0.2006||22.3830 ± 0.1943||23.4583 ± 0.2541|
|Transformer-Best||RMSE||5.6898 ± 0.0219||5.4066 ± 0.0302||5.6546 ± 0.0581||5.9926 ± 0.0472|
|MAPE (%)||22.2793 ± 0.1810||21.4545 ± 0.0448||22.1954 ± 0.1792||23.1868 ± 0.3334|
|Forecaster||RMSE||5.1879 ± 0.0082||4.9629 ± 0.0102||5.2275 ± 0.0083||5.3651 ± 0.0065|
|MAPE (%)||20.1362 ± 0.0316||19.8889 ± 0.0269||20.0954 ± 0.0299||20.4232 ± 0.0604|
4 Related Work
To our knowledge, this work is the first (1) to integrate Gaussian Markov Random fields with deep learning to forecast spatial and time-dependent data, using the former to derive a dependency graph; (2) to sparsify the architecture of the Transformer based on the dependency graph, significantly improving the forecasting quality of the result architecture. The most closely related work is a set of proposals on forecasting spatial and time-dependent data and the Transformer, which we briefly review in this section.
4.1 Spatial and Time-Dependent Data Forecasting
Conventional methods for forecasting spatial and time-dependent data such as ARIMA and Kalman filtering-based methods [15, 14] usually impose strong stationary assumptions on the data, which are often violated . Recently, deep learning-based methods have been proposed to tackle the non-stationary and highly nonlinear nature of the data [28, 30, 29, 6, 27, 13]. Most of these works consist of two parts: modules to capture spatial dependency and modules to capture temporal dependency. Regarding spatial dependency, the literature mostly uses prior knowledge such as physical closeness between regions to derive an adjacency matrix and/or pre-defined distance/similarity metrics to decide whether two locations are dependent or not. Then, based on this information, they usually use a (standard or graph) CNN to characterize the spatial dependency between dependent locations. However, these methods are not good predictors of dependency relations between the data at different locations. Regarding temporal dependency, available works [28, 29, 6, 27, 13] usually use RNNs and CNNs to extract the long-range temporal dependency. However, both RNN and CNN do not learn well the long-range temporal dependency, with the number of operations used to relate signals at two distant time positions in a sequence growing at least logarithmically with the distance between them .
We evaluate our architecture with the problem of forecasting taxi ride-hailing demand around a large number of spatial locations. The problem has two essential features: (1) These locations are not uniformly distributed like pixels in an image, making standard CNN-based methods [28, 27, 30] not good for this problem; (2) it is desirable to perform multi-step forecasting, i.e., forecasting at several time instants in the future, this implying that the work mainly designed for single-step forecasting [29, 6] is less applicable. DCRNN  is the state-of-the-art baseline satisfying both features. Hence, we compare our architecture with DCRNN and show that our work outperforms DCRNN.
The Transformer  avoids recurrence and instead purely relies on the self-attention mechanism to let the data at distant positions in a sequence to relate to each other directly. This benefits learning long-range temporal dependency. The Transformer and its extensions have been shown to significantly outperform RNN-based methods in NLP and image generation tasks [23, 18, 2, 26, 4, 17]. However, it is still unknown how to apply the architecture of Transformer to spatial and time-dependent data, especially to deal with spatial dependency between locations. Later work  extends the architecture of Transformer to video generation. Even though this also needs to address spatial dependency between pixels, the nature of the problem is different from our task. In video generation, pixels exhibit spatial dependency only over a short time interval, lasting for at most tens of frames — two pixels may be dependent only for a few frames and become not depend in later frames. On the contrary, in spatial and time-dependent data, locations exhibit long-term spatial dependency lasting for months or even years. This fundamental difference of the applications that we consider enables us to use Gaussian Markov random fields to determine the graph of dependent locations as basis for sparsifying the Transformer. Child et al.  propose another sparse Transformer architecture with a different goal of accelerating the self-attention operations in the Transformer. This architecture is very different from our architecture.
Forecasting spatial and time-dependent data is challenging due to complex spatial dependency, long-range temporal dependency, non-stationarity, and heterogeneity within the data. This paper proposes Forecaster, a graph Transformer architecture to tackle these challenges. Forecaster uses Gaussian Markov random fields to determine the dependency graph between the data at different locations. Then, Forecaster sparsifies the architecture of the Transformer based on the structure of the graph and lets the sparsified Transformer (i.e., graph Transformer) capture the spatiotemporal dependencies, non-stationarity, and heterogeneity in one shot. We apply Forecaster to the problem of forecasting taxi-ride hailing demand at a large number of spatial locations. Evaluation results demonstrate that Forecaster significantly outperforms state-of-the-art baselines (Transformer and DCRNN).
-  (2019-Apr.) Generating Long Sequences with Sparse Transformers. arXiv preprint arXiv:1904.10509. Cited by: §4.2.
-  (2019-Jul.) Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In Annual Meeting of the Association for Computational Linguistics (ACL), pp. 2978–2988. Cited by: §1, §4.2.
-  (2006) 25 Years of Time Series Forecasting. International Journal of Forecasting 22 (3), pp. 443–473. Cited by: §1.
-  (2019-Jun.) Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pp. 4171–4186. Cited by: §1, §4.2.
-  (2008) Sparse Inverse Covariance Estimation with the Graphical Lasso. Biostatistics 9 (3), pp. 432–441. Cited by: §1, §2.2.
Spatiotemporal Multi-Graph Convolution Network for Ride-Hailing Demand Forecasting.
AAAI Conference on Artificial Intelligence, pp. 3656–3663. Cited by: §1, §3.1.3, §3.1.3, §4.1, §4.1.
-  (2016) The Forces of Economic Growth: A Time Series Perspective. Princeton University Press. Cited by: §1.
-  (2001) Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies. Cited by: §1.
-  (2017) Visual Exploration of Global Trade Networks with Time-Dependent and Weighted Hierarchical Edge Bundles on GPU. Computer Graphics Forum 36 (3), pp. 273–282. Cited by: §1.
-  (2016-Jun.) Structural-RNN: Deep Learning on Spatio-Temporal Graphs. In , pp. 5308–5317. Cited by: §1.
-  (2019) Cope: Interactive Exploration of Co-Occurrence Patterns in Spatial Time Series. IEEE Transactions on Visualization and Computer Graphics 25 (8), pp. 2554–2567. Cited by: §1.
-  (2014-Nov.) Vismate: Interactive Visual Analysis of Station-Based Observation Data on Climate Changes. In IEEE Conference on Visual Analytics Science and Technology (VAST), pp. 133–142. Cited by: §1.
-  (2018-Apr.) Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting. In International Conference on Learning Representations (ICLR), pp. 1–16. Cited by: §1, §1, §1, §3.1.3, §3.2.3, §3, §4.1, §4.1.
Short-Term Traffic Flow Forecasting: An Experimental Comparison of Time-Series Analysis and Supervised Learning. IEEE Transactions on Intelligent Transportation Systems 14 (2), pp. 871–882. Cited by: §4.1.
-  (2011-Aug.) Discovering Spatio-Temporal Causal Interactions in Traffic Data Streams. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 1010–1018. Cited by: §4.1.
-  (2009) Expectation-Based Scan Statistics for Monitoring Spatial Time Series Data. International Journal of Forecasting 25 (3), pp. 498–517. Cited by: §1.
International Conference on Machine Learning (ICML), pp. 4055–4064. Cited by: §4.2.
-  (2018-Jun.) Improving Language Understanding by Generative Pre-training. OpenAI. Cited by: §1, §4.2.
-  (2011) High-Dimensional Covariance Estimation by Minimizing L1-Penalized Log-Determinant Divergence. Electronic Journal of Statistics 5, pp. 935–980. Cited by: footnote 1.
-  (2005) Gaussian Markov Random Fields: Theory And Applications (Monographs on Statistics and Applied Probability). Chapman & Hall/CRC. Cited by: §1.
-  (2014-Dec.) Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing Systems (NIPS), pp. 3104–3112. Cited by: §2.4.
-  (2018) TLC Trip Record Data. Cited by: §1, §3.1.1.
-  (2017-Dec.) Attention is All You Need. In Advances in Neural Information Processing Systems (NIPS), pp. 5998–6008. Cited by: §1, §1, §3.2.2, §3, §4.1, §4.2.
-  (2018-Jun.) Non-Local Neural Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7794–7803. Cited by: §4.2.
-  (2018) Historical Weather. External Links: Cited by: §3.1.1.
-  (2019-Jun.) XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv preprint arXiv:1906.08237. Cited by: §1, §4.2.
-  (2019-Jan.) Revisiting Spatial-Temporal Similarity: A Deep Learning Framework for Traffic Prediction. In AAAI Conference on Artificial Intelligence, pp. 5668–5675. Cited by: §1, §4.1, §4.1.
-  (2018-Feb.) Deep Multi-View Spatial-Temporal Network for Taxi Demand Prediction. In AAAI Conference on Artificial Intelligence, pp. 2588–2595. Cited by: §1, §4.1, §4.1.
-  (2018-Jul.) Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting. In International Joint Conference on Artificial Intelligence (IJCAI), pp. 3634–3640. Cited by: §1, §4.1, §4.1.
-  (2017-Feb.) Deep Spatio-Temporal Residual Networks for Citywide Crowd Flows Prediction. In AAAI Conference on Artificial Intelligence, pp. 1655–1661. Cited by: §1, §4.1, §4.1.
-  (2003-Apr.) Correlation Analysis of Spatial Time Series Datasets: A Filter-and-Refine Approach. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), pp. 532–544. Cited by: §1.
-  (2017-Nov.) Spatio-Temporal Neural Networks for Space-Time Series Forecasting and Relations Discovery. In IEEE International Conference on Data Mining (ICDM), pp. 705–714. Cited by: §1, §1.