1 Introduction
Spatial and timedependent data describe the evolution of signals (i.e., the values of attributes) at multiple spatial locations across time [31, 11]. It occurs in many domains, including economics [7], global trade [9], environment studies [12], public health [16], or traffic networks [13] to name a few. For example, the gross domestic product (GDP) of different countries in the past century, the daily temperature measurements of different cities for the last decade, and the hourly taxi ridehailing demand at various urban locations in the recent year are all spatial and timedependent data. Forecasting such data allows to proactively allocate resources and take actions to improve the efficiency of society and the quality of life.
However, forecasting spatial and timedependent data is challenging — they exhibit complex spatial dependency, longrange temporal dependency, heterogeneity, and nonstationarity. Take the spatial and timedependent data in a traffic network as an example. The data at a location (e.g., taxi ridehailing demand) may correlate more with the data at a geometrically remote location than a nearby location [13]. Also, the data at a time instant may depend on the data at a recent time instant, say an hour ago, but may be highly correlated with the data a day ago or even a week ago, showing strong longrange temporal dependency. Additionally, the spatial and timedependent data may be influenced by many other relevant factors (e.g., weather influences taxi demand). These factors are relevant information, shall be taken into account. In other words, in this paper we propose to perform forecasting with heterogeneous sources of data at different spatial and time scales and including auxiliary information of a different nature or modality. Further, the data may be nonstationary due to unexpected incidents or traffic accidents [13]
. This nonstationarity makes the conventional time series forecasting methods such as AutoRegressive Integrated Moving Average (ARIMA) and Kalman filtering, which usually rely on stationarity
[3], inappropriate for accurate forecasting with spatial and timedependent data. [13, 32].Recently, deep learning models have been proposed for forecasting for spatial and timedependent data
[13, 28, 10, 29, 6, 30, 27, 32]. To deal with spatial dependency, most of these models either use predefined distance/similarity metrics or other prior knowledge like adjacency matrices of traffic networks to determine dependencies among locations. Then, they often use a (standard or graph) convolutional neural network (CNN) to better characterize the spatial dependency between these locations. These adhoc methods may lead to errors. For example, the locations that are considered as dependent may actually be independent in practice. Regarding temporal dependency, most of these models use recurrent neural networks (RNN), CNN, or their variants to capture the data longrange temporal dependency and nonstationarity. But it is well documented that these networks may fail to capture temporal dependencies between distant time epochs
[8, 23].To tackle these challenges, we propose Forecaster, a new deep learning architecture for forecasting spatial and timedependent data. Our architecture consists of two parts. First, we use the theory of Gaussian Markov random fields [20] to learn the structure of the graph that parsimoniously represents the spatial dependency between the locations (we call such graph a dependency graph
). Gaussian Markov random fields model spatial and timedependent data as a multivariant Gaussian distribution over the spatial locations. We then estimate the precision matrix of the distribution
[5].^{1}^{1}1The approach to estimate the precision matrix of a Gaussian Markov random field (i.e., graphical lasso) can also be used with nonGaussian distributions [19]. The precision matrix provides the graph structure with each node representing a location and each edge representing the dependency between two locations. This contrasts prior work on forecasting — we learn from the data its spatial dependencies. Second, we integrate the dependency graph in the architecture of the Transformer [23] for forecasting spatial and timedependent data. The Transformer and its extensions [23, 2, 26, 4, 18] have been shown to significantly outperform RNN and CNN in NLP tasks, as they capture relations among data at distant positions, significantly improving the learning of longrange temporal dependency [23]. In our Forecaster, in order to better capture the spatial dependencies, we associate each neuron in different layers with a spatial location. Then, we sparsify the Transformer based on the dependency graph: if two locations are not connected, we prune the connection between their associated neurons. In this way, the state encoding for each location is only impacted by its own state encoding and encodings for other dependent locations. Moreover, pruning the unnecessary connections in the Transformer avoids overfitting.
To evaluate the effectiveness of our proposed architecture, we apply it to the task of forecasting taxi ridehailing demand in New York City [22]. We pick 996 hot locations in New York City and forecast the hourly taxi ridehailing demand around each location from January 1st, 2009 to June 30th, 2016. Our architecture accounts for crucial auxiliary information such as weather, day of the week, hour of the day, and holidays. This improves significantly the forecasting task. Evaluation results show that our architecture reduces the root mean square error (RMSE) and mean absolute percentage error (MAPE) of the Transformer by 8.8210% and 9.6192%, respectively, and also show that our architecture significantly outperforms other stateoftheart baselines.
In this paper, we present critical innovations:

Forecaster combines the theory of Gaussian Markov random fields with deep learning. It uses the former to find the dependency graph among locations, and this graph becomes the basis for the deep learner forecast spatial and timedependent data.

Forecaster sparsifies the architecture of the Transformer based on the dependency graph, allowing the Transformer to capture better the spatiotemporal dependencies within the data.

We apply Forecaster to forecasting taxi ridehailing demand and demonstrate the advantage of its proposed architecture over stateoftheart baselines.
2 Methodology
In this section, we introduce the proposed architecture of Forecaster. We start by formalizing the problem of forecasting spatial and timedependent data (Section 2.1). Then, we use Gaussian Markov random fields to determine the dependency graph among data at different locations (Section 2.2). Based on this dependency graph, we design a sparse linear layer, which is a fundamental building block of Forecaster (Section 2.3). Finally, we present the entire architecture of Forecaster (Section 2.4).
2.1 Problem Statement
We define spatial and timedependent data as a series of spatial signals, each collecting the data at all spatial locations at a certain time. For example, hourly taxi demand at a thousand locations in 2019 is a spatial and timedependent data, while the hourly taxi demand at these locations between 8 a.m. and 9 a.m. of January 1st, 2019 is a spatial signal. The goal of our forecasting task is to predict the future spatial signals given the historical spatial signals and historical/future auxiliary information (e.g., weather history and forecast). We formalize forecasting as learning a function that maps historical spatial signals and historical/future auxiliary information to future spatial signals, as Equation (1):
(1) 
where is the spatial signal at time , , with the data at location at time ; the number of locations; the auxiliary information at time , , the dimension of the auxiliary information;^{2}^{2}2For simplicity, we assume in this work that different locations share the same auxiliary information, i.e., can impact , for any . However, it is easy to generalize our approach to the case where locations do not share the same auxiliary information. and is the set of the reals.
2.2 Gaussian Markov Random Field
We use Gaussian Markov random fields to find the dependency graph of the data over the different spatial locations. Gaussian Markov random fields model the spatial and timedependent data as a multivariant Gaussian distribution over
locations, i.e., the probability density function of the vector given by
is(2) 
where and are the expected value (mean) and precision matrix (inverse of the covariance matrix) of the distribution.
The precision matrix characterizes the conditional dependency between different locations — whether the data and at the and locations depend on each other or not given the data at all the other locations (). We can measure the conditional dependency between locations and through their conditional correlation coefficient :
(3) 
where is the , entry of . In practice, we set a threshold on , and treat locations and as conditionally dependent if the absolute value of is above the threshold.
The nonzero entries define the structure of the dependency graph between locations. Figure 1 shows an example of a dependency graph. Locations 1 and 2 and locations 2 and 3 are conditionally dependent, while locations 1 and 3 are conditionally independent. This principle example illustrates the advantage of Gaussian Markov random field over adhoc pairwise similarity metrics — the former leads to parsimonious (sparse) graph representations.
We estimate the precision matrix by graphical lasso [5], an L1penalized maximum likelihood estimator:
(4) 
where is the empirical covariance matrix computed from the data:
(5) 
where is the number of time samples used to compute .
2.3 Building Block: Sparse Linear Layer
We use the dependency graph to sparsify the architecture of the Transformer. This leads to the Transformer better capturing the spatial dependencies within the data. There are multiple linear layers in the Transformer. Our sparsification on the Transformer replaces all these linear layers by the sparse linear layers described in this section.
We use the dependency graph to build a sparse linear layer. Figure 2 shows an example (based on the dependency graph in Figure 1). Suppose that initially the layer (of nine neurons) is fully connected to the layer (of five neurons). We assign neurons to the data at different locations and to the auxiliary information as illustrated next. In this example, assign one neuron to each location and two neurons to the auxiliary information at the layer and assign two neurons to each location and three neurons to the auxiliary information at the layer.^{3}^{3}3How to assign neurons is a design choice for users. After assigning neurons, we prune connections based on the structure of the dependency graph. As location 1 and 3 are conditionally independent, we prune the connections between them. We also prune the connections between the neurons associated with locations and the auxiliary information.^{4}^{4}4This is another design choice for users. This way, the encoding for the data at a location is only impacted by the encodings of itself and of its dependent locations, better capturing the spatial dependency between locations. Moreover, pruning the unnecessary connections between conditionally independent locations helps avoiding overfitting.
2.4 Entire Architecture: Graph Transformer
Forecaster adopts an architecture similar to that of the Transformer except for substituting all the linear layers in the Transformer with our sparse linear layer designed based on the dependency graph. Figure 3 shows its architecture. Forecaster employs an encoderdecoder architecture [21]. The encoder is used to encode the historical spatial signals and historical auxiliary information; the decoder is used to predict the future spatial signals based on the output of the encoder and the future auxiliary information. Due to the page limit, we omit what Forecaster shares with the Transformer (e.g., (masked) multihead attention, positional encoding) and emphasize only on their differences.
2.4.1 Encoder
At each time step in the history, we concatenate the spatial signal with its auxiliary information. This way, we obtain a sequence where each element is a vector consisting of the spatial signal and the auxiliary information at a specific time step. The encoder takes this sequence as input. Then, a sparse embedding
layer (consisting of a sparse linear layer with ReLU activation) maps each element of this sequence to the
state space of the model and outputs a new sequence. In Forecaster, except for the sparse linear layer at the end of the decoder, all the layers have the same output dimension. We term this dimension and the space with this dimension as the state space of the model. After that, we add positional encoding to the new sequence and let it pass through stacked encoder layers to generate the encoding of the input sequence. Each encoder layer consists of a sparse multihead attention layer and a sparse feedforward layer. These layers are the same multihead attention layer and feedforward layer as in the Transformer, except that sparse linear layers to replace linear layers within them.2.4.2 Decoder
For each time step in the future, we concatenate its auxiliary information with the (predicted) spatial signal one step before. Then, we input this sequence to the decoder. The decoder first uses a sparse embedding layer to map each element of the sequence to the state space of the model, and then passes it through stacked decoder layers to obtain the new encoding of each element. Finally, the decoder uses a sparse linear layer to project this encoding back and predict the next spatial signal. Like the Transformer, the decoder layer contains two sparse multihead attention layers. The first one compares the elements in the input sequence of the decoder, obtaining a new encoding for each element of the sequence; the second one enables the comparison between the input sequences of the decoder and the encoder so that we can learn from the past history.
3 Evaluation
In this section, we apply Forecaster to the problem of forecasting taxi ridehailing demand in Manhattan, New York City. We demonstrate that Forecaster outperforms the stateoftheart baselines (Transformer [23] and DCRNN [13]).
3.1 Evaluation Settings
3.1.1 Dataset
Our evaluation uses the NYC Taxi dataset [22] from 01/01/2009 to 06/30/2016 (7.5 years in total). This dataset records detailed information for each taxi trip in New York City, including its pickup and dropoff locations. Based on this dataset, we select 996 locations with hot taxi ridehailing demand in Manhattan of New York City, shown in Figure 4
. Specifically, we compute the taxi ridehailing demand at each location by accumulating the taxi ride closest to that location. Note that these selected locations are not uniformly distributed, as different regions of Manhattan has distinct taxi demand. We split the dataset into three parts — training set, validation set, and test set. Training set uses the data in the time interval 01/01/2009 – 12/31/2011 and 07/01/2012 – 06/30/2015; validation set uses the data in 01/01/2012 – 06/30/2012; and the test set uses the data in 07/01/2015 –06/30/2016.
Our evaluation uses hourly weather data from Weather Underground [25] to construct (part of) the auxiliary information. Each record in this weather data contains seven entries — temperature, wind speed, precipitation, visibility, and the Booleans for rain, snow, and fog.
3.1.2 Details of the Forecasting Task
In our evaluation, we forecast taxi demand for the next three hours based on the previous 674 hours and the corresponding auxiliary information (i.e., use four weeks and two hours as history; , in Equation 1). Instead of directly inputing this history sequence into the model, we first filter it. This filtering is based on the following observation: a future taxi demand correlates more with the taxi demand at previous recent hours, the similar hours of the past week, and the similar hours on the same weekday in the past several weeks. In other words, we shrink the history sequence and only input the elements relevant to forecasting. Specifically, our filtered history sequence contains the data for the following taxi demand (and the corresponding auxiliary information):

The recent past hours: ;

Similar hours of the past week: ;

Similar hours on the same weekday of the past several weeks: .
3.1.3 Evaluation Metrics
Similar to prior work [13, 6], we use root mean square error (RMSE) and mean absolute percentage error (MAPE) to evaluate the quality of the forecasting results. Suppose that for the forecasting job (), the ground truth is , and the prediction is , where is the number of locations, and is the length of the forecasted sequence. Then RMSE and MAPE are:
(6) 
Following practice in prior work [6], we set a threshold on when computing MAPE: if , disregard the term associated it. This practice prevents small dominating MAPE.
3.2 Models Details
We evaluate Forecaster and compare it against baseline models including the Transformer and DCRNN.
3.2.1 Our model: Forecaster
Forecaster uses weather (7 bits), weekday (onehot encoding, 7 bits), hour (onehot encoding, 24 bits), and a Boolean for holidays (1 bit) as auxiliary information (39 bits). Concatenated with a spatial signal (996 bits), each element of the input sequence for Forecaster has 1035 bits. Forecaster uses one encoder layer and one decoder layer (i.e.,
. Except for the sparse linear layer at the end, all the layers of Forecaster use four neurons for encoding the data at each location and 64 neurons for encoding the auxiliary information and thus have 4048 neurons in total (i.e.,). The sparse linear layer at the end has 996 neurons. Forecaster uses the following loss function:
(7) 
where is a constant used to balance the impact of RMSE with MAPE.
3.2.2 Baseline model: Transformer [23]
The Transformer uses the same input and loss function as Forecaster. It also adopts a similar architecture except that all the layers are fullyconnected. For a comprehensive comparison, we evaluate two versions of the Transformer:

Transformer4048: All the layers in this implementation (except for the linear layer at the end) have the same width as Forecaster, i.e., ;

TransformerBest: We vary the width of all the layers (except for the linear layer at the end) from 64 to 4096, and pick the best width in performance to implement.
3.2.3 Baseline model: DCRNN [13]
DCRNN models the dependency relations between locations as a diffusion process guided by a predefined distance metric. Then, it leverages graph CNN to capture spatial dependency and RNN to capture the temporal dependency within the data.
3.3 Results
Our evaluation of Forecaster starts by using Gaussian Markov random fields to determine the spatial dependency between the data at different locations. Based on the methods described in Section 2.2, we can obtain a conditional correlation matrix where each entry of the matrix represents the conditional correlation coefficient between two corresponding locations. If the absolute value of an entry is less than a threshold, we will treat the corresponding two locations as conditionally independent, and round the value of the entry to zero. Figure 5 shows the structure of the conditional correlation matrix under a threshold of 0.1. We can see that the matrix is sparse, which means a location generally depends on just a few other locations other than all the locations. We found that a location depends on only 2.5 other locations on average. There are some locations on which many other locations depend on. For example, there is a location in Lower Manhattan on which 16 other locations depend on. This may be because there are many locations with significant taxi demand in Lower Manhattan, with these locations sharing a strong dependency. Figure 6 shows the top 400 spatial dependencies. We see some longrange spatial dependency between remote locations. For example, there is a strong dependency between Grand Central Terminal and New York Penn Station, which are two important stations in Manhattan with a large traffic of passengers.
After determining the spatial dependency between locations, we use the graph transformer architecture of Forecaster to forecast the taxi demand. Table 1
contrasts the performance of Forecaster to other baseline models. Here we run all the evaluated models six times (using different seeds) and report the mean and the standard deviation of the results. Forecaster outperforms all the baseline methods at every future step of forecasting. On average (over these future steps), Forecaster achieves an RMSE of 5.1879 and a MAPE of 20.1362, which is 8.8210% and 9.6192% better than the Transformer (TransformerBest), and 3.4809% and 19.4078% better than DCRNN. This demonstrates the advantage of Forecaster on capturing spatiotemporal dependencies.
Model  Metrics  Average  Next step  Second next step  Third next step 
DCRNN  RMSE  5.3750 ± 0.0691  5.1627 ± 0.0644  5.4018 ± 0.0673  5.5532 ± 0.0758 
MAPE (%)  24.9853 ± 0.1275  24.4747 ± 0.1342  25.0366 ± 0.1625  25.4424 ± 0.1238  
Transformer4048  RMSE  5.6802 ± 0.0206  5.4055 ± 0.0109  5.6632 ± 0.0173  5.9584 ± 0.0478 
MAPE (%)  22.5787 ± 0.2153  21.8932 ± 0.2006  22.3830 ± 0.1943  23.4583 ± 0.2541  
TransformerBest  RMSE  5.6898 ± 0.0219  5.4066 ± 0.0302  5.6546 ± 0.0581  5.9926 ± 0.0472 
MAPE (%)  22.2793 ± 0.1810  21.4545 ± 0.0448  22.1954 ± 0.1792  23.1868 ± 0.3334  
Forecaster  RMSE  5.1879 ± 0.0082  4.9629 ± 0.0102  5.2275 ± 0.0083  5.3651 ± 0.0065 
MAPE (%)  20.1362 ± 0.0316  19.8889 ± 0.0269  20.0954 ± 0.0299  20.4232 ± 0.0604 
4 Related Work
To our knowledge, this work is the first (1) to integrate Gaussian Markov Random fields with deep learning to forecast spatial and timedependent data, using the former to derive a dependency graph; (2) to sparsify the architecture of the Transformer based on the dependency graph, significantly improving the forecasting quality of the result architecture. The most closely related work is a set of proposals on forecasting spatial and timedependent data and the Transformer, which we briefly review in this section.
4.1 Spatial and TimeDependent Data Forecasting
Conventional methods for forecasting spatial and timedependent data such as ARIMA and Kalman filteringbased methods [15, 14] usually impose strong stationary assumptions on the data, which are often violated [13]. Recently, deep learningbased methods have been proposed to tackle the nonstationary and highly nonlinear nature of the data [28, 30, 29, 6, 27, 13]. Most of these works consist of two parts: modules to capture spatial dependency and modules to capture temporal dependency. Regarding spatial dependency, the literature mostly uses prior knowledge such as physical closeness between regions to derive an adjacency matrix and/or predefined distance/similarity metrics to decide whether two locations are dependent or not. Then, based on this information, they usually use a (standard or graph) CNN to characterize the spatial dependency between dependent locations. However, these methods are not good predictors of dependency relations between the data at different locations. Regarding temporal dependency, available works [28, 29, 6, 27, 13] usually use RNNs and CNNs to extract the longrange temporal dependency. However, both RNN and CNN do not learn well the longrange temporal dependency, with the number of operations used to relate signals at two distant time positions in a sequence growing at least logarithmically with the distance between them [23].
We evaluate our architecture with the problem of forecasting taxi ridehailing demand around a large number of spatial locations. The problem has two essential features: (1) These locations are not uniformly distributed like pixels in an image, making standard CNNbased methods [28, 27, 30] not good for this problem; (2) it is desirable to perform multistep forecasting, i.e., forecasting at several time instants in the future, this implying that the work mainly designed for singlestep forecasting [29, 6] is less applicable. DCRNN [13] is the stateoftheart baseline satisfying both features. Hence, we compare our architecture with DCRNN and show that our work outperforms DCRNN.
4.2 Transformer
The Transformer [23] avoids recurrence and instead purely relies on the selfattention mechanism to let the data at distant positions in a sequence to relate to each other directly. This benefits learning longrange temporal dependency. The Transformer and its extensions have been shown to significantly outperform RNNbased methods in NLP and image generation tasks [23, 18, 2, 26, 4, 17]. However, it is still unknown how to apply the architecture of Transformer to spatial and timedependent data, especially to deal with spatial dependency between locations. Later work [24] extends the architecture of Transformer to video generation. Even though this also needs to address spatial dependency between pixels, the nature of the problem is different from our task. In video generation, pixels exhibit spatial dependency only over a short time interval, lasting for at most tens of frames — two pixels may be dependent only for a few frames and become not depend in later frames. On the contrary, in spatial and timedependent data, locations exhibit longterm spatial dependency lasting for months or even years. This fundamental difference of the applications that we consider enables us to use Gaussian Markov random fields to determine the graph of dependent locations as basis for sparsifying the Transformer. Child et al. [1] propose another sparse Transformer architecture with a different goal of accelerating the selfattention operations in the Transformer. This architecture is very different from our architecture.
5 Conclusion
Forecasting spatial and timedependent data is challenging due to complex spatial dependency, longrange temporal dependency, nonstationarity, and heterogeneity within the data. This paper proposes Forecaster, a graph Transformer architecture to tackle these challenges. Forecaster uses Gaussian Markov random fields to determine the dependency graph between the data at different locations. Then, Forecaster sparsifies the architecture of the Transformer based on the structure of the graph and lets the sparsified Transformer (i.e., graph Transformer) capture the spatiotemporal dependencies, nonstationarity, and heterogeneity in one shot. We apply Forecaster to the problem of forecasting taxiride hailing demand at a large number of spatial locations. Evaluation results demonstrate that Forecaster significantly outperforms stateoftheart baselines (Transformer and DCRNN).
References
 [1] (2019Apr.) Generating Long Sequences with Sparse Transformers. arXiv preprint arXiv:1904.10509. Cited by: §4.2.
 [2] (2019Jul.) TransformerXL: Attentive Language Models beyond a FixedLength Context. In Annual Meeting of the Association for Computational Linguistics (ACL), pp. 2978–2988. Cited by: §1, §4.2.
 [3] (2006) 25 Years of Time Series Forecasting. International Journal of Forecasting 22 (3), pp. 443–473. Cited by: §1.
 [4] (2019Jun.) Bert: Pretraining of Deep Bidirectional Transformers for Language Understanding. In Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACLHLT), pp. 4171–4186. Cited by: §1, §4.2.
 [5] (2008) Sparse Inverse Covariance Estimation with the Graphical Lasso. Biostatistics 9 (3), pp. 432–441. Cited by: §1, §2.2.

[6]
(2019Jan.)
Spatiotemporal MultiGraph Convolution Network for RideHailing Demand Forecasting.
In
AAAI Conference on Artificial Intelligence
, pp. 3656–3663. Cited by: §1, §3.1.3, §3.1.3, §4.1, §4.1.  [7] (2016) The Forces of Economic Growth: A Time Series Perspective. Princeton University Press. Cited by: §1.
 [8] (2001) Gradient Flow in Recurrent Nets: the Difficulty of Learning LongTerm Dependencies. Cited by: §1.
 [9] (2017) Visual Exploration of Global Trade Networks with TimeDependent and Weighted Hierarchical Edge Bundles on GPU. Computer Graphics Forum 36 (3), pp. 273–282. Cited by: §1.

[10]
(2016Jun.)
StructuralRNN: Deep Learning on SpatioTemporal Graphs.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pp. 5308–5317. Cited by: §1.  [11] (2019) Cope: Interactive Exploration of CoOccurrence Patterns in Spatial Time Series. IEEE Transactions on Visualization and Computer Graphics 25 (8), pp. 2554–2567. Cited by: §1.
 [12] (2014Nov.) Vismate: Interactive Visual Analysis of StationBased Observation Data on Climate Changes. In IEEE Conference on Visual Analytics Science and Technology (VAST), pp. 133–142. Cited by: §1.
 [13] (2018Apr.) Diffusion Convolutional Recurrent Neural Network: DataDriven Traffic Forecasting. In International Conference on Learning Representations (ICLR), pp. 1–16. Cited by: §1, §1, §1, §3.1.3, §3.2.3, §3, §4.1, §4.1.

[14]
(2013)
ShortTerm Traffic Flow Forecasting: An Experimental Comparison of TimeSeries Analysis and Supervised Learning
. IEEE Transactions on Intelligent Transportation Systems 14 (2), pp. 871–882. Cited by: §4.1.  [15] (2011Aug.) Discovering SpatioTemporal Causal Interactions in Traffic Data Streams. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 1010–1018. Cited by: §4.1.
 [16] (2009) ExpectationBased Scan Statistics for Monitoring Spatial Time Series Data. International Journal of Forecasting 25 (3), pp. 498–517. Cited by: §1.

[17]
(2018)
Image Transformer.
In
International Conference on Machine Learning (ICML)
, pp. 4055–4064. Cited by: §4.2.  [18] (2018Jun.) Improving Language Understanding by Generative Pretraining. OpenAI. Cited by: §1, §4.2.
 [19] (2011) HighDimensional Covariance Estimation by Minimizing L1Penalized LogDeterminant Divergence. Electronic Journal of Statistics 5, pp. 935–980. Cited by: footnote 1.
 [20] (2005) Gaussian Markov Random Fields: Theory And Applications (Monographs on Statistics and Applied Probability). Chapman & Hall/CRC. Cited by: §1.
 [21] (2014Dec.) Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing Systems (NIPS), pp. 3104–3112. Cited by: §2.4.
 [22] (2018) TLC Trip Record Data. Cited by: §1, §3.1.1.
 [23] (2017Dec.) Attention is All You Need. In Advances in Neural Information Processing Systems (NIPS), pp. 5998–6008. Cited by: §1, §1, §3.2.2, §3, §4.1, §4.2.
 [24] (2018Jun.) NonLocal Neural Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7794–7803. Cited by: §4.2.
 [25] (2018) Historical Weather. External Links: Link Cited by: §3.1.1.
 [26] (2019Jun.) XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv preprint arXiv:1906.08237. Cited by: §1, §4.2.
 [27] (2019Jan.) Revisiting SpatialTemporal Similarity: A Deep Learning Framework for Traffic Prediction. In AAAI Conference on Artificial Intelligence, pp. 5668–5675. Cited by: §1, §4.1, §4.1.
 [28] (2018Feb.) Deep MultiView SpatialTemporal Network for Taxi Demand Prediction. In AAAI Conference on Artificial Intelligence, pp. 2588–2595. Cited by: §1, §4.1, §4.1.
 [29] (2018Jul.) SpatioTemporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting. In International Joint Conference on Artificial Intelligence (IJCAI), pp. 3634–3640. Cited by: §1, §4.1, §4.1.
 [30] (2017Feb.) Deep SpatioTemporal Residual Networks for Citywide Crowd Flows Prediction. In AAAI Conference on Artificial Intelligence, pp. 1655–1661. Cited by: §1, §4.1, §4.1.
 [31] (2003Apr.) Correlation Analysis of Spatial Time Series Datasets: A FilterandRefine Approach. In PacificAsia Conference on Knowledge Discovery and Data Mining (PAKDD), pp. 532–544. Cited by: §1.
 [32] (2017Nov.) SpatioTemporal Neural Networks for SpaceTime Series Forecasting and Relations Discovery. In IEEE International Conference on Data Mining (ICDM), pp. 705–714. Cited by: §1, §1.