Many real-world phenomena are spatiotemporal, such as the traffic flow, the diffusion of air pollutants and the regional rainfall. Correctly predicting the future of these spatiotemporal systems based on the past observations is essential for a wide range of scientific studies and real-life applications like traffic management, precipitation nowcasting, and typhoon alert systems.
If there exists an accurate numerical model of the dynamical system, which usually happen in the case that we have fully understood its rules, forecasting can be achieved by first finding the initial condition of the model and then simulate it to get the future predictions . However, for some complex spatiotemporal dynamical systems like atmosphere, crowd, and natural videos, we are still not entirely clear about their inner mechanisms. In these situations, machine learning based methods, by which we can try to learn the systems’ inner “laws” based on the historical data, have proven helpful for making accurate predictions [2, 3, 4, 5, 6]. Moreover, recent studies [7, 8] have shown that the “laws” inferred by machine learning algorithms can further be used to guide other tasks like classification and control. For example, controlling a robotic arm will be much easier once you can anticipate how it will move after performing specific actions .
|Trajectory Forecasting of Moving Point Cloud||Changing||Fixed/Changing|
|Spatiotemporal Forecasting on Regular Grid||Fixed regular grid||Changing|
|Spatiotemporal Forecasting on Irregular Grid||Fixed irregular grid||Changing|
In this paper, we review these machine learning based methods for spatiotemporal sequence forecasting. We define a length- spatiotemporal sequence as a series of matrices . Each contains a set of coordinates and their corresponding measurements. Here, is the number of coordinates, is the number of measurements, and is the dimension of the coordinate. We further denote the measurement part and coordinate part of as and . For many forecasting problems, we can use auxiliary information to enhance the prediction accuracy. For example, regarding wind speed forecasting, the latitudes and past temperatures are also helpful for predicting the future wind speeds. We denote the auxiliary information available at timestamp as . The Spatiotemporal Sequence Forecasting (STSF) problem is to predict the length- () sequence in the future given the previous observations plus the auxiliary information that is allowed to be empty. The mathematical form is given in (1).
Here, we limit STSF to only contain problems where the input and output are both sequences with multiple elements. According to this definition, problems like video generation using a static image  and dense optical flow prediction , in which the input or the output contain a single element, will not be counted.
Based on the characteristics of the coordinate sequence and the measurement sequence , we classify STSF into three categories as shown in Table I. If the coordinates are changing, the STSF problem is called Trajectory Forecasting of Moving Point Cloud (TF-MPC). Problems like human trajectory prediction [11, 4] and human dynamics prediction [12, 13, 14] fall into this category. We need to emphasize here that the entities in moving point cloud can not only be human but also be general moving objects. If the coordinates are fixed, we only need to predict the future measurements. Based on whether the coordinates lie in a regular grid, we can get the other two categories, called STSF on Regular Grid (STSF-RG) and STSF on Irregular Grid (STSF-IG). Problems like precipitation nowcasting , crowd density prediction  and video prediction , in which we have dense observations, are STSF-RG problems. Problems like air quality forecasting , influenza prediction  and traffic speed forecasting , in which the monitoring stations (or sensors) spread sparsely across the city, are STSF-IG problems. We need to emphasize that although the measurements in TF-MPC can also be time-variant due to factors like appearance change, most models for TF-MPC deal with the fixed-measurement case and only predict the coordinates. We thus focus on the fixed-measurement case in this survey.
Having both the characteristics of spatial data like images and temporal data like sentences and audios, spatiotemporal sequences contain information about “what”, “when”, and “where” and provide a comprehensive view of the underlying dynamical system. However, due to the complex spatial and temporal relationships within the data and the potential long forecasting horizon, spatiotemporal sequence forecasting imposes new challenges to the machine learning community. The first challenge is how to learn a model for multi-step forecasting. Early researches in time-series forecasting [18, 19] have investigated two approaches: Direct Multi-step
(DMS) estimation andIterated Multi-step
(IMS) estimation. The DMS approach directly optimizes the multi-step forecasting objective while the IMS approach learns a one-step-ahead forecaster and iteratively applies it to generate multi-step predictions. However, choosing between DMS and IMS involves a trade-off among forecasting bias, estimation variance, the length of the prediction horizon and the model’s nonlinearity. Recent studies have tried to find the mid-ground between these two approaches [20, 21]. The second challenge is how to model the spatial and temporal structures within the data adequately. In fact, the number of free parameters of a length- spatiotemporal sequence can be up to where is usually above thousands for STSF-RG problems like video prediction. Forecasting in such a high dimensional space is impossible if the model cannot well capture the data’s underlying structure. Therefore, machine learning based methods for STSF, including both classical methods and deep learning based methods, have special designs in their model architectures for capturing these spatiotemporal correlations.
In this paper, we organize the literature about STSF following these two challenges. Section 2 covers the general learning strategies for multi-step forecasting including IMS, DMS, boosting strategy and scheduled sampling. Section 3 reviews the classical methods for STSF including the feature-based methods, state-space models, and Gaussian process models. Section 4
reviews the deep learning based methods for STSF including deep temporal generative models and feed-forward and recurrent neural network-based methods. Section5 summarizes the survey and discusses some future research directions.
2 Learning Strategies for Multi-step Forecasting
Compared with single-step forecasting, learning a model for multi-step forecasting is more challenging because the forecasting output is a sequence with non-i.i.d elements. In this section, we first introduce and compare two basic strategies called IMS and DMS  and then review two extensions that bridge the gap between these two strategies. To simplify the notation, we keep the following notation rules in this and the following sections. We denote the information available at timestamp as . Learning an STSF model is thus to train a model, which can be either probabilistic or deterministic, to make the -step-ahead prediction based on . We also denote the ground-truth at timestamp as and recursively applying the basic function for times as . Setting the th subscript to be “:” means to select all elements at the th dimension. Setting the th subscript to a letter means to select the th element at the th dimension. For example, for matrix , means the th row, means the th column, and means the th element. The flattened version of matrices , and is denoted as .
2.1 Iterative Multi-step Estimation
The IMS strategy trains a one-step-ahead forecasting model and iteratively feeds the generated samples to the one-step-ahead forecaster to get the multi-step-ahead prediction. There are two types of IMS models, the deterministic one-step-ahead forecasting model and the probabilistic one-step-ahead forecasting model . For the deterministic forecasting model, the forecasting process and the training objective of the deterministic forecaster  are shown in (2) and (3) respectively where measures the distance between and .
When the one-step-ahead forecaster is probabilistic and nonlinear, the MLE of (4) is generally intractable. In this case, we can draw samples from by techniques like Sequential Monte-Carlo (SMC)  and predict based on these samples.
There are two advantages of the IMS approach: 1) The one-step-ahead forecaster is easy to train because it only needs to consider the one-step-ahead forecasting error and 2) we can generate predictions of an arbitrary length by recursively applying the basic forecaster. However, there is a discrepancy between training and testing in IMS. In the training process, we use the ground-truths of the previous steps to generate the th-step prediction. While in the testing process, we feed back the model predictions instead of the ground-truths to the forecaster. This makes the model prone to accumulative errors in the forecasting process . For the deterministic forecaster example above, the optimal muti-step-ahead forecaster, which can be obtained by minimizing , is not the same as recursively applying the optimal one-step-ahead forecaster defined in (3) when is nonlinear. This is because the forecasting error at earlier timestamps will propagate to later timestamps .
2.2 Direct Multi-step Estimation
The main motivation behind DMS is to avoid the error drifting problem in IMS by directly minimizing the long-term prediction error. Instead of training a single model, DMS trains a different model for each forecasting horizon . There can thus be models in the DMS approach. When the forecasters are deterministic, the forecasting process and training objective are shown in (5) and (6) respectively where is the distance measure.
To disentangle the model size from the number of forecasting steps , we can also construct by recursively applying the one-step-ahead forecaster . In this case, the model parameters are shared and the optimization problem turns into (7). We need to emphasize here that (7), which optimizes the multi-step-ahead forecasting error, is different from (3), which only minimizes the one-step-ahead forecasting error. This construction method is widely adopted in deep learning based methods [7, 24].
2.3 Comparison of IMS and DMS
Chevillon  compared the IMS strategy and DMS strategy. According to the paper, DMS can lead to more accurate forecasts when 1) the model is misspecified, 2) the sequences are non-stationary, or 3) the training set is too small. However, DMS is more computationally expensive than IMS. For DMS, if the s are not shared, we need to store and train models. If the s are shared, we need to recursively apply for steps [19, 21, 25]. Both cases require larger memory storage or longer running time than solving the IMS objective. On the other hand, the training process of IMS is easy to parallelize since each forecasting horizon can be trained in isolation from the others . Moreover, directly optimizing a -step-ahead forecasting loss may fail when the forecasting model is highly nonlinear, or the parameters are not well-initialized .
Due to these trade-offs, models for STSF problems choose the strategies that best match the problems’ characteristics. Generally speaking, DMS is preferred for short-term prediction while IMS is preferred for long-term forecast. For example, the IMS strategy is widely adopted in solving TF-MPC problems where the forecasting horizons are long, which are generally above 12 [13, 4] and can be as long as 100 . The DMS strategy is used in STSF-RG problems where the dimensionality of data is usually very large and only short-term predictions are required [5, 6]. For STSF-IG problems, some works adopt DMS when the overall dimensionality of the forecasting sequence is acceptable for the direct approach [27, 3].
Overall, IMS is easier to train but less accurate for multi-step forecasting while DMS is more difficult to train but more accurate. Since IMS and DMS are somewhat complementary, later studies have tried to bridge the gap between these two approaches. We will introduce two representative strategies in the following two sections.
2.4 Boosting Strategy
The boosting strategy proposed in  combines the IMS and DMS for univariate time series forecasting. The strategy boosts the linear forecaster trained by IMS with several small and nonlinear adjustments trained by DMS. The model assumption is shown in (8).
Here, is the basic one-step-ahead forecasting model, is the chosen weak learner at step , is the boosting iteration number, is the shrinkage factor and
is the error term. The author set the base linear learner to be the auto-regressive model and set the weak nonlinear learners to be the penalized regression splines. Also, the weak learners were designed to only account for the interaction between two elements in the observation history. The training method is similar to the gradient boosting algorithm. is first trained by minimizing the IMS objective (3). After that, the weak-learners are fit iteratively to the gradient residuals.
The author also performed a theoretical analysis of the bias and variance of different models trained with DMS, IMS and the boosting strategy for 2-step-ahead prediction. The result shows that the estimation bias of DMS can be arbitrarily small while that of the IMS will be amplified for larger forecasting horizon. The model estimated by the boosting strategy also has a bias, but it is much smaller than the IMS approach due to the compensation of the weak learners trained by DMS. Also, the variance of the DMS approach depends on the variability of the input history and the model size which will be very large for complex and nonlinear models. The variance of the boosting strategy is smaller because the basic model is linear and the weak-learners only consider the interaction between two elements.
Although the author only proposed the boosting strategy for the univariate case, the method, which lies in the mid-ground between IMS and DMS, should apply to the STSF problems with multiple measurements, which leads to potential future works as discussed in Section 5.
2.5 Scheduled Sampling
The idea of Scheduled Sampling (SS)  is first to train the model with IMS and then gradually replaces the ground-truths in the objective function with samples generated by the model. When all ground-truth samples are replaced with model-generated samples, the training objective becomes the DMS objective. The generation process of SS is described in (9):
Here, and are the generated sample and the ground-truth at forecasting horizon , correspondingly. is the basic one-step-ahead forecasting model.is the probability of choosing the ground-truth at the th iteration. In the training phase, SS minimizes the distance between and , which is shown in (10). In the testing phase, is fixed to 0, meaning that the model-generated samples are always used.
If equals to , the ground-truth will always be chosen and the objective function will be the same as in the IMS strategy. If is , the generated sample will always be chosen and the optimization objective will be the same as in the DMS strategy. Based on this observation, the author proposed to gradually decay during training to make the optimization objective shift smoothly from IMS to DMS, which is a type of curriculum learning . The paper also provided some ways to change the sampling ratio.
One issue of SS is that the expectation over s in (10
) makes the loss function non-differentiable. The original paper obviates the problem by treating s as constants, which does not optimize the true objective [30, 25].
The SS strategy is widely adopted in DL models for STSF. In , the author applied SS to video prediction and showed that the model trained via SS outperformed the model trained via IMS. In , the author proposed a training strategy similar to SS except that the training curriculum is adapted from IMS to DMS by controlling the number of the prediction steps instead of controlling the frequency of sampling the ground-truth. In [32, 33], the author used SS to train models for traffic speed forecasting and showed it outperformed IMS.
3 Classical Methods for STSF
In this section, we will review the classical methods for STSF, including feature-based methods, state space models, and Gaussian process models. These methods are generally based on the shallow models or spatiotemporal kernels. Since a spatiotemporal sequence can be viewed as a multivariate sequence if we flatten the
s into vectors or treat the observations at every location independently, methods designed for the general multivariate time-series forecasting are naturally applicable to STSF. However, directly doing so ignores the spatiotemporal structure within the data. Therefore, most algorithms introduced in this section have extended this basic approach by utilizing the domain knowledge of the specific task or considering the interactions among different measurements and locations.
3.1 Feature-based Methods
Feature-based methods solve the STSF by training a regression model based on some human-engineered spatiotemporal features. To reduce the problem size, most feature-based methods treat the measurements at the locations to be independent given the extracted features. The high-level IMS and DMS models of the feature-based methods are given in (11) and (12) respectively.
Here, represents the features at location
. Once the feature extraction methodis defined, any regression model can be applied. The feature-based methods are usually adopted in solving STSF-IG problems. Because the observations are sparsely and irregularly distributed in STSF-IG, it is sometimes more straightforward to extract human-engineered features than to develop a purely machine learning based model. Despite their straightforwardness, the feature-based methods have a major drawback. The model’s performance relies heavily on the quality of the human-engineered features. We will briefly introduce two representative feature-based methods for two STSF-IG problems to illustrate the approach.
Ohashi and Torgo 
proposed a feature-based method for wind-speed forecasting. The author designed a set of features called spatiotemporal indicators and applied several regression algorithms including regression tree, support vector regression, and random forest. There are two kinds of spatiotemporal indicators defined in the paper: the average or weighted average of the observed historical values within a space-time neighborhood and the ratio of two average values calculated with different distance thresholds. The regression model takes the spatiotemporal indicators as the input to generate the one-step-ahead prediction. Experiment results show that the spatiotemporal indicators are helpful for enhancing the forecasting accuracy.
Zheng et al.  proposed a feature-based method for air quality forecasting. The author adopted the DMS approach and combined multiple predictors, including the temporal predictor, the spatial predictor, and the inflection predictor, to generate the features at location and timestamp
. The temporal predictor focuses on the temporal side of the data and uses the measurements within a temporal sliding window to generate the features. The spatial predictor focuses on the spatial side of the data and uses the previous measurements within a nearby region to generate the features. Also, to deal with the sparsity and irregularity of the coordinates, the author partitioned the neighborhood region into several groups and aggregated the inner measurements within a group. The inflection predictor extracts the sudden changes within the observations. For the regression model, the temporal predictor uses the linear regression model, the spatial predictor uses a 2-hidden-layer neural network and the inflection predictor used the regression tree model. A regression tree model further combines these predictors.
3.2 State Space Models
The State Space Model (SSM) adopts a generative viewpoint of the sequence. For the observed sequence , SSM assumes that each element is generated by a hidden state and these hidden states, which are allowed to be discrete in our definition, form a Markov process. The general form of the state space model  is given in (13). Here, is the transition model, is the observation model, is the hidden state, is the noise of the transition model, and is the noise of the observation model. The SSM has a long history in forecasting  and many classical time-series models like Markov Model (MM), Hidden Markov Model (HMM), Vector Autoregression (VAR), and Autoregressive Integrated Moving Average Model (ARIMA) can be written as SSMs.
Given the underlying hidden state , the observation is independent concerning the historical information . Thus, the posterior distribution of the sequence for the next steps can be written as follows:
The advantage of SSM for solving STSF is that it naturally models the uncertainty of the dynamical system due to its Bayesian formulation. The posterior distribution (14
) encodes the probability of different future outcomes. We can not only get the most likely future but also know the confidence of our prediction by Bayesian inference. Common types of SSMs used in STSF literatures have either discrete hidden states or linear transition models and have limited representational power. In this section, we will focus on these common types of SSMs and leave the review of more advanced SSMs, which are usually coupled with deep architectures, in Section 4.2. We will first briefly review the basics of three common classes of SSMs and then introduce their representative spatiotemporal extensions, including the group-based method, Space-time Autoregressive Moving Average
(STARIMA), and the low-rank tensor auto-regression.
3.2.1 Common Types of SSMs
In STSF literature, the following three classes of SSMs are most common.
First-order Markov Model: When we set to be in (13) and constrain it to be discrete, the SSM reduces to the first-order MM. Assuming s have choices, the first-order MM is determined by the transition matrix where and the prior distribution where . The maximum posterior estimator can be obtained via dynamic programming and the optimal model parameters can be solved in closed form .
Hidden Markov Model: When the states are discrete, the SSM turns into the HMM. Assuming the s have choices, the HMM is determined by the transition matrix , the state prior and the class conditional density . For HMM, the maximum posterior estimator and the optimal model parameters can be calculated via the forward algorithm and the Expectation Maximization (EM) algorithm, respectively . It is worth emphasizing that the posterior of HMM is tractable even for non-Gaussian observation models. This is because the states of HMM are discrete and it is possible to enumerate all the possibilities.
where and are covariance matrices of the Gaussian noises. A large number of classical time-series models like VAR and ARIMA can be written in the form of LG-SSM [35, 38]. As both the transition model and the observation model are linear-Gaussian, the posterior is also a linear-Gaussian and we can use the Kalman Filtering (KF) algorithm  to calculate the mean and variance. Similar to HMM, the model parameters can also be learned by EM.
3.2.2 Group-based Methods
Group-based methods divide the locations into non-intersected groups and train an independent SSM for each group. To be more specific, for the location set , the group-based methods assume that it can be decomposed as where . The measurement sequences that belong to the same group, i.e., , are used to fit a single SSM. If is equal to , it reduces to the simplest case where the measurement sequences at all locations share the same model.
This technique has mainly been applied to TF-MPC problems. Asahara et al.  proposed a group-based extension of MM called Mixed-Markov-chain Model
Mixed-Markov-chain Model(MMM) for pedestrian-movement prediction. MMM assumes that the people belonging to a specific group satisfy the same first-order MM. To use the first-order MM, the author manually discretized the coordinates of each person. The group-assignments along with the model parameters are learned by EM. Experiments show that MMM outperforms MM and HMM in this task. Mathew et al.  extended the work by using HMM as the base model. Also, in , the groups are determined by off-line clustering algorithms instead of EM.
Group-based methods capture the intra-group relationships by model sharing. However, it cannot capture the inter-group correlations. Also, only a single group assignment is considered while there may be multiple grouping relationships in the spatiotemporal data. Moreover, the latent state transitions of SSMs have some internal grouping effects. It is often not necessary to explicitly divide the locations into different groups. Due to these limitations, group-based methods are less mentioned in later research works.
STARIMA [42, 43, 44] is a classical spatiotemporal extension of the univariate ARIMA model . STARIMA emphasizes the impact of a location’s spatial neighborhoods on its future values. The fundamental assumption is that the future measurements at a specific location is not only related to the past measurements observed at this location but also related to the past measurements observed at its local neighborhoods. STARIMA is designed for univariate STSF-RG and STSF-IG problems where there is a single measurement at each location. If we denote the measurements at timestamp as , the model of STARIMA is shown in the following:
Here, is the temporal difference operator where and . and are respectively the autoregressive parameter and the moving average parameter at temporal lag and spatial lag . are maximum spatial orders. is a predefined matrix of spatial order .
is the normally distributed error vector at timethat satisfies the following relationships:
The spatial order matrix is the key in STARIMA for modeling the spatiotemporal correlations. is only non-zero if the site is the th order neighborhood of site .
is defined to be the identity matrix. The spatial order relationship in a two-dimensional coordinate system is illustrated in Figure1. However, the s need to be predefined by the model builder and cannot vary over-time. Also, the STARIMA still belongs to the class of LG-SSM. It is thus not suitable for modeling nonlinear spatiotemporal systems. Another problem of STARIMA is that it is designed for problems where there is only one measurement at each location and has not considered the case of multiple correlated measurements.
3.2.4 Low-rank Tensor Auto-regression
Bahadori et al.  proposed to train an auto-regression model with low-rank constraint to better capture the latent structure of the spatiotemporal data. This method is designed for STSF-IG and STSF-RG problems where the coordinates are fixed. Assuming the observed measurements satisfy the VAR() model, the optimization problem of the low-rank tensor auto-regression is given in (18):
Here, is the three-dimensional weight tensor. is the spatial Laplacian matrix calculated from the fixed coordinate matrix . means the rank of the weight tensor and is the maximum rank. The regularization term in the objective function constrains the predicted results to be spatially smooth. is the strength of this spatial similarity constraint.
Constraining the rank of the weight tensor imposes an implicit spatiotemporal structure in the data. When the weight tensor is low-rank, the observed sequences can be described with a few latent factors.
The optimization problem (18) is non-convex and NP-hard in general  due to the low-rank constraint. One standard approach is to relax the rank constraint to a nuclear norm constraint and optimize the relaxed problem. However, this approach is not computationally feasible for solving (18) because of the high dimensionality of the spatiotemporal data. Thus, researchers [46, 48, 49] proposed optimization algorithms to accelerate the learning process.
The primary limitation of the low-rank tensor auto-regression is that the regression model is still linear even if we have perfectly learned the weights. We cannot directly apply it to solve forecasting problems that have strong nonlinearities.
3.3 Gaussian Process
Gaussian Process (GP) is a non-parametric Bayesian model. GP defines a prior in the function space and tries to infer the posterior given the observed data 
. GP assumes the joint distribution of the function values of an arbitrary finite set, i.e.,
, is a Gaussian distribution. A typical form of the GP prior is shown in (19):
where is the mean function and is the kernel function.
, which is used to interpolate the missing values of the data. When applied to STSF, GP is usually used in STSF-IG and STSF-RG problems where the coordinates are fixed. The general formula of the GP based models is shown in (20):
Here, contains the space-time positions of the samples, which is the concatenation of the spatial coordinates and the temporal index. For example, where is the timestamp and is the spatial coordinate. is the kernel function parameterized by . is the observation model. Like SSM, GP is also a generative model. The difference is that GP directly generates the function describing the relationship between the observation and the future while SSM generates the future measurements one-by-one through the Markov assumption. We need to emphasize here that GP can also be applied to solve TF-MPC problems. A well-known example is the Interacting Gaussian Processes (IGP) , in which a GP models the movement of each person, and a potential function defined over the positions of different people models their interactions. Nevertheless, the method for calculating the posterior of IGP is similar to that of (20) and we will stick to this model formulation.
To predict the measurements given a GP model, we are required to calculate the predictive posterior where are the observed data points and is the candidate space-time coordinates to infer the measurements. If the observation model is a Gaussian distribution, i.e., , the posterior distribution is also a Gaussian distribution as shown in (21).
If the observation model is non-Gaussian, which is common for STSF problems , Laplacian approximation can be utilized to approximate the posterior . The inference process of GP is computationally expensive due to the involvement of and the naive approach requires computations and storage. There are many acceleration techniques to speed up the inference . We will not investigate them in detail in this survey and readers can refer to [51, 53].
The key ingredient of GP is the kernel function. For STSF problems, the kernel function should take both the spatial and temporal correlations into account. The common strategy for defining complex kernels is to combine multiple basic kernels by operations like summation and multiplication. Thus, GP models for STSF generally use different kernels to model different aspects of the spatiotemporal sequence and ensemble them together to form the final kernel function. In , the author proposed to decompose the spatiotemporal kernel as where is the spatial kernel and is the temporal kernel . If we denote the kernel matrix corresponding to as , this decomposition results in where is the Kronecker product. Based on this observation, the author utilized the computational benefits of Kronecker algebra to scale up the inference of the GP models . In , the author combined two spatial kernels and one temporal kernel to construct a GP model for seasonal influenza prediction. The author defined the basic spatial kernels as the average of multiple RBF kernels. The temporal kernel was defined to be the sum of the periodic kernel, the Paciorek’s kernel, and the exponential kernel. The final kernel function used in the paper was .
In this section, we reviewed three types of classical methods for solving the STSF problems, feature-based methods, SSMs, and GP models. The feature-based methods are simple to implement and can provide useful predictions in some practical situations. However, these methods rely heavily on the human-engineered features and may not be applied without the help of domain experts. The SSMs assume that the observations are generated by Markovian hidden states. We introduced three common types of SSMs: first-order MM, HMM and LG-SSM. Group-based methods extend SSMs by creating groups among the states or observations. STARIMA extends ARIMA by explicitly modeling the impact of nearby locations on the future. Low-rank tensor autoregression extends the basic autoregression model by adding low-rank constraints on the model parameters. The advantage of these common types of SSMs for STSF is that we can calculate the exact posterior distribution and conduct Bayesian inference. However, the overall nonlinearity of these models are limited and may not be suitable for some complex STSF problems like video prediction. GP models the inner characteristics of the spatiotemporal sequences by spatiotemporal kernels. Although being very powerful and flexible, GP has high computational and storage complexity. Without any optimization, the inference of GP requirescomputations and storage, which makes it not suitable for cases when a huge number of training instances are available.
4 Deep Learning Methods for STSF
Deep Learning (DL) is an emerging new area in machine learning that studies the problem of how to learn a hierarchically-structured model to directly map the raw inputs to the desired outputs . Usually, DL model stacks some basic learnable blocks, or layers, to form a deep architecture. The overall network is then trained end-to-end. In recent years, a growing number of researchers have proposed DL models for STSF. To better understand these models, we will start with an introduction of the relevant background of DL and then review the models for STSF.
4.1.1 Basic Building Blocks
In this section, we will review the basic blocks that have been used to build DL models for STSF. Since there is a vast literature on DL, the introduction of this part serves only as a brief introduction. Readers can refer to  for more details.
Restricted Boltzmann Machine: Restricted Boltzmann Machine (RBM)  is the basic building block of Deep Generative Models (DGMs). RBMs are undirected graphical models that contain a layer of observable variables and another layer of hidden variables . These two layers are interconnected and form a bipartite graph. In its simplest form, and are all binary and contain and nodes. The joint distribution and the conditional distributions are defined in (22) :
Here, is the bias of the visible state, is the bias of the hidden state, is the weight between visible states and hidden states,
is the sigmoid function defined in TableII, and
is the normalization constant for energy-based models, which is also known as the partition function
. RBM can be learned by Contrastive Divergence (CD), which approximates the gradient of by drawing samples from cycles of MCMC.
The basic form of RBM has been extended in various ways. Gaussian-Bernoulli RBM (GRBM) 
assumes the visible variables are continuous and replaces the Bernoulli distribution with a Gaussian distribution.Spike and Slab RBM (ssRBM) maintains two types of hidden variables to model the covariance.
Activations: Activations refer to the element-wise nonlinear mappings
. Activation functions that are used in the reviewed models are listed in TableII.
|Leaky ReLU |
Sigmoid Belief Network: Sigmoid Belief Network (SBN) is another building block of DGM. Unlike RBM which is undirected, SBN is a directed graphical model. SBN contains a hidden binary state and a visible binary state . The basic generation process of SBN is given in the following:
Here, are biases and is the weight. Because the posterior distribution is intractable, SBN is learned by methods like Neural Variational Inference and Learning (NVIL)  and Wake-Sleep (WS)  algorithm, which train an inference network along with the generation network.
Fully-connected Layer: The Fully-Connected (FC) layer is the very basic building block of DL models. It refers to the linear-mapping from the input to the hidden state , i.e, , where is the weight and is the bias.
Convolution and Pooling: The convolution layer and pooling layer are first proposed to take advantage of the translation invariance property of image data. The convolution layer computes the output by scanning over the input and applying the same set of linear filters. Although the input can have an arbitrary dimensionality , we will just introduce 2D convolution and 2D pooling in this section. For input , the output of the convolution layer , which is also known as feature map, will be calculated as the following:
Here, is the weight, and are the kernel sizes, “” denotes convolution, and is the bias. If we denote the feature vector at the position as , the flattened weight as , and the flattened version of the local region inputs as , (24) can be rewritten as (25):
This shows that convolution layer can also be viewed as applying multiple FC layers with shared weight and bias on local regions of the input. Here, the ordered neighborhood set63]. For example, if the kernel size of the convolution filter is , we have . Readers can refer to [26, 63] for details about how to calculate the neighborhood set based on these hyper-parameters. Also, the out-of-boundary regions of the input are usually set to zero, which is known as zero-padding.
Like convolution layer, the pooling layer also scans over the input and applies the same mapping function. Pooling layer generally has no parameter and uses some reduction operations like max, sum, and average for mapping. The formula of the pooling layer is given in the following:
where is the element-wise reduction operation.
Deconvolution and Unpooling: The deconvolution  and unpooling layer  are first proposed to serve as the “inverse” operation of the convolution and the pooling layer. Computing the forward pass of these two layers is equivalent to computing the backward pass of the convolution and pooling layers .
Graph Convolution: Graph convolution generalizes the standard convolution, which is operated over a regular grid topology, to convolution over graph structures [66, 67, 68, 33]. There are two interpretations of graph convolution: the spectral interpretation and the spatial interpretation. In the spectral interpretation [66, 69]
, graph convolution is defined by the convolution theorem, i.e., the Fourier transform of convolution is the point-wise product of Fourier transforms:
Here, is the graph Fourier transform. The transformation matrix is defined based on the eigendecomposition of the graph Laplacian. Defferrard et al.  defined
as the eigenvectors of, in which is the diagonal degree matrix, is the adjacency matrix, and is the identity matrix.
In the spatial interpretation, graph convolution is defined by a localized parameter-sharing operator that aggregates the features of the neighboring nodes, which can also be termed as graph aggregator [70, 33]:
where is the output feature vector of node , is the mapping function parameterized by , is the input feature of node , and contains the features of node ’s neighborhoods. The mapping function needs to be permutation invariant and dynamically resizing. One example is the pooling aggregator defined in (29):
where and are mapping functions and means concatenation. Compared with the spectral interpretation, the spatial interpretation does not require the expensive eigendecomposition and is more suitable for large graphs. Moreover, Defferrard et al.  and Kipf & Welling  proposed accelerated versions of the spectral graph convolution that could be interpreted from the spatial perspective.
4.1.2 Types of DL Models for STSF
Based on the characteristic of the model formulation, we can categorize DL models for STSF into two major types: Deep Temporal Generative Models (DTGMs) and Feedforward Neural Network (FNN) and Recurrent Neural Network (RNN) based methods. In the following, we will briefly introduce the basics about DTGM, FNN, and RNN.
DTGMs are DGMs for temporal sequences. DTGM adopts a generative viewpoint of the sequence and tries to define the probability distribution. The aforementioned RBM and SBN are building-blocks of DTGM.
FNN: FNNs, also called deep feedforward networks , refer to deep learning models that construct the mapping by stacking various of the aforementioned basic blocks such as FC layer, convolution layer, deconvolution layer, graph convolution layer, and activation layer. Common types of FNNs include Multilayer Perceptron (MLP) [71, 26], which stacks multiple FC layers and nonlinear activations, Convolutional Neural Network (CNN) , which stacks multiple convolution and pooling layers, and graph CNN , which stacks multiple graph convolution layers. The parameters of FNN are usually estimated by minimizing the loss function plus some regularization terms. Usually, the optimization problem is solved via stochastic gradient-based methods 
, in which the gradient is computed by backpropagation.
RNN: RNN generalizes the structure of FNN by allowing cyclic connections in the network. These recurrent connections make RNN especially suitable for modeling sequential data. Here, we will only introduce the discrete version of RNN which updates the hidden states using . These hidden states can be further used to compute the output and define the loss function. One example of an RNN model is shown in (30). The model uses the tanh activation and the fully-connected layer for recurrent connections. It uses another full-connected layer to get the output:
Similar to FNNs, RNNs are also trained by stochastic gradient based methods. To calculate the gradient, we can first unfold an RNN to a FNN and then perform backpropagation on the unfolded computational graph. This algorithm is called Backpropagation Through Time (BPTT) .
Directly training the RNN shown in (30
) with BPTT will cause the vanishing and exploding gradient problems
. One way to solve the vanishing gradient problem is to use theLong-Short Term Memory (LSTM) , which has a cell unit that is designed to capture the long-term dependencies. The formula of LSTM is given in the following:
Here, is the memory cell and are the input gate, the forget gate, and the output gate, respectively. The memory cell is accessed by the next layer when is turned on. Also, the cell can be updated or cleared when or is activated. By using the memory cell and gates to control information flow, the gradient will be trapped in the cell and will not vanish over time, which is also known as the Constant Error Carousel (CEC) . LSTM belongs to a broader category of RNNs called gated RNNs . We will not introduce other architectures like Gated Recurrent Unit (GRU) in this survey because LSTM is the most representative example. To simplify the notation, we will also denote the transition rule of LSTM as
4.2 Deep Temporal Generative Models
In this section, we will review DTGMs for STSF. Similar to SSM and GP model, DTGM also adopts a generative viewpoint of the sequences. The characteristics of DTGMs is that models belonging to this class can all be stacked to form a deep architecture, which enables them to model complex system dynamics. We will survey four DTGMs in the following. The general structures of these models are shown in Figure 2.
Temporal Restricted Boltzmann Machine: Sutskever & Hinton  proposed Temporal Restricted Boltzmann Machine (TRBM), which is a temporal extension of RBM. In TRBM, the bias of the RBM in the current timestamp is determined by the previous RBMs. The conditional probability of TRBM is given in the following:
where and . The joint distribution is modeled as the multiplication of the conditional probabilities, i.e., .
The inference of TRBM is difficult and involves evaluating the ratio of two partition functions 
. The original paper adopted a heuristic inference procedure. The results on a synthetic bouncing ball dataset show that a two-layer TRBM outperforms a single-layer TRBM in this task.
Recurrent Temporal Restricted Boltzmann Machine: Later, Sutskever et al.  proposed the Recurrent Temporal Restricted Boltzmann Machine (RTRBM) that extends the idea of TRBM by conditioning the parameters of the RBM at timestamp on the output of an RNN. The model of RTRBM is described in the following:
Here, encodes the visible states . The inference is simpler in RTRBM than TRBM because the model is the same as the standard RBM once is given. Similar to RBM, RTRBM can also be learned by CD. The author also extended RTRBM to use GRBM for modeling continuous time-series. Qualitative experiments on the synthetic bouncing ball dataset show that RTRBM can generate more realistic sequences than TRBM.
Structured Recurrent Temporal Restricted Boltzmann Machine: Mittelman et al.  proposed Structured Recurrent Temporal Restricted Boltzmann Machine (SRTRBM) that improves upon RTRBM by learning the connection structure based on the spatiotemporal patterns of the input. SRTRBM uses two block-masking matrices to mask the weights between the hidden and visible nodes. The formula of SRTRBM is similar to RTRBM except for the usage of two masking matrices and :
Here, and are block-masking matrices generated by a graph . To construct the graph, the visible units and the hidden units are divided into nonintersecting subsets. Each node in the graph is assigned to a subset of hidden units and a subset of visible units . The ways to separate and assign the units are based on the spatiotemporal coordinates of the data. If there is an edge between two nodes in the graph, the corresponding hidden and visible subsets are connected in SRTRBM.
If the masks are fixed, the learning process of SRTRBM is the same as RTRBM. To learn the mask, the paper proposed to use the sigmoid function as a soft mask and learn the parameters . Also, a sparsity regularizer is added to s to encourage sparse connections. The training process is first to fix the mask while updating the other parameters and then update the mask. The author also gives the ssRBM version of SRTBM and RTRBM by replacing RBM with ssRBM while fixing the part of the energy function related to to be unchanged. Experiments on human motion prediction, which is a TF-MPC task, and climate data prediction, which is an STSF-IG task, prove that SRTRBM achieves smaller one-step-ahead prediction error than RTRBM and RNN. This shows that using a masked weight matrix to model the spatial and temporal correlations is useful for STSF problems.
Temporal Sigmoid Belief Network: Gan et al.  proposed a temporal extension of SBN called Temporal Sigmoid Belief Network (TSBN). TSBN connects a sequence of SBNs in a way that the bias of the SBN at timestamp depends on the state of the previous SBNs. For a length- binary sequence , TSBN describes the following joint probability, in which we have omitted the biases for simplicity:
Also, the author extended TSBN to the continuous domain by using a conditional Gaussian for the visible nodes:
Multiple single-layer TSBNs can be stacked to form a deep architecture. For a multi-layer TSBN, the hidden states of the lower layers are conditioned on the states of the higher layers. The connection structure of a multi-layer TSBN is given in Figure 2(e). The author investigated both the stochastic and deterministic conditioning for multi-layer models and showed that the deep model with stochastic conditioning works better. Experiment results on the human motion prediction task show a multi-layer TSBN with stochastic conditioning outperforms variants of SRTRBM and RTRBM.
4.3 FNN and RNN based Models
One shortcoming of DTGMs for STSF is that the learning and inference processes are often complicated and time-consuming. Recently, there is a growing trend of using FNN and RNN based methods for STSF. Compared with DTGM, FNN and RNN based methods are simpler in the learning and inference processes. Moreover, these methods have achieved good performance in many STSF problems like video prediction [31, 80, 81], precipitation nowcasting [5, 82], traffic speed forecasting [17, 32, 33], human trajectory prediction , and human motion prediction . In the following, we will review these methods.
4.3.1 Encoder-Forecaster Structure
Recall that the goal of STSF is to predict the length- sequence in the future given the past observations. The Encoder-Forecaster (EF) structure [7, 5] first encodes the observations into a finite-dimensional vector or a probability distribution using the encoder and then generates the one-step-ahead prediction or multi-step predictions using the forecaster:
Here, is the encoder and is the forecaster. In our definition, we also allow probabilistic encoder and forecaster, where and turn into random variables:
Using probabilistic encoder and forecaster is a way to handle uncertainty and we will review the details in Section 4.3.5. All of the surveyed FNN and RNN based methods fit into the EF framework. In the following, we will review how previous works design the encoder and the forecaster for different types of STSF problems, i.e., STSF-RG, STSF-IG, and TF-MPC.
4.3.2 Models for STSF-RG
For STSF-RG problems, the spatiotemporal sequences lie in a regular grid and can naturally be modeled by CNNs, which have proved effective for extracting features from images and videos [72, 62]. Therefore, researchers have used CNNs to build the encoder and the forecaster. In , two input frames are encoded independently to two state vectors by a 2D CNN. The future frame is generated by linearly extrapolating these two state vectors and then mapping the extrapolated vector back to the image space with a series of 2D deconvolution and unpooling layers. Also, to better capture the spatial features, the author proposed the phase pooling layer, which keeps both the pooled values and pooled indices in the feature maps. In , multi-scale 2D CNNs with ReLU activations were used as the encoder and the forecaster. Rather than encoding the frames independently, the author concatenated the input frames along the channel dimension to generate a single input tensor. The input tensor is rescaled to multiple resolutions and encoded by a multi-scale 2D CNN. To better capture the spatiotemporal correlations, Vondrick et al.  proposed to use 3D CNNs as the encoder and forecaster. Also, the author proposed to use a 2D CNN to predict the background image and only use the 3D CNN to generate the foreground. Later, Shi et al.  compared the performance of 2D CNN and 3D CNN for precipitation nowcasting and showed that 3D CNN outperformed 2D CNN in the experiments. Kalchbrenner et al.  proposed the Video Pixel Network (VPN) that uses multiple causal convolution layers  in the forecaster. Unlike the previous 2D CNN and 3D CNN models, of which the forecaster directly generates the multi-step predictions, VPN generates the elements in the spatiotemporal sequence one-by-one with the help of causal convolution. Experiments show that VPN outperforms the baseline which uses a 2D CNN to generate the predictions directly. Recently, Xu et al.  proposed the PredCNN network that stacks multiple dilated causal convolution layers to encode the input frames and predict the future. Experiments show that PredCNN achieves state-of-the-art performance in video prediction.
Besides using CNNs, researchers have also used RNNs to build models for STSF-RG. The advantage of using RNN for STSF problems is that it naturally models the temporal correlations within the data. Srivastava et al.  proposed to use multi-layer LSTM networks as the encoder and the forecaster. To use the LSTM, the author directly flattened the images into vectors. Oh et al.  proposed to use 2D CNNs to encode the image frames before feeding into LSTM layers. However, the state-state connection structure of Fully-Connected LSTM (FC-LSTM) is redundant for STSF and is not suitable for capturing the spatiotemporal correlations in the data . To solve the problem, Shi et al.  proposed the Convolutional LSTM (ConvLSTM). ConvLSTM uses convolution instead of full-connection in both the feed-forward and recurrent connections of LSTM. The formula of ConvLSTM is given in the following: