1 Introduction
Natural spatiotemporal processes exhibit complex nonstationarity in both space and time, where neighboring pixels exhibit local dependencies, and their joint distributions are changing over time. Learning higherorder properties underlying the spatiotemporal nonstationarity is particularly significant for many video prediction tasks. Examples include modeling highly complicated realworld systems such as traffic flows
[36, 34] and weather conditions [23, 31]. A wellperformed predictive model is expected to learn the intrinsic variations in consecutive spatiotemporal context, which can be seen as a combination of the stationary component and the deterministic nonstationary component.A great challenge in nonstationary spatiotemporal prediction is how to effectively capture higherorder trends regarding each pixel and its local area. For example, when making precipitation forecasting, one should carefully consider the complicated and diverse local trends on the evolving radar maps, shown as Figure 1. But this problem is extremely difficult due to the complicated nonstationarity in both space and time. Most previous work handles trendlike nonstationarity with recursions of CNNs [36, 34] or relatively simple state transitions in RNNs [23, 31]. The lack of nonstationary modeling capability prevents reasoning about uncertainties in spatiotemporal dynamics and partially leads to the blurry effect of the predicted frames.
We attempt to resolve this problem by proposing a generic RNNs architecture that is more effective in nonstationarity modeling. We find that though the forget gates in the recurrent predictive models could deliver, select, and discard information in the process of memory state transitions, they are too simple to capture higherorder nonstationary trends in highdimensional time series. In particular, the forget gates in the recent PredRNN model [31] do not work appropriately on precipitation forecasting: About of them are saturated over all time stamps, implying almost timeinvariant memory state transitions. In other words, future frames are predicted by approximately linear extrapolations.
In this paper, we focus on improving the memory transition functions of RNNs. Most statistical forecasting methods in classic time series analysis assume that the nonstationary trends can be rendered approximately stationary by performing suitable transformations such as differencing. We introduce this idea to RNNs and propose a new RNNs building block named Memory In Memory (MIM), which leverages the differential information between neighboring hidden states in the recurrent paths. MIM can be viewed as an improved version of LSTM [12]
, whose forget gate is replaced by another two embedded long shortterm memories.
MIM has the following characteristics: (1) It creates a unified modeling for the spatiotemporal nonstationarity by differencing neighboring hidden states rather than raw images. (2) By stacking multiple MIM blocks, our model has a chance to gradually stationalize the spatiotemporal process and make it more predictable. (3) Note that overdifferencing is no good for time series prediction, as it may inevitably lead to a loss of information. This is another reason that we apply differencing in memory transitions rather than all recurrent signals, e.g. the input gate and the input modulation gate. (4) MIM has one memory cell adopted from LSTMs as well as two additional recurrent modules with their own memories embedded in the transition path of the first memory. We use these modules to respectively model the higherorder nonstationary and approximately stationary components of the spatiotemporal dynamics. The proposed MIM networks achieve the stateoftheart results on multiple prediction tasks, including a synthetic dataset, a real traffic flow dataset, and a precipitation forecasting dataset.
2 Related Work
2.1 ARIMA Models for Time Series Forecasting
Our model is inspired by the Autoregressive Integrated Moving Average (ARIMA) models. A timeseries random variable whose power spectrum remains constant over time can be viewed as a combination of signal and noise. An ARIMA model, like a “filter”, aims to separate the signal from the noise. The obtained signal is then extrapolated into the future. In theory, it tackles the time series forecasting problem by transforming the nonstationary process to be stationary through differencing
[4].2.2 Deterministic Spatiotemporal Prediction
Spatiotemporal nonstationary processes are more complicated, as the joint distribution of neighboring pixel values is varying in both space and time. Like lowdimensional time series, they can also be decomposed into deterministic and stochastic components. Recent work in neural networks explored spatiotemporal prediction from these two aspects.
CNNs [17] and RNNs [26] have been widely used for learning the deterministic spatial correlations and temporal dependencies from videos. Srivastava et al. [25] introduced the sequence to sequence LSTM network from language modeling to video prediction. But this model can only capture temporal variations. To learn spatial and temporal variations in a unified network structure, Shi et al. [23] integrated the convolution operator into recurrent state transition functions, and proposed the Convolutional LSTM. Finn et al. [10] developed an actionconditioned video prediction model that can be further used in robotics planning when combined with the model predictive control methods. Villegas et al. [28] and Patraucean et al. [21] presented recurrent models based on the convolutional LSTM that leverage optical flow guided features. Kalchbrenner et al. [14] proposed the Video Pixel Network (VPN) that encodes the time, space, color structures of videos as a fourdimensional dependency chain. It achieves sharp prediction results but suffers from a high computational complexity. Wang et al. [31, 30] extended the convolutional LSTM with zigzag memory flows, which provides a great modeling capability for shortterm video dynamics. Adversarial learning [11, 8] has been increasingly used in video generation or prediction [19, 29, 9, 27, 33], as it aims to solve the multimodal training difficulty of the future prediction and helps generate less blurry frames.
However, the highorder nonstationarity of video dynamics has not been thoroughly considered by the above work, whose temporal transition methods are relatively simple, either controlled by the recurrent gate structures, or implemented by the recursion of the feedforward network. By contrast, our model is characterized by exploiting highorder differencing to mitigate the nonstationary learning difficulty.
2.3 Stochastic Spatiotemporal Prediction
Some recent methods [35, 7, 18]
attempted to model the stochastic component of video dynamics using Variational Autoencoder
[16]. These methods increase the prediction diversity, but are difficult to evaluate and require to run a great number of times for a satisfactory result. In this paper, we focus on the deterministic part of spatiotemporal nonstationarity. More specifically, this work attempts to stationalize the complicated spatiotemporal processes and make their deterministic components in the future more predictable by proposing new RNNs architecture for nonstationarity.3 Preliminaries
In theory, the proposed Memory In Memory state transition mechanism for nonstationary modeling can be integrated into all LSTMlike units. In spatiotemporal prediction, we choose the Spatiotemporal LSTM (STLSTM) [31] as our base network for a tradeoff between prediction accuracy and computation simplicity. STLSTM is characterized by a dualmemory structure: the temporal memory is adopted from the Convolutional LSTM [23], and the socalled spatiotemporal memory is updated along a zigzag direction. This network topology is illustrated by black arrows in Figure 4.
The structure of STLSTM is shown in Figure 2 (left). Accordingly, we can see four inputs in Equation (1): , which is either the input frame for or the output hidden states by the previous layer for ; and , the hidden states and memory states from the previous time stamp; as well as , the spatiotemporal memory states either from the top layer at the previous time stamp or the last layer at the current time stamp. All states are represented by tensors, where the first dimension is the number of their channels, and the following two dimensions denote the width and height of feature maps. The output of a certain unit at time stamp and layer is determined by the spatiotemporal memory from the previous layer, as well as the temporal memory from the previous time stamp:
(1) 
where
is the sigmoid function,
is the convolution, and is the Hadamard product. The input gate , input modulation gate , forget gate and output gate control the spatiotemporal information flow. The biggest highlight of STLSTM is its zigzag memory flow . It provides a great modeling capability of the shortterm trends in longer pathways through the vertical layers. However, it also suffers from the problem of blurry predictions as it still uses the simple forget gate inherited from previous methods. The extremely complex nonstationarity cannot be fully captured by such simple temporal transitions.4 Methods
As mentioned above, the spatiotemporal nonstationarity remains underexplored and its differential features have not been fully exploited by previous methods using neural networks. In this section, we first present the Memory In Memory (MIM) blocks for learning about the higherorder nonstationarity from RNNs memory transitions. We then discuss a new RNN architecture, which interlinks multiple MIM blocks with diagonal state connections, for modeling the differential information in the spatiotemporal prediction.
4.1 Memory In Memory Blocks
We observe that the complex dynamics in spatiotemporal sequences can be handled more effectively as a combination of stationary variations and nonstationary variations. Suppose we have a video sequence showing a person walking at a constant speed. The velocity can be seen as a stationary variable and the swing of the legs should be considered as a nonstationary process, which is apparently more difficult to predict. Unfortunately, the forget gate in previous LSTMlike models is a simple gating structure that struggles to capture the nonstationary variations in spacetime. In preliminary experiments, we find that the majority of forget gates in the recent PredRNN model [31] are saturated, implying that the units always remember stationary variations.
The Memory In Memory (MIM) block is enlightened by the idea of modeling the nonstationary variations using a series of cascaded memory transitions instead of the simple, saturationprone forget gate in STLSTM. As compared in Figure 2 (the smaller dashed boxes), two cascaded temporal memory recurrent modules are designed to replace the temporal forget gate in STLSTM. The first module additionally taking as input is used to capture the nonstationary variations based on the differencing
between two consecutive hidden representations. So we name it the
nonstationary module (shown as MIMN in Figure 3). It generates differential features based on the differencestationary assumption [22]. The other recurrent module takes as inputs the output of the MIMN module and the outer temporal memory to capture the approximately stationary variations in spatiotemporal sequences. So we call it the stationary module (shown as MIMS in Figure 3). By replacing the forget gate with the final output of the cascaded nonstationary and stationary modules (as shown in Figure 2), the nonstationary dynamics can be captured more effectively. Key calculations inside a MIM block can be shown as follows:(2) 
where MIMN and MIMS denote the nonstationary module and the stationary module respectively; and denote the horizontallytransited memory cells in the two corresponding recurrent modules; denotes the differential features, which are learned by the MIMN module and fed into MIMS.
The cascaded structure enables an endtoend modeling of different orders of nonstationary dynamics. It is based on the differencestationary assumption that differencing a nonstationary process multiple times will likely lead to a stationary one [22]. A schematic of MIMN and MIMS is presented in Figure 3. We present the detailed calculations of MIMN as follows:
(3) 
where all gates , , and are updated by incorporating the frame difference , which highlights the nonstationary variations in the spatiotemporal sequence. The detailed calculations of MIMS are shown as follows:
(4) 
which takes the memory states and the differential features generated by MIMN as input. As can be validated, the stationary module provides a gating mechanism to adaptively decide whether to trust the original memory or the differential feature . If the differential features vanish, indicating that the nonstationary dynamics is not prominent, then MIMS will mainly reuse the original memory. Otherwise, if the differential features are prominent, then MIMS will overwrite the original memory to focus more on the nonstationary dynamics.
4.2 Memory In Memory Networks
Stacking multiple MIM blocks, our model has a chance to capture higher orders of nonstationarity, gradually stationalizes the spatiotemporal process and makes the future sequence more predictable. The key idea of this architecture is to deliver necessary hidden states for generating differential features and best facilitating nonstationarity modeling.
A schematic of our proposed diagonal recurrent architecture is shown in Figure 4. We deliver the hidden states and to the Memory In Memory (MIM) block at time stamp and layer to generate the difference features for further use. These connections are shown as diagonal arrows in Figure 4. As the first layer doesn’t have any previous layer, we simply use the Spatiotemporal LSTM (STLSTM) [31] to generate its hidden presentations. Note that, the temporal differencing is performed by subtracting hidden state from the hidden state in MIM. Comparing with differencing neighboring raw images directly, differencing temporally adjacent hidden states can reveal the nonstationarity more evidently, as the spatiotemporal variations in local areas have been encoded into the hidden representations through the bottom STLSTM layer.
Another distinctive feature of the MIM networks resides in the horizontal state transition paths. As the MIM blocks have two cascaded temporal memory modules to capture the nonstationary and stationary dynamics respectively, we further deliver the two temporal memories (denoted by for the nonstationary memory and by for the stationary memory) along the blue arrows in Figure 4.
The MIM networks generate one frame at one time stamp. Calculations of the entire model with one STLSTM and MIMs can be presented as follows (for ). Note that there is no MIM block that is marked as .
(5) 
By stacking multiple MIM blocks, we could potentially learn higherorder nonstationarity from spatiotemporal dynamics.
5 Experiments
In this section, we perform extensive evaluation of the proposed Memory In Memory (MIM) approach. For each evaluation dataset, we will introduce the dataset details and the implementation details on it. At last, we report the performance of our proposed MIM models and analyze experimental results both qualitatively and quantitatively.
We use three spatiotemporal prediction datasets: a synthetic dataset with moving digits, a real traffic flow dataset and another real radar echo dataset. Here are some common settings all over these datasets. Our model has four layers in all experiments, including one STLSTM layer as the first layer and three MIMs. The number of feature channels in each MIM block is
, as a tradeoff of prediction accuracy and memory efficiency. We train all of the models with L2 loss function, using the ADAM optimizer
[15] with a starting learning rate. The batch size of each iteration is set to . Note that we extend the layer normalization [2] to 3D tensors and apply it on the MIM and ConvLSTM models, based on the idea that the training process of deep convolutional networks can be stabilized by reducing the covariate shift problem [13]. Note that the MIM blocks in the first time stamp do not have any previous hidden representations as input, so the nonstationary modules take tensors filled with zero as initialization. All experiments are implemented in TensorFlow
[1].5.1 Moving MNIST Dataset
Implementation
Standard Moving MNIST dataset consists of grayscale sequences of length displaying pairs of digits moving around the image (10 for the inputs and 10 for the predictions). The sequences are generated by the method described in the work of Srivastava et al. [25], following the experimental settings in PredRNN [31]. Before training the model, we apply maxmin normalization to scale the data from its original intensities to . In particular, to reduce the training time and memory usage on the Moving MNIST dataset, we reshape each input image into a tensor. By doing this, we significantly reduce the parameters and training time of MIM and the comparison methods, while the performance is affected marginally. Besides, the scheduled sampling strategy [3] is applied to all of the models to stitch the discrepancy between training and inference.
Model  SSIM  MSE  MAE 

FCLSTM [25]  0.690  118.3  209.4 
ConvLSTM [23]  0.707  103.3  182.9 
TrajGRU [24]  0.713  106.9  190.1 
CDNA [10]  0.721  97.4  175.3 
DFN [6]  0.726  89.0  172.8 
FRNN [20]  0.813  69.7  150.3 
VPN baseline [14]  0.870  64.1  131.0 
PredRNN [31]  0.867  56.8  126.1 
Causal LSTM [30]  0.898  46.5  106.8 
MIM  0.874  52.0  116.5 
MIM*  0.910  44.2  101.1 
Model  SSIM  MSE  MAE 

MIM (without MIMN)  0.858  54.4  124.8 
MIM (without MIMS)  0.853  55.7  125.5 
MIM  0.874  52.0  116.5 
Results
We use the perframe structural similarity index measure (SSIM) [32], the mean square error (MSE) and the mean absolute error (MAE) to evaluate our models. A lower MSE or MAE, or a higher SSIM indicates a better prediction.
As shown in Table 1, our proposed MIM model approaches the stateoftheart results on the standard Moving MNIST dataset. In particular, we construct another model named MIM* by using Causal LSTM [30] as the first layer, and integrating the cascaded MIMS and MIMN modules into the Causal LSTM memory cells, using them to replace the temporal forget gate in Causal LSTMs. This result shows that the memory in memory mechanism is not specifically designed for the STLSTM, instead, it is a generic mechanism for improving RNNs memory transitions. Though in other parts of this paper, we use STLSTM as our base structure for a tradeoff between prediction accuracy and computational complexity, we can see that MIM performs better than its STLSTM (PredRNN) baseline, while MIM* also performs better than its Causal LSTM baseline.
We also design two ablation experiments, by respectively removing the stationary modules or nonstationary modules, to verify the necessity of cascading inner recurrent modules. As illustrated in Table 2, the MIM network without MIMN works slightly better than that without MIMS. Also, either of them has significant improvements over the PredRNN model in MSE/MAE, showing the necessity of cascading them in a unified network. When MIMN and MIMS are interlinked, the entire MIM model achieves the best performance.
We visualize a sequence of predicted frames on the standard Moving MNIST test set in Figure 5
. This example is challenging, as severe occlusions exist near the junction of the input sequence and the output sequence. The occlusions can be viewed as information bottleneck, in which the mean and variance of the spatiotemporal process meet drastic changes, indicating the presence of a highorder nonstationarity. Still, the generated images of MIM are more satisfactory, less blurry than those of other models. Actually, we cannot even tell the digits in the last frames generated by other models. We may conclude that MIM shows more capability in capturing complicated nonstationary variations.
5.2 TaxiBJ Traffic Flow Dataset
Traffic flows are collected from the chaotic realworld environment. Apparently, traffic conditions will not vary uniformly over time, and there are strong temporal dependencies between the traffic conditions at neighboring time stamps.
Implementation
TaxiBJ contains traffic flow images collected consecutively from the GPS monitors of taxicabs in Beijing. Each frame in TaxiBJ is a grid of image. Two channels represent the traffic flow entering and leaving the same district at this time. We generate sequences from TaxiBJ dataset and split the whole dataset into a training set and a test set as described in the work of Zhang et al. [37]. Each sequence contains 8 consecutive frames, 4 for the inputs and 4 for the predictions. The frames are also scaled to and reshaped to as described above.
Results
The experiment settings on the TaxiBJ dataset are adopted from STResNet [36], which yields the previous stateoftheart results on this dataset. STResNet only predicts one frame in one pass due to its nonrecurrent structures, thus, it generates sequence outputs in a recursive manner. We show the quantitative results in Table 3 and the qualitative results in Figure 6. To make the comparisons conspicuous, we also visualize the difference between the predictions and the ground truth images. Obviously, MIM shows the best performance in all predicted frames among all compared models, with the lowest difference intensities.
Model  Frame 1  Frame 2  Frame 3  Frame 4 

STResNet [36]  0.688  0.939  1.130  1.288 
VPN [14]  0.744  1.031  1.251  1.444 
FRNN [20]  0.682  0.823  0.989  1.183 
PredRNN [31]  0.634  0.934  1.047  1.263 
Causal LSTM [30]  0.641  0.855  0.979  1.158 
MIM  0.554  0.737  0.887  0.999 
5.3 Radar Echo Dataset
Implementation
The radar echo dataset contains evolving radar maps that were collected every minutes, from May 1st, 2014 to June 30th, 2014. Each frame is a grid of image, covering square kilometers. We process the data in the same approach as for the TaxiBJ dataset.
Results
We first use imagelevel MSE averaged by the next generated images (at a time interval of minutes and covering the next hour) to evaluate the compared models, as shown in Table 4. We then convert pixel intensities to radar echo values in dBZ, and choose 30 dBZ, 40 dBZ and 50 dBZ as the thresholds to calculate the hits (prediction = 1, truth = 1), misses (prediction = 0, truth = 1) and false alarms (prediction = 1, truth = 0). Critical success index (CSI) is a skill score that is defined as [23]. A higher CSI denotes a better prediction. MIM consistently outperforms other models in both MSE and CSIs. Figure 7 shows the framewise comparisons over future time stamps. As the number of predicted frames grows, the results of MIM get better, indicating that our model could improve the forecasting outcomes by better capturing underlying, deterministic nonstationarity. Though all widely used in real precipitation forecasting applications, the prediction accuracy regarding 40 dBZ and 50 dBZ are more important than other metrics, as they indicate how much probabilities are there for the severe weather. But due to the long tail effect, predicting highintensity radar echoes are nontrivial. Still, we can see that the proposed MIM model performs the best even for these two challenging metrics.
Model  MSE  CSI30  CSI40  CSI50 

FRNN [20]  52.5  0.254  0.203  0.163 
PredRNN [31]  31.8  0.401  0.378  0.306 
Causal LSTM [30]  29.8  0.362  0.331  0.251 
MIM  27.8  0.429  0.399  0.317 
We also visualize the generated radar maps to show our model’s performance, as illustrated in Figure 8
. We can see that the evolution of radar echoes is a highly nonstationary process. The accumulation, deformation and dissipation of the radar echoes are happening at every moment. In this sequence, the echoes in the bottom left corner become larger while the echoes in the upper right corner become smaller. Incorrectly, the PredRNN model thinks all the echoes are getting larger, and Causal LSTM thinks the echoes will stay still. Only MIM captures the correct trends.
As shown in Figure 9, to prove that the original forget gates do not work in an appropriate way, we calculate the percentage of the forget gate values which are greater than in the PredRNN model. To further verify the efficacy of our proposed MIMN and MIMS memory recurrent modules, we divide the outputs of MIMS by the previous cell state, , to get a “pseudo forget gate”. Most of the forget gates in PredRNN are saturated (), while only about in the MIM model are saturated.
6 Conclusions
We investigate the underlying nonstationarity that forms the main obstacle in spatiotemporal prediction. RNNs are powerful for modeling differencestationary sequences, while the ARIMA models are good at modeling lowdimensional time series nonstationarity. This paper enables nonstationary modeling in spacetime by proposing a new recurrent neural network to exploit the differential information between adjacent recurrent states. A Memory In Memory (MIM) block is derived to model the complicated variations, which uses two cascaded recurrent modules to handle the nonstationary and approximately stationary components in the spatiotemporal dynamics. MIM achieves the stateoftheart prediction performance on three datasets: a synthetic dataset with moving digits, a real dataset with traffic flows and another real dataset with quickly evolving radar echoes.
References

[1]
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,
A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving,
M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg,
D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,
B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,
F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and
X. Zheng.
TensorFlow: Largescale machine learning on heterogeneous systems, 2015.
Software available from tensorflow.org.  [2] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
 [3] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In NIPS, pages 1171–1179, 2015.
 [4] G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung. Time series analysis: forecasting and control. John Wiley & Sons, 2015.
 [5] H. Cramér. On some classes of nonstationary stochastic processes. In Proceedings of the Fourth Berkeley symposium on mathematical statistics and probability, volume 2, pages 57–78. University of Los Angeles Press Berkeley and Los Angeles, 1961.
 [6] B. De Brabandere, X. Jia, T. Tuytelaars, and L. Van Gool. Dynamic filter networks. In NIPS, 2016.
 [7] E. Denton and R. Fergus. Stochastic video generation with a learned prior. In ICML, pages 1174–1183, 2018.
 [8] E. L. Denton, S. Chintala, R. Fergus, et al. Deep generative image models using a laplacian pyramid of adversarial networks. In NIPS, pages 1486–1494, 2015.
 [9] E. L. Denton et al. Unsupervised learning of disentangled representations from video. In NIPS, pages 4414–4423, 2017.
 [10] C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction. In NIPS, 2016.
 [11] I. J. Goodfellow, J. Pougetabadie, M. Mirza, B. Xu, D. Wardefarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial networks. NIPS, 3:2672–2680, 2014.
 [12] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 [13] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
 [14] N. Kalchbrenner, A. v. d. Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu. Video pixel networks. In ICML, 2017.
 [15] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 [16] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [17] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
 [18] A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018.
 [19] M. Mathieu, C. Couprie, and Y. LeCun. Deep multiscale video prediction beyond mean square error. In ICLR, 2016.
 [20] M. Oliu, J. Selva, and S. Escalera. Folded recurrent neural networks for future video prediction. In ECCV, September 2018.
 [21] V. Patraucean, A. Handa, and R. Cipolla. Spatiotemporal video autoencoder with differentiable memory. In ICLR Workshop, 2016.
 [22] D. B. Percival and A. T. Walden. Spectral Analysis for Physical Applications. Cambridge University Press, 1993.
 [23] X. Shi, Z. Chen, H. Wang, D.Y. Yeung, W.K. Wong, and W.c. Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In NIPS, pages 802–810, 2015.
 [24] X. Shi, Z. Gao, L. Lausen, H. Wang, D.Y. Yeung, W.k. Wong, and W.c. Woo. Deep learning for precipitation nowcasting: A benchmark and a new model. In NIPS, 2017.
 [25] N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using lstms. In ICML, 2015.
 [26] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. NIPS, 4:3104–3112, 2014.
 [27] S. Tulyakov, M.Y. Liu, X. Yang, and J. Kautz. Mocogan: Decomposing motion and content for video generation. In CVPR, 2018.
 [28] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and content for natural video sequence prediction. In ICLR, 2017.
 [29] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In NIPS, 2016.
 [30] Y. Wang, Z. Gao, M. Long, J. Wang, and P. S. Yu. PredRNN++: Towards a resolution of the deepintime dilemma in spatiotemporal predictive learning. In ICML, pages 5123–5132, 2018.
 [31] Y. Wang, M. Long, J. Wang, Z. Gao, and S. Y. Philip. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. In NIPS, pages 879–888, 2017.
 [32] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. TIP, 13(4):600, 2004.
 [33] N. Wichers, R. Villegas, D. Erhan, and H. Lee. Hierarchical longterm video prediction without supervision. In ICML, 2018.
 [34] Z. Xu, Y. Wang, M. Long, J. Wang, and M. KLiss. Predcnn: Predictive learning with cascade convolutions. In IJCAI, pages 2940–2947, 2018.
 [35] T. Xue, J. Wu, K. Bouman, and B. Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In NIPS, pages 91–99, 2016.
 [36] J. Zhang, Y. Zheng, and D. Qi. Deep spatiotemporal residual networks for citywide crowd flows prediction. In AAAI, pages 1655–1661, 2017.
 [37] J. Zhang, Y. Zheng, D. Qi, R. Li, and X. Yi. Dnnbased prediction model for spatiotemporal data. In ACM SIGSPATIAL, page 92. ACM, 2016.