Natural spatiotemporal processes exhibit complex non-stationarity in both space and time, where neighboring pixels exhibit local dependencies, and their joint distributions are changing over time. Learning higher-order properties underlying the spatiotemporal non-stationarity is particularly significant for many video prediction tasks. Examples include modeling highly complicated real-world systems such as traffic flows[36, 34] and weather conditions [23, 31]. A well-performed predictive model is expected to learn the intrinsic variations in consecutive spatiotemporal context, which can be seen as a combination of the stationary component and the deterministic non-stationary component.
A great challenge in non-stationary spatiotemporal prediction is how to effectively capture higher-order trends regarding each pixel and its local area. For example, when making precipitation forecasting, one should carefully consider the complicated and diverse local trends on the evolving radar maps, shown as Figure 1. But this problem is extremely difficult due to the complicated non-stationarity in both space and time. Most previous work handles trend-like non-stationarity with recursions of CNNs [36, 34] or relatively simple state transitions in RNNs [23, 31]. The lack of non-stationary modeling capability prevents reasoning about uncertainties in spatiotemporal dynamics and partially leads to the blurry effect of the predicted frames.
We attempt to resolve this problem by proposing a generic RNNs architecture that is more effective in non-stationarity modeling. We find that though the forget gates in the recurrent predictive models could deliver, select, and discard information in the process of memory state transitions, they are too simple to capture higher-order non-stationary trends in high-dimensional time series. In particular, the forget gates in the recent PredRNN model  do not work appropriately on precipitation forecasting: About of them are saturated over all time stamps, implying almost time-invariant memory state transitions. In other words, future frames are predicted by approximately linear extrapolations.
In this paper, we focus on improving the memory transition functions of RNNs. Most statistical forecasting methods in classic time series analysis assume that the non-stationary trends can be rendered approximately stationary by performing suitable transformations such as differencing. We introduce this idea to RNNs and propose a new RNNs building block named Memory In Memory (MIM), which leverages the differential information between neighboring hidden states in the recurrent paths. MIM can be viewed as an improved version of LSTM 
, whose forget gate is replaced by another two embedded long short-term memories.
MIM has the following characteristics: (1) It creates a unified modeling for the spatiotemporal non-stationarity by differencing neighboring hidden states rather than raw images. (2) By stacking multiple MIM blocks, our model has a chance to gradually stationalize the spatiotemporal process and make it more predictable. (3) Note that over-differencing is no good for time series prediction, as it may inevitably lead to a loss of information. This is another reason that we apply differencing in memory transitions rather than all recurrent signals, e.g. the input gate and the input modulation gate. (4) MIM has one memory cell adopted from LSTMs as well as two additional recurrent modules with their own memories embedded in the transition path of the first memory. We use these modules to respectively model the higher-order non-stationary and approximately stationary components of the spatiotemporal dynamics. The proposed MIM networks achieve the state-of-the-art results on multiple prediction tasks, including a synthetic dataset, a real traffic flow dataset, and a precipitation forecasting dataset.
2 Related Work
2.1 ARIMA Models for Time Series Forecasting
Our model is inspired by the Autoregressive Integrated Moving Average (ARIMA) models. A time-series random variable whose power spectrum remains constant over time can be viewed as a combination of signal and noise. An ARIMA model, like a “filter”, aims to separate the signal from the noise. The obtained signal is then extrapolated into the future. In theory, it tackles the time series forecasting problem by transforming the non-stationary process to be stationary through differencing.
2.2 Deterministic Spatiotemporal Prediction
Spatiotemporal non-stationary processes are more complicated, as the joint distribution of neighboring pixel values is varying in both space and time. Like low-dimensional time series, they can also be decomposed into deterministic and stochastic components. Recent work in neural networks explored spatiotemporal prediction from these two aspects.
CNNs  and RNNs  have been widely used for learning the deterministic spatial correlations and temporal dependencies from videos. Srivastava et al.  introduced the sequence to sequence LSTM network from language modeling to video prediction. But this model can only capture temporal variations. To learn spatial and temporal variations in a unified network structure, Shi et al.  integrated the convolution operator into recurrent state transition functions, and proposed the Convolutional LSTM. Finn et al.  developed an action-conditioned video prediction model that can be further used in robotics planning when combined with the model predictive control methods. Villegas et al.  and Patraucean et al.  presented recurrent models based on the convolutional LSTM that leverage optical flow guided features. Kalchbrenner et al.  proposed the Video Pixel Network (VPN) that encodes the time, space, color structures of videos as a four-dimensional dependency chain. It achieves sharp prediction results but suffers from a high computational complexity. Wang et al. [31, 30] extended the convolutional LSTM with zigzag memory flows, which provides a great modeling capability for short-term video dynamics. Adversarial learning [11, 8] has been increasingly used in video generation or prediction [19, 29, 9, 27, 33], as it aims to solve the multi-modal training difficulty of the future prediction and helps generate less blurry frames.
However, the high-order non-stationarity of video dynamics has not been thoroughly considered by the above work, whose temporal transition methods are relatively simple, either controlled by the recurrent gate structures, or implemented by the recursion of the feed-forward network. By contrast, our model is characterized by exploiting high-order differencing to mitigate the non-stationary learning difficulty.
2.3 Stochastic Spatiotemporal Prediction
attempted to model the stochastic component of video dynamics using Variational Autoencoder. These methods increase the prediction diversity, but are difficult to evaluate and require to run a great number of times for a satisfactory result. In this paper, we focus on the deterministic part of spatiotemporal non-stationarity. More specifically, this work attempts to stationalize the complicated spatiotemporal processes and make their deterministic components in the future more predictable by proposing new RNNs architecture for non-stationarity.
In theory, the proposed Memory In Memory state transition mechanism for non-stationary modeling can be integrated into all LSTM-like units. In spatiotemporal prediction, we choose the Spatiotemporal LSTM (ST-LSTM)  as our base network for a trade-off between prediction accuracy and computation simplicity. ST-LSTM is characterized by a dual-memory structure: the temporal memory is adopted from the Convolutional LSTM , and the so-called spatiotemporal memory is updated along a zigzag direction. This network topology is illustrated by black arrows in Figure 4.
The structure of ST-LSTM is shown in Figure 2 (left). Accordingly, we can see four inputs in Equation (1): , which is either the input frame for or the output hidden states by the previous layer for ; and , the hidden states and memory states from the previous time stamp; as well as , the spatiotemporal memory states either from the top layer at the previous time stamp or the last layer at the current time stamp. All states are represented by tensors, where the first dimension is the number of their channels, and the following two dimensions denote the width and height of feature maps. The output of a certain unit at time stamp and layer is determined by the spatiotemporal memory from the previous layer, as well as the temporal memory from the previous time stamp:
is the sigmoid function,is the convolution, and is the Hadamard product. The input gate , input modulation gate , forget gate and output gate control the spatiotemporal information flow. The biggest highlight of ST-LSTM is its zigzag memory flow . It provides a great modeling capability of the short-term trends in longer pathways through the vertical layers. However, it also suffers from the problem of blurry predictions as it still uses the simple forget gate inherited from previous methods. The extremely complex non-stationarity cannot be fully captured by such simple temporal transitions.
As mentioned above, the spatiotemporal non-stationarity remains under-explored and its differential features have not been fully exploited by previous methods using neural networks. In this section, we first present the Memory In Memory (MIM) blocks for learning about the higher-order non-stationarity from RNNs memory transitions. We then discuss a new RNN architecture, which interlinks multiple MIM blocks with diagonal state connections, for modeling the differential information in the spatiotemporal prediction.
4.1 Memory In Memory Blocks
We observe that the complex dynamics in spatiotemporal sequences can be handled more effectively as a combination of stationary variations and non-stationary variations. Suppose we have a video sequence showing a person walking at a constant speed. The velocity can be seen as a stationary variable and the swing of the legs should be considered as a non-stationary process, which is apparently more difficult to predict. Unfortunately, the forget gate in previous LSTM-like models is a simple gating structure that struggles to capture the non-stationary variations in spacetime. In preliminary experiments, we find that the majority of forget gates in the recent PredRNN model  are saturated, implying that the units always remember stationary variations.
The Memory In Memory (MIM) block is enlightened by the idea of modeling the non-stationary variations using a series of cascaded memory transitions instead of the simple, saturation-prone forget gate in ST-LSTM. As compared in Figure 2 (the smaller dashed boxes), two cascaded temporal memory recurrent modules are designed to replace the temporal forget gate in ST-LSTM. The first module additionally taking as input is used to capture the non-stationary variations based on the differencing
between two consecutive hidden representations. So we name it thenon-stationary module (shown as MIM-N in Figure 3). It generates differential features based on the difference-stationary assumption . The other recurrent module takes as inputs the output of the MIM-N module and the outer temporal memory to capture the approximately stationary variations in spatiotemporal sequences. So we call it the stationary module (shown as MIM-S in Figure 3). By replacing the forget gate with the final output of the cascaded non-stationary and stationary modules (as shown in Figure 2), the non-stationary dynamics can be captured more effectively. Key calculations inside a MIM block can be shown as follows:
where MIM-N and MIM-S denote the non-stationary module and the stationary module respectively; and denote the horizontally-transited memory cells in the two corresponding recurrent modules; denotes the differential features, which are learned by the MIM-N module and fed into MIM-S.
The cascaded structure enables an end-to-end modeling of different orders of non-stationary dynamics. It is based on the difference-stationary assumption that differencing a non-stationary process multiple times will likely lead to a stationary one . A schematic of MIM-N and MIM-S is presented in Figure 3. We present the detailed calculations of MIM-N as follows:
where all gates , , and are updated by incorporating the frame difference , which highlights the non-stationary variations in the spatiotemporal sequence. The detailed calculations of MIM-S are shown as follows:
which takes the memory states and the differential features generated by MIM-N as input. As can be validated, the stationary module provides a gating mechanism to adaptively decide whether to trust the original memory or the differential feature . If the differential features vanish, indicating that the non-stationary dynamics is not prominent, then MIM-S will mainly reuse the original memory. Otherwise, if the differential features are prominent, then MIM-S will overwrite the original memory to focus more on the non-stationary dynamics.
4.2 Memory In Memory Networks
Stacking multiple MIM blocks, our model has a chance to capture higher orders of non-stationarity, gradually stationalizes the spatiotemporal process and makes the future sequence more predictable. The key idea of this architecture is to deliver necessary hidden states for generating differential features and best facilitating non-stationarity modeling.
A schematic of our proposed diagonal recurrent architecture is shown in Figure 4. We deliver the hidden states and to the Memory In Memory (MIM) block at time stamp and layer to generate the difference features for further use. These connections are shown as diagonal arrows in Figure 4. As the first layer doesn’t have any previous layer, we simply use the Spatiotemporal LSTM (ST-LSTM)  to generate its hidden presentations. Note that, the temporal differencing is performed by subtracting hidden state from the hidden state in MIM. Comparing with differencing neighboring raw images directly, differencing temporally adjacent hidden states can reveal the non-stationarity more evidently, as the spatiotemporal variations in local areas have been encoded into the hidden representations through the bottom ST-LSTM layer.
Another distinctive feature of the MIM networks resides in the horizontal state transition paths. As the MIM blocks have two cascaded temporal memory modules to capture the non-stationary and stationary dynamics respectively, we further deliver the two temporal memories (denoted by for the non-stationary memory and by for the stationary memory) along the blue arrows in Figure 4.
The MIM networks generate one frame at one time stamp. Calculations of the entire model with one ST-LSTM and MIMs can be presented as follows (for ). Note that there is no MIM block that is marked as .
By stacking multiple MIM blocks, we could potentially learn higher-order non-stationarity from spatiotemporal dynamics.
In this section, we perform extensive evaluation of the proposed Memory In Memory (MIM) approach. For each evaluation dataset, we will introduce the dataset details and the implementation details on it. At last, we report the performance of our proposed MIM models and analyze experimental results both qualitatively and quantitatively.
We use three spatiotemporal prediction datasets: a synthetic dataset with moving digits, a real traffic flow dataset and another real radar echo dataset. Here are some common settings all over these datasets. Our model has four layers in all experiments, including one ST-LSTM layer as the first layer and three MIMs. The number of feature channels in each MIM block is
, as a trade-off of prediction accuracy and memory efficiency. We train all of the models with L2 loss function, using the ADAM optimizer with a starting learning rate. The batch size of each iteration is set to . Note that we extend the layer normalization  to 3D tensors and apply it on the MIM and ConvLSTM models, based on the idea that the training process of deep convolutional networks can be stabilized by reducing the covariate shift problem 
. Note that the MIM blocks in the first time stamp do not have any previous hidden representations as input, so the non-stationary modules take tensors filled with zero as initialization. All experiments are implemented in TensorFlow.
5.1 Moving MNIST Dataset
Standard Moving MNIST dataset consists of grayscale sequences of length displaying pairs of digits moving around the image (10 for the inputs and 10 for the predictions). The sequences are generated by the method described in the work of Srivastava et al. , following the experimental settings in PredRNN . Before training the model, we apply max-min normalization to scale the data from its original intensities to . In particular, to reduce the training time and memory usage on the Moving MNIST dataset, we reshape each input image into a tensor. By doing this, we significantly reduce the parameters and training time of MIM and the comparison methods, while the performance is affected marginally. Besides, the scheduled sampling strategy  is applied to all of the models to stitch the discrepancy between training and inference.
|VPN baseline ||0.870||64.1||131.0|
|Causal LSTM ||0.898||46.5||106.8|
|MIM (without MIM-N)||0.858||54.4||124.8|
|MIM (without MIM-S)||0.853||55.7||125.5|
We use the per-frame structural similarity index measure (SSIM) , the mean square error (MSE) and the mean absolute error (MAE) to evaluate our models. A lower MSE or MAE, or a higher SSIM indicates a better prediction.
As shown in Table 1, our proposed MIM model approaches the state-of-the-art results on the standard Moving MNIST dataset. In particular, we construct another model named MIM* by using Causal LSTM  as the first layer, and integrating the cascaded MIM-S and MIM-N modules into the Causal LSTM memory cells, using them to replace the temporal forget gate in Causal LSTMs. This result shows that the memory in memory mechanism is not specifically designed for the ST-LSTM, instead, it is a generic mechanism for improving RNNs memory transitions. Though in other parts of this paper, we use ST-LSTM as our base structure for a trade-off between prediction accuracy and computational complexity, we can see that MIM performs better than its ST-LSTM (PredRNN) baseline, while MIM* also performs better than its Causal LSTM baseline.
We also design two ablation experiments, by respectively removing the stationary modules or non-stationary modules, to verify the necessity of cascading inner recurrent modules. As illustrated in Table 2, the MIM network without MIM-N works slightly better than that without MIM-S. Also, either of them has significant improvements over the PredRNN model in MSE/MAE, showing the necessity of cascading them in a unified network. When MIM-N and MIM-S are interlinked, the entire MIM model achieves the best performance.
We visualize a sequence of predicted frames on the standard Moving MNIST test set in Figure 5
. This example is challenging, as severe occlusions exist near the junction of the input sequence and the output sequence. The occlusions can be viewed as information bottleneck, in which the mean and variance of the spatiotemporal process meet drastic changes, indicating the presence of a high-order non-stationarity. Still, the generated images of MIM are more satisfactory, less blurry than those of other models. Actually, we cannot even tell the digits in the last frames generated by other models. We may conclude that MIM shows more capability in capturing complicated non-stationary variations.
5.2 TaxiBJ Traffic Flow Dataset
Traffic flows are collected from the chaotic real-world environment. Apparently, traffic conditions will not vary uniformly over time, and there are strong temporal dependencies between the traffic conditions at neighboring time stamps.
TaxiBJ contains traffic flow images collected consecutively from the GPS monitors of taxicabs in Beijing. Each frame in TaxiBJ is a grid of image. Two channels represent the traffic flow entering and leaving the same district at this time. We generate sequences from TaxiBJ dataset and split the whole dataset into a training set and a test set as described in the work of Zhang et al. . Each sequence contains 8 consecutive frames, 4 for the inputs and 4 for the predictions. The frames are also scaled to and reshaped to as described above.
The experiment settings on the TaxiBJ dataset are adopted from ST-ResNet , which yields the previous state-of-the-art results on this dataset. ST-ResNet only predicts one frame in one pass due to its non-recurrent structures, thus, it generates sequence outputs in a recursive manner. We show the quantitative results in Table 3 and the qualitative results in Figure 6. To make the comparisons conspicuous, we also visualize the difference between the predictions and the ground truth images. Obviously, MIM shows the best performance in all predicted frames among all compared models, with the lowest difference intensities.
|Model||Frame 1||Frame 2||Frame 3||Frame 4|
|Causal LSTM ||0.641||0.855||0.979||1.158|
5.3 Radar Echo Dataset
The radar echo dataset contains evolving radar maps that were collected every minutes, from May 1st, 2014 to June 30th, 2014. Each frame is a grid of image, covering square kilometers. We process the data in the same approach as for the TaxiBJ dataset.
We first use image-level MSE averaged by the next generated images (at a time interval of minutes and covering the next hour) to evaluate the compared models, as shown in Table 4. We then convert pixel intensities to radar echo values in dBZ, and choose 30 dBZ, 40 dBZ and 50 dBZ as the thresholds to calculate the hits (prediction = 1, truth = 1), misses (prediction = 0, truth = 1) and false alarms (prediction = 1, truth = 0). Critical success index (CSI) is a skill score that is defined as . A higher CSI denotes a better prediction. MIM consistently outperforms other models in both MSE and CSIs. Figure 7 shows the frame-wise comparisons over future time stamps. As the number of predicted frames grows, the results of MIM get better, indicating that our model could improve the forecasting outcomes by better capturing underlying, deterministic non-stationarity. Though all widely used in real precipitation forecasting applications, the prediction accuracy regarding 40 dBZ and 50 dBZ are more important than other metrics, as they indicate how much probabilities are there for the severe weather. But due to the long tail effect, predicting high-intensity radar echoes are non-trivial. Still, we can see that the proposed MIM model performs the best even for these two challenging metrics.
|Causal LSTM ||29.8||0.362||0.331||0.251|
We also visualize the generated radar maps to show our model’s performance, as illustrated in Figure 8
. We can see that the evolution of radar echoes is a highly non-stationary process. The accumulation, deformation and dissipation of the radar echoes are happening at every moment. In this sequence, the echoes in the bottom left corner become larger while the echoes in the upper right corner become smaller. Incorrectly, the PredRNN model thinks all the echoes are getting larger, and Causal LSTM thinks the echoes will stay still. Only MIM captures the correct trends.
As shown in Figure 9, to prove that the original forget gates do not work in an appropriate way, we calculate the percentage of the forget gate values which are greater than in the PredRNN model. To further verify the efficacy of our proposed MIM-N and MIM-S memory recurrent modules, we divide the outputs of MIM-S by the previous cell state, , to get a “pseudo forget gate”. Most of the forget gates in PredRNN are saturated (), while only about in the MIM model are saturated.
We investigate the underlying non-stationarity that forms the main obstacle in spatiotemporal prediction. RNNs are powerful for modeling difference-stationary sequences, while the ARIMA models are good at modeling low-dimensional time series non-stationarity. This paper enables non-stationary modeling in spacetime by proposing a new recurrent neural network to exploit the differential information between adjacent recurrent states. A Memory In Memory (MIM) block is derived to model the complicated variations, which uses two cascaded recurrent modules to handle the non-stationary and approximately stationary components in the spatiotemporal dynamics. MIM achieves the state-of-the-art prediction performance on three datasets: a synthetic dataset with moving digits, a real dataset with traffic flows and another real dataset with quickly evolving radar echoes.
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,
A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving,
M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg,
D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,
B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,
F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and
TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.Software available from tensorflow.org.
-  J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
-  S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In NIPS, pages 1171–1179, 2015.
-  G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung. Time series analysis: forecasting and control. John Wiley & Sons, 2015.
-  H. Cramér. On some classes of nonstationary stochastic processes. In Proceedings of the Fourth Berkeley symposium on mathematical statistics and probability, volume 2, pages 57–78. University of Los Angeles Press Berkeley and Los Angeles, 1961.
-  B. De Brabandere, X. Jia, T. Tuytelaars, and L. Van Gool. Dynamic filter networks. In NIPS, 2016.
-  E. Denton and R. Fergus. Stochastic video generation with a learned prior. In ICML, pages 1174–1183, 2018.
-  E. L. Denton, S. Chintala, R. Fergus, et al. Deep generative image models using a laplacian pyramid of adversarial networks. In NIPS, pages 1486–1494, 2015.
-  E. L. Denton et al. Unsupervised learning of disentangled representations from video. In NIPS, pages 4414–4423, 2017.
-  C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction. In NIPS, 2016.
-  I. J. Goodfellow, J. Pougetabadie, M. Mirza, B. Xu, D. Wardefarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial networks. NIPS, 3:2672–2680, 2014.
-  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
-  N. Kalchbrenner, A. v. d. Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu. Video pixel networks. In ICML, 2017.
-  D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
-  D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
-  A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018.
-  M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. In ICLR, 2016.
-  M. Oliu, J. Selva, and S. Escalera. Folded recurrent neural networks for future video prediction. In ECCV, September 2018.
-  V. Patraucean, A. Handa, and R. Cipolla. Spatio-temporal video autoencoder with differentiable memory. In ICLR Workshop, 2016.
-  D. B. Percival and A. T. Walden. Spectral Analysis for Physical Applications. Cambridge University Press, 1993.
-  X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In NIPS, pages 802–810, 2015.
-  X. Shi, Z. Gao, L. Lausen, H. Wang, D.-Y. Yeung, W.-k. Wong, and W.-c. Woo. Deep learning for precipitation nowcasting: A benchmark and a new model. In NIPS, 2017.
-  N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using lstms. In ICML, 2015.
-  I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. NIPS, 4:3104–3112, 2014.
-  S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz. Mocogan: Decomposing motion and content for video generation. In CVPR, 2018.
-  R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and content for natural video sequence prediction. In ICLR, 2017.
-  C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In NIPS, 2016.
-  Y. Wang, Z. Gao, M. Long, J. Wang, and P. S. Yu. PredRNN++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. In ICML, pages 5123–5132, 2018.
-  Y. Wang, M. Long, J. Wang, Z. Gao, and S. Y. Philip. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. In NIPS, pages 879–888, 2017.
-  Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. TIP, 13(4):600, 2004.
-  N. Wichers, R. Villegas, D. Erhan, and H. Lee. Hierarchical long-term video prediction without supervision. In ICML, 2018.
-  Z. Xu, Y. Wang, M. Long, J. Wang, and M. KLiss. Predcnn: Predictive learning with cascade convolutions. In IJCAI, pages 2940–2947, 2018.
-  T. Xue, J. Wu, K. Bouman, and B. Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In NIPS, pages 91–99, 2016.
-  J. Zhang, Y. Zheng, and D. Qi. Deep spatio-temporal residual networks for citywide crowd flows prediction. In AAAI, pages 1655–1661, 2017.
-  J. Zhang, Y. Zheng, D. Qi, R. Li, and X. Yi. Dnn-based prediction model for spatio-temporal data. In ACM SIGSPATIAL, page 92. ACM, 2016.