Memory In Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity from Spatiotemporal Dynamics

Natural spatiotemporal processes can be highly non-stationary in many ways, e.g. the low-level non-stationarity such as spatial correlations or temporal dependencies of local pixel values; and the high-level variations such as the accumulation, deformation or dissipation of radar echoes in precipitation forecasting. From Cramer's Decomposition, any non-stationary process can be decomposed into deterministic, time-variant polynomials, plus a zero-mean stochastic term. By applying differencing operations appropriately, we may turn time-variant polynomials into a constant, making the deterministic component predictable. However, most previous recurrent neural networks for spatiotemporal prediction do not use the differential signals effectively, and their relatively simple state transition functions prevent them from learning too complicated variations in spacetime. We propose the Memory In Memory (MIM) networks and corresponding recurrent blocks for this purpose. The MIM blocks exploit the differential signals between adjacent recurrent states to model the non-stationary and approximately stationary properties in spatiotemporal dynamics with two cascaded, self-renewed memory modules. By stacking multiple MIM blocks, we could potentially handle higher-order non-stationarity. The MIM networks achieve the state-of-the-art results on three spatiotemporal prediction tasks across both synthetic and real-world datasets. We believe that the general idea of this work can be potentially applied to other time-series forecasting tasks.


page 1

page 6

page 7

page 8


Non-stationary Transformers: Rethinking the Stationarity in Time Series Forecasting

Transformers have shown great power in time series forecasting due to th...

Self-Adaptive Forecasting for Improved Deep Learning on Non-Stationary Time-Series

Real-world time-series datasets often violate the assumptions of standar...

Forecasting of Non-Stationary Sales Time Series Using Deep Learning

The paper describes the deep learning approach for forecasting non-stati...

FDNet: A Deep Learning Approach with Two Parallel Cross Encoding Pathways for Precipitation Nowcasting

With the goal of predicting the future rainfall intensity in a local reg...

Physics-informed Tensor-train ConvLSTM for Volumetric Velocity Forecasting

According to the National Academies, a weekly forecast of velocity, vert...

PredRNN++: Towards A Resolution of the Deep-in-Time Dilemma in Spatiotemporal Predictive Learning

We present PredRNN++, an improved recurrent network for video predictive...

Dynamic Decomposition of Spatiotemporal Neural Signals

Neural signals are characterized by rich temporal and spatiotemporal dyn...

1 Introduction

Figure 1: First row:

consecutive radar maps where the whiter pixels show higher precipitation probability.

Second, third, last row:

pixel values distributions, means and standard deviations for corresponding local regions identified by bounding boxes of different colors. Note that different regions have different variation trends, making the prediction problem extremely challenging.

Natural spatiotemporal processes exhibit complex non-stationarity in both space and time, where neighboring pixels exhibit local dependencies, and their joint distributions are changing over time. Learning higher-order properties underlying the spatiotemporal non-stationarity is particularly significant for many video prediction tasks. Examples include modeling highly complicated real-world systems such as traffic flows

[36, 34] and weather conditions [23, 31]. A well-performed predictive model is expected to learn the intrinsic variations in consecutive spatiotemporal context, which can be seen as a combination of the stationary component and the deterministic non-stationary component.

A great challenge in non-stationary spatiotemporal prediction is how to effectively capture higher-order trends regarding each pixel and its local area. For example, when making precipitation forecasting, one should carefully consider the complicated and diverse local trends on the evolving radar maps, shown as Figure 1. But this problem is extremely difficult due to the complicated non-stationarity in both space and time. Most previous work handles trend-like non-stationarity with recursions of CNNs [36, 34] or relatively simple state transitions in RNNs [23, 31]. The lack of non-stationary modeling capability prevents reasoning about uncertainties in spatiotemporal dynamics and partially leads to the blurry effect of the predicted frames.

We attempt to resolve this problem by proposing a generic RNNs architecture that is more effective in non-stationarity modeling. We find that though the forget gates in the recurrent predictive models could deliver, select, and discard information in the process of memory state transitions, they are too simple to capture higher-order non-stationary trends in high-dimensional time series. In particular, the forget gates in the recent PredRNN model [31] do not work appropriately on precipitation forecasting: About of them are saturated over all time stamps, implying almost time-invariant memory state transitions. In other words, future frames are predicted by approximately linear extrapolations.

In this paper, we focus on improving the memory transition functions of RNNs. Most statistical forecasting methods in classic time series analysis assume that the non-stationary trends can be rendered approximately stationary by performing suitable transformations such as differencing. We introduce this idea to RNNs and propose a new RNNs building block named Memory In Memory (MIM), which leverages the differential information between neighboring hidden states in the recurrent paths. MIM can be viewed as an improved version of LSTM [12]

, whose forget gate is replaced by another two embedded long short-term memories.

MIM has the following characteristics: (1) It creates a unified modeling for the spatiotemporal non-stationarity by differencing neighboring hidden states rather than raw images. (2) By stacking multiple MIM blocks, our model has a chance to gradually stationalize the spatiotemporal process and make it more predictable. (3) Note that over-differencing is no good for time series prediction, as it may inevitably lead to a loss of information. This is another reason that we apply differencing in memory transitions rather than all recurrent signals, e.g. the input gate and the input modulation gate. (4) MIM has one memory cell adopted from LSTMs as well as two additional recurrent modules with their own memories embedded in the transition path of the first memory. We use these modules to respectively model the higher-order non-stationary and approximately stationary components of the spatiotemporal dynamics. The proposed MIM networks achieve the state-of-the-art results on multiple prediction tasks, including a synthetic dataset, a real traffic flow dataset, and a precipitation forecasting dataset.

2 Related Work

2.1 ARIMA Models for Time Series Forecasting

Our model is inspired by the Autoregressive Integrated Moving Average (ARIMA) models. A time-series random variable whose power spectrum remains constant over time can be viewed as a combination of signal and noise. An ARIMA model, like a “filter”, aims to separate the signal from the noise. The obtained signal is then extrapolated into the future. In theory, it tackles the time series forecasting problem by transforming the non-stationary process to be stationary through differencing


2.2 Deterministic Spatiotemporal Prediction

Spatiotemporal non-stationary processes are more complicated, as the joint distribution of neighboring pixel values is varying in both space and time. Like low-dimensional time series, they can also be decomposed into deterministic and stochastic components. Recent work in neural networks explored spatiotemporal prediction from these two aspects.

CNNs [17] and RNNs [26] have been widely used for learning the deterministic spatial correlations and temporal dependencies from videos. Srivastava et al. [25] introduced the sequence to sequence LSTM network from language modeling to video prediction. But this model can only capture temporal variations. To learn spatial and temporal variations in a unified network structure, Shi et al. [23] integrated the convolution operator into recurrent state transition functions, and proposed the Convolutional LSTM. Finn et al. [10] developed an action-conditioned video prediction model that can be further used in robotics planning when combined with the model predictive control methods. Villegas et al. [28] and Patraucean et al. [21] presented recurrent models based on the convolutional LSTM that leverage optical flow guided features. Kalchbrenner et al. [14] proposed the Video Pixel Network (VPN) that encodes the time, space, color structures of videos as a four-dimensional dependency chain. It achieves sharp prediction results but suffers from a high computational complexity. Wang et al. [31, 30] extended the convolutional LSTM with zigzag memory flows, which provides a great modeling capability for short-term video dynamics. Adversarial learning [11, 8] has been increasingly used in video generation or prediction [19, 29, 9, 27, 33], as it aims to solve the multi-modal training difficulty of the future prediction and helps generate less blurry frames.

However, the high-order non-stationarity of video dynamics has not been thoroughly considered by the above work, whose temporal transition methods are relatively simple, either controlled by the recurrent gate structures, or implemented by the recursion of the feed-forward network. By contrast, our model is characterized by exploiting high-order differencing to mitigate the non-stationary learning difficulty.

2.3 Stochastic Spatiotemporal Prediction

Some recent methods [35, 7, 18]

attempted to model the stochastic component of video dynamics using Variational Autoencoder

[16]. These methods increase the prediction diversity, but are difficult to evaluate and require to run a great number of times for a satisfactory result. In this paper, we focus on the deterministic part of spatiotemporal non-stationarity. More specifically, this work attempts to stationalize the complicated spatiotemporal processes and make their deterministic components in the future more predictable by proposing new RNNs architecture for non-stationarity.

3 Preliminaries

In theory, the proposed Memory In Memory state transition mechanism for non-stationary modeling can be integrated into all LSTM-like units. In spatiotemporal prediction, we choose the Spatiotemporal LSTM (ST-LSTM) [31] as our base network for a trade-off between prediction accuracy and computation simplicity. ST-LSTM is characterized by a dual-memory structure: the temporal memory is adopted from the Convolutional LSTM [23], and the so-called spatiotemporal memory is updated along a zigzag direction. This network topology is illustrated by black arrows in Figure 4.

The structure of ST-LSTM is shown in Figure 2 (left). Accordingly, we can see four inputs in Equation (1): , which is either the input frame for or the output hidden states by the previous layer for ; and , the hidden states and memory states from the previous time stamp; as well as , the spatiotemporal memory states either from the top layer at the previous time stamp or the last layer at the current time stamp. All states are represented by tensors, where the first dimension is the number of their channels, and the following two dimensions denote the width and height of feature maps. The output of a certain unit at time stamp and layer is determined by the spatiotemporal memory from the previous layer, as well as the temporal memory from the previous time stamp:



is the sigmoid function,

is the convolution, and is the Hadamard product. The input gate , input modulation gate , forget gate and output gate control the spatiotemporal information flow. The biggest highlight of ST-LSTM is its zigzag memory flow . It provides a great modeling capability of the short-term trends in longer pathways through the vertical layers. However, it also suffers from the problem of blurry predictions as it still uses the simple forget gate inherited from previous methods. The extremely complex non-stationarity cannot be fully captured by such simple temporal transitions.

4 Methods

As mentioned above, the spatiotemporal non-stationarity remains under-explored and its differential features have not been fully exploited by previous methods using neural networks. In this section, we first present the Memory In Memory (MIM) blocks for learning about the higher-order non-stationarity from RNNs memory transitions. We then discuss a new RNN architecture, which interlinks multiple MIM blocks with diagonal state connections, for modeling the differential information in the spatiotemporal prediction.

4.1 Memory In Memory Blocks

Figure 2: The ST-LSTM block [31] in the left plot and the proposed Memory In Memory (MIM) block in the right plot. MIM is designed to introduce two recurrent modules (yellow squares) to replace the forget gate (dashed box) in ST-LSTM. MIM-N is the non-stationary module and MIM-S is the stationary module. Note that the MIM block cannot be used in the first layer so the input is replaced by .

We observe that the complex dynamics in spatiotemporal sequences can be handled more effectively as a combination of stationary variations and non-stationary variations. Suppose we have a video sequence showing a person walking at a constant speed. The velocity can be seen as a stationary variable and the swing of the legs should be considered as a non-stationary process, which is apparently more difficult to predict. Unfortunately, the forget gate in previous LSTM-like models is a simple gating structure that struggles to capture the non-stationary variations in spacetime. In preliminary experiments, we find that the majority of forget gates in the recent PredRNN model [31] are saturated, implying that the units always remember stationary variations.

Figure 3: The non-stationary module (MIM-N) and the stationary module (MIM-S), which are interlinked in a cascaded structure in the MIM block. Non-stationarity is modeled by differencing.

The Memory In Memory (MIM) block is enlightened by the idea of modeling the non-stationary variations using a series of cascaded memory transitions instead of the simple, saturation-prone forget gate in ST-LSTM. As compared in Figure 2 (the smaller dashed boxes), two cascaded temporal memory recurrent modules are designed to replace the temporal forget gate in ST-LSTM. The first module additionally taking as input is used to capture the non-stationary variations based on the differencing

between two consecutive hidden representations. So we name it the

non-stationary module (shown as MIM-N in Figure 3). It generates differential features based on the difference-stationary assumption [22]. The other recurrent module takes as inputs the output of the MIM-N module and the outer temporal memory to capture the approximately stationary variations in spatiotemporal sequences. So we call it the stationary module (shown as MIM-S in Figure 3). By replacing the forget gate with the final output of the cascaded non-stationary and stationary modules (as shown in Figure 2), the non-stationary dynamics can be captured more effectively. Key calculations inside a MIM block can be shown as follows:


where MIM-N and MIM-S denote the non-stationary module and the stationary module respectively; and denote the horizontally-transited memory cells in the two corresponding recurrent modules; denotes the differential features, which are learned by the MIM-N module and fed into MIM-S.

The cascaded structure enables an end-to-end modeling of different orders of non-stationary dynamics. It is based on the difference-stationary assumption that differencing a non-stationary process multiple times will likely lead to a stationary one [22]. A schematic of MIM-N and MIM-S is presented in Figure 3. We present the detailed calculations of MIM-N as follows:


where all gates , , and are updated by incorporating the frame difference , which highlights the non-stationary variations in the spatiotemporal sequence. The detailed calculations of MIM-S are shown as follows:


which takes the memory states and the differential features generated by MIM-N as input. As can be validated, the stationary module provides a gating mechanism to adaptively decide whether to trust the original memory or the differential feature . If the differential features vanish, indicating that the non-stationary dynamics is not prominent, then MIM-S will mainly reuse the original memory. Otherwise, if the differential features are prominent, then MIM-S will overwrite the original memory to focus more on the non-stationary dynamics.

4.2 Memory In Memory Networks

Stacking multiple MIM blocks, our model has a chance to capture higher orders of non-stationarity, gradually stationalizes the spatiotemporal process and makes the future sequence more predictable. The key idea of this architecture is to deliver necessary hidden states for generating differential features and best facilitating non-stationarity modeling.

Figure 4: A MIM network with three MIMs and one ST-LSTM. Red arrows: the diagonal state transition paths of for differential modeling. Blue arrows: the horizontal transition paths of the memory cells , and . Black arrows: the zigzag state transition paths of . Input: the input can be either the ground truth frame for input sequence, or the generated frame at previous time stamp. Output: one frame is generated at each time stamp.

A schematic of our proposed diagonal recurrent architecture is shown in Figure 4. We deliver the hidden states and to the Memory In Memory (MIM) block at time stamp and layer to generate the difference features for further use. These connections are shown as diagonal arrows in Figure 4. As the first layer doesn’t have any previous layer, we simply use the Spatiotemporal LSTM (ST-LSTM) [31] to generate its hidden presentations. Note that, the temporal differencing is performed by subtracting hidden state from the hidden state in MIM. Comparing with differencing neighboring raw images directly, differencing temporally adjacent hidden states can reveal the non-stationarity more evidently, as the spatiotemporal variations in local areas have been encoded into the hidden representations through the bottom ST-LSTM layer.

Another distinctive feature of the MIM networks resides in the horizontal state transition paths. As the MIM blocks have two cascaded temporal memory modules to capture the non-stationary and stationary dynamics respectively, we further deliver the two temporal memories (denoted by for the non-stationary memory and by for the stationary memory) along the blue arrows in Figure 4.

The MIM networks generate one frame at one time stamp. Calculations of the entire model with one ST-LSTM and MIMs can be presented as follows (for ). Note that there is no MIM block that is marked as .


By stacking multiple MIM blocks, we could potentially learn higher-order non-stationarity from spatiotemporal dynamics.

5 Experiments

In this section, we perform extensive evaluation of the proposed Memory In Memory (MIM) approach. For each evaluation dataset, we will introduce the dataset details and the implementation details on it. At last, we report the performance of our proposed MIM models and analyze experimental results both qualitatively and quantitatively.

We use three spatiotemporal prediction datasets: a synthetic dataset with moving digits, a real traffic flow dataset and another real radar echo dataset. Here are some common settings all over these datasets. Our model has four layers in all experiments, including one ST-LSTM layer as the first layer and three MIMs. The number of feature channels in each MIM block is

, as a trade-off of prediction accuracy and memory efficiency. We train all of the models with L2 loss function, using the ADAM optimizer

[15] with a starting learning rate. The batch size of each iteration is set to . Note that we extend the layer normalization [2] to 3D tensors and apply it on the MIM and ConvLSTM models, based on the idea that the training process of deep convolutional networks can be stabilized by reducing the covariate shift problem [13]

. Note that the MIM blocks in the first time stamp do not have any previous hidden representations as input, so the non-stationary modules take tensors filled with zero as initialization. All experiments are implemented in TensorFlow


5.1 Moving MNIST Dataset


Standard Moving MNIST dataset consists of grayscale sequences of length displaying pairs of digits moving around the image (10 for the inputs and 10 for the predictions). The sequences are generated by the method described in the work of Srivastava et al. [25], following the experimental settings in PredRNN [31]. Before training the model, we apply max-min normalization to scale the data from its original intensities to . In particular, to reduce the training time and memory usage on the Moving MNIST dataset, we reshape each input image into a tensor. By doing this, we significantly reduce the parameters and training time of MIM and the comparison methods, while the performance is affected marginally. Besides, the scheduled sampling strategy [3] is applied to all of the models to stitch the discrepancy between training and inference.

FC-LSTM [25] 0.690 118.3 209.4
ConvLSTM [23] 0.707 103.3 182.9
TrajGRU [24] 0.713 106.9 190.1
CDNA [10] 0.721 97.4 175.3
DFN [6] 0.726 89.0 172.8
FRNN [20] 0.813 69.7 150.3
VPN baseline [14] 0.870 64.1 131.0
PredRNN [31] 0.867 56.8 126.1
Causal LSTM [30] 0.898 46.5 106.8
MIM 0.874 52.0 116.5
MIM* 0.910 44.2 101.1
Table 1: A comparison for predicting frames on the standard Moving MNIST dataset. All models have comparable numbers of parameters and are trained with target frames. MIM*: a network using Causal LSTM as the first layer, and integrating the cascaded MIM-S and MIM-N modules into the Causal LSTM memory cells. This result shows that MIM is a generic mechanism for improving recurrent memory transitions.
MIM (without MIM-N) 0.858 54.4 124.8
MIM (without MIM-S) 0.853 55.7 125.5
MIM 0.874 52.0 116.5
Table 2: Ablation study of the MIM block with either non-stationary module or stationary module removed. These experiments also predict frames on the Moving MNIST dataset.


We use the per-frame structural similarity index measure (SSIM) [32], the mean square error (MSE) and the mean absolute error (MAE) to evaluate our models. A lower MSE or MAE, or a higher SSIM indicates a better prediction.

As shown in Table 1, our proposed MIM model approaches the state-of-the-art results on the standard Moving MNIST dataset. In particular, we construct another model named MIM* by using Causal LSTM [30] as the first layer, and integrating the cascaded MIM-S and MIM-N modules into the Causal LSTM memory cells, using them to replace the temporal forget gate in Causal LSTMs. This result shows that the memory in memory mechanism is not specifically designed for the ST-LSTM, instead, it is a generic mechanism for improving RNNs memory transitions. Though in other parts of this paper, we use ST-LSTM as our base structure for a trade-off between prediction accuracy and computational complexity, we can see that MIM performs better than its ST-LSTM (PredRNN) baseline, while MIM* also performs better than its Causal LSTM baseline.

We also design two ablation experiments, by respectively removing the stationary modules or non-stationary modules, to verify the necessity of cascading inner recurrent modules. As illustrated in Table 2, the MIM network without MIM-N works slightly better than that without MIM-S. Also, either of them has significant improvements over the PredRNN model in MSE/MAE, showing the necessity of cascading them in a unified network. When MIM-N and MIM-S are interlinked, the entire MIM model achieves the best performance.

Figure 5: Prediction examples on the standard Moving MNIST. All models predict 10 frames into the future by observing 10 previous frames. The output frames are shown at two frames intervals.

We visualize a sequence of predicted frames on the standard Moving MNIST test set in Figure 5

. This example is challenging, as severe occlusions exist near the junction of the input sequence and the output sequence. The occlusions can be viewed as information bottleneck, in which the mean and variance of the spatiotemporal process meet drastic changes, indicating the presence of a high-order non-stationarity. Still, the generated images of MIM are more satisfactory, less blurry than those of other models. Actually, we cannot even tell the digits in the last frames generated by other models. We may conclude that MIM shows more capability in capturing complicated non-stationary variations.

5.2 TaxiBJ Traffic Flow Dataset

Traffic flows are collected from the chaotic real-world environment. Apparently, traffic conditions will not vary uniformly over time, and there are strong temporal dependencies between the traffic conditions at neighboring time stamps.


TaxiBJ contains traffic flow images collected consecutively from the GPS monitors of taxicabs in Beijing. Each frame in TaxiBJ is a grid of image. Two channels represent the traffic flow entering and leaving the same district at this time. We generate sequences from TaxiBJ dataset and split the whole dataset into a training set and a test set as described in the work of Zhang et al. [37]. Each sequence contains 8 consecutive frames, 4 for the inputs and 4 for the predictions. The frames are also scaled to and reshaped to as described above.


The experiment settings on the TaxiBJ dataset are adopted from ST-ResNet [36], which yields the previous state-of-the-art results on this dataset. ST-ResNet only predicts one frame in one pass due to its non-recurrent structures, thus, it generates sequence outputs in a recursive manner. We show the quantitative results in Table 3 and the qualitative results in Figure 6. To make the comparisons conspicuous, we also visualize the difference between the predictions and the ground truth images. Obviously, MIM shows the best performance in all predicted frames among all compared models, with the lowest difference intensities.

Model Frame 1 Frame 2 Frame 3 Frame 4
ST-ResNet [36] 0.688 0.939 1.130 1.288
VPN [14] 0.744 1.031 1.251 1.444
FRNN [20] 0.682 0.823 0.989 1.183
PredRNN [31] 0.634 0.934 1.047 1.263
Causal LSTM [30] 0.641 0.855 0.979 1.158
MIM 0.554 0.737 0.887 0.999
Table 3: Per-frame MSE on the TaxiBJ dataset. All compared models take historical traffic flow images as inputs, and predict the next images (traffic conditions for the next two hours).
Figure 6: Prediction examples on TaxiBJ. We also visualize the difference for easy observations (GT: Ground Truth, P: Prediction).

5.3 Radar Echo Dataset


The radar echo dataset contains evolving radar maps that were collected every minutes, from May 1st, 2014 to June 30th, 2014. Each frame is a grid of image, covering square kilometers. We process the data in the same approach as for the TaxiBJ dataset.


We first use image-level MSE averaged by the next generated images (at a time interval of minutes and covering the next hour) to evaluate the compared models, as shown in Table 4. We then convert pixel intensities to radar echo values in dBZ, and choose 30 dBZ, 40 dBZ and 50 dBZ as the thresholds to calculate the hits (prediction = 1, truth = 1), misses (prediction = 0, truth = 1) and false alarms (prediction = 1, truth = 0). Critical success index (CSI) is a skill score that is defined as [23]. A higher CSI denotes a better prediction. MIM consistently outperforms other models in both MSE and CSIs. Figure 7 shows the frame-wise comparisons over future time stamps. As the number of predicted frames grows, the results of MIM get better, indicating that our model could improve the forecasting outcomes by better capturing underlying, deterministic non-stationarity. Though all widely used in real precipitation forecasting applications, the prediction accuracy regarding 40 dBZ and 50 dBZ are more important than other metrics, as they indicate how much probabilities are there for the severe weather. But due to the long tail effect, predicting high-intensity radar echoes are non-trivial. Still, we can see that the proposed MIM model performs the best even for these two challenging metrics.

Model MSE CSI-30 CSI-40 CSI-50
FRNN [20] 52.5 0.254 0.203 0.163
PredRNN [31] 31.8 0.401 0.378 0.306
Causal LSTM [30] 29.8 0.362 0.331 0.251
MIM 27.8 0.429 0.399 0.317
Table 4: A comparison for predicting frames on the subsets of the radar dataset. All of the models are also trained with target frames and made to predict future frames at test time.
(a) MSE
(b) CSI-30
(c) CSI-40
(d) CSI-50
Figure 7: Frame-wise comparisons for the forecasting results regarding the next radar maps. Lower curves of MSE or higher curves of CSI indicate better results.
Figure 8: An example of radar echo forecasting for the next hour.
Figure 9: Saturated rates () regarding the forget gates in PredRNN and the “pseudo forget gates” in MIM on the radar dataset.

We also visualize the generated radar maps to show our model’s performance, as illustrated in Figure 8

. We can see that the evolution of radar echoes is a highly non-stationary process. The accumulation, deformation and dissipation of the radar echoes are happening at every moment. In this sequence, the echoes in the bottom left corner become larger while the echoes in the upper right corner become smaller. Incorrectly, the PredRNN model thinks all the echoes are getting larger, and Causal LSTM thinks the echoes will stay still. Only MIM captures the correct trends.

As shown in Figure 9, to prove that the original forget gates do not work in an appropriate way, we calculate the percentage of the forget gate values which are greater than in the PredRNN model. To further verify the efficacy of our proposed MIM-N and MIM-S memory recurrent modules, we divide the outputs of MIM-S by the previous cell state, , to get a “pseudo forget gate”. Most of the forget gates in PredRNN are saturated (), while only about in the MIM model are saturated.

6 Conclusions

We investigate the underlying non-stationarity that forms the main obstacle in spatiotemporal prediction. RNNs are powerful for modeling difference-stationary sequences, while the ARIMA models are good at modeling low-dimensional time series non-stationarity. This paper enables non-stationary modeling in spacetime by proposing a new recurrent neural network to exploit the differential information between adjacent recurrent states. A Memory In Memory (MIM) block is derived to model the complicated variations, which uses two cascaded recurrent modules to handle the non-stationary and approximately stationary components in the spatiotemporal dynamics. MIM achieves the state-of-the-art prediction performance on three datasets: a synthetic dataset with moving digits, a real dataset with traffic flows and another real dataset with quickly evolving radar echoes.


  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng.

    TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.

    Software available from
  • [2] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  • [3] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In NIPS, pages 1171–1179, 2015.
  • [4] G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung. Time series analysis: forecasting and control. John Wiley & Sons, 2015.
  • [5] H. Cramér. On some classes of nonstationary stochastic processes. In Proceedings of the Fourth Berkeley symposium on mathematical statistics and probability, volume 2, pages 57–78. University of Los Angeles Press Berkeley and Los Angeles, 1961.
  • [6] B. De Brabandere, X. Jia, T. Tuytelaars, and L. Van Gool. Dynamic filter networks. In NIPS, 2016.
  • [7] E. Denton and R. Fergus. Stochastic video generation with a learned prior. In ICML, pages 1174–1183, 2018.
  • [8] E. L. Denton, S. Chintala, R. Fergus, et al. Deep generative image models using a laplacian pyramid of adversarial networks. In NIPS, pages 1486–1494, 2015.
  • [9] E. L. Denton et al. Unsupervised learning of disentangled representations from video. In NIPS, pages 4414–4423, 2017.
  • [10] C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction. In NIPS, 2016.
  • [11] I. J. Goodfellow, J. Pougetabadie, M. Mirza, B. Xu, D. Wardefarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial networks. NIPS, 3:2672–2680, 2014.
  • [12] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [13] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
  • [14] N. Kalchbrenner, A. v. d. Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu. Video pixel networks. In ICML, 2017.
  • [15] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • [16] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [17] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
  • [18] A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018.
  • [19] M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. In ICLR, 2016.
  • [20] M. Oliu, J. Selva, and S. Escalera. Folded recurrent neural networks for future video prediction. In ECCV, September 2018.
  • [21] V. Patraucean, A. Handa, and R. Cipolla. Spatio-temporal video autoencoder with differentiable memory. In ICLR Workshop, 2016.
  • [22] D. B. Percival and A. T. Walden. Spectral Analysis for Physical Applications. Cambridge University Press, 1993.
  • [23] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In NIPS, pages 802–810, 2015.
  • [24] X. Shi, Z. Gao, L. Lausen, H. Wang, D.-Y. Yeung, W.-k. Wong, and W.-c. Woo. Deep learning for precipitation nowcasting: A benchmark and a new model. In NIPS, 2017.
  • [25] N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video representations using lstms. In ICML, 2015.
  • [26] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. NIPS, 4:3104–3112, 2014.
  • [27] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz. Mocogan: Decomposing motion and content for video generation. In CVPR, 2018.
  • [28] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee. Decomposing motion and content for natural video sequence prediction. In ICLR, 2017.
  • [29] C. Vondrick, H. Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In NIPS, 2016.
  • [30] Y. Wang, Z. Gao, M. Long, J. Wang, and P. S. Yu. PredRNN++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. In ICML, pages 5123–5132, 2018.
  • [31] Y. Wang, M. Long, J. Wang, Z. Gao, and S. Y. Philip. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. In NIPS, pages 879–888, 2017.
  • [32] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity. TIP, 13(4):600, 2004.
  • [33] N. Wichers, R. Villegas, D. Erhan, and H. Lee. Hierarchical long-term video prediction without supervision. In ICML, 2018.
  • [34] Z. Xu, Y. Wang, M. Long, J. Wang, and M. KLiss. Predcnn: Predictive learning with cascade convolutions. In IJCAI, pages 2940–2947, 2018.
  • [35] T. Xue, J. Wu, K. Bouman, and B. Freeman. Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In NIPS, pages 91–99, 2016.
  • [36] J. Zhang, Y. Zheng, and D. Qi. Deep spatio-temporal residual networks for citywide crowd flows prediction. In AAAI, pages 1655–1661, 2017.
  • [37] J. Zhang, Y. Zheng, D. Qi, R. Li, and X. Yi. Dnn-based prediction model for spatio-temporal data. In ACM SIGSPATIAL, page 92. ACM, 2016.