Video prediction is a crucial problem in Computer Vision that aims to anticipate the future frames conditioned on an observed video clip. It is a challenging task, especially when it comes to long term prediction due to the drastic evolution of the pixel space and the complicated uncertainty of the future. Tremendous accomplishments have been made through the introduction of Long Short-Term Memory (LSTM) models and their variants to address this task, and in fact, the current trend is to mainly focus on designing sophisticated architectures to exploit the spatial-temporal clues. However, none of these existing approaches has paid particular attention to the prediction space used in the video prediction task.
In this paper, we first carry out an in-depth investigation of the existing work from the neglected prediction space perspective and reveal that a simple but effective strategy to utilize multi-prediction spaces can surprisingly bring about marginal gains over the recent leading work than excepted. For instance, we show that our model based on four-prediction spaces gains an improvement of and in terms of MSE (17.2) and MAE (47.7) respectively on the existing benchmark MNIST-2 in comparison with PhyDNet . We demonstrate an example of comparison results in Figure 1, from which we can observe that our approach can correctly predict the movement of two numbers.
The feature space used as the input and output of the recurrent prediction process is noted as prediction space. According to the prediction space used, the attempts of existing research can be broadly categorized into two types. The first type of approaches conducts prediction directly in the pixel space [41, 36, 34, 37, 35, 31, 39], where their recurrent models directly predict the evolution of pixels and recursively generate future frames. Despite their advantages in modelling the sequence data, these methods present two main limitations: first, these models are limited to a pixel-to-pixel level prediction, and as a result they have difficulties learning to hold the semantic and structural information in the videos, e.g. shape, motion; secondly, errors in recurrent models are prone to accumulate in such a way that small discrepancies at the beginning will be amplified to serious compound error and distorted structures in the long term.
Another group of trending works tackles these limitations by predicting the video in high-level feature space first, then translating it back into pixel-level frames [7, 33, 32, 42, 4]. As memtioned in , researchers have proposed different frameworks to predict the instant segmentation , motion , human pose , object trajectory , or key points  in the video first, then translate them back into the pixel space. In this way, the video prediction task is simplified and the content in the video is better preserved, thus leading to a decrease of the pixel level error due to more accurate content. However, the reduction and recovering processes between the pixel-level space and the high-level feature space usually bring intractable errors.
Although numerous techniques have been introduced so far, most of the existing work only limited the prediction task to one single feature space. Performing video prediction simultaneously in multi-prediction spaces has never been investigated in the literature to our best knowledge. We believe that the different level of feature spaces can be complementary to each other.
In light of that, our key insight is to conduct the prediction task at multi-feature spaces and recover the final frame by assembling all the prediction results. In this paper, we dedicatedly investigate the prediction performance in different feature spaces and propose an architecture with Recurrent Connections. Specifically, the input video is downsampled through convolutional layers to obtain different feature spaces. After that, an LSTM-based module is employed to predict the feature sequences on different scales. The final frames will be synthesised by fusing all the results together.
Our main contributions can be summarized as follows:
To our knowledge, we are the first to take notice of the previously unrealised multi-prediction space perspective and propose a novel recurrent connection to address the video prediction problem.
The experiment results show that our model introduces considerable improvements compared to the state-of-the-art works on various challenging video datasets. Furthermore, the generalization of our idea is validated by introducing the same strategy on different architectures.
We thoroughly analyze various ways to build the recurrent connection and fuse the prediction results together from multi-spaces. A series of comprehensive experiments are conducted to provide a variety of insights for the utilization of multi-prediction spaces.
2 Related Works
We reviewed the current approaches to tackle the task of video prediction from the perspective of the prediction space being used. They usually adopt two types, the pixel space and high-level feature space.
constructed a Recurrent Neural Network (RNN) model to predict the next frames. Inspired by architectures used for language processing, Srivastava et al. introduced the sequence-to-sequence model to video prediction. Since this kind of model can only capture temporal variations, to instead learn both spatial and temporal variations in a unified network structure, Shi et al.  integrated the convolutional operator into recurrent state transition functions and proposed the Convolutional LSTM for joint modelling of both variations. However, in their approach, the LSTM cells in different layers are independent of each other and consequently the information from the last top layer cannot flow into the bottom layer during the next time step. In order to address this issue, [17, 6, 13, 28, 36, 34, 37] further extended the convolutional LSTM model and investigated spatio-temporal prediction. All of these models focus on rectifying the LSTM cells to capture more spatial and temporal clues.
Furthermore, more various strategies are carried out. Under the assumption that video sequences are symmetric, Kwon et al.  implemented a retrospective prediction network by training a generator for both, forward and backward predictions. Hu et al. 
presented a novel cycle-consistency loss function to be incorporated within their framework for more accurate predictions. In addition, several other studies[21, 8, 22] further explored the benefits of forward and backward predictions. Another typical way of predicting video frames is to use 3D convolutions. Wang et al.  presented a gated-controlled self-attention module that effectively manages historical memory records across multiple time steps.
Overall, the performance of these models still suffers from a significant degradation in the long-term prediction task. The high dimensionality of the pixel space causes the prediction error to grows exponentially as the prediction goes longer.
High-Level Feature Space.
To deal with the curse of dimensionality, researchers proposed various approaches to learn the high-level representations in the latent space of videos via an encode networks first, then perform prediction in the learned high-level space.
The proposed Parsing with predictive feature Learning (PEARL) PEARL  is the first systematic predictive learning model for video scene parsing. Concurrently, Luc et al.  extended the msCNN model of  to the novel task of predicting semantic segmentation of future frames. However, both these models are not end-to-end and do not explicitly capture the temporal continuity across frames. To address this limitation, Jin et al.  ﬁrst proposed a model for jointly predicting motion ﬂow and scene parsing. An advantage of flow-based representations is that they implicitly draw temporal correlations from the input data, thus producing temporally coherent per-pixel segmentation. Aside from segmentation, other high-level spaces such as human pose and keypoints represent promising ways of approaching this problem. Minderer et al.  modelled dynamics in the keypoint coordinate space, achieved stable learning and avoided the compounding of errors in pixel space. Tang et al. 
proposed a pose-guided approach for appearance preserving video prediction by combining global and local information using Generative Adversarial Networks (GANs). However, a major drawback of these semantic feature spaces is that they are not generic for all the video data and usually requires laborious ground-truth labelling. Yu et al.
proposed the Conditionally Reversible Network (CrevNet) that theoretically ensures no information loss during the feature extraction process, in this way guaranteeing a higher level of efficiency for their model as well.
Some two-stream based work factorized the video data into content and motion and predict motion separately. For instance, [33, 3] proposed MCnet and Disentangled-representation Net (DRNET) to explicitly separate scene dynamics from the visual appearance. These methods usually assume the visual appearance is static and only predict the dynamics in motion spaces and, as a consequence, they are also limited to one prediction space according to our definition. Lastly, high-level feature spaces can be learned by encoder models, which are usually not semantic explainable anymore [32, 4]. PhyDNet  has been proposed to disentangle PDE dynamics to obtain a latent feature space by utilizing physical knowledge. However, finding a proper high-level feature space is very challenging and the dimensionality reduction of feature space usually comes down to a trade-off between simplicity and quality.
To date, all the aforementioned approaches attempted to use sophisticated architectures to improve the performance and still limited their models only to one prediction space. In contrast to them, we firstly thoroughly investigate different strategies to construct models that incorporate multi-prediction spaces in this paper.
In this section, we formulate the mathematical descriptions for the video prediction problem and their architectures. In order to model video data through multi-prediction spaces, there are three important questions to answer: (1) How to obtain the multi-prediction spaces? (2) How to conduct the prediction process on the multi-prediction spaces? (3) How to connect the different prediction spaces? Hereafter, we address these three questions by conducting extensive studies and finally propose a full architecture to deal with this problem.
3.1 The Prediction Workflow
Given an input image sequence , , …, , our goal is to predict the future frame sequence , , …, . The most common way to video prediction problem is to generate prediction frames in a recursive way, i.e., training a model takes current timestep frame as input to output the next frame . As previously mentioned, prediction methods can be divided into the following three main categories, we illustrate their workflow in Figure 2 and present the mathematical formulations:
Pixel Space Prediction. The typical prediction method in pixel space can be written as:
where is the input image in time , is the output image, and is usually a RNN based network.
High-Level Feature Space Prediction. Encoders and Decoders are employed to translate the input images to the high level feature space and back. The corresponding prediction method can be written as:
where corresponds to the high level feature map obtained by the encoder, is the high level feature prediction and the output image is obtained through the decoder network.
Multi Feature Space Prediction. In our approach, multiple feature spaces for prediction are extracted. Since we believe the incorporation of these feature spaces is able to capture clues of the natural videos in a complementary manner. Therefore, the multi-prediction spaces in our experiment are designed. To be simplified, the equations for a two-prediction spaces scenario can be derived as the followings:
3.2 The Prediction Spaces
The most common way to obtain different feature spaces is to operate different scales of convolutions. Usually, the high-level feature space-based methods spend a lot of effort in designing a special prediction space to enhance the performance. However, we discovered through our experiment that even simple feature spaces extracted by typical convolutions show much greater improvements when they are used together.
2D convolutional layers are employed to generate multi-scale features. More detailed information for the layers can be found in the implementation details section. In order to study the characteristic of each feature space, we trained the model, which adopts ConvLSTM  to predict the future sequences on each single feature spaces and reported their performance and accuracy results here.
Discussion: The comparison results presented in Figure 3 and Table 1 show that the deeper feature space is able to preserve the content better without distortions and blurry artefacts. For example, digital numbers four and eight are still readable in the final predicted frame. On the other hand, the numbers are unreadable already from the first frame if the prediction is conducted directly in pixel space. Therefore, these observations validate the superiority of the higher-level representation and show that the deeper the feature space is, the higher the accuracy will be. Most importantly, the model which incorporates the prediction results of all feature spaces achieves the best performance. These phenomena verify that the comprehensive information of video frames in different prediction spaces can boost the video forecasting tasks.
3.3 The Recurrent Connection
The key issue to perform our idea is how to connect different prediction processes to fuse the generations from multi-feature space. Therefore, a novel Recurrent Connection strategy is designed in this paper to address this problem. Assuming the input frame at time and layer is denoted as , and the predicted feature on the next timestep and layer is denoted as . As a consequence, there is a spatial-temporal gap between these two features so that the core technologies are proposed to model the transition from to . As shown in Figure 2, the methods of pixel space prediction models this transition by a recurrent cell such as ConvLSTM. In our model, a recurrent connection strategy will connect the prediction features every two layers and construct a new way to model the transition from to .
The detailed design of the propagation of sequential data in our Recurrent Connection can be described as:
In the equation, and corresponds to the propagation in the layer . and represents the propagation in the layer. The feature is obtained from Recurrent Connection between the layer and the layer. The last layer of the architecture will perform the transition of frames by typical recurrent cells. Two LSTM cell are employed before and after the DownSample and UpSample process to strengthen the ability to capture the spatio-temporal clues. Three types of RNN block for sequences prediction are employed in our experiments, namely ConvLSTM , ConvGRU  and PredRNN . After consideration of efficiency and performace, we use ConvLSTM in our final model.
Fusion Strategy. For every input frame , there will be one frame generated from the current prediction space and another frame obtained from the upsampling of the lower-level prediction space. It is quite important to investigate how the low-level prediction feature map should be fused with the high-level prediction feature map. Various fusion functions are proposed, we chose the most widely used four types of fusion functions  , namely Sum Fusion, Concatenation Fusion, Max Fusion and Attention Fusion. Comparison results are reported in Section 4.
In this section, we will first introduce the datasets used and implementation details of our model. Then we will verify the performance of our model by comparing it with the five competitive models including ConvLSTM , PredRNN , MIM , CrevNet  and PhyDNet .
Implementation details. Our model is trained by using loss and ADAM  optimizer with an initial learning rate of . The kernel size of all convolutional layers and ConvLSTMs is set to
. The RNN path is composed of a 1-layer ConvLSTM with the same number of filters as its input channels. All the experiments are implemented in PyTorch
and conducted on NVIDIA V100 Tensor Core GPUs.
We use three common evaluation metrics used in previous video prediction works: Mean Squared Error (MSE), Mean Absolute Error (MAE) and Structural Similarity (SSIM). These metrics are averaged for each frame of the output sequence. Lower values of MSE, MAE and higher values for SSIM indicate better performances.
Moving MNIST  is a standard benchmark in video prediction containing 2 random moving digits bouncing inside a 64 64 grid. Each sequence contains 20 frames, 10 for the input sequences and 10 for the prediction outcome. We generate Moving MNIST sequences following the method in  for training and use the fixed test set of 10,000 sequences provided by  for evaluation.
KTH  includes 6 kinds of human actions (walking, jogging, running, boxing, hand waving and hand clapping) performed in different scenarios. The video frames used in our experiment are subsampled from the original videos at 1Hz, and resized to be 128 128. All video frames are divided into a training set (subjects 1-16) and a test set (subjects 17-25). Following the common setup for this benchmark, we generate 10 frames from the last 10 observations. In our experiments, we use 172,372 sequences for training and 99,375 sequences for testing.
Human 3.6M  contains 17 kinds of human actions, including 3.6 million poses and corresponding images. The original images in Human3.6M dataset are 1000 1000 3. All images are centered and cropped to 800 800 3, and resized to 128 128 3 in our experiments. According to , subjects S1, S5, S6, S7, S8 are used for training and S9, S11 used for testing. We evaluate 4 future frames given the previous 4 input frames.
Radar Echo dataset is generated from historical radar maps from local weather with an interval of 6 minutes. Each frame is a 700 900 1 grid for an image, covering 700 900 square kilometers. To effectively train all the models for comparison, the images are firstly converted to 900 900
1 with zero paddings and then resized to 128128 1 for training and testing. We predict 10 radar maps at a time interval of 6 minutes, covering the next hour. In our experiments, we use 9,600 radar sequences for training and 2,400 for testing.
4.2 Comparison with the State-of-the-Art work
We report here the evaluation of our proposed strategies against leading research works in video prediction. In our final model, four layers of feature spaces and ConvLSTM are used. The top performance fusion function is selected as well.
Quantitative As shown in Table 2 and 3, our proposed model achieves marginal improvements in comparison with all the five baselines on all the datasets and terms, therefore this evidence validates the effectiveness of the idea to use multi-prediction spaces. From the results on the Moving MNIST with 3 handwritten digits, almost half of the value of MSE and MAE are drastically decreased when compared to the competitive PhyDNet model. It is worth noting that this big gap had not been introduced by the various previously proposed approaches. Moreover, our model outperforms the State-of-the-Art method significantly on the other three real-world video benchmarks. On the challenging Radar Echo Dataset, our model yields the best results (76.7 MSE, 319.3 MAE and 0.942 SSIM) to date. Moreover, the large and persistent performance boost is reported in the term of SSIM, indicating that our model also has superior ability in preserving the image quality during the prediction task. In addition, we reported the CSI scores  specific for weather forecasting task in Figure 5. Our model consistently outperforms the compared methods in the future 10 frames (1 hour), which indicates the superior performance of our model on the precipitation nowcasting task.
We can observe from Figure 4 that the proposed model can clearly predict the pose of human and their position, while blurry and incorrect results derive from the other two models.
For the Radar Echo Dataset, different colors represent different radar dBZ values, and high dBZ value is more important for weather nowcasting. As shown in Figure 6, out model can successfully preserve the high dBZ values (the area in the red blocks) in the prediction frames, our prediction results are much sharper than other models, which is rather useful to get more accurate weather forecasting.
Furthermore, the prediction results of Human action videos presented in Figure 7, compared with the State-of-the-Art method, surprisingly show that our multi-prediction spaces method is more advantageous – the human legs are still distinct and accurate in the predicted future.
All the visualization results validate the remarkable advantages of the idea to leverage both the high-level semantic information and low-level pixel information by using the recurrent connection mechanism. More visualization comparisons on all the datasets are provided in our Demo and Supplementary materials.
|KTH||Radar Echo Dataset||Human 3.6M|
5 Ablation Study
Extensive ablation studies are carried out in this section to thoroughly investigate the best way to design the whole architecture, including the analysis of each component of our model. In the following, we address and deeply explore the three questions discussed in Section 3.
5.1 The Number of Feature Layers
As shown in Table 4 (a), We evaluated four different baselines on the Moving MNIST dataset to study the relationship between the number of prediction spaces and the accuracy. The - baseline denotes the model with layers. It can be clearly observed that the utilization of more prediction spaces will lead to more favourable performance gains. While - just perform prediction on the single Pixel Space, the involvement of the second layer of feature space brings significant improvements, with a drop by and in terms of MSE and MAE respectively. From the results of - and -, we can see the errors decrease slower however the improvements in performance will be harder at that level as well. When the prediction spaces are increased from one to four, we see a remarkable drop on MSE by 42.6 and on MAE by 104.5. Concurrently to the continuous enhancements on MSE and MAE, the quality of the predicted frames is also increasing significantly according to the SSIM metric. This also can be seen from the illustration in Figure 8. Based on all these observations, we can confidently conclude from the experiments results that increasing the number of prediction spaces will introduce further performance gains to the video forecasting task.
5.2 The Fusion Strategy
The other question we want to investigate is how the different fusion functions will contribute to the final results. Here we set the same number of prediction spaces and compare the four most commonly used fusion strategies , namely sum fusion, concatenate fusion, max fusion and attention fusion.
From the report in Table 4 (b), the Max fusion function achieved the top performance among the four on the term of MSE. It surpasses the concatenate fusion function by a gap of 1.9. However, the concatenate fusion function achieves the best performance in terms of MAE. A slightly lower error of 47.7 is obtained by concatenate fusion. On the last column of Table 4 (b), we can see that the SSIM values of the three fusion functions is almost the same. Therefore this evaluation suggests that all the different types of fusion strategies can achieve very similar results and more complex fusion function as attention does not lead to significantly better results.
5.3 The Recurrent Connection Module
Additionally, we carefully studied how the design of recurrent connections will affect the performance of the final model. To accomplish this, we selected the competitive recurrent models, i.e. ConvLSTM , ConvGRU  and ST-LSTM  for spatial-temporal data prediction.
The following observations can be found in Table 4 (c). Firstly, the ConvLSTM module outperforms the other two modules in a small gap, which is 1.0 MSE less than ST-LSTM and 4.4 MSE less than ConvGRU. The same conclusion can be deduced from the comparison of MAE in the second column. Secondly, the differences of the three baselines are quite small when they are compared with the sharper difference of 31.9 MSE by increasing new prediction spaces, therefore these results clearly underline the dominant importance of the proposed idea Single-To-Multiple.
|(c)||Ours with ConvGRU||21.6||63.7||0.953|
|Ours with ST-LSTM||18.2||48.8||0.966|
|Ours with ConvLSTM||17.2||47.7||0.967|
5.4 Generalization of our idea
Finally, a natural question arises, that is if this idea could also be applicable to other types of high-level feature space? Motivated by this, we evaluated our idea by employing another base module called PhyDNet , which is a typical encoder-predictor-decoder framework. The feature maps from the layers of the encoder are propagated to the layer of the decoder directly, in a similar way as to the U-Net model. However, there is no prediction process during this link on a different scale of feature maps. Therefore, we integrated and implemented our idea by revising the propagation so that multi-prediction processes are introduced in the model. Comparison results in terms of MSE, MAE and SSIM are presented in Table 5 show that considerable gains can be achieved to use this idea on the feature spaces of PhyDNet as well. Therefore, we can conclude that the evaluation study indicates that the key idea of this paper generalizes well for diverse types of feature spaces to achieve significant improvements.
In this paper, to the best of our knowledge, we are the first to systematically investigate the largely neglected yet significant strategy to introduce multi-prediction spaces for the task of video forecasting. This is unlike the more common approach of just using a single prediction space. However, the results obtained through our extensive experiments on four datasets, indicate that our proposed idea is capable of drastically boosting the performance over the State-of-the-Art methods. We conducted a series of ablation studies on various ways to factorize the feature spaces, fuse the feature maps from different prediction spaces, and build up Recurrent Connection. Furthermore, evaluations on other prediction spaces demonstrate the generalization of our idea. Finally, we also found that the prediction results from pixel space and high-level feature spaces are complementary to each other and can be fused together to futher improve the performance. This study is a first step towards enhancing our understanding of this topic, and we believe that the various in-depth component analysis will provide a lot of important insights that will hopefully encourage the future research work in the video forecasting domain.
-  Nicolas Ballas, Li Yao, Chris Pal, and Aaron Courville. Delving deeper into convolutional networks for learning video representations. arXiv preprint arXiv:1511.06432, 2015.
-  Hsu-kuang Chiu, Ehsan Adeli, and Juan Carlos Niebles. Segmenting the future. IEEE Robotics and Automation Letters, 5(3):4202–4209, 2020.
-  Emily Denton and Vighnesh Birodkar. Unsupervised learning of disentangled representations from video. arXiv preprint arXiv:1705.10915, 2017.
Emily Denton and Rob Fergus.
Stochastic video generation with a learned prior.
International Conference on Machine Learning, pages 1174–1183. PMLR, 2018.
Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman.
Convolutional two-stream network fusion for video action recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1933–1941, 2016.
-  Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical interaction through video prediction. In Advances in neural information processing systems, pages 64–72, 2016.
-  Vincent Le Guen and Nicolas Thome. Disentangling physical dynamics from unknown factors for unsupervised video prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11474–11484, 2020.
-  Ruibing Hou, Hong Chang, Bingpeng Ma, and Xilin Chen. Video prediction with bidirectional constraint network. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), pages 1–8. IEEE, 2019.
-  Zhihang Hu and Jason Wang. A novel adversarial inference framework for video prediction with action control. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 0–0, 2019.
-  Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2013.
-  Xiaojie Jin, Xin Li, Huaxin Xiao, Xiaohui Shen, Zhe Lin, Jimei Yang, Yunpeng Chen, Jian Dong, Luoqi Liu, Zequn Jie, et al. Video scene parsing with predictive feature learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 5580–5588, 2017.
-  Xiaojie Jin, Huaxin Xiao, Xiaohui Shen, Jimei Yang, Zhe Lin, Yunpeng Chen, Zequn Jie, Jiashi Feng, and Shuicheng Yan. Predicting scene parsing and motion dynamics in the future. In Advances in Neural Information Processing Systems, pages 6915–6924, 2017.
-  Nal Kalchbrenner, Aäron Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video pixel networks. In International Conference on Machine Learning, pages 1771–1779. PMLR, 2017.
-  Yunji Kim, Seonghyeon Nam, In Cho, and Seon Joo Kim. Unsupervised keypoint learning for guiding class-conditional video prediction. In Advances in Neural Information Processing Systems, pages 3814–3824, 2019.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Yong-Hoon Kwon and Min-Gyu Park. Predicting future frames using retrospective cycle gan. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1811–1820, 2019.
-  William Lotter, Gabriel Kreiman, and David Cox. Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint arXiv:1605.08104, 2016.
-  Pauline Luc, Natalia Neverova, Camille Couprie, Jakob Verbeek, and Yann LeCun. Predicting deeper into the future of semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 648–657, 2017.
-  Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440, 2015.
-  Matthias Minderer, Chen Sun, Ruben Villegas, Forrester Cole, Kevin P Murphy, and Honglak Lee. Unsupervised learning of object structure and dynamics from videos. In Advances in Neural Information Processing Systems, pages 92–102, 2019.
-  Ishan Misra, C Lawrence Zitnick, and Martial Hebert. Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision, pages 527–544. Springer, 2016.
Marc Oliu, Javier Selva, and Sergio Escalera.
Folded recurrent neural networks for future video prediction.In Proceedings of the European Conference on Computer Vision (ECCV), pages 716–731, 2018.
-  Sergiu Oprea, Pablo Martinez-Gonzalez, Alberto Garcia-Garcia, John Alejandro Castro-Vargas, Sergio Orts-Escolano, Jose Garcia-Rodriguez, and Antonis Argyros. A review on deep learning techniques for video prediction. arXiv preprint arXiv:2004.05214, 2020.
-  Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703, 2019.
-  MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and Sumit Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604, 2014.
-  Christian Schuldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: a local svm approach. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., volume 3, pages 32–36. IEEE, 2004.
-  Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. arXiv preprint arXiv:1506.04214, 2015.
-  Xingjian Shi, Zhihan Gao, Leonard Lausen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, and Wang-chun Woo. Deep learning for precipitation nowcasting: A benchmark and a new model. In Advances in neural information processing systems, pages 5617–5627, 2017.
-  Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. In International conference on machine learning, pages 843–852, 2015.
-  Jilin Tang, Haoji Hu, Qiang Zhou, Hangguan Shan, Chuan Tian, and Tony QS Quek. Pose guided global and local gan for appearance preserving human video prediction. In 2019 IEEE International Conference on Image Processing (ICIP), pages 614–618. IEEE, 2019.
-  Adam Terwilliger, Garrick Brazil, and Xiaoming Liu. Recurrent flow-guided semantic forecasting. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1703–1712. IEEE, 2019.
-  Ruben Villegas, Dumitru Erhan, Honglak Lee, et al. Hierarchical long-term video prediction without supervision. In International Conference on Machine Learning, pages 6038–6046. PMLR, 2018.
-  Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033, 2017.
-  Yunbo Wang, Zhifeng Gao, Mingsheng Long, Jianmin Wang, and Philip S Yu. Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. arXiv preprint arXiv:1804.06300, 2018.
-  Yunbo Wang, Lu Jiang, Ming-Hsuan Yang, Li-Jia Li, Mingsheng Long, and Li Fei-Fei. Eidetic 3d lstm: A model for video prediction and beyond. In International Conference on Learning Representations, 2018.
-  Yunbo Wang, Mingsheng Long, Jianmin Wang, Zhifeng Gao, and S Yu Philip. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. In Advances in Neural Information Processing Systems, pages 879–888, 2017.
-  Yunbo Wang, Jianjin Zhang, Hongyu Zhu, Mingsheng Long, Jianmin Wang, and Philip S Yu. Memory in memory: A predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9154–9162, 2019.
-  Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
-  Haixu Wu, Zhiyu Yao, Mingsheng Long, and Jianmin Wan. Motionrnn: A flexible model for video prediction with spacetime-varying motions. arXiv preprint arXiv:2103.02243, 2021.
-  Yue Wu, Rongrong Gao, Jaesik Park, and Qifeng Chen. Future video synthesis with object motion prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5539–5548, 2020.
-  SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting. In Advances in neural information processing systems, pages 802–810, 2015.
-  Ceyuan Yang, Zhe Wang, Xinge Zhu, Chen Huang, Jianping Shi, and Dahua Lin. Pose guided human video generation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 201–216, 2018.
-  Wei Yu, Yichao Lu, Steve Easterbrook, and Sanja Fidler. Crevnet: Conditionally reversible video prediction. arXiv preprint arXiv:1910.11577, 2019.
-  Wei Yu, Yichao Lu, Steve Easterbrook, and Sanja Fidler. Efficient and information-preserving future frame prediction and beyond. In International Conference on Learning Representations, 2019.