Unsupervised video prediction, as a fundamental vision problem, has attracted more and more attention in the research community and AI companies. It aims at predicting upcoming future frames based on the observation of previous frames. This looking-ahead ability has a broad application prospect on video surveillance , robotic systems  and autonomous vehicles . However, building an accurate predictive model still remains very challenging because it requires to master not only the visual abstraction model of different objects but also the evolution of various motions over time.
A variety of recent deep learning methods[22, 46, 36, 3, 39, 38, 43, 21] have brought about great development on the video prediction task since the pioneering work of . However, there still exists a clear gap between their predictions and the ground-truth (GT), as shown in Figure 1. The predictions of the compared methods suffer from deficient retention of high-frequency details and insufficient use of motion information, which results in distortion and temporal inconsistency. We detail the reasons mainly in the following two aspects:
Loss of details. Down-sampling is commonly adopted to enlarge the receptive field and extract global information, resulting in inevitable loss of high-frequency details. However, video prediction is a pixel-wise dense prediction problem. Sharp predictions would not be made without the assistance of fine details. Although dilated convolution can be employed to avoid using down-sampling, it has the problem of grid effect and is not friendly to small objects, which hinders the application to video prediction.
Insufficient exploitation of temporal motions. Dynamic scenes are composed of motions occurring at more than one temporal frequency. In Figure 2
, we can observe the lower temporal motion of the smaller car in the left and the faster temporal motion of the bigger truck in the right. They have different moving frequencies. However, previous methods usually process them one by one at a fixed frame rate. Although Recurrent Neural Networks (RNNs) are used to memory dynamic dependencies, it has no ability to distinguish motions at different frequencies and cannot analyze time-frequency characteristics of temporal information.
Therefore, it is necessary to introduce multi-frequency analysis into video prediction task. Biological studies [16, 4] have shown that Human Visual System (HVS) exhibits multi-channel characteristics for spatial and temporal frequency information. The retinal images are decomposed to different frequency bands with approximately equal bandwidth on a logarithmic scale for processing , which includes a low frequency band and multiple high frequency bands. Besides spatial dimension, there also is a similar frequency band decomposition in temporal dimension. These characteristics enable the Human Visual System (HVS) to process visual content with better discrimination of detailed information and motion information. Wavelet analysis [6, 1] is a spatial-scale (temporal-frequency) analysis method, which has the characteristic of multi-resolution (frequency) analysis and can well represent the local characteristics of spatial (temporal) frequency signal, which is very similar to HVS.
Discrete Wavelet Transform (DWT) is a common wavelet analysis method for image processing. As shown in Figure 3(B), the Discrete Wavelet Transform in Spatial dimension (DWT-S) ( in Figure 3(A)) can decompose an image into one low frequency sub-band and three anisotropic high frequency sub-bands of different directions (horizontal, vertical, diagonal). Figure 3(D) shows the Discrete Wavelet Transform in Temporal dimension (DWT-T) (in Figure 3(C)) decomposes a video sequence of length four into two high-frequency sub-bands and two low-frequency sub-bands on time axis. The frequency on time axis here can be viewed as how fast the pixels change with time, which is related to temporal motions. So, inspired by the characteristics of HVS and wavelet transform, we propose to explore the multi-frequency analysis for high-fidelity and temporal-consistency video prediction. The main contributions are summarized as follows:
To the best of our knowledge, we are the first to propose a video prediction framework based on multi-frequency analysis that is trainable in an end-to-end manner.
To strengthen the spatial details, we develop a multi-level Spatial Wavelet Analysis Module (S-WAM) to decompose each frame into one low-frequency approximation sub-band and three high-frequency anisotropic detail sub-bands. The high-frequency sub-bands represent the boundary details well and are in favor of sharpening the prediction details. Besides, multi-level decomposition forms a spatial frequency pyramid, helping to extract objects’ features with multi scales.
To fully exploit the multi-frequency temporal motions of objects in dynamic scenes, we employ a multi-level Temporal Wavelet Analysis Module (T-WAM) to decompose buffered video sequence into sub-bands with different frequencies on time axis, promoting the description of multi-frequency motions and helping to comprehensively capture dynamic representations.
Both quantitative and qualitative experiments on diverse datasets demonstrate a significant performance boost than the state-of-the-art. Ablation studies are made to show the generalization ability of our model and the evaluation of sub-modules.
2 Related Work
2.1 Video Generation and Video Prediction
Video generation is to synthesize photo-realistic image sequences without the need to guarantee the fidelity of the results. It focuses on modeling the uncertainty of the dynamic development of video to produce results that may be inconsistent with the ground truth but reasonable. Differently, Video prediction is to perform deterministic image generation. It needs not only to focus on the per-frame visual quality, but also to master the internal temporal features to determine the most reliable development trend that is closest to the ground truth.
Stochastic Video Generation. Stochastic Video Generation models focus on handling the inherent uncertainty in predicting the future. They seek to generate multiple possible futures by incorporating stochastic models. Probabilistic latent variable models such as Variational Auto-Encoders (VAEs) [20, 33] and Variational Recurrent Neural Networks (VRNNs)  are the most commonly used structures.  developed a stochastic variational video prediction (SV2P) method that predicted a different possible future for each sample of its latent variables, which was the first to provide effective stochastic multi-frame generation for real-world videos. SVG  proposed a generation model that combined deterministic prediction of the next frame with stochastic latent variables, introducing a per-step latent variables model(SVG-FP) and a variant with a learned prior (SVG-LP). SAVP  proposed a stochastic generation model combining VAEs and GANs.  extended the VRNN formulation by proposing a hierarchical variant that used multiple levels of latents per timestep.
High-fidelity Video Prediction.
High-fidelity Video Prediction models aim to produce naturalistic image sequences as close to the ground truth as possible. The main consideration is to minimize the reconstruction error between the true future frame and the generated future frame. Such models can be classified as direct prediction models[35, 46, 43, 21, 3, 39, 30, 38, 18, 25] and transformation-based prediction models [49, 40, 37, 32]. Direct prediction models predict pixel values of future frames directly. In general, they use a combination of forward neural network and recurrent neural network to encode spatial and temporal features, and then perform decoding to get the prediction with the corresponding decoding network. Generative adversarial networks (GANs) are often employed to make the predicted frames more realistic. For example,  developed a dual motion Generative Adversarial Net (GAN) architecture to explicitly enforce future-frame predictions to be consistent with the pixel-wise flows.  trained a single generator that predict both future and past frames by enforcing the consistency of bi-directional prediction using retrospective prediction scheme. Meanwhile, they employed two discriminators not only to identify fake frames but also to distinguish fake contained image sequences from the real sequence. Transformation-based prediction models aim at modeling the source of variability and operate in the space of transformations between frames. They focus on learning the transformation kernels between frames which are applied to the previous frames to synthesize the future frames indirectly.
Here, latent variables in stochastic video generation models is not considered in our model. Such models learn and sample from a space of possible futures to generate the subsequent frames. Although reasonable results can be generated by sampling different latent variables, there is no guarantee of consistency with the ground truth. Moreover, the quality of generation results vary from sample to sample, which is uncontrollable. This limits the application of such models in some practical tasks requiring a high degree of certainty, such as autonomous driving. We focus on high-fidelity video prediction, aiming to construct a prediction model to predict realistic future frame sequences as close to the ground truth as possible. To overcome the challenges of lack of details and motion blur, we propose to explore multi-frequency analysis based video prediction by incorporating wavelet transform with generative adversarial network.
2.2 Wavelet Transform
Wavelet Transform (WT) has been widely applied in image compression , image reconstruction  and many other fields. In Wavelet Transform, a scalable modulation window is moved along the signal, calculating the spectrum at each position, and then repeated multiple times with a slightly shorter (or longer) window. Finally, the result will be a collection of time-frequency representations of the signal with different resolutions (frequencies). In image processing, Discrete Wavelet Transform (DWT) is often used. A fast implementation of it by using filter bank is proposed in . The filter bank implementation of wavelets can be interpreted as computing the wavelet coefficients of a discrete set of child wavelets for a given mother wavelet. According to , we illustrate the process of DWT on space axes of an image and DWT on time axis of a video sequence in Figure 3. Multi-level DWT can be done by repeating a similar process on a sub-band images. The multi-resolution (frequency) analysis of DWT is consistent with Human Visual System (HVS), which provides a biological basis for our approach. We recommend to refer to  to learn more about Discrete Wavelet Transform (DWT).
3.1 Problem Statement
We aim to synthesize future frames of high fidelity and temporal consistency by observing several beginning frames. Let be the input of length . represents the th frame, where . H, W and C are the height, width and channel number. Let represents the ground truth of future frame sequence of length and represents the prediction of . The goal is to minimize the reconstruction error between and . For the sake of clarity, we will introduce the network in detail by taking the next frame prediction as an example.
3.2 Network Architecture
We adopt generative adversarial network as the model structure. The Generator and discriminator are trained with competing goals: aims to predict frames that can fool , while aims to distinguish whether the input samples are real (from the training dataset) or fake (from ).
Figure 4 demonstrates the overall block diagram of the generator G to predict frame at time step
. It follows an encoder-decoder architecture. The encoder aims to transform the input sequence into a hidden feature tensor, while the decoder is in charge of decoding the feature tensor to generate the prediction of the next frame. The encoder consists of three part: stem CNN-LSTM, cascaded Spatial Wavelet Analysis Modules (S-WAMs) and Temporal Analysis Module (T-WAM). The decoder is composed of deconvolution and up-sampling layers.
The stem encoder is a ’CNN-LSTM’ structure. At each time step , the frame is passed through the stem network to extract multi-scale spatial information under different receptive fields. As video prediction is a pixel-wise visual task, to pursue a better expression of appearance features, we refer to the Residual-in-Residual Dense Block (RRDB) proposed by  in the design of our stem CNN structure. It is a combination of multi-level residual network and dense connections. We make a modification: adding a down-sampling layer in each RRDB unit, which can reduce the size of feature maps.
To reserve more high-frequency spatial details in prediction, considering multi-resolution analysis of wavelet transform, we propose a Spatial Wavelet Analysis Module (S-WAM) to enhance the representation of high-frequency information. As illustrated in Figure 4
, S-WAM consists of two stages: Firstly, the input is decomposed into one low-frequency sub-band and three high-frequency detail sub-bands by Discrete Wavelet Transform on Spatial dimension (DWT-S); Secondly, the sub-bands are fed into a shallow CNN to do further feature extraction and obtain consistent number of channels with the corresponding m_RRDB unit. We cascade three S-WAMs to do multi-level wavelet analysis. The output of each level of S-WAM is added with the corresponding feature tensors of the m_RRDB unit. The cascaded S-WAMs provide the compensation of details to the stem network under multiple frequencies, which promotes the prediction of fine details.
On the other side, to model the temporal multi-frequency motions in video sequences, we design a multi-level Temporal Wavelet Analysis Module (T-WAM) decomposing the sequence into sub-bands under different frequencies on time axis. As shown above in Figure 3(B), one level DWT of a sequence with length on temporal axis will decompose it into low-frequency sub-bands and high-frequency sub-bands. In our experiments, we conduct multi-level DWT on temporal dimension (DWT-T) on the input sequence until the number of low-frequency sub-bands or high-frequency sub-bands equals two. We take three DWT-T as an example in Figure 4. Then we concatenate those sub-bands as the input of a CNN to extract features and adjust the size of feature maps. The output is fused with the historical information from LSTM cell to strengthen the ability to distinguish multi-frequency motions for the model. The fused feature tensors from the encoder network are fed to the decoder network to generate the prediction of the next frame. We conduct a discriminator network as  and train the discriminator to classify the input into class and the input into class .
3.3 Loss Function
We adopt multi-module losses which consists of the image domain loss and the adversarial loss.
Image Domain Loss. We combine loss with the Gradient Difference Loss (GDL)  as the image domain loss:
We define the loss as:
And the loss is given by:
where is an integer greater or equal to , and is the operation of absolute value function.
Adversarial Loss. Adversarial training involves a generator G and a discriminator D, where D learns to distinguish whether the frame sequence is from the real dataset or produced by G. The two networks are trained alternately, thus improving until D can no longer discriminate the frame sequence generated by G. In our model, the prediction model is regarded as a generator. We formulate the adversarial loss on the discriminator D as:
and the adversarial loss for the generator G as:
Hence, we combine the losses previously defined for our generator model with different weights:
where and are hyper-parameters to trade off between these distinct losses.
In this section, we will first introduce the experiment setup, and then show the quantitative and qualitative evaluation on diverse datasets. Besides, we do ablation studies to show our model’s generalization capability and evaluation of sub-modules.
4.1 Experiment Setup
Datasets. We perform experiments on diverse datasets widely used to evaluate video prediction models. The KTH dataset 
contains 6 types of actions from 25 persons. We use person 1-16 for training and 17-25 for testing. Models are trained to predict next 10 frames based on the observation of previous 10 frames. The prediction range of testing is extended to 20 or 40 frames. The hyper parameters in the loss function6 on KTH dataset are: and . The BAIR dataset  consists of a random moving robotic arm that pushes objects on a table. This dataset is particularly challenging due to the high stochasticity of the arm movements and the diversity of the background. We follow the setup in  and train the models to predict future 28 frames. The hyper parameters in the loss function 6 on the BAIR dataset are: and . In addition, following the experiments settings in , we validate the generalization ability of our models on the car-mounted camera datasets (train: KITTI dataset , test:Caltech Pedestrian dataset ). The hyper parameters are: and .
Quantitative evaluation of the the accuracy on the testing datasets is performed based on Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) metrics. Higher values indicate better results. To measure the realism of predicted results, we employ the metric of Learned Perceptual Image Patch Similarity (LPIPS) . Lower values for LPIPS indicate better results.
Baselines. We consider representative baselines from two categories: stochastic video generation models [22, 2, 8, 5] and deterministic video prediction models [38, 31, 44, 42, 19, 23, 43, 3, 13, 21, 15]. All the experiment results of the baselines are from the cited papers or reproduced by the pre-trained models the authors report online.
|SV2P time-invariant ||27.56||0.826||17.92||25.92||0.778||25.21|
|SV2P time-variant ||27.79||0.838||15.04||26.12||0.789||22.48|
|Ours (no S-WAM)||29.13||0.872||12.33||26.42||0.805||16.06|
|Ours (no T-WAM)||28.57||0.839||15.16||26.08||0.782||17.45|
|SV2P time-invariant ||20.36||0.817||9.14|
|Improved VRNN ||-||0.822||5.50|
4.2 Quantitative Evaluation
The results of methods [38, 31, 44, 42, 19, 43, 23, 5] are reported in the reference papers [43, 19, 23, 5]. For the models [22, 2, 8], we generate the results by running the pre-trained models the authors reported online.
Table 1 reports quantitative comparison on the KTH dataset. We can see that our model achieves the best result on PSNR and SSIM in terms of prediction for both future 20 frames and 40 frames, which indicates that our results are more consistent with the ground truth. However, on LPIPS, SAVP and its variants SAVP-VAE perform better than us. We analyze that the introduction of latent variables in the stochastic generation methods focuses more on the visual quality of the generated results and less on the consistency with ground truth. Nevertheless, our model focuses more on fidelity and temporal consistency with the original sequences, which is in line with our original intention.
Figure 5 illustrates the per-frame quantitative comparison of future time steps on the BAIR dataset. We also calculate the average results in Table 2. In consistent with the result on KTH dataset, we obtain the best PSNR and SSIM among the reported methods. While the Improved VRNN  achieves the highest on LPIPS. Because of the high stochasticity of the BAIR dataset, it is challenging to maintain fidelity and temporal consistency while making good visual effects.
4.3 Qualitative Evaluation
The first row is the ground truth, in which the initial frames represent the input sequence. We can see that our model makes more accurate predictions while maintaining more details of the arms in the handclapping example in first group of Figure 6. Meanwhile, we predict a walking sequence that is more consistent with the ground truth in the second group of Figure 6, while for other methods, the person in the image walks out of the scene too quickly (VarNet) or two slowly (SAVP and SV2P time-invarient).
For the two groups of prediction on BAIR dataset, we are also the most consistent. Though the stochastic generation methods seem to generate more clear results, they are very different from the moving trajectories of the real sequence. This again confirms our belief that introducing more stochasticity in models will sacrifice fidelity. From the experiment results above, we can see that the multi-frequency analysis of discrete wavelet transform does help models to retain more detail information as well as temporal motion information.
|Dual Motion GAN ||-||0.899||-||-|
|Cycle GAN ||29.2||0.830||-||-|
4.4 Ablation Study
To more fully verify the validity of our model, we did the following ablation studies on driving scene datasets: KITII dataset and Caltech Pedestrian dataset.
Evaluation of generalization ability. Consistent with the way that the previous works evaluate generalization ability of predictive models, we test our mdoel on the Caltech Pedestrian dataset after trained on KITTI dataset. Table 3 shows the comparison results with other models. It can be seen that we achieve the state-of-the-art performance. In addition, Figure 8 shows the visualization examples on KITTI dataset (the first group) and Caltech Pedestrian dataset (the second group). From Figure 8, we can see that our model can predict clearly the evolution of driving lines and the cars. Meanwhile, the results remain consistent with the ground truth, which verifies the good generalization ability of the model. Besides, we also report the number of model’s parameters in Table 3. Compared to ContextVP  and DVF , our model achieves better results with fewer parameters.
Evaluation of sub-modules. To assess the impact of each sub-module, we do ablation studies on the KTH dataset in the absence of S-WAM or T-WAM respectively. As shown in Table 1, it suggests that sub-modules, S-WAM and T-WAM, have both contributed to improving the prediction effect. To be specific, the model without S-WAM seems to gain more than the model without T-WAM. We analyze that the temporal motion information is of vital importance in the long-term prediction, especially for long-term prediction. Improving the expression of multi-frequency motion information in the model is the basis for making predictions with high-fidelity and temporal-consistency.
This paper discusses the issues of missing details and ignoring temporal multi-scale motions in current video prediction models, which always lead to blurry results. To address these issues, inspired by the mechanism in Human Visual System (HVS), we explore a video prediction network based on multi-frequency analysis, which integrates spatial-temporal wavelet transform and generative adversarial network. Specifically,the Spatial Wavelet Analysis Module (S-WAM) is proposed to reserve more detail information through multi-level decomposition of each frame. The Temporal Wavelet Analysis Module (T-WAM) is proposed to exploit the temporal motions through multi-level decomposition of video sequences on time axis. Extensive experiments on diverse datasets demonstrate the superiority of the proposed method when compared to the state-of-the-art methods. In the future, we will validate our method on more video datsets.
-  Milad Alemohammad, Jasper R Stroud, Bryan T Bosworth, and Mark A Foster. High-speed all-optical haar wavelet transform for real-time image compression. Optics Express, 2017.
-  Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H Campbell, and Sergey Levine. Stochastic variational video prediction. arXiv preprint arXiv:1710.11252, 2017.
-  Wonmin Byeon, Qin Wang, Rupesh Kumar Srivastava, and Petros Koumoutsakos. Contextvp: Fully context-aware video prediction. In ECCV, 2018.
-  Fergus W Campbell and Janus J Kulikowski. Orientational selectivity of the human visual system. The Journal of physiology, 1966.
-  Lluis Castrejon, Nicolas Ballas, and Aaron Courville. Improved conditional vrnns for video prediction. arXiv preprint arXiv:1904.12165, 2019.
-  Honggang Chen, Xiaohai He, Linbo Qing, Shuhua Xiong, and Truong Q Nguyen. Dpw-sdnet: Dual pixel-wavelet domain deep cnns for soft decoding of jpeg-compressed images. In , 2018.
-  Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In NeurIPS, 2015.
-  Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. arXiv preprint arXiv:1802.07687, 2018.
-  Piotr Dollar, Christian Wojek, Bernt Schiele, and Pietro Perona. Pedestrian detection: An evaluation of the state of the art. PAMI, 2012.
-  Frederik Ebert, Chelsea Finn, Alex X Lee, and Sergey Levine. Self-supervised visual planning with temporal skip connections. arXiv preprint arXiv:1710.05268, 2017.
-  Issam Elafi, Mohamed Jedra, and Noureddine Zahid. Unsupervised detection and tracking of moving objects for video surveillance applications. Pattern Recognition Letters, 2016.
-  Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In ICRA, 2017.
-  Hang Gao, Huazhe Xu, Qi-Zhi Cai, Ruth Wang, Fisher Yu, and Trevor Darrell. Disentangling propagation and generation for video prediction. In ICCV, 2019.
-  Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. IJRR, 2013.
-  Zekun Hao, Xun Huang, and Serge Belongie. Controllable video generation with sparse trajectories. In CVPR, 2018.
-  RF Hess and RJ Snowden. Temporal properties of human visual filters: Number, shapes and spatial covariation. Vision research, pages 47–59, 1992.
Huaibo Huang, Ran He, Zhenan Sun, and Tieniu Tan.
Wavelet-srnet: A wavelet-based cnn for multi-scale face super resolution.In ICCV, 2017.
-  Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik Learned-Miller, and Jan Kautz. CVPR, 2018.
-  Beibei Jin, Yu Hu, Yiming Zeng, Qiankun Tang, Shice Liu, and Jing Ye. Varnet: Exploring variations for unsupervised video prediction. IROS, 2018.
-  Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
-  Yong-Hoon Kwon and Min-Gyu Park. Predicting future frames using retrospective cycle gan. In CVPR, 2019.
-  Alex X Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, Chelsea Finn, and Sergey Levine. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018.
-  Jungbeom Lee, Jangho Lee, Sungmin Lee, and Sungroh Yoon. Mutual suppression network for video prediction using disentangled features. In BMVC, 2019.
-  Xiaodan Liang, Lisa Lee, Wei Dai, and Eric P Xing. Dual motion gan for future-flow embedded video prediction. In ICCV, 2017.
-  Wenqian Liu, Abhishek Sharma, Octavia Camps, and Mario Sznaier. Dyan: A dynamical atoms-based network for video prediction. In ECCV, 2018.
-  Ziwei Liu, Raymond A Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agarwala. Video frame synthesis using deep voxel flow. In ICCV, 2017.
William Lotter, Gabriel Kreiman, and David Cox.
Deep predictive coding networks for video prediction and unsupervised learning.ICLR, 2017.
-  Stephane G Mallat. A theory for multiresolution signal decomposition: the wavelet representation. TPAMI, 1989.
-  James Mannos and David Sakrison. The effects of a visual fidelity criterion on the encoding of images. IEEE Transactions On Information Theory, 2003.
-  Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440, 2015.
-  Marc Oliu, Javier Selva, and Sergio Escalera. Folded recurrent neural networks for future video prediction. In ECCV, 2018.
-  Fitsum A Reda, Guilin Liu, Kevin J Shih, Robert Kirby, Jon Barker, David Tarjan, Andrew Tao, and Bryan Catanzaro. Sdc-net: Video prediction using spatially-displaced convolution. In ECCV, 2018.
-  Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
-  Christian Schuldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: a local svm approach. ICPR, 2004.
-  Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. ICML, 2015.
-  Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. CVPR, 2018.
-  Joost Van Amersfoort, Anitha Kannan, Marc’Aurelio Ranzato, Arthur Szlam, Du Tran, and Soumith Chintala. Transformation-based models of video sequences. arXiv preprint arXiv:1701.08435, 2017.
-  Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033, 2017.
-  Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn, Xunyu Lin, and Honglak Lee. Learning to generate long-term future via hierarchical prediction. arXiv preprint arXiv:1704.05831, 2017.
-  C. Vondrick and A. Torralba. Generating the future with adversarial transformers. In CVPR, 2017.
-  Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: Enhanced super-resolution generative adversarial networks. In ECCV, 2018.
-  Yunbo Wang, Zhifeng Gao, Mingsheng Long, Jianmin Wang, and Philip S Yu. Predrnn++: Towards a resolution of the deep-in-time dilemma in spatiotemporal predictive learning. PMLR, 2018.
-  Yunbo Wang, Lu Jiang, Ming-Hsuan Yang, Li-Jia Li, Mingsheng Long, and Li Fei-Fei. Eidetic 3d lstm: A model for video prediction and beyond. In ICLR, 2019.
-  Yunbo Wang, Mingsheng Long, Jianmin Wang, Zhifeng Gao, and S Yu Philip. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. In NeurIPS, 2017.
-  Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. TIP, 2004.
-  Henglai Wei, Xiaochuan Yin, and Penghong Lin. Novel video prediction for large-scale scene using optical flow. arXiv preprint arXiv:1805.12243, 2018.
-  Junqing Wei, John M Dolan, and Bakhtiar Litkouhi. A prediction- and cost function-based algorithm for robust autonomous freeway driving. IV, 2010.
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang.
The unreasonable effectiveness of deep features as a perceptual metric.In CVPR, 2018.
-  Yipin Zhou and Tamara L Berg. Learning temporal transformations from time-lapse videos. In ECCV, 2016.