1 Introduction
Unsupervised video prediction, as a fundamental vision problem, has attracted more and more attention in the research community and AI companies. It aims at predicting upcoming future frames based on the observation of previous frames. This lookingahead ability has a broad application prospect on video surveillance [11], robotic systems [12] and autonomous vehicles [47]. However, building an accurate predictive model still remains very challenging because it requires to master not only the visual abstraction model of different objects but also the evolution of various motions over time.
A variety of recent deep learning methods
[22, 46, 36, 3, 39, 38, 43, 21] have brought about great development on the video prediction task since the pioneering work of [35]. However, there still exists a clear gap between their predictions and the groundtruth (GT), as shown in Figure 1. The predictions of the compared methods suffer from deficient retention of highfrequency details and insufficient use of motion information, which results in distortion and temporal inconsistency. We detail the reasons mainly in the following two aspects:Loss of details. Downsampling is commonly adopted to enlarge the receptive field and extract global information, resulting in inevitable loss of highfrequency details. However, video prediction is a pixelwise dense prediction problem. Sharp predictions would not be made without the assistance of fine details. Although dilated convolution can be employed to avoid using downsampling, it has the problem of grid effect and is not friendly to small objects, which hinders the application to video prediction.
Insufficient exploitation of temporal motions. Dynamic scenes are composed of motions occurring at more than one temporal frequency. In Figure 2
, we can observe the lower temporal motion of the smaller car in the left and the faster temporal motion of the bigger truck in the right. They have different moving frequencies. However, previous methods usually process them one by one at a fixed frame rate. Although Recurrent Neural Networks (RNNs) are used to memory dynamic dependencies, it has no ability to distinguish motions at different frequencies and cannot analyze timefrequency characteristics of temporal information.
Therefore, it is necessary to introduce multifrequency analysis into video prediction task. Biological studies [16, 4] have shown that Human Visual System (HVS) exhibits multichannel characteristics for spatial and temporal frequency information. The retinal images are decomposed to different frequency bands with approximately equal bandwidth on a logarithmic scale for processing [29], which includes a low frequency band and multiple high frequency bands. Besides spatial dimension, there also is a similar frequency band decomposition in temporal dimension. These characteristics enable the Human Visual System (HVS) to process visual content with better discrimination of detailed information and motion information. Wavelet analysis [6, 1] is a spatialscale (temporalfrequency) analysis method, which has the characteristic of multiresolution (frequency) analysis and can well represent the local characteristics of spatial (temporal) frequency signal, which is very similar to HVS.
Discrete Wavelet Transform (DWT) is a common wavelet analysis method for image processing. As shown in Figure 3(B), the Discrete Wavelet Transform in Spatial dimension (DWTS) ( in Figure 3(A)) can decompose an image into one low frequency subband and three anisotropic high frequency subbands of different directions (horizontal, vertical, diagonal). Figure 3(D) shows the Discrete Wavelet Transform in Temporal dimension (DWTT) (in Figure 3(C)) decomposes a video sequence of length four into two highfrequency subbands and two lowfrequency subbands on time axis. The frequency on time axis here can be viewed as how fast the pixels change with time, which is related to temporal motions. So, inspired by the characteristics of HVS and wavelet transform, we propose to explore the multifrequency analysis for highfidelity and temporalconsistency video prediction. The main contributions are summarized as follows:

To the best of our knowledge, we are the first to propose a video prediction framework based on multifrequency analysis that is trainable in an endtoend manner.

To strengthen the spatial details, we develop a multilevel Spatial Wavelet Analysis Module (SWAM) to decompose each frame into one lowfrequency approximation subband and three highfrequency anisotropic detail subbands. The highfrequency subbands represent the boundary details well and are in favor of sharpening the prediction details. Besides, multilevel decomposition forms a spatial frequency pyramid, helping to extract objects’ features with multi scales.

To fully exploit the multifrequency temporal motions of objects in dynamic scenes, we employ a multilevel Temporal Wavelet Analysis Module (TWAM) to decompose buffered video sequence into subbands with different frequencies on time axis, promoting the description of multifrequency motions and helping to comprehensively capture dynamic representations.

Both quantitative and qualitative experiments on diverse datasets demonstrate a significant performance boost than the stateoftheart. Ablation studies are made to show the generalization ability of our model and the evaluation of submodules.
2 Related Work
2.1 Video Generation and Video Prediction
Video generation is to synthesize photorealistic image sequences without the need to guarantee the fidelity of the results. It focuses on modeling the uncertainty of the dynamic development of video to produce results that may be inconsistent with the ground truth but reasonable. Differently, Video prediction is to perform deterministic image generation. It needs not only to focus on the perframe visual quality, but also to master the internal temporal features to determine the most reliable development trend that is closest to the ground truth.
Stochastic Video Generation. Stochastic Video Generation models focus on handling the inherent uncertainty in predicting the future. They seek to generate multiple possible futures by incorporating stochastic models. Probabilistic latent variable models such as Variational AutoEncoders (VAEs) [20, 33] and Variational Recurrent Neural Networks (VRNNs) [7] are the most commonly used structures. [2] developed a stochastic variational video prediction (SV2P) method that predicted a different possible future for each sample of its latent variables, which was the first to provide effective stochastic multiframe generation for realworld videos. SVG [8] proposed a generation model that combined deterministic prediction of the next frame with stochastic latent variables, introducing a perstep latent variables model(SVGFP) and a variant with a learned prior (SVGLP). SAVP [22] proposed a stochastic generation model combining VAEs and GANs. [5] extended the VRNN formulation by proposing a hierarchical variant that used multiple levels of latents per timestep.
Highfidelity Video Prediction.
Highfidelity Video Prediction models aim to produce naturalistic image sequences as close to the ground truth as possible. The main consideration is to minimize the reconstruction error between the true future frame and the generated future frame. Such models can be classified as direct prediction models
[35, 46, 43, 21, 3, 39, 30, 38, 18, 25] and transformationbased prediction models [49, 40, 37, 32]. Direct prediction models predict pixel values of future frames directly. In general, they use a combination of forward neural network and recurrent neural network to encode spatial and temporal features, and then perform decoding to get the prediction with the corresponding decoding network. Generative adversarial networks (GANs) are often employed to make the predicted frames more realistic. For example, [24] developed a dual motion Generative Adversarial Net (GAN) architecture to explicitly enforce futureframe predictions to be consistent with the pixelwise flows. [21] trained a single generator that predict both future and past frames by enforcing the consistency of bidirectional prediction using retrospective prediction scheme. Meanwhile, they employed two discriminators not only to identify fake frames but also to distinguish fake contained image sequences from the real sequence. Transformationbased prediction models aim at modeling the source of variability and operate in the space of transformations between frames. They focus on learning the transformation kernels between frames which are applied to the previous frames to synthesize the future frames indirectly.Here, latent variables in stochastic video generation models is not considered in our model. Such models learn and sample from a space of possible futures to generate the subsequent frames. Although reasonable results can be generated by sampling different latent variables, there is no guarantee of consistency with the ground truth. Moreover, the quality of generation results vary from sample to sample, which is uncontrollable. This limits the application of such models in some practical tasks requiring a high degree of certainty, such as autonomous driving. We focus on highfidelity video prediction, aiming to construct a prediction model to predict realistic future frame sequences as close to the ground truth as possible. To overcome the challenges of lack of details and motion blur, we propose to explore multifrequency analysis based video prediction by incorporating wavelet transform with generative adversarial network.
2.2 Wavelet Transform
Wavelet Transform (WT) has been widely applied in image compression [6], image reconstruction [17] and many other fields. In Wavelet Transform, a scalable modulation window is moved along the signal, calculating the spectrum at each position, and then repeated multiple times with a slightly shorter (or longer) window. Finally, the result will be a collection of timefrequency representations of the signal with different resolutions (frequencies). In image processing, Discrete Wavelet Transform (DWT) is often used. A fast implementation of it by using filter bank is proposed in [28]. The filter bank implementation of wavelets can be interpreted as computing the wavelet coefficients of a discrete set of child wavelets for a given mother wavelet. According to [28], we illustrate the process of DWT on space axes of an image and DWT on time axis of a video sequence in Figure 3. Multilevel DWT can be done by repeating a similar process on a subband images. The multiresolution (frequency) analysis of DWT is consistent with Human Visual System (HVS), which provides a biological basis for our approach. We recommend to refer to [28] to learn more about Discrete Wavelet Transform (DWT).
3 Method
3.1 Problem Statement
We aim to synthesize future frames of high fidelity and temporal consistency by observing several beginning frames. Let be the input of length . represents the th frame, where . H, W and C are the height, width and channel number. Let represents the ground truth of future frame sequence of length and represents the prediction of . The goal is to minimize the reconstruction error between and . For the sake of clarity, we will introduce the network in detail by taking the next frame prediction as an example.
3.2 Network Architecture
We adopt generative adversarial network as the model structure. The Generator and discriminator are trained with competing goals: aims to predict frames that can fool , while aims to distinguish whether the input samples are real (from the training dataset) or fake (from ).
Figure 4 demonstrates the overall block diagram of the generator G to predict frame at time step
. It follows an encoderdecoder architecture. The encoder aims to transform the input sequence into a hidden feature tensor, while the decoder is in charge of decoding the feature tensor to generate the prediction of the next frame. The encoder consists of three part: stem CNNLSTM, cascaded Spatial Wavelet Analysis Modules (SWAMs) and Temporal Analysis Module (TWAM). The decoder is composed of deconvolution and upsampling layers.
The stem encoder is a ’CNNLSTM’ structure. At each time step , the frame is passed through the stem network to extract multiscale spatial information under different receptive fields. As video prediction is a pixelwise visual task, to pursue a better expression of appearance features, we refer to the ResidualinResidual Dense Block (RRDB) proposed by [41] in the design of our stem CNN structure. It is a combination of multilevel residual network and dense connections. We make a modification: adding a downsampling layer in each RRDB unit, which can reduce the size of feature maps.
To reserve more highfrequency spatial details in prediction, considering multiresolution analysis of wavelet transform, we propose a Spatial Wavelet Analysis Module (SWAM) to enhance the representation of highfrequency information. As illustrated in Figure 4
, SWAM consists of two stages: Firstly, the input is decomposed into one lowfrequency subband and three highfrequency detail subbands by Discrete Wavelet Transform on Spatial dimension (DWTS); Secondly, the subbands are fed into a shallow CNN to do further feature extraction and obtain consistent number of channels with the corresponding m_RRDB unit. We cascade three SWAMs to do multilevel wavelet analysis. The output of each level of SWAM is added with the corresponding feature tensors of the m_RRDB unit. The cascaded SWAMs provide the compensation of details to the stem network under multiple frequencies, which promotes the prediction of fine details.
On the other side, to model the temporal multifrequency motions in video sequences, we design a multilevel Temporal Wavelet Analysis Module (TWAM) decomposing the sequence into subbands under different frequencies on time axis. As shown above in Figure 3(B), one level DWT of a sequence with length on temporal axis will decompose it into lowfrequency subbands and highfrequency subbands. In our experiments, we conduct multilevel DWT on temporal dimension (DWTT) on the input sequence until the number of lowfrequency subbands or highfrequency subbands equals two. We take three DWTT as an example in Figure 4. Then we concatenate those subbands as the input of a CNN to extract features and adjust the size of feature maps. The output is fused with the historical information from LSTM cell to strengthen the ability to distinguish multifrequency motions for the model. The fused feature tensors from the encoder network are fed to the decoder network to generate the prediction of the next frame. We conduct a discriminator network as [30] and train the discriminator to classify the input into class and the input into class .
3.3 Loss Function
We adopt multimodule losses which consists of the image domain loss and the adversarial loss.
Image Domain Loss. We combine loss with the Gradient Difference Loss (GDL) [30] as the image domain loss:
(1) 
We define the loss as:
(2) 
And the loss is given by:
(3) 
where is an integer greater or equal to , and is the operation of absolute value function.
Adversarial Loss. Adversarial training involves a generator G and a discriminator D, where D learns to distinguish whether the frame sequence is from the real dataset or produced by G. The two networks are trained alternately, thus improving until D can no longer discriminate the frame sequence generated by G. In our model, the prediction model is regarded as a generator. We formulate the adversarial loss on the discriminator D as:
(4) 
and the adversarial loss for the generator G as:
(5) 
Hence, we combine the losses previously defined for our generator model with different weights:
(6) 
where and are hyperparameters to trade off between these distinct losses.
4 Experiments
In this section, we will first introduce the experiment setup, and then show the quantitative and qualitative evaluation on diverse datasets. Besides, we do ablation studies to show our model’s generalization capability and evaluation of submodules.
4.1 Experiment Setup

Datasets. We perform experiments on diverse datasets widely used to evaluate video prediction models. The KTH dataset [34]
contains 6 types of actions from 25 persons. We use person 116 for training and 1725 for testing. Models are trained to predict next 10 frames based on the observation of previous 10 frames. The prediction range of testing is extended to 20 or 40 frames. The hyper parameters in the loss function
6 on KTH dataset are: and . The BAIR dataset [10] consists of a random moving robotic arm that pushes objects on a table. This dataset is particularly challenging due to the high stochasticity of the arm movements and the diversity of the background. We follow the setup in [22] and train the models to predict future 28 frames. The hyper parameters in the loss function 6 on the BAIR dataset are: and . In addition, following the experiments settings in [24], we validate the generalization ability of our models on the carmounted camera datasets (train: KITTI dataset [14], test:Caltech Pedestrian dataset [9]). The hyper parameters are: and . 
Metrics.
Quantitative evaluation of the the accuracy on the testing datasets is performed based on Peak SignaltoNoise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) metrics
[45]. Higher values indicate better results. To measure the realism of predicted results, we employ the metric of Learned Perceptual Image Patch Similarity (LPIPS) [48]. Lower values for LPIPS indicate better results. 
Baselines. We consider representative baselines from two categories: stochastic video generation models [22, 2, 8, 5] and deterministic video prediction models [38, 31, 44, 42, 19, 23, 43, 3, 13, 21, 15]. All the experiment results of the baselines are from the cited papers or reproduced by the pretrained models the authors report online.
Method  KTH  
PSNR  SSIM  LPIPS  PSNR  SSIM  LPIPS  
MCNET [38]  25.95  0.804    23.89  0.73   
fRNN [31]  26.12  0.771    23.77  0.678   
PredRNN [44]  27.55  0.839    24.16  0.703   
PredRNN++ [42]  28.47  0.865    25.21  0.741   
VarNet [19]  28.48  0.843    25.37  0.739   
E3DLSTM [43]  29.31  0.879    27.24  0.810   
MSNET [23]  27.08  0.876         
SAVP [22]  25.38  0.746  9.37  23.97  0.701  13.26 
SAVPVAE [22]  27.77  0.852  8.36  26.18  0.811  11.33 
SV2P timeinvariant [2]  27.56  0.826  17.92  25.92  0.778  25.21 
SV2P timevariant [2]  27.79  0.838  15.04  26.12  0.789  22.48 
Ours  29.85  0.893  11.81  27.56  0.851  14.13 
Ours (no SWAM)  29.13  0.872  12.33  26.42  0.805  16.06 
Ours (no TWAM)  28.57  0.839  15.16  26.08  0.782  17.45 
Method  BAIR  
PSNR  SSIM  LPIPS  
SAVP [22]  18.42  0.789  6.34 
SAVPVAE [22]  19.09  0.815  6.22 
SV2P timeinvariant [2]  20.36  0.817  9.14 
SVGLP [8]  17.72  0.815  6.03 
Improved VRNN [5]    0.822  5.50 
Ours  21.02  0.844  9.36 
4.2 Quantitative Evaluation
The results of methods [38, 31, 44, 42, 19, 43, 23, 5] are reported in the reference papers [43, 19, 23, 5]. For the models [22, 2, 8], we generate the results by running the pretrained models the authors reported online.
Table 1 reports quantitative comparison on the KTH dataset. We can see that our model achieves the best result on PSNR and SSIM in terms of prediction for both future 20 frames and 40 frames, which indicates that our results are more consistent with the ground truth. However, on LPIPS, SAVP and its variants SAVPVAE perform better than us. We analyze that the introduction of latent variables in the stochastic generation methods focuses more on the visual quality of the generated results and less on the consistency with ground truth. Nevertheless, our model focuses more on fidelity and temporal consistency with the original sequences, which is in line with our original intention.
Figure 5 illustrates the perframe quantitative comparison of future time steps on the BAIR dataset. We also calculate the average results in Table 2. In consistent with the result on KTH dataset, we obtain the best PSNR and SSIM among the reported methods. While the Improved VRNN [5] achieves the highest on LPIPS. Because of the high stochasticity of the BAIR dataset, it is challenging to maintain fidelity and temporal consistency while making good visual effects.
4.3 Qualitative Evaluation
To show the prediction results more intuitively, we report the visualization examples on KTH dataset and BAIR datasets in Figure 6 and Figure 7.
The first row is the ground truth, in which the initial frames represent the input sequence. We can see that our model makes more accurate predictions while maintaining more details of the arms in the handclapping example in first group of Figure 6. Meanwhile, we predict a walking sequence that is more consistent with the ground truth in the second group of Figure 6, while for other methods, the person in the image walks out of the scene too quickly (VarNet) or two slowly (SAVP and SV2P timeinvarient).
For the two groups of prediction on BAIR dataset, we are also the most consistent. Though the stochastic generation methods seem to generate more clear results, they are very different from the moving trajectories of the real sequence. This again confirms our belief that introducing more stochasticity in models will sacrifice fidelity. From the experiment results above, we can see that the multifrequency analysis of discrete wavelet transform does help models to retain more detail information as well as temporal motion information.
Method  PSNR  SSIM  LPIPS  #param 
PredNet [27]  27.6  0.905  7.47  6.9M 
ContextVP [3]  28.7  0.921  6.03  8.6M 
DVF [26]  26.2  0.897  5.57  8.9M 
Dual Motion GAN [24]    0.899     
CtrlGen [15]  26.5  0.900  6.38   
DPG [13]  28.2  0.923  5.04   
Cycle GAN [21]  29.2  0.830     
Ours  29.1  0.927  5.89  7.6M 
4.4 Ablation Study
To more fully verify the validity of our model, we did the following ablation studies on driving scene datasets: KITII dataset and Caltech Pedestrian dataset.
Evaluation of generalization ability. Consistent with the way that the previous works evaluate generalization ability of predictive models, we test our mdoel on the Caltech Pedestrian dataset after trained on KITTI dataset. Table 3 shows the comparison results with other models. It can be seen that we achieve the stateoftheart performance. In addition, Figure 8 shows the visualization examples on KITTI dataset (the first group) and Caltech Pedestrian dataset (the second group). From Figure 8, we can see that our model can predict clearly the evolution of driving lines and the cars. Meanwhile, the results remain consistent with the ground truth, which verifies the good generalization ability of the model. Besides, we also report the number of model’s parameters in Table 3. Compared to ContextVP [3] and DVF [26], our model achieves better results with fewer parameters.
Evaluation of submodules. To assess the impact of each submodule, we do ablation studies on the KTH dataset in the absence of SWAM or TWAM respectively. As shown in Table 1, it suggests that submodules, SWAM and TWAM, have both contributed to improving the prediction effect. To be specific, the model without SWAM seems to gain more than the model without TWAM. We analyze that the temporal motion information is of vital importance in the longterm prediction, especially for longterm prediction. Improving the expression of multifrequency motion information in the model is the basis for making predictions with highfidelity and temporalconsistency.
5 Conclusion
This paper discusses the issues of missing details and ignoring temporal multiscale motions in current video prediction models, which always lead to blurry results. To address these issues, inspired by the mechanism in Human Visual System (HVS), we explore a video prediction network based on multifrequency analysis, which integrates spatialtemporal wavelet transform and generative adversarial network. Specifically,the Spatial Wavelet Analysis Module (SWAM) is proposed to reserve more detail information through multilevel decomposition of each frame. The Temporal Wavelet Analysis Module (TWAM) is proposed to exploit the temporal motions through multilevel decomposition of video sequences on time axis. Extensive experiments on diverse datasets demonstrate the superiority of the proposed method when compared to the stateoftheart methods. In the future, we will validate our method on more video datsets.
References
 [1] Milad Alemohammad, Jasper R Stroud, Bryan T Bosworth, and Mark A Foster. Highspeed alloptical haar wavelet transform for realtime image compression. Optics Express, 2017.
 [2] Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H Campbell, and Sergey Levine. Stochastic variational video prediction. arXiv preprint arXiv:1710.11252, 2017.
 [3] Wonmin Byeon, Qin Wang, Rupesh Kumar Srivastava, and Petros Koumoutsakos. Contextvp: Fully contextaware video prediction. In ECCV, 2018.
 [4] Fergus W Campbell and Janus J Kulikowski. Orientational selectivity of the human visual system. The Journal of physiology, 1966.
 [5] Lluis Castrejon, Nicolas Ballas, and Aaron Courville. Improved conditional vrnns for video prediction. arXiv preprint arXiv:1904.12165, 2019.

[6]
Honggang Chen, Xiaohai He, Linbo Qing, Shuhua Xiong, and Truong Q Nguyen.
Dpwsdnet: Dual pixelwavelet domain deep cnns for soft decoding of
jpegcompressed images.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops
, 2018.  [7] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In NeurIPS, 2015.
 [8] Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. arXiv preprint arXiv:1802.07687, 2018.
 [9] Piotr Dollar, Christian Wojek, Bernt Schiele, and Pietro Perona. Pedestrian detection: An evaluation of the state of the art. PAMI, 2012.
 [10] Frederik Ebert, Chelsea Finn, Alex X Lee, and Sergey Levine. Selfsupervised visual planning with temporal skip connections. arXiv preprint arXiv:1710.05268, 2017.
 [11] Issam Elafi, Mohamed Jedra, and Noureddine Zahid. Unsupervised detection and tracking of moving objects for video surveillance applications. Pattern Recognition Letters, 2016.
 [12] Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In ICRA, 2017.
 [13] Hang Gao, Huazhe Xu, QiZhi Cai, Ruth Wang, Fisher Yu, and Trevor Darrell. Disentangling propagation and generation for video prediction. In ICCV, 2019.
 [14] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. IJRR, 2013.
 [15] Zekun Hao, Xun Huang, and Serge Belongie. Controllable video generation with sparse trajectories. In CVPR, 2018.
 [16] RF Hess and RJ Snowden. Temporal properties of human visual filters: Number, shapes and spatial covariation. Vision research, pages 47–59, 1992.

[17]
Huaibo Huang, Ran He, Zhenan Sun, and Tieniu Tan.
Waveletsrnet: A waveletbased cnn for multiscale face super resolution.
In ICCV, 2017. 
[18]
Huaizu Jiang, Deqing Sun, Varun Jampani, MingHsuan Yang, Erik LearnedMiller,
and Jan Kautz.
Super slomo: High quality estimation of multiple intermediate frames for video interpolation.
CVPR, 2018.  [19] Beibei Jin, Yu Hu, Yiming Zeng, Qiankun Tang, Shice Liu, and Jing Ye. Varnet: Exploring variations for unsupervised video prediction. IROS, 2018.
 [20] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [21] YongHoon Kwon and MinGyu Park. Predicting future frames using retrospective cycle gan. In CVPR, 2019.
 [22] Alex X Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, Chelsea Finn, and Sergey Levine. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018.
 [23] Jungbeom Lee, Jangho Lee, Sungmin Lee, and Sungroh Yoon. Mutual suppression network for video prediction using disentangled features. In BMVC, 2019.
 [24] Xiaodan Liang, Lisa Lee, Wei Dai, and Eric P Xing. Dual motion gan for futureflow embedded video prediction. In ICCV, 2017.
 [25] Wenqian Liu, Abhishek Sharma, Octavia Camps, and Mario Sznaier. Dyan: A dynamical atomsbased network for video prediction. In ECCV, 2018.
 [26] Ziwei Liu, Raymond A Yeh, Xiaoou Tang, Yiming Liu, and Aseem Agarwala. Video frame synthesis using deep voxel flow. In ICCV, 2017.

[27]
William Lotter, Gabriel Kreiman, and David Cox.
Deep predictive coding networks for video prediction and unsupervised learning.
ICLR, 2017.  [28] Stephane G Mallat. A theory for multiresolution signal decomposition: the wavelet representation. TPAMI, 1989.
 [29] James Mannos and David Sakrison. The effects of a visual fidelity criterion on the encoding of images. IEEE Transactions On Information Theory, 2003.
 [30] Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multiscale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440, 2015.
 [31] Marc Oliu, Javier Selva, and Sergio Escalera. Folded recurrent neural networks for future video prediction. In ECCV, 2018.
 [32] Fitsum A Reda, Guilin Liu, Kevin J Shih, Robert Kirby, Jon Barker, David Tarjan, Andrew Tao, and Bryan Catanzaro. Sdcnet: Video prediction using spatiallydisplaced convolution. In ECCV, 2018.
 [33] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 [34] Christian Schuldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: a local svm approach. ICPR, 2004.
 [35] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. ICML, 2015.
 [36] Sergey Tulyakov, MingYu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. CVPR, 2018.
 [37] Joost Van Amersfoort, Anitha Kannan, Marc’Aurelio Ranzato, Arthur Szlam, Du Tran, and Soumith Chintala. Transformationbased models of video sequences. arXiv preprint arXiv:1701.08435, 2017.
 [38] Ruben Villegas, Jimei Yang, Seunghoon Hong, Xunyu Lin, and Honglak Lee. Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033, 2017.
 [39] Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn, Xunyu Lin, and Honglak Lee. Learning to generate longterm future via hierarchical prediction. arXiv preprint arXiv:1704.05831, 2017.
 [40] C. Vondrick and A. Torralba. Generating the future with adversarial transformers. In CVPR, 2017.
 [41] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: Enhanced superresolution generative adversarial networks. In ECCV, 2018.
 [42] Yunbo Wang, Zhifeng Gao, Mingsheng Long, Jianmin Wang, and Philip S Yu. Predrnn++: Towards a resolution of the deepintime dilemma in spatiotemporal predictive learning. PMLR, 2018.
 [43] Yunbo Wang, Lu Jiang, MingHsuan Yang, LiJia Li, Mingsheng Long, and Li FeiFei. Eidetic 3d lstm: A model for video prediction and beyond. In ICLR, 2019.
 [44] Yunbo Wang, Mingsheng Long, Jianmin Wang, Zhifeng Gao, and S Yu Philip. Predrnn: Recurrent neural networks for predictive learning using spatiotemporal lstms. In NeurIPS, 2017.
 [45] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. TIP, 2004.
 [46] Henglai Wei, Xiaochuan Yin, and Penghong Lin. Novel video prediction for largescale scene using optical flow. arXiv preprint arXiv:1805.12243, 2018.
 [47] Junqing Wei, John M Dolan, and Bakhtiar Litkouhi. A prediction and cost functionbased algorithm for robust autonomous freeway driving. IV, 2010.

[48]
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang.
The unreasonable effectiveness of deep features as a perceptual metric.
In CVPR, 2018.  [49] Yipin Zhou and Tamara L Berg. Learning temporal transformations from timelapse videos. In ECCV, 2016.
Comments
There are no comments yet.