1 Introduction
Video prediction is the task of inferring future frames from a sequence of past frames. The ability to predict future frames could find applications in various domains – ranging from future state estimation for selfdriving vehicles to video analysis. For a video prediction model to perform well, it must accurately capture not only how objects move, but also how their displacement affects the visibility and appearance of surrounding structures. Our work focuses on predicting one or more immediate next frames that are sharp, realistic and at high resolution.
Another attribute of the video prediction task is that models can be trained on raw unlabeled video frames. We train our models on large amounts of high resolution footage from video game sequences, which we find improves accuracy because video game sequences contain a large range of motion. We demonstrate that the resulting models perform well not only on video game footage, but also on reallife footage.
Video prediction is an active research area and our work builds on the literature [19, 37, 33, 13, 20, 35, 4, 2, 3, 26, 18]
. Previous approaches for video prediction often focus on direct synthesis of pixels using generative models. For instance, convolutional neural networks were used to predict pixel RGB values, while recurrent mechanisms were used to model temporal changes. Ranzato et al.
[28]proposed to partition input sequences into a dictionary of image patch centroids and trained recurrent neural networks (RNN) to generate target images by indexing the dictionaries. Srivastava et al.
[31] and Villegas et al. [34]used a convolutional LongShortTermMemory (LSTM) encoderdecoder architecture conditioned on previous frame data. Similarly, Lotter et al.
[17] presented a predictive coding RNN architecture to model the motion dynamics of objects in the image for frame prediction. Mathieu et al. [21] proposed a multiscale conditional generative adversarial network (GAN) architecture to alleviate the short range dependency of singlescale architectures. These approaches, however, suffer from blurriness and do not model large object motions well. This is likely due to the difficulty in directly regressing to pixel values, as well as the low resolution and lack of large motion in their training data.Another popular approach for frame synthesis is learning to transform input frames. Liang et al. [14] proposed a generative adversarial network (GAN) approach with a joint future opticalflow and future frame discriminator. However, ground truth optical flows are not trivial to collect at large scale. Training with estimated optical flows could also lead to erroneous supervision signals. Jiang et al. [10]
presented a model to learn offset vectors for sampling for frame interpolation, and perform frame synthesis using bilinear interpolation guided by the learned sampling vectors. These approaches are desirable in modeling large motion. However, in our experiments, we found sampling vectorbased synthesis results are often affected by speckled noise.
One particular approach proposed by Niklaus et al. [24, 23] and Vondrick et al. [36] for frame synthesis is to learn to predict sampling kernels that adapt to each output pixel. A pixel is then synthesized as the weighted sampling of a source patch centered at the pixel location. Niklaus et al. [24, 23] employed this for the related task of video frame interpolation, applying predicted sampling kernels to consecutive frames to synthesize the intermediate frame. In our experiments, we found the kernelbased approaches to be effective in keeping objects intact as they are transformed. However, this approach cannot model large motion, since its displacement is limited by the kernel size. Increasing kernel size can be prohibitively expensive.
Inspired by these approaches, we present a spatiallydisplaced convolutional (SDC) module for video frame prediction. We learn a motion vector and a kernel for each pixel and synthesize a pixel by applying the kernel at a displaced location in a source image, defined by the predicted motion vector. Our approach inherits the merits of both sampling vectorbased and kernelbased approaches, while ameliorating their respective disadvantages. We take the largemotion advantage of sampling vectorbased approach, while reducing the speckled noise patterns. We take the clean object boundary prediction advantages of the kernelbased approaches, while significantly reducing kernel sizes, hence reducing the memory demand.
The contributions of our work are:

We propose a deep model for highresolution frame prediction from a sequence of past frames.

We propose a spatiallydisplaced convolutional (SDC) module for effective frame synthesis via transformation learning.

We compare our SDC module with kernelbased, vectorbased and stateoftheart approaches.
2 Methods
Given a sequence of frames (the immediate past frames), our work aims to predict the next future frame . We formulate the problem as a transformation learning problem
(1) 
where is a learned function that predicts transformation parameters, and is a transformation function. In prior work, can be a bilinear sampling operation guided by a motion vector [10, 15]:
(2) 
where is a bilinear interpolator [15], is a motion vector predicted by , and is a pixel value at in the immediate past frame . We refer this approach as a vectorbased resampling. Fig. 3a illustrates this approach.
An alternative approach is to define as a convolution module that combines motion or displacement learning and resampling into a single operation [24, 23, 36]:
(3) 
where is an NN 2D kernel predicted by at and is an NN patch centered at in . We refer this approach as adaptive kernelbased resampling [24, 23]. Fig. 3b illustrates this approach.
Since equation (2) considers few pixels in synthesis, its results often appear degraded by speckled noise patterns. It can, however, model large displacements without a significant increase in parameter count. On the other hand, equation (3) produces visually pleasing results for small displacements, but requires large kernels to be predicted at each location to capture large motions. As such, the kernelbased approach can easily become not only costly at inference, but also difficult to train.
2.1 Spatially Displaced Convolution
To achieve the best of both worlds, we propose a hybrid solution – the Spatially Displaced Convolution (SDC). The SDC uses predictions of both a motion vector and an adaptive kernel , but convolves the predicted kernel with a patch at the displaced location in . Pixel synthesis using SDC is computed as:
(4) 
The predicted pixel is thus the weighted sampling of pixels in an NN region centered at in . The patch is bilinearly sampled at nonintegral coordinates. Fig. 3c illustrates our SDCbased approach.
Setting to a kernel of allzeros except for a one at the center reduce the SDC to equation (2), whereas setting and to zero reduces it to equation (3). However, it is important to note that the SDC is not the same as applying equation (2) and equation (3) in succession. If applied in succession, the NN patch sampled by would be subject to the resampling effect of equation (2) as opposed to being the original patch from .
Our SDC effectively decouples displacement and kernel learning, allowing us to achieve the visually pleasing results of kernelbased approaches while keeping the kernel sizes small. We also adopt separable kernels [24] for to further reduce computational cost. At each location, we predict a pair of 1D kernels and calculate the outerproduct of them to form a 2D kernel. This reduces our kernel parameter count from to . In total, our model predicts parameters for each pixel, including the motion vector. We empirically set . Inference at 1080p resolution uses 174MB of VRAM, which easily fits in GPU memory.
We develop deep neural networks to learn motion vectors and kernels adapted to each output pixel. The SDC is fully differentiable and thus allows our model to train endtoend. Losses for training are applied to the SDCpredicted frame. We also condition our model on both past frames and past optical flows. This allows our network to easily capture motion dynamics and evolution of pixels needed to learn the transformation parameters. We formulate our model as:
(5) 
where transformation is realized with SDC and operates on the most recent input , and is the backwards optical flow (see Section 2.3 ) between and . We calculate F using stateoftheart neural networkbased optical flow models [7, 9, 32].
Our approach naturally extends to multiple frame prediction by recursively recirculating SDC predicted frames back as inputs. For instance, to predict a frame two steps ahead, we recirculate the SDC predicted frame as input to our model to predict .
2.2 Network Architecture
We realize using a fully convolutional network. Our model takes in a sequence of past frames and past optical flows and outputs pixelwise separable kernels , and a motion vector . We use 3D convolutions to convolve across width, height, and time. We concatenate RGB channels from our input images to the two optical flow channels to create 5 channels per frame. The topology of our architecture gets inspiration from various Vnet type typologies [7, 22, 29], with an encoder and a decoder. Each layer of the encoder applies 3D convolutions followed by a Leaky Rectified Unit (LeakyRELU) [8]
and a convolution with a stride (1,2,2) to downsample features to capture longrange spatial dependencies. Following
[7], we use 3x3x3 convolution kernels, except for the first and second layers where we use 3x7x7 and 3x5x5 for capturing large displacements. Each decoder subpart applies deconvolutions [16] followed by LeakyRELU, and a convolution after corresponding features from the contracting part have been concatenated. The decoding part also has several heads, one head for and one each for and . The last two decoding layers of and use upsampling with a trilinear mode, instead of normal deconvolution, to minimize the checkerboard effect [25]. Finally, we apply repeated convolutions in each decoding head to reduce the time dimension to 1.2.3 Optical Flow
We calculate the interframe optical flow we input to our model using FlowNet2 [9], a stateoftheart optical flow model. This allows us to extrapolate motion conditioned on past flow information. We calculate backwards optical flows because we model our transformation learning problem with backwards resampling, i.e. predict a sampling location in for each location in .
It is important to note that the motion vectors predicted by our model at each pixel are not equivalent to optical flow vectors , as pure backwards optical flow is undefined (or zero valued) for disoccluded pixels (pixels not visible in the previous frame due to occlusion). A schematic explanation of the disocclusion problem is shown in Fig. 4, where a 22 square is moving horizontally to the right at a speed of 1 pixel per step. The groundtruth backward optical flow at is shown in Fig. 4b. As shown in Fig. 4c, resampling the square at using the perfect optical flow will duplicate the left border of the square because the optical flow is zero at the second column. To achieve a perfect synthesis via resampling at , as shown in Fig. 4e, adaptive sampling vectors must be used. Fig. 4d shows an example of such sampling vectors, in which a is used to fillin disoccluded region. A learned approach is necessary here as it not only allows the disocclusion sampling to adapt for various degrees of motion, but also allows for a learned solution for which background pixels from the previous frame would look best in the filled gap.
2.4 Loss Functions
Our primary loss function is the L1 loss over the predicted image:
, where is a target and is a predicted frame. We found the L1 loss to be better at capturing small changes compared to L2, and generally produces sharper images.We also incorporated the L1 norm on highlevel VGG16 feature representations [30]. Specifically, we used the perceptual and style loss [11], defined as:
(6) 
and
(7) 
Here, is the feature map from the
th selected layer of a pretrained Imagenet VGG16 for
, is the number of layers considered, and is a normalization factor (channel, height, width) for the th selected layer. We use the perceptual and style loss terms in conjunction with the L1 over RGB as follows:(8) 
We found the finetune loss to be robust in eliminating the checkerboard artifacts and generates a much sharper prediction than L1 alone.
Finally, we introduce a loss to initialize the adaptive kernels, which we found to significantly speed up training. We use the L2 norm to initialize kernels and as a middleonehot vector each. That is, all elements in each kernel are set very close to zero, except for the middle element which is initialized close to one. When and elements are initialized as middlehot vectors, the output of our displaced convolution described in equation (4) will be the same as our vectorbased approach described in equation (2). The kernel loss is expressed as:
(9) 
where is a middleonehot vector, and and are the width and height of images.
Other loss functions considered include the L1 or L2 distance between predicted motion vectors and target optical flows. We found this loss to lead to inferior results. As discussed in Section 2.3, optimizing for optical flow will not properly handle the disocclusion problem. Further, use of estimated optical flow as a training target introduces additional noise.
2.5 Training
We trained our SDC model using frames extracted from many short sequence videos. To allow our model to learn robust invariances, we selected frames in highdefinition video game plays with realistic, diverse content, and a wide range of motion. We collected 428K 1080p frames from GTAV and Battlefield1 game plays. Each example consists of five (=5) consecutive 256256 frames randomly cropped from the fullHD sequence. We use a batch size of 128 over 8 V100 GPUs.
We optimize with Adam [12] using , and with no weight decay. First, we optimize our model to learn using loss with a learning rate of
for 400 epochs. Optimizing for
alone allows our network to capture large and coarse motions faster. Next, we fix all weights of the network except for the decoding heads of and and train them using our loss defined in equation (9) to initialize kernels at each output pixel as middleonehot vectors. Then, we optimize all weights in our deep model using loss and a learning rate of for 300 epochs to jointly finetune the and (, ) at each pixel. Since we optimize for both kernels and motion vectors in this step, our network learns to pick up small and subtle motions and corrects disocclusion related artifacts. Finally, we further finetune all weights in our model using at a learning rate of . The weights we use to combine losses are 0.2, 0.06, 36.0 for , , and respectively. We used the activations from VGG16 layers relu1_2, relu2_2 and relu3_3 for the perceptual and style loss terms. The last finetuning step of our training makes predictions sharper and produces visually appealing frames in our video prediction task. We initialized the FlowNet2 model with pretrained weights^{1}^{1}1https://github.com/lmbfreiburg/flownet2 and fix them during training.3 Experiments
We implemented all our Vector, Kernel, and SDCbased models using PyTorch
[27]. To efficiently train our model, we wrote a CUDA custom layer for our SDC module. We set kernel size to 5151 for the Kernelbased model as suggested in [24]. The kernel size for our SDCbased model is 1111. Inference using our SDCbased model at 1080p takes 1.66sec, of which 1.03 sec is spent on FlowNet2.3.1 Datasets and Metrics
3.2 Comparison on lowquality videos
Table 1 presents next frame prediction comparisions with BeyondMSE [21], PredNet [17], MCNet [34], and DualGAN [14] on CaltechPed test partition. We also compare with CopyLast, which is the trivial baseline that uses the most recent past frame as the prediction. For PredNet and DualGAN, we directly report results presented in [17] and [15], respectively.
For BeyondMSE^{2}^{2}2 https://github.com/coupriec/VideoPredictionICLR2016 and MCNet^{3}^{3}3 https://github.com/rubenvillegas/iclr2017mcnet, we evaluated using released pretrained models.
Our SDCbased model outperforms all other models, achieving an L2 score of and SSIM of , compared to the stateoftheart DualGAN model which has an L2 score of and SSIM of . The MCNet which was trained on dataset that is equally as large as ours shows inferior results with L2 of and SSIM of . CopyLast method has significantly worse L2 of and SSIM of , making it a significantly less viable approach for next frame prediction. Our Vectorbased approach has higher accuracy than our Kernelbased approach. Since the CaltechPed videos contain slightly larger motion, the Vectorbased approach, which is advantageous in large motion sequences, is expected to perform better.
In Fig. 5, we present qualitative comparisons on CaltechPed. SDCNet predicted frames are crisp, sharp and show minimal unnatural deformation of the highlighted car (framed in red). All predictions were able in picking up the right motion as shown with their good alignment with the groundtruth frame. However, both BeyondMSE and MCNet create generally blurrier predictions and unnatural deformations on the highlighted car.
3.3 Comparison on highdefinition videos
Table 2 presents next frame prediction comparisons with BeyondMSE, MCNet and CopyLast on 26 fullHD YouTube vidoes. Our SDCNet model outperforms all other models, achieving an L2 of and SSIM of , compared to the stateoftheart MCNet model which has an L2 of and SSIM of .
YouTube8M  L1  L2  PSNR  SSIM 

BeyondMSE[21]  0.0271  0.00328  33.33  0.858 
MCNet[34]  0.0216  0.00255  35.64  0.895 
CopyLast  0.0260  0.00506  36.63  0.854 
Our Vector  0.0177  0.00270  37.24  0.905 
Our Kernel  0.0186  0.00303  37.33  0.904 
Our SDC  0.0174  0.00240  37.15  0.911 
In Fig. 6, SDCNet is shown to provide crisp and sharp frames, with motion mostly in good alignment with the groundtruth frame. Since our models do not hallucinate pixels, they produce visually good results by exploiting the image content of the last input frame. For instance, instead of duplicating the borders of foreground objects, our models displace to appropriate locations in the previous frame and synthesize pixels by convolving the learned kernel for that pixel with an image patch centered at the displaced location.
Since our approach takes FlowNet2 [9] predicted flows as part of its input, the transformation parameters predicted by our deep model are affected by inaccurate optical flows. For instance, optical flow for the ski in Fig. 6 (bottom right) is challenging, and so the ski movement was not predicted by our model as well as the movement of the skiing person.
In Fig. 7, we qualitatively show comparisons for MCNet, our Kernel, Vector, and SDCbased methods for a large camera motion. MCNet shows significantly blurry results and ineffectiveness in capturing large motions. MCNet also significantly alters the color distribution in the predicted frames. Our Kernelbased method has difficulty predicting large motion, but preserves the color distribution. However, the Kernelbased approach often moves components disjointly, leading to visually inferior results. Our Vectorbased approach better captures large displacement, such as the motion present in this sequence. However, its predictions suffer from pixel noise. Our SDCbased method, which combines merits of both our Kernel and Vectorbased approaches, combines the ability of our Vectorbased method to predict large motions, along with the visually pleasing results of our Kernelbased approach.
3.4 Comparison in multistep prediction
Previous experiments showed SDCNet’s performance in next frame prediction. In practice, models are used to predict multiple future frames. Here, we condition each approach on five original frames and predict five future frames on CaltechPed. Fig. 8 shows that SDCNet predicted multiple frames are consistently favourable when compared to previous approaches, as quantified by L1, L2, SSIM and PSNR over 120,725 unique Caltech Pedestrian frames. Fig. 9 presents an example fivestep prediction that show SDCNet predicted frames preserving color distribution, object shapes and their fine details.
3.5 Ablation results
We compare our Vectorbased with our SDCbased approach in Fig. 10. Our Vectorbased approach struggles with disocclusions (orange box), as described in 2.3. In Fig. 10, the Vectorbased model avoids completely stretching the glove borders, but still leaves some residual glove pixels behind. The Vectorbased approach also may produce speckled noise patterns due to large motion (red box). Disocclusion and speckled noise are significantly reduced in the SDCNet results shown in Fig.10.
In Fig. 11, we present qualitative results for our SDCbased model trained using loss alone vs followed by our given by equation (8). We note that using loss alone leads to slightly blurry results, e.g. the glove (red box), and the fence (orange box) in Fig.11. Fig. 11 (center column) shows the same result after finetuning, with finer details preserved – demonstrating that the perceptual and style losses reduce blurriness. We also observed that the loss helps capture large motions that are otherwise challenging to capture.
Fig. 11 represents a challenging example due to fast motion. Since our model depends on optical flow, situations that are challenging for optical flow are also difficult for our model. The prediction errors can be seen with the relatively larger misalignment on the fence compared to the ground truth (orange box). Our approach also fails during scene transitions, where past frames are not relevant to future frames. Currently, we automatically detect scene transitions by analyzing optical flow statistics, and skip frame prediction until enough (five) valid frames to condition our models are available.
4 Conclusions
We present a 3D CNN and a novel spatiallydisplaced convolution (SDC) module that achieves stateoftheart video frame prediction. Our SDC module effectively handles large motion and allows our model to predict crisp future frames with motion closely matching that of groundtruth sequences. We trained our model on 428K highresolution video frames collected from gameplay footage. To the best of our knowledge, this is the first attempt in transfer learning from synthetic to real life for video frame prediction. Our model’s accuracy is dependent on the accuracy of the input estimated flows, thus leading to failures in fast motion sequences. Future work will include a study on the effect of multiscale architectures for fast motion.
Acknowledgements: We would like to thank Jonah Alben, Paulius Micikevicius, Nikolai Yakovenko, MingYu Liu, Xiaodong Yang, Atila Orhon, Haque Ishfaq and NVIDIA Applied Research staff for suggestions and discussions, and Robert Pottorff for capturing the game datasets used for training.
References
 [1] AbuElHaija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., Vijayanarasimhan, S.: Youtube8m: A largescale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016)
 [2] Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R.H., Levine, S.: Stochastic variational video prediction. arXiv preprint arXiv:1710.11252 (2017)
 [3] Byeon, W., Wang, Q., Srivastava, R.K., Koumoutsakos, P.: Fully contextaware video prediction. arXiv preprint arXiv:1710.08518 (2017)
 [4] Denton, E., Fergus, R.: Stochastic video generation with a learned prior. arXiv preprint arXiv:1802.07687 (2018)
 [5] Dollár, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: A benchmark. In: CVPR (June 2009)
 [6] Dollár, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: An evaluation of the state of the art. PAMI 34 (2012)

[7]
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2758–2766 (2015)
 [8] He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision. pp. 1026–1034 (2015)

[9]
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow estimation with deep networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). vol. 2 (2017)
 [10] Jiang, H., Sun, D., Jampani, V., Yang, M.H., LearnedMiller, E., Kautz, J.: Super slomo: High quality estimation of multiple intermediate frames for video interpolation. arXiv preprint arXiv:1712.00080 (2017)

[11]
Johnson, J., Alahi, A., FeiFei, L.: Perceptual losses for realtime style transfer and superresolution. In: European Conference on Computer Vision. pp. 694–711. Springer (2016)
 [12] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
 [13] Leibfried, F., Kushman, N., Hofmann, K.: A deep learning approach for joint video frame and reward prediction in atari games. arXiv preprint arXiv:1611.07078 (2016)
 [14] Liang, X., Lee, L., Dai, W., Xing, E.P.: Dual motion gan for futureflow embedded video prediction. In: ICCV (2017)
 [15] Liu, Z., Yeh, R., Tang, X., Liu, Y., Agarwala, A.: Video frame synthesis using deep voxel flow. In: International Conference on Computer Vision (ICCV). vol. 2 (2017)
 [16] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3431–3440 (2015)

[17]
Lotter, W., Kreiman, G., Cox, D.: Deep predictive coding networks for video prediction and unsupervised learning. In: ICLR (2014)
 [18] Lu, C., Hirsch, M., Schölkopf, B.: Flexible spatiotemporal networks for video prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6523–6531 (2017)
 [19] Luc, P., Neverova, N., Couprie, C., Verbeek, J., LeCun, Y.: Predicting deeper into the future of semantic segmentation. In: of: ICCV 2017International Conference on Computer Vision. p. 10 (2017)
 [20] Mahjourian, R., Wicke, M., Angelova, A.: Geometrybased next frame prediction from monocular video. In: Intelligent Vehicles Symposium (IV), 2017 IEEE. pp. 1700–1707. IEEE (2017)
 [21] Michael Mathieu, Camille Couprie, Y.L.: Deep multiscale video prediction beyond mean square error. International Conference on Learning Representations (2016)
 [22] Milletari, F., Navab, N., Ahmadi, S.A.: Vnet: Fully convolutional neural networks for volumetric medical image segmentation. In: 3D Vision (3DV), 2016 Fourth International Conference on. pp. 565–571. IEEE (2016)
 [23] Niklaus, S., Mai, L., Liu, F.: Video frame interpolation via adaptive convolution. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)
 [24] Niklaus, S., Mai, L., Liu, F.: Video frame interpolation via adaptive separable convolution. In: IEEE International Conference on Computer Vision (2017)
 [25] Odena, A., Dumoulin, V., Olah, C.: Deconvolution and checkerboard artifacts. Distill (2016). https://doi.org/10.23915/distill.00003, http://distill.pub/2016/deconvcheckerboard
 [26] Oliu, M., Selva, J., Escalera, S.: Folded recurrent neural networks for future video prediction. arXiv preprint arXiv:1712.00311 (2017)
 [27] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017)
 [28] Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., Chopra, S.: Video (language) modeling: a baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604 (2014)
 [29] Ronneberger, O., Fischer, P., Brox, T.: Unet: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computerassisted intervention. pp. 234–241. Springer (2015)
 [30] Simonyan, K., Zisserman, A.: Very deep convolutional networks for largescale image recognition. CoRR abs/1409.1556 (2014)
 [31] Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using LSTMs. In: ICML (2015)
 [32] Sun, D., Yang, X., Liu, M.Y., Kautz, J.: Pwcnet: Cnns for optical flow using pyramid, warping, and cost volume. arXiv preprint arXiv:1709.02371 (2017)
 [33] Van Amersfoort, J., Kannan, A., Ranzato, M., Szlam, A., Tran, D., Chintala, S.: Transformationbased models of video sequences. arXiv preprint arXiv:1701.08435 (2017)
 [34] Villegas, R., Yang, J., Hong, S., Lin, X., Lee, H.: Decomposing motion and content for natural video sequence prediction. In: ICLR (2017)
 [35] Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: Advances In Neural Information Processing Systems. pp. 613–621 (2016)
 [36] Vondrick, C., Torralba, A.: Generating the future with adversarial transformers. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 2992–3000 (2017)
 [37] Vukotić, V., Pintea, S.L., Raymond, C., Gravier, G., van Gemert, J.C.: Onestep timedependent future video frame prediction with a convolutional encoderdecoder neural network. In: International Conference on Image Analysis and Processing. pp. 140–151. Springer (2017)
 [38] Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004)
Comments
There are no comments yet.