Extrapolative-Interpolative Cycle-Consistency Learning for Video Frame Extrapolation

05/27/2020 ∙ by Sangjin Lee, et al. ∙ Yonsei University 0

Video frame extrapolation is a task to predict future frames when the past frames are given. Unlike previous studies that usually have been focused on the design of modules or construction of networks, we propose a novel Extrapolative-Interpolative Cycle (EIC) loss using pre-trained frame interpolation module to improve extrapolation performance. Cycle-consistency loss has been used for stable prediction between two function spaces in many visual tasks. We formulate this cycle-consistency using two mapping functions; frame extrapolation and interpolation. Since it is easier to predict intermediate frames than to predict future frames in terms of the object occlusion and motion uncertainty, interpolation module can give guidance signal effectively for training the extrapolation function. EIC loss can be applied to any existing extrapolation algorithms and guarantee consistent prediction in the short future as well as long future frames. Experimental results show that simply adding EIC loss to the existing baseline increases extrapolation performance on both UCF101 and KITTI datasets.



There are no comments yet.


page 1

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video frame extrapolation is a task that predicts future frames, which is very challenging because it requires comprehensive understanding of the objects and motions. It can be used as a key component for video application such as future forecasting, action recognition, and video compression.

Two of the most important problems in frame extrapolation are blurry output image and vulnerability to occlusion. Therefore, recent extrapolation studies have made various efforts to solve these problems through various methods. The extrapolative methods can be classified into three categories: the pixel-based, flow-based, and hybrid methods. Research approached using the pixel-based method

[1, 13, 16, 12] uses 3D Convolutional Network or ConvLSTM [20] / GRU [2] to predict directly future pixels, and expect to learn flow information between frames implicitly while training. However, despite using Generative Adversarial Network (GAN) [5]

or Variational Autoencoder (VAE)

[7], the pixel-based methods make the image blur due to the recurrent modeling. Another research, approached with the flow-based method [11, 9, 3], obtains target frame by warping the input frame with the predicted target flow map. In contrast to the pixel-based method, sharper image can be obtained by moving the pixels through predicted flows. However, it is vulnerable to objects which have large movements or with occlusions. To solve these problems, some research [18, 10] uses both methods to form a network for complementary purposes which shows better performance. We use this hybrid method as our baseline.

Figure 1: is a target frame and are input frames for extrapolation network. The result of extrapolation network () and frame are passed into the interpolation network. By equation (5), we can obtain EIC loss like red dotted line from the result of interpolation network ().

In our research, we look for ways to increase performance by working on designing loss rather than network architecture. Cycle-consistency loss has been used in existing research [22, 14, 8] and has resulted in improved learning stability and performance. Unlike the research [22, 14, 8], it is difficult to apply cycle-consistency loss to video frame extrapolation in traditional ways, due to the uncertainty of extrapolation. In [14]

, they suggests that optical flow performance can be increased by training via forward-backward flow learning. Because optical flow estimation usually requires two frames, this forward-backward consistency is effectively operated. In video extrapolation,

[8] trains prediction models with forward-backward consistency loss to predict back the past frames. However, training single prediction function from scratch with cycle loss can cause unsuitability in training. In [22], they proposes cycle-consistency loss between two mapping functions from scratch for unsupervised image translation. Different from these models, we build new cycle-consistency from extrapolation and interpolation. Because frame interpolation task is easier than extrapolation in terms of the object occlusions and motion uncertainties, it can give stable guidance signal to train the extrapolation function. Specifically, in the case of frame interpolation, all the information to estimate intermediate frame can be obtained at least one adjacent frame. For this reason, our goal is to increase extrapolation performance from the interpolation module by adding cycle-consistency loss.

In this paper, we propose a loss simply applicable to frame extrapolation called Extrapolative-Interpolative Cycle (EIC) loss. Then, we add it to existing extrapolation loss to identify performance differences. To check the effectiveness on the hard predictions, we verify the results not only in the short future but also in long future frames. Results show that our novel loss gives huge performance increments in various settings.

2 Proposed Algorithm

2.1 Overview

The overall concept of Extrapolative-Interpolative Cycle (EIC) loss is described in Fig. 1. The system for EIC loss consists of two functions; extrapolation network and interpolation network. Note that any kinds of algorithm can be used for the two functions if equations (1) and (2) are satisfied. For frame extrapolation network , we define the function which predicts the next future frame as


where indicates the given frames, is the number of given past frames and is the target frame. For the frame interpolation network , we define the function which predicts the intermediate frame as


After predicting the next frame , since the interpolation network generates an intermediate frame when two consecutive frames are given, we can re-synthesize the ()-th frame from the ()-th and ()-th frames.

DVF [11] SuperSlomo [6] SepConv [15]
PSNR 34.31 33.92 34.38
SSIM 0.949 0.949 0.951
Table 1: Performance (PSNR and SSIM [19]) of frame interpolation modules on UCF101.
Baseline 27.10 0.861 22.61 0.730
+ DVF () 28.24 0.862 23.33 0.760
+ SuperSlomo () 28.01 0.868 22.86 0.759
+ SepConv () 28.20 0.876 22.88 0.761
+ DVF () 28.34 0.877 23.17 0.771
+ SuperSlomo () 28.10 0.883 22.98 0.722
+ SepConv () 28.29 0.889 23.01 0.773
+ DVF () 28.12 0.863 23.05 0.761
+ SuperSlomo () 27.96 0.865 22.90 0.731
+ SepConv () 28.09 0.873 22.95 0.765
Table 2: Performance (PSNR and SSIM) of video frame extrapolation on UCF101 and KITTI with and without our EIC loss.
Baseline 27.10 / 0.861 22.45 / 0.770 19.44 / 0.688 17.26 / 0.613 22.61 / 0.730 19.58 / 0.638 17.32 / 0.563 15.61 / 0.510
28.24 / 0.862
(1.14) / (0.001)
24.28 / 0.771
(1.83) / (0.001)
21.85 / 0.691
(2.41) / (0.003)
20.13 / 0.635
(2.87) / (0.022)
23.33 / 0.760
(0.72) / (0.030)
20.86 / 0.665
(1.28) / (0.027)
18.88 / 0.593
(1.56) / (0.030)
17.36 / 0.537
(1.75) / (0.027)
+ SuperSlomo
28.01 / 0.868
(0.91) / (0.007)
24.20 / 0.785
(1.75) / (0.015)
21.87 / 0.720
(2.43) / (0.032)
20.22 / 0.670
(2.96) / (0.057)
22.86 / 0.759
(0.25) / (0.029)
20.25 / 0.666
(0.67) / (0.028)
18.32 / 0.599
(1.00) / (0.036)
16.94 / 0.550
(1.33) / (0.040)
+ SepConv
28.20 / 0.876
(1.10) / (0.015)
24.17 / 0.789
(1.72) / (0.019)
21.51 / 0.722
(2.07) / (0.034)
19.50 / 0.666
(2.24) / (0.053)
22.88 / 0.761
(0.27) / (0.031)
20.30 / 0.673
(0.72) / (0.035)
18.26 / 0.600
(0.94) / (0.037)
16.70 / 0.541
(1.09) / (0.031)
Table 3: Performance (PSNR / SSIM) of long future frame extrapolation on UCF101 and KITTI. The values in parenthesis are the relative performance improvements between baseline and additional loss. Four frames (, , and ) are predicted.

2.2 Learning with EIC Loss

To train the extrapolation network, various combinations of losses have been used in previous works (Yellow dased line in Fig. 1). We can formulate these losses as


where , and indicate error-based, generative model-based and regularization loss, respectively. The choice of the , , , and is algorithm-specific. Error-based loss () can be either loss or loss. Examples of generative model-based loss () are GAN loss or KL-divergence in the VAE loss. Smoothing loss such as total variation (TV) in the pixel or flow domain can be used as regularization loss ().

Our proposed Extrapolative-Interpolative Cycle (EIC) loss can simply be added to the existing loss which is defined as


where is obtained by entering as the input instead of in equation (2) (Red dashed line in Fig. 1). We use pre-trained interpolation networks for without fine-tuning. Finally, our total loss to train the extrapolation network can be summarized as


where is a hyper-parameter which balances the extrapolation loss and cycle guidance loss. We report the performance differences depending on the choice of in the Section 3. The network can be trained end-to-end with our loss and almost similar training time is required. In the test phase, only the extrapolation network is used because our loss with the interpolation network is only used in the training phase. Therefore, proposed method including our model does not require additional computations and parameters at the test phase compare to existing methods.

Figure 2: Qualitative results of long future frame extrapolation on UCF101 dataset, we denote Ours-1 and Ours-2 for DVF and SepConv, respectively. Error maps represent errors between GT images and each predicted images.

3 Experiments

3.1 Settings

We select Dual Motion GAN [10] as our baseline since it is hybrid method and representative algorithm for frame extrapolation. Different from Dual Motion GAN, six previous frames () are given as input for video extrapolation. Vimeo90K [21] dataset is used for training and UCF101 [17] and KITTI [4]

datasets are used for evaluation. Other settings such as loss function (

) and learning rate are identical with the prior research, and our EIC loss is added using equation (6).

For EIC loss, three representative pre-trained video frame interpolation models are used in our research: Deep Voxel Flow (DVF) [11], SuperSlomo [6] and SepConv [15]. We report the interpolation performance of these models on the UCF101 dataset in Table 1 when they are trained on the Vimeo90K dataset.

3.2 Extrapolation Results

We evaluate the performance by measuring PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity)

[19] for all test datasets. Instead of using provided model, we train three interpolation models [11, 6, 15] from scratch to identify differences in extrapolation performance. We describe the performance differences of video extrapolation in Table 2, where higher values of PSNR and SSIM show better extrapolation results. In general, results with EIC loss roughly follow the performance of interpolation modules. For example, DVF and SepConv outperform SuperSlomo and baseline in all datasets and settings. If we choice interpolation module and hyper-parameter properly (e.g. DVF and ), the extrapolation performance (PSNR) is increased by 1.24 dB without changing network structures in the UCF101 dataset.

Additionally, we train models with three hyper-parameter values: . In general, the models with outperform the others. However, in KITTI dataset, some model with shows better performance because when the videos have more complex motions, interpolation module can guide uncertain motion more effectively than the models with small values.

3.3 Extrapolation Results on Long Future Frames

Another our purpose of designing EIC loss is to verify performance improvement in long future frame prediction. To verify this, we conduct test for trained model to predict four future frames (, , , ). When the number of input frames the model received is , we can get by equation (1). In contrast, to obtain the next frames (), we do not have ideal inputs such as (). Therefore, in long future prediction, we use () instead of () as the input frames. This can cause performance decrement according to the number of predicted frames is increased for long future prediction due to the prediction error propagation.

In Table 3, we can see how the prediction performances are changed according to the future frames. In general, due to the error propagation, PSNR and SSIM decrease when a predict frame is far from the input frames. Results show that our model can guarantee consistent prediction in long future frames from the observation that the performance gap (values in parenthesis) compared to the baseline model is increased. In Fig. 2, we describe these visual results and error maps in long future frames (, ), which show better visual results and lower error propagation (e.g. artifacts) when prediction goes to the long futures.

4 Conclusion

In this work, we propose a novel Extrapolative-Interpolative Cycle (EIC) loss for video frame extrapolation. By adding our EIC loss, model can be learned at almost the same speed as baseline and produces improved performance without increasing memory usage. EIC loss can be applied to any combination of extrapolation and interpolation modules without modification of network structures. As shown in Section 3, when our EIC loss is added, performance is hugely increased qualitatively and quantitatively. Since the interpolation guidance makes the uncertain prediction more stable, and better short future prediction quality mitigates the error propagation, long future frame prediction performance is also increased drastically.


This work was supported by Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government(MSIP) (No.2016-0-00197, Development of the high-precision natural 3D view generation technology using smart-car multi sensors and deep learning)


  • [1] W. Byeon, Q. Wang, R. Kumar Srivastava, and P. Koumoutsakos (2018) Contextvp: fully context-aware video prediction. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    pp. 753–769. Cited by: §1.
  • [2] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §1.
  • [3] H. Gao, H. Xu, Q. Cai, R. Wang, F. Yu, and T. Darrell (2019) Disentangling propagation and generation for video prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9006–9015. Cited by: §1.
  • [4] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. International Journal of Robotics Research (IJRR). Cited by: Extrapolative-Interpolative Cycle-Consistency Learning for Video Frame Extrapolation, §3.1.
  • [5] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
  • [6] H. Jiang, D. Sun, V. Jampani, M. Yang, E. Learned-Miller, and J. Kautz (2018) Super slomo: high quality estimation of multiple intermediate frames for video interpolation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 9000–9008. Cited by: Table 1, §3.1, §3.2.
  • [7] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1.
  • [8] Y. Kwon and M. Park (2019-06) Predicting future frames using retrospective cycle gan. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [9] Y. Li, C. Fang, J. Yang, Z. Wang, X. Lu, and M. Yang (2018) Flow-grounded spatial-temporal video prediction from still images. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 600–615. Cited by: §1.
  • [10] X. Liang, L. Lee, W. Dai, and E. P. Xing (2017) Dual motion gan for future-flow embedded video prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1744–1752. Cited by: §1, §3.1.
  • [11] Z. Liu, R. A. Yeh, X. Tang, Y. Liu, and A. Agarwala (2017) Video frame synthesis using deep voxel flow. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4463–4471. Cited by: §1, Table 1, §3.1, §3.2.
  • [12] W. Lotter, G. Kreiman, and D. Cox (2016)

    Deep predictive coding networks for video prediction and unsupervised learning

    arXiv preprint arXiv:1605.08104. Cited by: §1.
  • [13] C. Lu, M. Hirsch, and B. Scholkopf (2017) Flexible spatio-temporal networks for video prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6523–6531. Cited by: §1.
  • [14] S. Meister, J. Hur, and S. Roth (2018) UnFlow: unsupervised learning of optical flow with a bidirectional census loss. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §1.
  • [15] S. Niklaus, L. Mai, and F. Liu (2017) Video frame interpolation via adaptive separable convolution. In Proceedings of the IEEE International Conference on Computer Vision, pp. 261–270. Cited by: Table 1, §3.1, §3.2.
  • [16] M. Oliu, J. Selva, and S. Escalera (2018)

    Folded recurrent neural networks for future video prediction

    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 716–731. Cited by: §1.
  • [17] K. Soomro, A. R. Zamir, and M. Shah (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402. External Links: Link Cited by: Extrapolative-Interpolative Cycle-Consistency Learning for Video Frame Extrapolation, §3.1.
  • [18] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee (2017) Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033. Cited by: §1.
  • [19] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: Table 1, §3.2.
  • [20] S. Xingjian, Z. Chen, H. Wang, D. Yeung, W. Wong, and W. Woo (2015)

    Convolutional lstm network: a machine learning approach for precipitation nowcasting

    In Advances in neural information processing systems, pp. 802–810. Cited by: §1.
  • [21] T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman (2019) Video enhancement with task-oriented flow. International Journal of Computer Vision (IJCV) 127 (8), pp. 1106–1125. Cited by: §3.1.
  • [22] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017)

    Unpaired image-to-image translation using cycle-consistent adversarial networks

    In Computer Vision (ICCV), 2017 IEEE International Conference on, Cited by: §1.