Frame Interpolation with Multi-Scale Deep Loss Functions and Generative Adversarial Networks

11/16/2017 ∙ by Joost van Amersfoort, et al. ∙ Twitter 0

Frame interpolation attempts to synthesise intermediate frames given one or more consecutive video frames. In recent years, deep learning approaches, and in particular convolutional neural networks, have succeeded at tackling low- and high-level computer vision problems including frame interpolation. There are two main pursuits in this line of research, namely algorithm efficiency and reconstruction quality. In this paper, we present a multi-scale generative adversarial network for frame interpolation (FIGAN). To maximise the efficiency of our network, we propose a novel multi-scale residual estimation module where the predicted flow and synthesised frame are constructed in a coarse-to-fine fashion. To improve the quality of synthesised intermediate video frames, our network is jointly supervised at different levels with a perceptual loss function that consists of an adversarial and two content losses. We evaluate the proposed approach using a collection of 60fps videos from YouTube-8m. Our results improve the state-of-the-art accuracy and efficiency, and a subjective visual quality comparable to the best performing interpolation method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Frame interpolation attempts to synthetically generate one or more plausible intermediate video frames from existing ones, the simple case being the interpolation of one frame given two adjacent video frames. This is a challenging problem requiring a solution that can model natural motion within a video, and generate frames that respect this modelling. Artificially increasing the frame-rate of videos enables a range of new applications. For example, data compression can be achieved by actively dropping video frames at the emitting end and recovering them via interpolation on the receiving end [25]. Increasing video frame-rate also directly allows to improve visual quality or to obtain an artificial slow-motion effect [1, 18, 17].

Frame interpolation commonly relies on optical flow [29, 22, 4, 17]. Optical flow relates consecutive frames in a sequence describing the displacement that each pixel undergoes from one frame to the next. One solution for frame interpolation is therefore to assume constant velocity in motion between existing frames and interpolate frames via warping. However, optical flow estimation is difficult and time-consuming, and a good illustration of this is that the average run-time per frame of the top five performing methods of 2017 in the Middlebury benchmark dataset [4] is 1.47 minutes111Runtime reported by authors and not normalised by processor speed.. Furthermore, there is in general no consensus on a single model that can accurately describe it. Different models have been proposed based on inter-frame colour or gradient constancy, but these are susceptible to failure in challenging conditions such as occlusion, illumination or nonlinear structural changes. As a result, methods that obtain frame interpolation as a derivative of flow suffer from inaccuracies in flow estimation.

In recent years, deep learning approaches, and in particular cnn, have set up new state-of-the-art results across many computer vision problems, and have also resulted in new optical flow estimation methods. In [7, 12]

, optical flow features are trained in a supervised setup mapping two frames with their ground truth optical flow field. The introduction of spatial transformer networks

[13] allows an image to be spatially transformed as part of a differentiable network. Hence, a transformation can be learnt implicitly by a network in an unsupervised fashion, enabling frame interpolation with an end-to-end differentiable network [17]. Choices in network design and training strategy can directly affect interpolation quality as well as efficiency. Multi-scale residual estimations have been repeatedly proposed in the literature [23, 28, 22], but only simple models based on colour constancy have been explored. More recently, training strategies have been proposed for low-level vision tasks to go beyond pixel-wise error metrics making use of more abstract data representations and adversarial training, leading to more visually pleasing results [14, 16]. An example of this notion applied to frame interpolation networks is explored very recently in [20].

In this paper we propose a real-time frame interpolation method that can generate realistic intermediate frames with high psnr. It is the first model that combines the pyramidal structure of classical optical flow modeling with recent advances in spatial transformer networks for frame interpolation. Compared to naive CNN processing, this leads to a speedup with a dB increase in psnr. Furthermore, to work around natural limitations of intensity variations and nonlinear deformations, we investigate deep loss functions and adversarial training. These contributions result in an interpolation model that is more expressive and informative relative to models based solely on pixel intensity losses as illustrated in Table 3 and Fig. 6.

2 Related Work

2.1 Frame interpolation with optical flow

The main challenge in frame interpolation lies in respecting object motion and occlusions such as to recreate a plausible frame that preserves structure and consistency of data across frames. Although there has been work in frame interpolation without explicit motion estimation [18], the vast majority of frame interpolation methods relies on flow estimation to obtain a description of how a scene moves from one frame to the next [4, 22, 17].

Let us define two consecutive frames with and , their optical flow relationship can be formulated as

(1)

where , and are pixel-wise displacement fields for dimensions and . For convenience, we will use the shorter notation to refer to an image with coordinate displacement , and write . Multiple strategies can be adopted for the estimation of , ranging from a classic minimisation of an energy functional given flow smoothness constraints [10], to recent proposals employing neural networks [7]. Flow amplitude can vary greatly, from slow moving details to large displacements caused by fast motion or camera panning, and in order to efficiently account for this variability flow can be approximated at multiple scales. Finer flow scales take advantage of estimations at lower coarse scales to progressively estimate the final flow in a coarse-to-fine fashion [5, 6, 11]. Given an optical flow between two frames, an interpolated intermediate frame can be estimated by projecting the flow at time and pulling intensity values bidirectionally from frames and . A description of this interpolation mechanism can be found in [4].

2.2 Neural networks for frame interpolation

Neural network solutions have been proposed for the supervised learning of optical flow fields from labelled data

[7, 12, 19, 20]. Although these have been successful and could be used for frame interpolation in the paradigm of flow estimation and independent interpolation, there exists an inherent limitation in that flow labelled data is scarce and expensive to produce. It is possible to work around this limitation by training on rendered videos where ground truth flow is known [7, 12], although this solution is susceptible to overfitting synthetically generated data. An approach to directly interpolate images has been recently suggested in [19, 20] where large convolution kernels are estimated for each output pixel value. Although results are visually appealing, the complexity of these approaches has not been constrained to meet real-time runtime requirements.

Spatial transformers [13]

have recently been used for unsupervised learning of optical flow by learning how to warp a video frame onto its consecutive frame

[24, 32, 3]. In [17] it is used to directly synthesise an interpolated frame using a cnn to estimate flow features and spatial weights to handle occlusions. Although flow is estimated at different scales, fine flow estimations do not reuse coarse flow estimation like in the traditional pyramidal flow estimation paradigm, potentially indicating design inefficiencies.

2.3 Deep loss functions

Low-level vision problem optimisation often minimise a pixel-wise metric such as squared or absolute errors, as these are objective definitions of the distance between true data and its estimation. However, it has been recently shown how departing from pixel space to evaluate modelling error in a different, more abstract dimensional space can be beneficial. In [14]

it is shown how high dimensional features from the VGG network are helpful in constructing a training loss function that correlates better with human perception than mse for the task of image super-resolution. In

[16] this is further enhanced with the use of gan. Neural network solutions for frame interpolation have been limited to the choice of classical objective metrics such as colour constancy [17, 19], but recently [20] has shown how perceptual losses can also be beneficial for this problem. Training for frame interpolation involving adversarial losses have nevertheless not yet been explored.

Figure 2: Overview of the frame interpolation method. Flow is estimated from two input frames at scales , and . The finest flow scale is used to synthesise the final frame. Optionally, intermediate flow scales can be used to synthesise coarse interpolated frames in a multi-scale supervision module contributing to the training cost function, and the synthesised frame can be further processed through a synthesis refinement module.

2.4 Contribution

We propose a neural network solution for frame interpolation that benefits from a multi-scale refinement of the flow learnt implicitly. The structural change to progressively apply and estimate flow fields has runtime implications as it presents an efficient use of computational network resources compared to the baseline as illustrated in Table 2. Additionally, we introduce a synthesis refinement module for the interpolation result inspired by [9], which shows helpful in correcting reconstruction results. Finally, we propose a higher level, more expressive interpolation error modelling taking account of classical colour constancy, a perceptual loss and an adversarial loss functions. Our main contributions are:

  • A real-time neural network for frame interpolation.

  • A multi-scale network architecture inspired by multi-scale optical flow estimation that progressively applies and refines flow fields.

  • A reconstruction network module that refines frame synthesis results.

  • A training loss function that combines colour constancy, a perceptual and adversarial losses.

3 Proposed Approach

The method proposed is based on a trainable cnn architecture that directly estimates an interpolated frame from two input frames and . This approach is similar to the one in [17], where given many examples of triplets of consecutive frames, we solve an optimisation task minimising a loss between the estimated frame and the ground truth intermediate frame . A high-level overview of the method is illustrated in Fig. 2, and details about it’s design and training are detailed in the following sections.

3.1 Network design

3.1.1 Multi-scale frame synthesis

Let us assume to represent the flow from time point to , and for convenience, let us refer to synthesis features as , where spatial weights can be used to handle occlusions and disocclusions. The synthesis interpolated frame is then given by

(2)

with denoting the Hadamard product. This is used in [17] to synthesise the final interpolated frame and is referred to as voxel flow. Although a multi-scale estimation of synthesis features is presented to process input data at different scales, coarser flow levels are not leveraged for the estimation of finer flow results. In contrast, we propose to reuse a coarse flow estimation for further processing with residual modules, in the same spirit as in [9].

To estimate synthesis features we build a pyramidal structure progressively applying and estimating optical flow between two frames at different scales , with the coarsest level. We refer to synthesis features at different scales as . If and denote bilinear up- and down-sampling matrix operators, flow features are obtained as

(3)

The processing for flow refinement is show in Fig. 3, and is formally given by

(4)
(5)

with the tanh non-linearity keeping the synthesis flow features within the range . The coarse flow estimation and flow residual modules, and visualised in Fig. 2 and Fig. 3 respectively, are both based on the cnn architecture described in Table 1. Both modules use to produce output synthesis features within the range , corresponding to flow features . For image color channels, coarse flow estimation uses , and residual flow uses .

Fixing , found to be a good compromise between efficiency and performance, the final features are upsampled from the first scale to be . Note that intermediate synthesis features can be used to obtain intermediate synthesis results as

(6)

In Section 3.2.1 we describe how intermediate synthesis results can be used in a multi-scale supervision module to facilitate network training.

3.1.2 Synthesis refinement module

Frame synthesis can be challenging in cases of large and complex motion or occlusions, where flow estimation may be inaccurate. In these situations, artifacts usually produce an unnatural look for moving regions of the image that benefit from further correction. We therefore introduce a synthesis refinement module that consists of a cnn allowing for further joint processing of the synthesised image with the original input frames that produced it.

(7)

This was shown in [9] to be beneficial in refining the brightness of a reconstruction result and to handle difficult occlusions. This module also uses the convolutional block in Table 1 with , and the identity function.

3.2 Network training

Given loss functions between the network output and the ground-truth frame, defined for an arbitrary number of components , the optimisation problem is

(8)

The output can either be or depending on whether the refinement module is used, and represents all trainable parameters in the interpolation network .

3.2.1 Multi-scale synthesis supervision

The multi-scale frame synthesis described in Section 3.1.1 can be used to define a loss function at the finest synthesis scale with a synthesis loss

(9)

with a distance metric. However, an optimisation task based solely on this cost function suffers from the fact that it leads to an ill-posed solution, that is, for one particular flow map there will be multiple possible solutions. It is therefore likely that the solution space contains many local minima, making it challenging to solve via gradient descent. The network can for instance easily get stuck in degenerate solutions where a case with no motion, , is represented by the network as , in which case there are infinite solutions for the flow fields at each scale.

In order to prevent this, we supervise the solution of all scales such that flow estimation is required to be accurate at all scales. In practice, we define the following multi-scale synthesis loss function

(10)

We heuristically choose the weighting of this multiscale loss to be

to prioritise the synthesis at the finest scale.

Figure 3: Flow refinement module. A coarse flow estimation wraps the input frames, which are then passed together with the coarse synthesis flow features to a flow residual module. The fine flow estimation is the sum of the residual flow and the coarse flow features. A tanh non-linearity is used to clip the result to within .

Additionally, the network using the synthesis refinement module adds a term to the cost function expressed as

(11)

and the total loss function for scales is

(12)

We propose to combine traditional pixel-wise distance metrics with higher order metrics given by a deep network, which have been shown to correlate better with human perception. As a pixel-wise metric we choose the -norm, which has been shown to produce sharper interpolation results than mse [27], and we employ features 5_4 from the VGG network [26] as a perceptual loss, as proposed in [14, 16]. Denoting with the transformation of an image to VGG space, the distance metric is therefore given by

(13)

Throughout this work we fix . We will analyse the impact of this term by also looking at results when this term is not included in training (ie. ).

3.2.2 Generative adversarial training

In the loss functions described above, there is no mechanism to avoid solutions that may not be visually pleasing. A successful approach to force the solution manifold to correspond with images of a realistic appearance has been gan training. We can incorporate such loss term to the objective function Eq. 12 as follows

(14)

Let us call the interpolation network the generator network , the gan term optimises the loss function

(15)
(16)

with representing a discriminator network that tries to discriminate original frames from interpolated results. The weighting parameter was chosen heuristically in order to avoid the gan loss overwhelming the total loss.

Adding this objective to the loss forces the generator to attempt fooling the discriminator. It has been shown that in practice this leads to image reconstructions that incorporate visual properties of photo-realistic images, such as improved sharpness and textures [16, 21]. The discriminator architecture is based on the one described in figure 4 of [16]

, with minor modifications. We start with 32 filters and follow up with 8 blocks of convolution, batch normalization and leaky ReLU with alternating strides of 2 and 1. At each block of stride 2 the number of features is doubled, which we found to improve the performance of the discriminator.

Layer Convolution kernel Non-linearity
ReLU
, …, ReLU
Table 1: Convolutional network block used for coarse flow, flow residual and reconstruction refinement modules. Convolution kernels correspond to number of input and output features, followed by kernel size dimensions .

4 Experiments

We first compute the performance of a baseline version of the model proposed that performs single-scale frame synthesis without synthesis refinement and is trained using a simple colour constancy -norm loss. We gradually incorporate to a baseline network the design and training choices proposed respectively in Section 3.1 and Section 3.2, and evaluate their benefits visually and quantitatively. As a performance metric for reconstruction accuracy we use psnr, however we note that this metric is known to not correlate well with human perception [16].

4.1 Data and implementation details

The dataset used is a collection of 60fps videos from YouTube-8m [2] resized to . Training samples are obtained by extracting one triplet of consecutive frames every second, discarding samples for which two consecutive frames were found to be almost identical with a small squared error threshold. Unless otherwise stated, all models used k, k and triplets of frames corresponding to the training, validation and testing sets.

All network layers from convolutional building blocks based on Table 1 are orthogonally initialised with a gain of

, except for the final layer which is initialised from a normal distribution with standard deviation

. This forces the network to initialise flow estimation close to zero, which leads to more stable training. Training was performed on batches of size using frame crops of size to diversify the content of batches and increase convergence speed. Furthermore, we used Adam optimisation [15] with learning rate

and applied early stopping with a patience of 10 epochs. All models converge around roughly

epochs with this setup.

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Figure 4: Impact of network design on two visual examples. Top: two full size original images with highlighted crops. Bottom: (a, c) ground-truth, (b, d) baseline CNN, (e, g) MS , (f, h) MS .

4.2 Complexity analysis

To remain framework and hardware agnostic, we report computational complexity of cnn in floating point operations (FLOPs) necessary for the processing of one frame, and in the number of trainable parameters. The bottleneck of the computation is in convolutional operations to estimate a flow field and refine the interpolation, therefore we ignore operations necessary for intermediate warping stages. Using and to denote height and width, the number of features in layer , and the kernel size, the number of FLOPs per convolution are approximated as

(17)

4.3 Network design experiments

In the first set of experiments we evaluate the benefits of exploiting an implicit estimation of optical flow as well as a synthesis refinement module.

4.3.1 Implicit optical flow estimation

cnn can spatially transform information through convolutions without the need for spatial transformer pixel regridding. However, computing an internal representation of optical flow that is used by a transformer is a more efficient alternative to achieve this goal. In Table 2 we compare results for our baseline architecture using multi-scale synthesis (MS ), relative to a simple cnn that attempts to directly estimate frame from inputs and . Both models are trained with -norm colour constancy loss (ie. in Eq. 13

). In order to replicate hyperparameters, all layers in the baseline cnn model are convolutional layers of

kernels of size followed by ReLU activations, except for the last layer which uses a linear activation.

The baseline model uses layers, which results in approximately the same number of trainable parameters to the proposed spatial transformer method. Note that the multi-scale design allows to obtain an estimation with fewer FLOPs. The baseline cnn produces a psnr dB lower than multi-scale synthesis on the test set. The visualisations in Fig. 4 show that the baseline cnn struggles to produce a satisfactory interpolation (b, d), and tends to produce an average of previous and past frames. The proposed multi-scale synthesis method results in more accurate approximations (e, g).

Method psnr Parameters FLOPs
Baseline CNN k G
MS k G
MS k G
Table 2: Impact of network design on performance.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Figure 5: Impact of network training on two visual examples. Top: two full size original images with highlighted crops. Bottom: (a, c) ground-truth, (b, d) MS, (e, g) MSVGG, (f, h) MSVGGGAN (FIGAN).

4.3.2 Synthesis refinement

Frame directly synthesised from flow estimation can exhibit spatial distortions leading to a visually unsatisfying result. This limitation can be substantially alleviated with the refinement module described in Section 3.1.2. In Table 2 we also include results for a multi-scale synthesis model that additionally uses a synthesis refinement module (MS ). This addition increases the number of trainable parameters by and the number of FLOP by , but achieves an additional dB in psnr and is able to correct inaccuracies in the estimation from the simpler MS , as shown in Fig. 4 (f, h).

4.4 Network training experiments

In this section we analyse the impact on interpolation results brought by multi-scale synthesis supervision and by the use of a perceptual loss term and gan training for an improved visual quality.

4.4.1 Multi-scale synthesis supervision

As described previously, the performance of synthesis models presented in Table 2 is limited by the fact that flow estimation in multiple scales is ill-posed. We retrained the model MS , showing the best performance from the design choices proposed, but modifying the objective function to supervise frame interpolation at all scales as proposed in Section 3.2.1. This model, which we refer to as MS for brevity (short for MS ), increases psnr on the test set compared to MS by dB up to dB as shown in Table 3 when trained on the same set of k training frames.

4.4.2 Impact of training data

Unsupervised motion learning is challenging due to the large space of possible video motion. In order to learn a generalisable representation of motion, it is important to have a diverse training set with enough representative examples of expected flow directions and amplitudes. We evaluated the same mode MS when trained on different training set sizes, in particular reducing the training set to k random frames and increasing it to k. Although increasing the training set size inevitably increases training time, it also has a considerable impact on psnr as shown in Table 3. The remaining experiments use a training set size of k as a compromise for performance and ease of experimentation.

4.4.3 Perceptual loss and GAN training

Extending the objective loss function with more abstract components such as a VGG term ( in Eq. 13) and a GAN training strategy (Eq. 14) also has an impact on results. In Table 3 we also include results for a network MSVGG, trained with the combination of -norm and VGG terms suggested in Eq. 13. We also show results for MSVGGGAN, which is a network additionally using adversarial training. This result of psnr on the full test set shows that both of these modifications lower the performance relative to the simpler colour constancy training loss. However, a visual inspection of results in Fig. 5 demonstrate how these changes help obtaining a sharper, more pleasing interpolation. This is in line with the findings from [16, 20].

4.5 State-of-the-art comparison

Method Training set size psnr (dB) FLOPs (G)
Farneback [8] - -
Deep Flow 2 [28] - -
PCA-layers [30] - -
Phase-based [18] - -
SepConv [20] -
SepConv [20]
MS k
k
k
MSVGG k
MSVGGGAN k
(FIGAN)
Table 3: State-of the-art interpolation comparison.
Figure 6: Visualisation of sate-of-the-art comparison. From left to right: full size original image with highlighted crop, Farneback’s method [8], PCA-layers [30], SepConv [20], proposed FIGAN, and ground-truth.
(a) Original
(b) SepConv
(c) FIGAN
Figure 7: SepConv and FIGAN interpolation for conflicting overlaid motions. FIGAN favours an accurate reconstruction of the foreground while SepConv approximates the reconstruction of the background at the expense of distorting the foreground structure.

In this section, several frame interpolation methods are compared to the algorithm proposed. These methods are listed in Table 3, where we also include psnr results on the full test set. The interpolation from flow-based methods [8, 28, 30] was done as described in [4] using the optical flow features generated from implementations of the respective authors222KITTI-tuning parameters were used for PCA-Layers [30].. The phase-based approach in [18] and SepConv [20] are both able to directly generate an interpolated frame. We include results from SepConv using both a colour constancy loss () and a perceptual loss ().

As shown in Table 3, the best performing method in terms of psnr is MS when trained on the large training set, however we found the best visual quality to be produced by FIGAN and SepConv-, both trained using perceptual losses. Visual examples from selected methods are provided in Fig. 6. Notice that some optical flow based methods such as Farneback and PCA-layers are unable to merge information from consecutive frames correctly, which can be attributed to an inaccurate flow estimation. In contrast, FIGAN shows more precise reconstructions, and most importantly preserves sharpness and features that make interpolation results perceptually more pleasing.

We found SepConv and FIGAN to have visually comparable results in situations with easily resolvable motion like those in Fig. 6. Their largest discrepancies in behaviour were found in challenging situations, such as for static objects overlaid on top of a fast moving scene, as shown in Fig. 7. Whereas SepConv favours resolving large displacements in the background, FIGAN produces a better reconstruction of the foreground object at the expense of reducing accuracy in the background. This could be due to the fact that SepConv looks for motion at an undersampled scale of , while FIGAN only downscales by . Coarse-to-fine flow estimation approaches can fail when coarse scales dominate the motion of finer scales [31], and this is likely to be more pronounced the larger the gap between the coarse and fine scales.

A relevant difference between SepConv and FIGAN is found in complexity. SepConv includes M training parameters, which is more than FIGAN, but also each frame interpolation requires G FLOPs, or more compared to FIGAN. Noting a comparable visual quality and psnr figures for a small fraction of training parameters, this highlights the efficiency advantages of FIGAN, which was designed under real-time constraints.

5 Conclusion

In this paper, we have described a multi-scale network based on recent advances in spatial transformers and composite perceptual losses. Our proposed architecture sets a new state of the art in terms of psnr, and produces visual quality results comparable to the best performing neural network solution with fewer computations. Our experiments confirm that a network design drawing from traditional pyramidal flow refinement allows to reduce its complexity while maintaining a competitive performance. Furthermore, training losses beyond classical pixel-wise metrics and adversarial training provide an abstract representation that translate into sharper, and visually more pleasing interpolation results.

References

  • [1] RE: Vision Effects, Twixtor, accessed in Feb 2016 at http://revisionfx.com/products/twixtor/.
  • [2] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan. Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675, 2016.
  • [3] A. Ahmadi and I. Patras. Unsupervised convolutional neural networks for motion estimation. In IEEE International Conference on Image Processing (ICIP), pages 1629–1633, 2016.
  • [4] S. Baker, D. Scharstein, J. P. Lewis, S. Roth, M. J. Black, and R. Szeliski. A database and evaluation methodology for optical flow. International Journal of Computer Vision, 92(1):1–31, 2011.
  • [5] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High accuracy optical flow estimation based on a theory for warping. In European Conference on Computer Vision (ECCV), pages 25–36. Springer, 2004.
  • [6] T. Brox and J. Malik. Large displacement optical flow: descriptor matching in variational motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(3):500–513, 2011.
  • [7] A. Dosovitskiy, P. Fischery, E. Ilg, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox. FlowNet: Learning optical flow with convolutional networks. In IEEE International Conference on Computer Vision (ICCV), pages 2758–2766, 2015.
  • [8] G. Farnebäck. Two-frame motion estimation based on polynomial expansion. Image analysis, pages 363–370, 2003.
  • [9] Y. Ganin, D. Kononenko, D. Sungatullina, and V. Lempitsky. Deepwarp: Photorealistic image resynthesis for gaze manipulation. In European Conference on Computer Vision (ECCV), pages 311–326. Springer, 2016.
  • [10] B. K. Horn and B. G. Schunck. Determining optical flow. Artificial intelligence, 17(1-3):185–203, 1980.
  • [11] Y. Hu, R. Song, and Y. Li. Efficient Coarse-to-Fine PatchMatch for Large Displacement Optical Flow. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 5704–5712, 2016.
  • [12] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [13] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In Advances in Neural Information Processing Systems (NIPS), pages 2017–2025, 2015.
  • [14] J. Johnson, A. Alahi, and L. Fei-Fei. Perceptual losses for real-time style transfer and super-resolution. European Conference on Computer Vision (ECCV), pages 694–711, 2016.
  • [15] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
  • [16] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi. Photo-realistic single image super-resolution using a generative adversarial network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [17] Z. Liu, R. Yeh, X. Tang, Y. Liu, and A. Agarwala. Video frame synthesis using deep voxel flow. In IEEE Intenational Conference on Computer Vision (ICCV), 2017.
  • [18] S. Meyer, O. Wang, H. Zimmer, M. Grosse, and A. Sorkine-Hornung. Phase-based frame interpolation for video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1410–1418, 2015.
  • [19] S. Niklaus, L. Mai, and F. Liu. Video frame interpolation via adaptive convolution. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [20] S. Niklaus, L. Mai, and F. Liu. Video frame interpolation via adaptive separable convolution. In IEEE International Conference on Computer Vision (ICCV), 2017.
  • [21] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In International Conference On Learning Representations (ICLR), 2016.
  • [22] L. L. Rakêt, L. Roholm, A. Bruhn, and J. Weickert. Motion compensated frame interpolation with a symmetric optical flow constraint. In International Symposium on Visual Computing, pages 447–457. Springer, 2012.
  • [23] A. Ranjan and M. J. Black. Optical flow estimation using a spatial pyramid network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [24] Z. Ren, J. Yan, B. Ni, B. Liu, X. Yang, and H. Zha. Unsupervised Deep Learning for Optical Flow Estimation. In AAAI Conference on Artificial Intelligence, pages 1495–1501, 2016.
  • [25] S. Sekiguchi, Y. Idehara, K. Sugimoto, and K. Asai. A low-cost video frame-rate up conversion using compressed-domain information. In IEEE International Conference on Image Processing (ICIP), volume 2, pages II–974, 2005.
  • [26] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), 2015.
  • [27] D. Sun, S. Roth, and M. J. Black. Secrets of optical flow estimation and their principles. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2432–2439. IEEE, 2010.
  • [28] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid. DeepFlow: Large displacement optical flow with deep matching. In IEEE Intenational Conference on Computer Vision (ICCV), 2013.
  • [29] M. Werlberger, T. Pock, M. Unger, and H. Bischof. Optical flow guided TV-L1 video interpolation and restoration. In EMMCVPR, pages 273–286. Springer, 2011.
  • [30] J. Wulff and M. J. Black. Efficient sparse-to-dense optical flow estimation using a learned basis and layers. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 120–130, 2015.
  • [31] L. Xu, J. Jia, and Y. Matsushita. Motion detail preserving optical flow estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9):1744–1757, 2012.
  • [32] J. J. Yu, A. W. Harley, and K. G. Derpanis. Back to Basics : Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness. European Conference on Computer Vision (ECCV), pages 3–10, 2016.