Deep Frame Interpolation

by   Vladislav Samsonov, et al.

This work presents a supervised learning based approach to the computer vision problem of frame interpolation. The presented technique could also be used in the cartoon animations since drawing each individual frame consumes a noticeable amount of time. The most existing solutions to this problem use unsupervised methods and focus only on real life videos with already high frame rate. However, the experiments show that such methods do not work as well when the frame rate becomes low and object displacements between frames becomes large. This is due to the fact that interpolation of the large displacement motion requires knowledge of the motion structure thus the simple techniques such as frame averaging start to fail. In this work the deep convolutional neural network is used to solve the frame interpolation problem. In addition, it is shown that incorporating the prior information such as optical flow improves the interpolation quality significantly.


Learning Image Matching by Simply Watching Video

This work presents an unsupervised learning based approach to the ubiqui...

LIFI: Towards Linguistically Informed Frame Interpolation

In this work, we explore a new problem of frame interpolation for speech...

Deep Iterative Frame Interpolation for Full-frame Video Stabilization

Video stabilization is a fundamental and important technique for higher ...

Anti-Aliasing Add-On for Deep Prior Seismic Data Interpolation

Data interpolation is a fundamental step in any seismic processing workf...

Spherical k-Nearest Neighbors Interpolation

Geospatial interpolation is a challenging task due to real world data of...

Frame Interpolation with Multi-Scale Deep Loss Functions and Generative Adversarial Networks

Frame interpolation attempts to synthesise intermediate frames given one...

Video Frame Interpolation Based on Deformable Kernel Region

Video frame interpolation task has recently become more and more prevale...


Frame interpolation is one of the most challenging tasks in the computer vision. The goal of the frame interpolation is to increase the number of frames in a video sequence to make it more visually appealing. Numerous approaches were proposed to solve this problem [1] [2]

. However, these approaches are unsupervised and do not exploit particular video structure. This work presents a deep convolutional neural network for frame interpolation. Deep convolutional neural networks are known to be one of the best methods in machine learning to extract semantic information. Frames with a small displacement typically do not require high-level semantic information. But when the motion becomes complex and the object displacements between consecutive frames becomes large, this high-level information becomes crucial to effectively restore the middle frame. Deep convolutional networks were successfully applied for similar tasks before. The G. Long et al.

[3] proposed a deep convolutional method for computing the frame interpolation in order to obtain the optical flow. The P. Fischer et al. [4] trained deep convolutional network for direct computation of optical flow. Compared to these methods the method proposed in this paper solves the problem of frame interpolation instead of the optical flow thus the visual quality of the middle frame is of high importance. The B. Yahia [5] used convolutional network to generate the middle frame but the result was blurry and visually unpleasant. In this paper it is shown how to overcome this problem by moving from the MSE loss objective to the adversarial training. It is also shown how to further improve quality by incorporating the optical flow prior.

Network architecture

Figure 1: Network architecture used for frame interpolation

The Y-style neural network with separate inputs for the first and for the second frame is used. The weights between the first and second line are shared to enforce symmetric output. As the side-effect this trick effectively reduces dimensionality and prevents overfitting as well. Two lines are then merged by elementwise sum and upsampled again using transposed convolution (also called deconvolution sometimes). Following [6]

the residual connections between corresponding downsampling and upsampling layers are added but this is not shown on the illustration.


The two network models are trained: one with simple mean-squared-error (MSE) and one with adversarial approach suggested by the work of I. Goodfellow et al. [7]. It is shown that adversarial approach helps to overcome unnatural blurriness of the output image.
For the dataset the open source movie Sintel was used. It is also one of the standard datasets for the optical flow evaluation. Since the existing MPI Sintel dataset is tailored mainly for the optical flow and too small to train deep convolutional network, we splitted the full movie into the sequence of 21312 frames. Then we took every consecutive triplets of frames for the training samples. The triplet consists of the first frame, ground truth middle frame and the second frame.

MSE loss objective

First the network was trained with a simple MSE loss as in [5] [3]:

Figure 2: From left to right, from top to bottom: first frame, ground truth frame, second frame, average of the first and second, output of the network with the MSE loss, output of the adversarial network

As it can be seen from the example output, MSE loss leads to unnatural blurriness of the output image.

Adversarial approach

Several methods were proposed to construct the loss function which is close to the human perception such as

[8]. We will use the adversarial training approach proposed by I. Goodfellow et al. [7]

. The adversarial approach jointly trains two networks: generator network and discriminator network. The first one generates the middle frame while the second one outputs the probability that this frame is generated by the first network and is not drawn from the original distribution. The second network is used as the loss function by the first one. The first network tries to minimize this probability while the second tries to distinguish generated frames from the original frames. LeNet-like architecture with 16 layers was used for the discriminator classifier network. To increase convergence speed the MSE loss was used for the initial training steps with an exponential decay:

Here denotes discriminator network and descreasing with an exponential decay:

The result of the adversarial approach looks more visually appealing compared to the previous method yet shows slightly worse values according to MSE and PSNR.

Incorporating optical flow prior

Computation of the optical flow has been studied extensively in the computer vision. Various methods were proposed for this task, but we choose DeepFlow for it’s state-of-the-art performance and available open-source implementation.

Formally, optical flow is a vector field where each vector component shows relative displacement of a point.

To incorporate optical flow prior to the existing architecture, we introduce a new layer type called Displacement Convolutional Layer (DCL). It is the generalization of the regular convolutional layer:

where and are the and components of the optical flow vector in position . If we set , we will get the regular convolutional layer as a special case of DCL:

Figure 3: Illustration of the Displacement Layer

Finally, the only thing needed to incorporate the optical flow prior is to replace the first convolutional layer by the Displacement Layer. Our final network with optical flow prior is trained using adversarial approach as before.

Figure 4: From left to right: simple warping from the optical flow, output of the network with the optical flow prior

Given the perfect optical flow, the middle frame could be reconstructed by the simple warping. Thus, the presented approach is also compared with the simple warping from the optical flow field to validate the necessity of neural network and demonstrate that the neural network learns to compensate for the optical flow inaccuracy.

Note that in this case optical flow is needed for training along with the video frames. Calculating accurate optical flow for the frame sequence might be a problem. It is actually possible to get rid of this requirement. To achieve this we introduce another neural network which is trained to predict vector field of DCL. Both networks are trained jointly as a whole with the same loss function as before. This network learns to predict optical flow implicitly from the frame sequence which means that frame sequence alone is enough for training: there is no need to calculate optical flow beforehand.

Average frame 0.0079 21.0 0.836
NN with MSE loss 0.0050 23.0 0.614
Adversarial NN 0.0053 22.8 0.721
Simple warping 0.0052 22.8 0.907
NN with optical flow prior 0.0023 26.4 0.945
Table 1: Comparison of all methods