Video frame interpolation aims to synthesize a new frame between two consecutive frames, which is widely used in video processing, such as video encoding, frame rate conversion, or generating slow motion video [Paliwal and Kalantari2020]. However, the complex information of video content, including irregular shapes of objects, various motion patterns or occlusion issues, etc., poses a major challenge to the authenticity of the interpolated frames. With the extensive development of deep learning in computer vision, more and more studies use deep learning methods to interpolate video frames.
Flow-based method is a common approach in this area, which warps the input reference frames according to the optical flow between the interpolation frame and the left and right reference frames, so as to predict the intermediate frame. Some researches such as [Ranjan and Black2017] and [Ilg et al.2017]
have developed several methods based on end-to-end flow estimation models for frame interpolation. However, since the interpolated frame does not really exist, the model cannot predict the accurate optical flow information between the interpolated frame and the reference frame. Therefore, low-quality images are generated when the optical flow is not suitable.
Kernel-based method is proposed by [Niklaus et al.2017a]
, which regards pixel interpolation as the convolution of corresponding image blocks in two reference frames. Compared with the flow based method, the kernel based method is more flexible. It uses a deep convolution neural network to estimate spatial adaptive convolution kernel, which unifies motion estimation and pixel synthesis into a single process, which can better deal with challenging frame interpolation scenes. However, the kernel size limits the accuracy of the predicted frame. Specifically, if kernel size is too small, the model cannot handle large moving objects, but if the model is set with a large kernel size, it requires a great amount of memory. Although SepConv[Niklaus et al.2017b] reduces memory consumption, it still cannot solve the problem of motion larger than the pre-defined kernel size.
At present, many researches have developed novel algorithms to simultaneously estimate the flow and compensation kernels with respect to the original reference frames, which can tightly couple the motion estimation and kernel estimation networks together to optimize the immediate frame through an overall network model. On the one hand, compared with the flow based method which relies on simple bilinear coefficients, this method can improve the interpolation accuracy by using data-driven kernel coefficients. On the other hand, the optical flow predicted by the flow estimation model first locates the approximate reference kernel region, which can greatly reduce the kernel size, to obtain higher computational efficiency than the sole kernel based method.
Although these methods have improved the video frame interpolation, they all ignore that the object itself has an irregular shape and the motion trajectory is not fixed. Therefore, the regular convolution kernel shape cannot adapt to various objects and motions. In this paper, inspired by Deformable Convolution (DCN) [Dai et al.2017]
, we propose to offset the reference pixels adaptively through the latent features extracted by the network, so that the position of the reference point becomes more practical. In addition, according to Deformable Convolution v2 (DCNv2)[Zhu et al.2019], to synthesize a target pixel, not all pixels within the reference field contribute equally to its response, and therefore we utilize both the kernel coefficient and bilinear coefficient to learn the differences in these contributions. Finally, we propose a pixel synthesis module to adaptively combine the optical flow, the interpolation kernel, and the offset field.
2 Related Work
2.1 Video Interpolation
Recently, an increasing number of studies have proved the success of application of deep learning in computer vision, which also inspired various frame interpolation based on deep learning. Super SloMo [Jiang et al.2018]
utilized two U-Nets to build the network, where one was used to estimate optical flow between two input reference frames, and the other was used to correct the linearly interpolated flow vector, so as to supplement the frame of the video.[Xu et al.2019] proposed a quadratic interpolation algorithm for synthesizing accurate intermediate video frames, which can better simulate the nonlinear motion in the real world by using the acceleration information of the video. Niklaus et al. proposed AdaConv [Niklaus et al.2017a] and SepConv [Niklaus et al.2017b]
successively, which combined the two steps into a convolution process by convolving the input frame with the spatial adaptive kernel. In addition, SepConv transformed the regular 2D kernel into the 1D kernel, which improves computational efficiency. And it also developed a convolutional neural network that takes in two input frames and estimates pairs of 1D kernels for all pixels simultaneously, which enables the neural network to produce visually pleasing frames.
MEMC-Net [Bao et al.2019b] proposed to exploit motion estimation and motion compensation in a neural network for video frame interpolation. It further proposed an adaptive warping layer for synthesizing new pixels, which can estimate the optical flow and compensation kernel simultaneously. This new warping layer was expected to closely couple the motion estimation and kernel estimation networks. In addition, DAIN [Bao et al.2019a] utilized the depth information to clearly detect occlusion for video frame interpolation. The bi-directional optical flow and depth map were estimated from the two input frames firstly. Then, instead of simple averaging of flows, the model calculated the contribution of each flow vector according to depth value since multiple flow vectors may be encountered at the same position, which will result in flows with clearer motion boundary.
2.2 Deformable Convolution
In recent years, Convolution Neural Network (CNN) has made rapid development and progress in the field of computer vision. However, due to the fixed geometric structure in its building module, CNN is inherently limited to model geometric transformation. How to effectively solve the geometric changes in the real world has been a challenge.
In order to handle this problem, [Dai et al.2017] proposed Deformable Convolution Network (DCN) to improve the modeling ability for geometric changes. Specifically, the deformable convolution module first learned offsets based on a parallel network, and then adds these offsets to the position of each sampling point in the convolution kernel, so as to achieve random sampling near the current position without being limited to the regular grid. This made the network more concentrated in the region or target we are interested in.
The problem of DCN was that the introduction of offset module leads to irrelevant context information, which is harmful to the network model. The motivation of Deformable ConvNets v2 (DCNv2) [Zhu et al.2019] is to reduce interference information in DCN, which can improve the adaptive capacity of the model to different geometric changes. It added a modulation mechanism in the deformable convolution module, where each sample not only undergoes a learned offset but is also modulated by a learned feature amplitude. Therefore, the network module can change the spatial distribution and relative influence of its samples.
3 Proposed Approach
In this section, we introduce the overall structure of our proposed model and the process of pixel synthesis.
3.1 Network Architecture
In this paper, we design a model with deformable kernel region for video frame interpolation. Its overall structure is shown in Figure 1. The input of the model contains two frames, Frame , and Frame , , and the purpose is to obtain the intermediate interpolated frame, Frame , between these two frames. Specifically, our model includes four submodules: coefficient generation part, offset generation part, occlusion processing part and frame enhancement part.
Inspired by MEMC-Net, this part is designed as a modulation mechanism for pixel synthesis, which consists of bilinear coefficient and kernel coefficient. In particular, we use PWC-Net [Sun et al.2018] to build the optical flow prediction module, which first predicts the optical flow between two reference frames, then obtains the motion vector fields ( and ) of target frame to reference frames via the flow projection method in [Bao et al.2019b]. Finally, we locate the approximate reference region of each target pixel using and . The kernel prediction module is used to estimate spatially-adaptive convolutional kernels for each target pixel, marked as and . We consider the U-Net [Ronneberger et al.2015] structure as the basic framework for predicting interpolation kernel coefficients. The network structure is shown in Figure 2. We first downsample the input frame several times to extract features through average pooling, and then use bilinear interpolation to upsample the feature maps with the same number of times to reconstruct it. For the kernel prediction module, the interpolation kernel with channels is finally obtained, where represents the number of pixels in the kernel region. This full convolution neural network will also be used to predict offset field and occlusion maps later.
Considering the irregularity of the object and the uncertainty of motion, which were mentioned in DCN and DCNv2 previously, we add an offset generation part to our model. Similarly, we use the network structure shown in Figure 2 as the offset prediction module, and an offset field with channels can be obtained, where 2 represents the offset along direction and direction, respectively. Then, according to the initial position and offsets of the reference pixels, the kernel region of target pixel is relaxed from the constraints of the regular grid, and finally we finds the more accurate reference position.
Due to the relative motion of objects in natural video, there may exist occluded regions between the two reference frames. When this happens, the target pixel become invisible in one of the input frames. Therefore, in order to select effective reference pixels, we use an occlusion prediction module to learn occlusion maps. Since occlusion maps can be understood as the weight of the reference frames, its values are in the range of [0,1]. Therefore, we add sigmoid activation after the basic network structure shown in Figure 2. The blended frame is generated by
where represents the matrix multiplication of the corresponding elements, and are the reference frames. is the output of the network and , where is a matrix of ones. Thus, the larger the value of , the better the visibility on . On the contrary, the larger the value of , the better the visibility on .
We also add a post-processing module to enhance the quality of interpolated frame. This module concatenate the warped frames, occlusion maps, interpolation kernels and projected flows as input. Since the input frame and the output frame of our model are quite similar, we make the post-processing module output residual between blended frame and the ground truth. Note that the network structure of the post-processing module is composed of 4 stacked standard convolutions with the filter size of
3.2 Pixel Synthesis
The task of the pixel synthesis module is to calculate the value for the target pixel according to the input reference frames, optical flows, interpolation kernels and offsets. First, the module locates a certain position of the target pixel on the reference frame according to the predicted optical flow, as depicted in Figure 3. Then, we set the size of the kernel region to , a total of 16 reference pixels, i.e., R=16, and is regarded as the relative coordinate of each reference pixel
Note that is the nearest integer pixel position on the top-left of . Next, we add the learned offsets to the position of initial reference point to get the final adaptive reference point. Since these offsets are all fractional, we obtain the pixel value through bilinear interpolation.
In addition, as shown in Figure 4, we flatten the learned interpolation kernel by channel to obtain the kernel coefficients of the target pixel. Besides, according to the principle of bilinear interpolation, we calculate the coefficients of . Then, due to offset consideration, we expand the range to the four parts as the kernel regions, and each part enlarges the original kernel region by one-pixel width along and directions. The bilinear coefficient of each part is set to the same as the correspoding . and the detailed calculation process is as follows:
where and , represents the final relative position in the X-axis direction, which is obtained by adding the offset to the integer position of the reference point, and we determine which bilinear coefficient to use according to the final adaptive position. The synthesis process of target pixels can be expressed as:
where represents bilinear interpolation, and is the optical flow predicted at point A. The calculation process for a target pixel is shown in Algorithm 1.
It has been proved in [Bao et al.2019b] that bilinear coefficients and interpolation coefficients can be obtained by back-propagating their gradients to the flow estimation network and the kernel estimation network, respectively. Therefore, we only prove that the offset estimation is differentiable. For convenience, we set and represents the coordinate of point . We take the offset on the horizontal component () of the current reference point as an example, and the derivative with respect to the offset field is calculated as follows:
where represent the values of four integer pixels around the current fractional pixel, and further
Note that here . We only calculate the derivative in the horizontal component, and the calculation method in the vertical component is similar.
4 Experimental details
The total loss function of our model is designed into two parts: one is used to calculate the loss between the average warped frame () and the ground truth , which is called warped loss , and the other is enhancement loss , which is used to calculate the loss between the output image after frame quality enhancement and the ground truth . The total loss function can be formulated as:
where is set to 1 and is set to 0.5. It is well known that optimization based on norm will lead to blurry results in most image synthesis tasks, so we use norm for the loss. Inspired by [Lee et al.2020], [Shi et al.2021], [Bao et al.2019a], we utilize the Charbonnier Penalty Function to smoothly approximate the norm and set .
We adopt the AdaMax optimizer, where and are set as the default values and
, respectively. We set the initial learning rate to 0.002, and during training, if the validation loss does not decrease in 3 consecutive epochs, we reduce the learning rate by a factor of. We select Viemo90k [Xue et al.2019] as the dataset and divide it into two parts, which are used for training and validating our proposed model respectively. One part contains 64,600 triples as training set, and the other part has 7,824 triples as the validation set, with a resolution of per frame. We regard the middle frame of the triple as the ground truth, and regard the remaining two frames as the input data. In addition, we also reversed the time order of the input sequence for data enhancement. During training, we set the batch size to 3, and deploy our experiment to RTX 2080Ti. After training for around 100 epochs, the training loss has converged.
In order to verify the effectiveness of our model, we also evaluate the trained model on the following four datasets.
Vimeo90K Test Set
This dataset [Xue et al.2019] is consisted of 3758 video sequences, each of which has three frames. As in the case of Vimeo90k training dataset, the frame resolution is , and we utilize the first and third frames of each sequence to synthesize the second one.
The UCF101 [Soomro et al.2012] dataset is a large-scale human behavior dataset. It consists of video sequences containing camera motion and cluttered background. We selected 333 triples from it for model test, of which the resolution is .
This dataset (Densely Annotated VIdeo Segmentation) [Perazzi et al.2016] is composed of 50 high-quality, full HD video sequences, covering many common video object segmentation challenges, such as occlusion, motion blur and appearance change. We select 50 groups of three consecutive frames as the test data, and the resolution of each frame is .
MPI-Sintel [Butler et al.2012]
introduces a new optical flow dataset from an open source 3D animation short filmSintel, which has some important features such as long sequence, large motion, specular reflection, motion blur, defocus blur and atmospheric effect. We randomly constructed triples from seven sequences for testing to verify the performance of our model under the above features.
Table 1 shows the quantitative comparison between our method and other various state-of-the-art, including CyclicGen [Liu et al.2019], SepConv-, SepConv- [Niklaus et al.2017c], AdaCoF [Lee et al.2020], MEMC [Bao et al.2019b], DAIN [Bao et al.2019a] and R-SepConv [Niklaus et al.2021]. We measured the PSNR, SSIM and the model parameters of each method on three test datasets. Although our method does not have the least model parameters, on the whole, it can perform best on three datasets with only a small burden.
In order to make our results more convincing, we selected several effective methods to visualize the drift-turn sequence in Davis480p. We took the first frame and the third frame as the input of each model to get the second frame, as shown in the first and second rows of Figure 5. Since the motion of the object in these three frames is very large, other methods cannot generate high-quality interpolation frames, for example, even the car wheels disappear or the words on the brand become blurred. But we find our method can synthesize large moving objects well.
We also apply several most advanced interpolation methods on MPI-Sintel datasets and compare them with our method. The quantitative comparison results are shown in Table 2
. According to the table, our proposed method can achieve higher PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity) in a relatively short runtime. We also visualized the interpolation frame, which is derived from a sequence namely bandage1 in the dataset, as shown in the third and fourth rows in Figure5. According to the results, our method also performs well in synthesizing the detailed texture of the object. More qualitative results and ablation experiments are provided in supplementary material.
This paper presents a novel method for video interpolation. Due to the diversity of object shape and the uncertainty of motion, it may not be possible to obtain meaningful feature with a fixed kernel region. Inspired by deformable convolution, we break the limitation of employing fixed grid reference pixels when synthesizing the target pixel, and make the model learn offsets for all reference points of each target pixel. We then locate the position of the final reference points according to the offsets, and finally synthesize the target pixel. Our model was tested on four datasets, and the experimental results show that our method is superior to most competitive algorithms.
[Bao et al.2019a]
Wenbo Bao, Wei-Sheng Lai, Chao Ma, Xiaoyun Zhang, Zhiyong Gao, and Ming-Hsuan
Depth-aware video frame interpolation.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3703–3712, 2019.
- [Bao et al.2019b] Wenbo Bao, Wei-Sheng Lai, Xiaoyun Zhang, Zhiyong Gao, and Ming-Hsuan Yang. Memc-net: Motion estimation and motion compensation driven neural network for video interpolation and enhancement. IEEE transactions on pattern analysis and machine intelligence, 2019.
- [Butler et al.2012] Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. A naturalistic open source movie for optical flow evaluation. In European conference on computer vision, pages 611–625. Springer, 2012.
- [Dai et al.2017] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017.
- [Ilg et al.2017] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper, Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2462–2470, 2017.
- [Jiang et al.2018] Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik Learned-Miller, and Jan Kautz. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9000–9008, 2018.
- [Lee et al.2020] Hyeongmin Lee, Taeoh Kim, Tae-young Chung, Daehyun Pak, Yuseok Ban, and Sangyoun Lee. Adacof: Adaptive collaboration of flows for video frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5316–5325, 2020.
[Liu et al.2019]
Yu-Lun Liu, Yi-Tung Liao, Yen-Yu Lin, and Yung-Yu Chuang.
Deep video frame interpolation using cyclic frame generation.
Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8794–8802, 2019.
- [Niklaus et al.2017a] Simon Niklaus, Long Mai, and Feng Liu. Video frame interpolation via adaptive convolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 670–679, 2017.
- [Niklaus et al.2017b] Simon Niklaus, Long Mai, and Feng Liu. Video frame interpolation via adaptive separable convolution. In Proceedings of the IEEE International Conference on Computer Vision, pages 261–270, 2017.
- [Niklaus et al.2017c] Simon Niklaus, Long Mai, and Feng Liu. Video frame interpolation via adaptive separable convolution. In Proceedings of the IEEE International Conference on Computer Vision, pages 261–270, 2017.
- [Niklaus et al.2021] Simon Niklaus, Long Mai, and Oliver Wang. Revisiting adaptive convolutions for video frame interpolation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1099–1109, 2021.
- [Paliwal and Kalantari2020] Avinash Paliwal and Nima Khademi Kalantari. Deep slow motion video reconstruction with hybrid imaging system. IEEE transactions on pattern analysis and machine intelligence, 42(7):1557–1569, 2020.
- [Perazzi et al.2016] Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 724–732, 2016.
- [Ranjan and Black2017] Anurag Ranjan and Michael J Black. Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4161–4170, 2017.
- [Ronneberger et al.2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
- [Shi et al.2021] Zhihao Shi, Xiaohong Liu, Kangdi Shi, Linhui Dai, and Jun Chen. Video frame interpolation via generalized deformable convolution. IEEE Transactions on Multimedia, 2021.
- [Soomro et al.2012] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
- [Sun et al.2018] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8934–8943, 2018.
- [Xu et al.2019] Xiangyu Xu, Li Siyao, Wenxiu Sun, Qian Yin, and Ming-Hsuan Yang. Quadratic video interpolation. arXiv preprint arXiv:1911.00627, 2019.
- [Xue et al.2019] Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and William T Freeman. Video enhancement with task-oriented flow. International Journal of Computer Vision, 127(8):1106–1125, 2019.
- [Zhu et al.2019] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9308–9316, 2019.
Supplementary Material for the paper “Video Frame Interpolation Based on Deformable Kernel Region”
To verify the effectiveness of our proposed model, we have performed comprehensive ablation experiments to analyze our network structure. We successively removed the kernel prediction module, offset prediction module, occlusion prediction module and post-processing module from the overall network model presented in the main paper, and combined the remaining modules after each individual module removal. We train the various module combinations using the same method, and then tested them on different datasets. The quantitative results are shown in Table 1. Note that the flow prediction module was not tested in our ablation experiments as the optical flow information has a direct impact on locating the kernel region.
For the F+OF+OC+P combination without K, as shown in the first row of the Table 1, we found that although the model can reasonably transform the position of the reference pixels by the learned offset, the PSNR and SSIM are still low. This is because the weight of each reference pixel has different influence on the target point. Therefore, when the model lacks of the kernel coefficient, it will lead to irrelevant reference, resulting in generating the low-quality interpolation frames. By comparing the second row and fifth row of Table 1, we observe that when the offset prediction module was added, PSNR is increased by 0.58db, 1.25db and 0.29db respectively on Vimeo90k, UCF101 and Davis480p datasets. When we remove the occlusion module or post-processing module, as illustrated in the third row and fourth row in the Table 1, the image quality would also decrease. For the improvement brought by the occlusion module, this is due to the relative motion of objects in natural video, and there will be occlusion areas between two reference frames. If this module has not been added, the model may select the occluded invalid reference pixels for interpolation. As for the post-processing module, since the blended image usually contains artifacts caused by inaccurate flow estimation or offset field, this module thus plays the role of eliminating these negative effects, which is also very important for improving overall visual quality.
Visualization of Experimental Results
To better demonstrate the effectiveness of the proposed video frame interpolation model, we provide more qualitative results in this section. We selected several sequences from Vimeo90k and Davis480p datasets for testing, and visualized the interpolated frames to perceive the image visual quality difference between our proposed method and recent state-of-the-art methods, as shown in Figure 1 5 below. Through observation, we found that the interpolation frame generated by our method has better subjective image quality, especially for the video sequences having complex large motion or occlusion.