Synthesizing the intermediate frame given consecutive frames, called video frame interpolation, is one of the main problems in the video processing area. With frame interpolation algorithm, we can get slow-motion videos from the ordinary videos without any professional high-speed cameras. Also, it can be applied to video compression by restoring the down-sampled videos. However, interpolating the video frames requires highly complicated and dedicate approaches compared to spatial image pixel interpolation.
Most of the approaches define video frame interpolation as the problem finding the reference locations on the input frames which include the information for estimating each output pixel values. Then the output pixel values are calculated from the reference pixels. Recent approaches train the deep neural networks to solve this problem. There are two outstanding paradigms in state-of-the-art deep learning based approaches. First is the kernel estimation based approach (Figure 2 (a)) [33, 34]. It estimates the kernel adaptively for each pixel and synthesize the intermediate frame by convolving them with the input. This approach finds the proper reference location by assigning large weight to the pixel of interest. However, it cannot deal with the motion larger than the kernel size and has to keep the estimated kernel unnecessarily large for the small motions as well. Second is the flow map estimation based approach (Figure 2 (b)) [19, 24]. It estimates the flow vector pointing the reference location directly for each output pixel, but only one location per each input frame is referred to. Therefore it is hard to deal with complex motions and may suffer from lack of information when the input frame is low-quality. Although the two paradigms have their own limitations, they are complementary to each other. The kernel estimation based approach is able to deal with complex motions or low-quality frames since it refers to multiple pixels, while the flow map estimation based approach is able to deal with any magnitude of motion because it directly points the reference location. The complementarity suggests the possibility that the advantages of these two paradigms could be combined.
Deformable convolutional networks (DCNs)
are the extended versions of convolutional neural networks (CNNs). While convolutional layer samples the pixels to multiply the weights from the square grid, deformable convolutional layer add 2D offset to the grid and sample the pixels arbitrarily. This enables us to convolve images with the kernels of various shapes and sizes for each pixel. Therefore, DCN can address the data with arbitrary scale or rotation by learning the spatial transforms of input data.
There are strong correlations between all adjacent video frames. However, they are not exactly same. In this paper, we interpret this relation as a spatial transform between the frames and define the frame interpolation task as a problem finding this transform from input frames to the intermediate frame. To solve this problem, we propose a new module called Adaptive Deformable Convolution (Figure 2 (c)). This operation is the extension of Deformable Convolution which learns the spatial transforms. It has advantages of both kernel estimation and flow map estimation paradigms. First, for each output pixel, it estimates both weights and offset vectors for the convolution operation. Therefore we do not have to estimate large kernel because the sizes and shapes are not fixed. Second, since we refer to at least 9 input pixels per one input frame, the resulting output pixel value can be more stable and reliable. As shown in Figure (b)b, these benefits actually lead to more realistic and stable results.
2 Related Work
Video interpolation. Most of classic video frame interpolation methods estimate the dense flow maps using optical flow algorithms and warp the input frames [2, 45, 47]. Therefore, the performance of these approaches largely depends on the optical flow algorithms. In fact, frame interpolation was often used to evaluate the optical flow algorithms [1, 3]. There are many classic approaches to estimate dense optical flow [17, 29] and the deep neural networks are also actively applied to them [7, 9, 15, 42, 44, 46]. However, optical flow based approaches have limitations in many cases such as occlusion, large motion, and brightness change. There are some approaches to solve the occlusion problem [14, 37, 48], but there still remain the other problems. Mahajan et al.  traced out the reference paths in source images and synthesized the output image by solving Poisson equation. This method can partially solve the limitations of classic approaches, but the heavy optimization processes make it computationally expensive. Meyer et al.  regarded the video as a linear combinations of the wavelets with different directions and frequencies. This approach interpolate each wavelet’s phase and magnitude. This method made notable progress in both performance and running time. Their recent work also applied deep learning to this approach . However, it still has limitations for large motions of high frequency components.
Many recent successes of deep learning in computer vision area[6, 10, 11, 13, 20, 22, 39] inspired various deep learning based frame interpolation methods. Since all we need for train neural networks are three consecutive video frames, learning based approach is greatly appropriate for this task. Long et al.  proposed an CNN architecture that takes two input frames and directly estimate the intermediate frame. However, this kind of approaches often lead to blurry results. The later methods, instead of directly estimating image, mainly focused on where to find the output pixel from the input frames. This paradigm was caused by the fact that at least one input frame contains the output pixel, even in case of occlusion. Niklaus et al.  estimate a kernel for each location and get the output pixel by convolving it over input patches. Each kernel sample the proper input pixels by selectively combining them. However, it requires large memory and is computationally expensive to estimate large kernels for every pixel. Niklaus et al.  solved this problem by estimating each kernel from the outer product of two vectors. However, this approach cannot handle the motions which are larger than kernel size and it is still wasteful to estimate large kernel even for the small motions. Liu et al.  estimated the flow map which consists of the vectors directly pointing the reference locations. They samples the proper pixels according to the flow map. However, since they suppose that the forward and backward flows are same, it is hard to handle complex motions. Jiang et al.  proposed the similar algorithm, but they estimate forward and backward flow separately. Also, they improved the flow computation stage by defining the warping loss. However, it could be risky to get only one pixel value from each frame, especially when the input patches are poor in quality. Niklaus et al.  warp the input frames with the bidirectional optical flow estimated from PWC-Net  and refine them with a neural network. Especially, they exploited the context informations extracted from ResNet-18  to enable the informative interpolation and succeeded to get high-quality result. However, this approach requires pre-trained optical flow estimator and the GridNet  architecture they use is computationally expensive.
Learning Spatial Transforms. Departing from the classic convolutions based on the square-shaped kernel, there are some approaches to learn the spatial transforms deforming the shape of receptive fields. Jaderberg et al. 
added affine transforms with some trainable parameters to the neural networks. This method apply the proper transforms to the input and therefore enables scale/rotation invariant feature extraction. However, due to the lack of parameters in the transformers, it is hard to learn various and complex transforms. Jeonet al.  proposed Active Convolution, which deforms the shape of convolution kernels with offset learning. It learns the offset vectors pointing where to multiply the kernel weights. However, since this offset vectors are shared all over the locations of input image, it is not possible to apply spatial transforms adaptively to each pixel. Dai et al.  proposed Deformable Convolutional Network(DCN), which is the offset learning method as well, but they estimate offsets as the dynamic model outputs. Therefore, it is possible to apply different transforms to each pixel, but there is still a limitation that the kernel weights are shared for all locations.
3 Proposed Approach
In this section, we redefine the frame interpolation tasks in Section 3.1. Then we explain the adaptive deformable convolution in Section 3.2, which is the main contribution of this paper. In Section 3.3, we propose some prior-based constraints that improve the performance of our method. Finally, the details of our network architecture and implementation issues are explained in Section 3.4 and 3.5.
3.1 Video Frame Interpolation
Given the consecutive video frames and , our goal is to find the intermediate frame , where is a frame index. Even in case of occlusion, all information to get can be obtained from and . However it does not mean that matches or exactly. There must be a spatial transform from and to because of the motion over the time. Therefore, for the forward and backward spatial transforms and , we can consider as a combination of and as follows.
The frame interpolation task results in a problem of how to find this spatial transform . We employ a new operation called Adaptive Deformable Convolution for , which convolve the input image with adaptive kernel weights and offset vectors for each output pixel.
Occlusion reasoning. Let the input and output image sizes to be . In case of occlusion, the target pixel will not be visible in one of the input images. Therefore we define occlusion map and modify Equation (1) as follows.
where is a pixel-wise multiplication and is matrix of ones. For the target pixel , means that the pixel is visible only in and means that it is visible only in .
3.2 Adaptive Deformable Convolution
Let the spatial transform result of to be . When we define as the classic convolution, we can write as follows.
where is the kernel size and are the kernel weights. The input image
is considered to be padded so that the original input and output size are equal. Deformable convolution adds offset vectors to the classic convolution as follows.
Adaptive deformable convolution, unlike classic deformable convolution, does not share the kernel weights all over the different pixels. Therefore the notation for the kernel weights should be written as follows.
The offset values and may not be the integer values. In other words, could point arbitrary location, not only the grid point. Therefore, the pixel value of for any location has to be defined. We use the bilinear interpolation for the values of non-grid location as follows.
for an arbitrary location and the floor operation as described in Figure 4. The bilinear interpolation also makes the module differentiable, therefore it is able to be included in a neural network as a layer and end-to-end trainable.
Adaptive deformable convolution has a very high degree of freedom. However, high expressive power does not always result in hight performance. In the experiment of Section4.1, we found that adding some constraints to the model leads to performance improvements. They help the model parameters to be trained in the right direction.
Weight constraint. In adaptive deformable convolution, each offset vector samples a reference location and the weights make final decision from the aggregated information. In other words, the weights act as an attention. Therefore we used softmax activation to make the weights non-negative and sum to 1. Since the occlusion map satisfies , the two sets of weights from each frame also sum to 1 when multiplied by the occlusion map.
. It is base on the prior that adjacent flow vectors have similar values. Therefore we regularize the total variation of the flow maps and the occlusion maps by modifying the loss function. This part is described in more detail in Section3.4.
3.4 Network Architecture
We design a fully convolutional neural network  which estimates the kernel weights , the offset vectors , and the occlusion map . Therefore, videos of arbitrary size can be used as the input. Also, since each module of the neural network is differentiable, it is end-to-end trainable. Our neural network starts with the U-Net architecture which consists of encoder, decoder and skip connections . Each processing unit basically contains 3
3 convolution and ReLU activation. For the encoder part, we use average pooling to extract the features and for the decoder part, we use bilinear interpolation for the upsampling. After the U-Net architecture, the seven sub-networks finally estimate the outputs (, , for each frame and ). We use sigmoid activation for in order to satisfy . Also, since the weights
for each pixel have to be non-negative and sum to 1, softmax layers are used for the constraint. More specific architectures of the network are described in Figure3.
Loss Function. First, we have to reduce a measure between the model output and the ground truth . We use norm for the measure as follows.
The norm can be used as the measure, but it is known that the norm-based optimization leads to blurry results in most of the image synthesis tasks [12, 25, 28, 41]. Following Liu et al. , we use the Charbonnier Function for optimizing norm, where .
We also consider the smoothness constraints for the flow and occlusion maps. First, the total variation over the occlusion map is used as the regularizer of it. For the flow map, since there are more than one vector for each image pixel, we get the weight sum of the offset vectors as follows.
Then we get the total variancefor the regularizer of the flow map. Finally, the total loss can be obtained as follows.
where and .
We train our neural network with training images of size using AdaMax optimizer , where
. The learning rate is initially 0.001 and decays half every 20 epochs. The batch size is 8 and the network is trained for 50 epochs. Our code will be uploaded online.
Dataset preparing. We compose 250,000 triplets of three consecutive frames with size from high-quality Youtube videos. Each triplet is extracted from randomly selected video index, time and location. Following Niklaus et al. , we calculate the mean flow magnitude of each triplet to balance the large and small motions in the dataset. 25% of the triplets have mean flow magnitude more than 20 pixels. We also standardized the datasets to have balanced negative and positive values by subtracting each color channel’s mean pixel value. In addition, to prevent the triplet sets from including scene changes, we get the color histogram of each frame and exclude those with large color distribution change. To augment the dataset, we randomly crop patches from the original
images. We also eliminate the biases due to the priors by flipping horizontally, vertically and swapping the order of frames for the probability 0.5.
Our approach is implemented using Pytorch. To implement the adaptive deformable convolution layer, we used CUDA and cuDNN  for the parallel processing. We set the kernel size and all the weights, offsets and occlusion map require 0.94 GB of memory for a 1080p video frame. It is about 70% demand compared to Niklaus et al. . Using RTX 2080 Ti GPU, it takes 0.21 seconds to synthesize a frame.
Boundary handing. Since adaptive deformable convolution is based on classic convolution operation, larger size of input than output is needed. Therefore, the input image needs to be padded. We found that reflection padding leads to high performance compared to the other methods.
In Section 4.1, we first check each contributions of the constraints explained in Section 3.3 through some ablation experiments. Then we quantitatively and qualitatively compare our algorithm with state-of-the-art methods in Section 4.2 and 4.3. The test datasets used for the experiments are Middlebury dataset , some randomly sampled sequences from UCF101  and DAVIS dataset . Finally, in Section 4.4, we visualize the kernel weights , the offset vectors , and the occlusion maps estimated from some sample images to check whether they behave as we intended.
4.1 Ablation Study
|Ours + s.c||33.799||0.962||33.687||0.970||26.380||0.869|
|Ours + w.c||33.987||0.964||34.135||0.970||26.466||0.872|
|Ours + s.c/w.c||34.148||0.966||34.033||0.969||26.504||0.874|
In Section 3.3, we introduced the two types of constraint called smoothness and weight constrains. We perform some ablation experiments to check the contribution of adding each constraint. We compare the performance of four versions according to whether they contain smoothness constraint (s.c) and weight constraint (w.c) respectively. We evaluate each version by measuring PSNR (Peak Signal-to-Noise Ratio) and SSIM (Structural Similarity)  for all test datasets. According to Table 1, the version with both smoothness and weight constraints outperforms the others. Figure (b)b shows that the results without the constraints suffer from blurring and ghosting artifacts.
4.2 Quantitative Evaluation
|Phase Based ||31.117||0.933||32.454||0.953||23.465||0.800|
We compare our method with simply overlapped results and several state-of-the-art algorithms including Phase Based , MIND , Sepconv , DVF  and SuperSlomo . Despite many recent success of state-of-the-art methods in frame interpolation area, there are few comparisons between them trained with a common train dataset. Therefore, we implement the competing algorithms and train them with the train dataset introduced in Section 3.5 commonly for 50 epochs. We measure PSNR and SSIM of each algorithm for the three test datasets. The results are shown in Table 3. According to the table, comparing Phase Based and MIND, each of them has limitation on challenging (DAVIS) and easy (Middlebury, UCF101) dataset respectively. Also, the flow map estimation based approaches (DVF, SuperSlomo) perform generally better than the kernel estimation based one (SepConv). Eventually, our method highly outperforms the other algorithms for all test datasets.
4.3 Visual Comparison
Since the video frame interpolation task doesn’t have the fixed answer, the evaluations based on PSNR and SSIM are not perfect by themselves. Therefore we quantitatively evaluate the methods by comparing each result. Especially, we check how our method and other state-of-the-art algorithms handle the two main obstacles in this area: large motion and occlusion.
Large motion. When the point of interest is located far away, the search area has to be expanded accordingly. Therefore the large motion problem is one of the most challenging obstacles in video frame interpolation area. Figure (g)g shows the estimated results of various approaches including our method. Compared to the other competing algorithms, our approach better synthesize fast moving objects.
Occlusion. Most of the objects in the intermediate frame appear in both adjacent frames. However, in case of occlusion, the object does not appear in one of the frames. Therefore, the appropriate frame has to be selected for each case, which makes the problem more difficult. In the sample image in Figure (g)g, a car causes occlusion in front and back of itself. Comparing the estimated images on occluded areas, the results of MIND, Sepconv tend to be blurry and DVF, SuperSlomo suffer from some artifacts. Our method better handle with the occlusion problems than the other approaches.
4.4 Offset Visualization
Our method estimates some parameters from the input images: the kernel weights , the offset vectors , and the occlusion map . In order to check whether the parameters behave as intended, we visualize them in various ways. We check the occlusion map, the mean flow maps, and the flow variance maps for a sampled image.
Occlusion map. Figure 8 (b) shows the occlusion map . In order to handle with occlusion, the proper frame has to be selected in each case. For example, the pixels in the red area cannot be found in the second frame. Therefore the network decides to consider only the first frame, not the second one. The blue area can be explained in the same way for the second frame, and the green area means that there is no occlusion.
Flow maps. Figure 8 (c),(d) show the weighted sum of the offset vectors for each pixel, calculated by the equation . This means the overall tendency of the offset vectors. Therefore they might behave like a forward/backward optical flow and the figures prove it. On the other hand, Figure 8 (e),(f) are the weighted variance of the offset vectors. The large value for this map means that the offset vectors for the pixel are more spread so that it can refer to more various pixels. According to the figure, more challenging locations such as that large motions and the edges have larger variance values.
In this paper, we redefine the video frame interpolation task as a problem finding the spatial transform between the adjacent frames. To model the spatial transformation, we propose a new operation called Adaptive Deformable Convolution(ADC). This method has both advantages of the two representative approaches in this area: kernel estimation and flow map estimation. The parameters needed for the ADC operation are obtained from a fully convolutional network which is end-to-end trainable. Our experiments show that our method outperforms many of the competing algorithms in several challenging cases such as large motion and occlusion. We finally visualize the network outputs to check whether they well behave as we intended. In the future, our new definition of the task and ADC-based approach may be used in various frame synthesis tasks such as video frame prediction.
-  (2011) A database and evaluation methodology for optical flow. International Journal of Computer Vision 92 (1), pp. 1–31. Cited by: §2, §3.3, §4.
-  (1994) Performance of optical flow techniques. International journal of computer vision 12 (1), pp. 43–77. Cited by: §2, §3.3.
-  (2012) A naturalistic open source movie for optical flow evaluation. In European Conference on Computer Vision, pp. 611–625. Cited by: §2.
-  (2014) Cudnn: efficient primitives for deep learning. arXiv preprint arXiv:1410.0759. Cited by: §3.5.
-  (2017-10) Deformable convolutional networks. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2, §3.2.
Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence 38 (2), pp. 295–307. Cited by: §2.
-  (2015-12) FlowNet: learning optical flow with convolutional networks. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2, §3.3.
-  (2017) Residual conv-deconv grid network for semantic segmentation. In British Machine Vision Conference, Cited by: §2.
PatchBatch: a batch augmented loss for optical flow.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §3.3.
-  (2016-06) Image style transfer using convolutional neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2014-06) Rich feature hierarchies for accurate object detection and semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2015) Learning to linearize under uncertainty. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 1234–1242. External Links: Cited by: §3.4.
-  (2016-06) Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2009) Occlusion reasoning for temporal interpolation using optical flow, department of computer science and engineering, university of washington. Technical report Technical report. Cited by: §2.
-  (2017) Flownet 2.0: evolution of optical flow estimation with deep networks. In IEEE conference on computer vision and pattern recognition (CVPR), Vol. 2, pp. 6. Cited by: §2, §3.3.
-  (2015) Spatial transformer networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 2017–2025. External Links: Cited by: §2.
-  (2017-07) Slow flow: exploiting high-speed cameras for accurate and diverse optical flow reference data. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §3.3.
-  (2017-07) Active convolution: learning the shape of convolution for image classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2018-06) Super slomo: high quality estimation of multiple intermediate frames for video interpolation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §4.2, Table 3.
-  (2016) Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pp. 694–711. Cited by: §2.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.5.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §2.
-  (1995) Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361 (10), pp. 1995. Cited by: §1.
-  (2017-10) Video frame synthesis using deep voxel flow. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2, §3.4, §4.2, Table 3.
-  (2016) Learning image matching by simply watching video. In European Conference on Computer Vision, pp. 434–450. Cited by: §2, §3.4, §4.2, Table 3.
-  (2015-06) Fully convolutional networks for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.4.
-  (2009) Moving gradients: a path-based method for plausible image interpolation. In ACM Transactions on Graphics (TOG), Vol. 28, pp. 42. Cited by: §2.
-  (2016) Deep multi-scale video prediction beyond mean square error. In International Conference on Learning Representations (ICLR), Cited by: §3.4.
-  (2015-06) Object scene flow for autonomous vehicles. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2018-06) PhaseNet for video frame interpolation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2015-06) Phase-based frame interpolation for video. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §4.2, Table 3.
-  (2018-06) Context-aware synthesis for video frame interpolation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2017-07) Video frame interpolation via adaptive convolution. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §3.5.
-  (2017-10) Video frame interpolation via adaptive separable convolution. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2, Figure 3, §3.5, §4.2, Table 3.
Automatic differentiation in pytorch.
NIPS 2017 Autodiff Workshop: The Future of Gradient-based Machine Learning Software and Techniques, Cited by: §3.5.
-  (2016) A benchmark dataset and evaluation methodology for video object segmentation. In Computer Vision and Pattern Recognition, Cited by: §4.
-  (2012) Motion compensated frame interpolation with a symmetric optical flow constraint. In International Symposium on Visual Computing, pp. 447–457. Cited by: §2.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi (Eds.), Cham, pp. 234–241. External Links: Cited by: §3.4.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.
-  (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: §4.
-  (2015) Unsupervised learning of video representations using lstms. In International conference on machine learning, pp. 843–852. Cited by: §3.4.
-  (2018-06) PWC-net: cnns for optical flow using pyramid, warping, and cost volume. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §2, §3.3.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §4.1.
-  (2013-12) DeepFlow: large displacement optical flow with deep matching. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2, §3.3.
-  (2011) Optical flow guided tv-l 1 video interpolation and restoration. In International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, pp. 273–286. Cited by: §2.
-  (2017-07) Accurate optical flow via direct cost volume processing. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §3.3.
-  (2013) Multi-level video frame interpolation: exploiting the interaction among different levels. IEEE Transactions on Circuits and Systems for Video Technology 23 (7), pp. 1235–1248. Cited by: §2.
-  (2016) View synthesis by appearance flow. In European conference on computer vision, pp. 286–301. Cited by: §2.