DeepAI
Log In Sign Up

IFRNet: Intermediate Feature Refine Network for Efficient Frame Interpolation

Prevailing video frame interpolation algorithms, that generate the intermediate frames from consecutive inputs, typically rely on complex model architectures with heavy parameters or large delay, hindering them from diverse real-time applications. In this work, we devise an efficient encoder-decoder based network, termed IFRNet, for fast intermediate frame synthesizing. It first extracts pyramid features from given inputs, and then refines the bilateral intermediate flow fields together with a powerful intermediate feature until generating the desired output. The gradually refined intermediate feature can not only facilitate intermediate flow estimation, but also compensate for contextual details, making IFRNet do not need additional synthesis or refinement module. To fully release its potential, we further propose a novel task-oriented optical flow distillation loss to focus on learning the useful teacher knowledge towards frame synthesizing. Meanwhile, a new geometry consistency regularization term is imposed on the gradually refined intermediate features to keep better structure layout. Experiments on various benchmarks demonstrate the excellent performance and fast inference speed of proposed approaches. Code is available at https://github.com/ltkong218/IFRNet.

READ FULL TEXT VIEW PDF

page 3

page 7

page 8

page 9

page 10

page 12

page 13

page 14

11/07/2022

A Unified Pyramid Recurrent Network for Video Frame Interpolation

Flow-guide synthesis provides a common framework for frame interpolation...
11/12/2020

RIFE: Real-Time Intermediate Flow Estimation for Video Frame Interpolation

We propose RIFE, a Real-time Intermediate Flow Estimation algorithm for ...
03/18/2021

CDFI: Compression-Driven Network Design for Frame Interpolation

DNN-based frame interpolation–that generates the intermediate frames giv...
09/09/2022

Sparsity-guided Network Design for Frame Interpolation

DNN-based frame interpolation, which generates intermediate frames from ...
02/10/2022

FILM: Frame Interpolation for Large Motion

We present a frame interpolation algorithm that synthesizes multiple int...
07/08/2022

Cross-Attention Transformer for Video Interpolation

We propose TAIN (Transformers and Attention for video INterpolation), a ...
02/24/2020

Semantic Flow for Fast and Accurate Scene Parsing

In this paper, we focus on effective methods for fast and accurate scene...

Code Repositories

IFRNet

IFRNet: Intermediate Feature Refine Network for Efficient Frame Interpolation (CVPR 2022)


view repo

1 Introduction

Video frame interpolation (VFI), that converts low frame rate (LFR) image sequences to high frame rate (HFR) videos is an important low-level computer vision task. Related techniques are widely applied to various practical applications, such as slow-motion generation 

[8579036], novel view synthesis [Zhou_2016] and cartoon creation [Siyao_2021_CVPR]. Although it has been studied by a large number of researches, there are still great challenges when dealing with complicated dynamic scenes, which include large displacement, severe occlusion, motion blur and abrupt brightness change.

Figure 1: Speed, accuracy and parameters comparison. Proposed IFRNet achieves state-of-the-art frame interpolation accuracy with fast inference speed and lightweight model size.
Figure 2: Different flow-based VFI paradigms.

We roughly classify existing flow-based VFI methods based on encoder-decoders with specific function. In (a) 

[8954114, 8578281, Niklaus_2020_CVPR, 8579036, qvi_nips19, BMBC, park2021asymmetric, Sim_2021_ICCV], FlowNet estimates conventional optical flow , the middle part approximates or further refines flow fields . In (b) [xue2019video, Zhang_2020, huang2021rife], the Intermediate FlowNet directly predicts intermediate flow of . Both (a) and (b) contain a separate synthesis network for target frame generation. In (c), proposed IFRNet jointly refines the intermediate flow together with a powerful intermediate feature to generate the target frame in a single encoder-decoder.

Recently, with the development of optical flow networks [7410673, 8579029, Teed_2020, 9560800], significant progress has been made by flow-based VFI approaches [8579036, qvi_nips19, Niklaus_2020_CVPR, park2021asymmetric], since optical flow can provide an explicit correspondence to register frames in a video sequence. Successful flow-based approaches usually follow a three-step pipeline: 1) Estimate optical flow between target frame and input frames. 2) Warp input frames or context features by predicted flow fields for spatial alignment. 3) Refine warped frames or features and generate the target frame by a synthesis network. Denoting input frames and target frame to be and , existing methods either first estimate optical flow  [8579036, 8954114, 8578281, Niklaus_2020_CVPR, BMBC], and then approximate or refine bilateral intermediate flow  [8579036, qvi_nips19, chiall, Sim_2021_ICCV] as shown in Figure 2 (a), or throw the intractable intermediate flow estimation sub-task to a learnable flow network for end-to-end training [xue2019video, Zhang_2020, huang2021rife] as depicted in Figure 2 (b). Their common step is to further employ an image synthesis network to encode spatial aligned context feature [8578281] for target frame generation or refinement.

Although above pipeline that first estimates intermediate flow and then context feature has become the most popular paradigm for flow-based VFI approaches [8578281, Niklaus_2020_CVPR, chiall, park2021asymmetric, Sim_2021_ICCV], it suffers from several defects. First, they divide intermediate flow and context feature refinement into separate encoder-decoders, which ignores the mutual promotion of these two crucial elements for frame interpolation. Second, their cascaded architecture based on above design concept can substantially increase the inference delay and model parameters, blocking them from mobile and real-time applications.

In this paper, we propose a novel Intermediate Feature Refine Network (IFRNet) for VFI to overcome the above limitations. For the first time, we merge above separated flow estimation and feature refinement into a single encoder-decoder based model for compactness and fast inference, abstracted in Figure 2 (c). It first extracts pyramid features from given inputs by the encoder, and then jointly refines the bilateral intermediate flow fields together with a powerful intermediate feature through coarse-to-fine decoders. The improved architecture can benefit intermediate flow and intermediate feature with each other, endowing our model with the ability to not only generate sharper moving objects but also capture better texture details.

For better supervision, we propose task-oriented flow distillation loss and feature space geometry consistency loss to effectively guide the multi-scale motion estimation and intermediate feature refinement. Specifically, our flow distillation approach adjusts the robustness of distillation loss adaptively in space and focuses on learning the useful teacher knowledge for frame synthesizing. Besides, proposed geometry consistency loss can employ the extracted intermediate features from ground truth to constrain the reconstructed intermediate features for keeping better structure layout. Figure 1 gives a speed, accuracy and parameters comparison among advanced VFI methods, demonstrating the state-of-the-art performance of our approaches. In summary, our main contributions are listed as follows:

  • We devise a novel IFRNet to jointly perform intermediate flow estimation and intermediate feature refinement for efficient video frame interpolation.

  • Task-oriented flow distillation loss and feature space geometry consistency loss are newly proposed to promote intermediate motion estimation and intermediate feature reconstruction of IFRNet, respectively.

  • Benchmark results demonstrate that our IFRNet not only achieves state-of-the-art VFI accuracy, but also enjoys fast inference speed and lightweight model size.

2 Related Work

Video Frame Interpolation.

The mainstream VFI methods can be classified into flow-based [8954114, 8578281, Niklaus_2020_CVPR, 8579036, xue2019video, qvi_nips19, 8237740, Yuan_2019_CVPR, BMBC, Zhang_2020, park2021asymmetric, Sim_2021_ICCV], kernel-based [8099727, 8237299, 8953614, Lee_2020_CVPR, Cheng_2020, 9501506, ding2021cdfi] and hallucination-based approaches [Gui_2020_CVPR, choi2020cain, Kim_2020]. Different VFI paradigms have their own merits and flaws due to the substantial frame synthesizing manner. For example, kernel-based methods are good at handling motion blur by convolving over local patches [8099727, 8237299], successive works mainly extend it to deal with high resolution videos [8953614]

, increase the degrees of freedom for convolution kernel 

[Lee_2020_CVPR, Cheng_2020, 9501506], or combine them with other paradigms for compensation [8840983, ding2021cdfi]. However, they are typically computationally expensive and short of dealing with occlusion. In another way, hallucination-based methods directly synthesize frames from the feature domain by blending field-of-view features generated by deformable convolution [8237351] or PixelShuffle operations [choi2020cain]. They can naturally generate complex contextual details, while the predicted frames tend to be blurry when fast-moving objects exist.

Figure 3:

Architecture overview and loss functions of IFRNet.

Our model is an efficient encoder-decoder based network, which first extracts pyramid context features from input frames with a shared encoder, and then gradually refines bilateral intermediate flow fields together with reconstructed intermediate feature through coarse-to-fine decoders, until yielding the final output. Besides the common image reconstruction loss , task-oriented flow distillation loss and feature space geometry consistency loss are newly devised to guide the feature alignment procedure more effectively towards intermediate frame synthesizing.

Recently, significant progress has been made by flow-based VFI approaches, since optical flow can provide an explicit correspondence for frame registration. These solutions either employ an off-the-shelf flow model [8578281, qvi_nips19] or estimate task-specific flow [xue2019video, 8237740, 8579036, park2021asymmetric, Sim_2021_ICCV] as a guidance for pixel-level motion. Common subsequent step is to forward [4056711] or backward [George_2000] warp input images to target frame, and finally refine warped frames by an image synthesis network [8578281, Niklaus_2020_CVPR, ding2021cdfi, park2021asymmetric], often instantiated as a GridNet [fourure2017gridnet]. For achieving better image interpolation quality, more complicated deep models are devised to estimate intermediate flow fields [qvi_nips19, chiall] and refine the generated target frame [8579036, Niklaus_2020_CVPR, BMBC, park2021asymmetric]. However, the heavy computation cost and large inference delay make them unsuitable for resource limited devices. To take a breath from above module cascading competition, and reconsider the improvement of prior efficient flow-based VFI paradigm, e.g. DVF [8237740], we propose a novel single encoder-decoder based IFRNet, that can perform real-time inference with excellent accuracy.

Optical Flow Estimation.

Finding dense correspondence between adjacent frames, namely optical flow estimation [HORN1981185], has been studied for decades for its fundamental role in many downstream video processing tasks [Yuan_2020_CVPR, Chan_2021_CVPR]. FlowNet [7410673]

is the first attempt to apply deep learning for optical flow estimation based on the encoder-decoder U-shape network. Inspired by traditional coarse-to-fine paradigm, SPyNet 

[Ranjan_2017_CVPR], PWC-Net [8579029] and FastFlowNet [9560800] integrate pyramid feature, backward warping and achieve impressive real-time performance. Knowledge distillation [HinVin15Distilling] also plays an important role in optical flow prediction, usually embodied as generating pseudo label in unsupervised optical flow learning [DDFlow, SelFlow] or related tasks [NIPS2014_00ec53c4, aleotti2020learning]. A recent VFI method [huang2021rife] also uses a distillation strategy to promote motion prediction. Beyond the difference of architecture design, our distillation approach can focus on the useful knowledge for intermediate frame synthesizing in a task adaptative manner.

3 Proposed Approach

In this section, we first introduce the IFRNet architecture built on the principle of joint refinement of intermediate flow and intermediate feature, to obtain an efficient encoder-decoder based framework for VFI. Then two novel objective functions, i.e., task-oriented flow distillation loss and feature space geometry consistency loss are introduced to help our model achieve excellent performance.

3.1 IFRNet

Given two input frames and at adjacent time instances, video frame interpolation aims to synthesize an intermediate frame , where . To achieve this goal, proposed model performs a first extraction phase so as to retrieve a pyramid of features from each frame, then in a coarse-to-fine manner it progressively refines bilateral intermediate flow fields together with reconstructed intermediate feature until reaching the highest level of the pyramid to obtain the final output. Figure 3 sketches the overall architecture of proposed IFRNet.

Pyramid Encoder. To obtain contextual representation from each input frame, we design a compact encoder to extract a pyramid of features. Purposely, the parameter shared encoder is built of a block of two 3

3 convolutions in each pyramid level, respectively with strides 2 and 1. As shown in Figure 

3, IFRNet extracts 4 levels of pyramid features, counting 8 convolution layers, each followed by a PReLU activation [7410480]. By gradually decimating the spatial size, it increases the feature channels to 32, 48, 72 and 96, generating pyramid features in level () for frames and , respectively.

Coarse-to-Fine Decoders. After extracting meaningful hierarchical representations, we then gradually refine intermediate flow fields through multiple decoders by backward warping pyramid features to generate according to and , respectively. The main advantage of coarse-to-fine warping strategy consists of computing easier residual flow at each scale. Different from previous VFI approaches containing post-refinement [Niklaus_2020_CVPR, ding2021cdfi, park2021asymmetric, huang2021rife], we explore to improve the bilateral flow prediction during its coarse-to-fine procedure for higher efficiency. Specifically, we make each decoder output a higher level reconstructed intermediate feature besides bilateral flow fields , which can fill up the missing reference information to facilitate motion estimation. On the other hand, better predicted flow fields will align source pyramid features to the target position more precisely, thus, generating better , which can in turn improve higher level intermediate feature reconstruction. Therefore, decoders in proposed IFRNet can jointly refine bilateral intermediate flow fields together with reconstructed intermediate feature, benefitting each other until reaching desired output. Moreover, the gradually refined intermediate feature, containing bilateral occlusion and global context information, can finally generate fusion mask and compensate for motion details, that are often missing by flow-based methods, enabling IFRNet a powerful encoder-decoder VFI architecture without additional refinement [Niklaus_2020_CVPR, park2021asymmetric].

Figure 4: Details of the decoder in each pyramid level.

Concretely, in each pyramid level, we stack corresponding input features into a holistic volume that is forwarded by a compact decoder network , consisting of a block of six 33 convolutions and one 44 deconvolution, with strides 1 and , respectively. A PReLU [7410480] follows each convolution layer. Details of each decoder is shown in Figure 4

. In order to keep relative large receptive field and channel numbers for motion estimation and feature encoding while maintaining efficiency, we modify the third and the fifth convolution to update only partial channels of previous output tensor. Furthermore, residual connection and interlaced placement can promote information propagation and joint refinement. More details are shown in supplementary. Note that inputs of

and outputs of are different from other decoders due to the task-related characteristics. In summary, features among decoders can be computed by

(1)
(2)
(3)

where stand for decoders at middle pyramid levels, denotes concatenation operation. is a one-channel conditional input for arbitrary time interpolation, whose values are all the same and set to . is a one-channel merge mask exported by a sigmoid layer whose elements range from 0 to 1, and is a three-channel image residual that can compensate for details. Finally, we can synthesize the desired frame by following formulation

(4)
(5)

where means backward warping, is element-wise multiplication. adjusts the mixing ratio according to bidirectional occlusion information, while compensates for some details when flow-based generation is unreliable, such as regions of target frame are occluded in both views.

Discussion with Optical Flow Networks. Different from the coarse-to-fine pipeline in real-time optical flow [8579029, 9560800] which mainly deals with large displacement matching challenge, in video interpolation, since the target frame is missing, its motion estimation becomes a “chicken-and-egg” problem. Therefore, decoders of IFRNet reconstruct intermediate feature besides intermediate flow fields, performing spatio-temporal feature aggregation and intermediate motion refinement jointly to benefit from each other.

Image Reconstruction Loss. According to above analysis, an efficient IFRNet has been designed for VFI, which is end-to-end trainable. For the purpose of generating intermediate frame, we employ the same image reconstruction loss as [park2021asymmetric] between network output and ground truth frame , which is the sum of two terms and denoted by

(6)

where with is the Charbonnier loss [413553] severing as a surrogate for the loss. While is the census loss, which calculates the soft Hamming distance between census-transformed [Meister_2018] image patches of size 77.

3.2 Task-Oriented Flow Distillation Loss

Training IFRNet with above reconstruction loss can already perform intermediate frame synthesizing. However, the simple optimization target usually drops into local minimum, since illuminance cases are often challenging, i.e., extreme brightness and repetitive texture regions. To deal with this problem, we try to adopt the knowledge distillation [HinVin15Distilling] strategy to guide multi-scale intermediate flow estimation of IFRNet by an off-the-shelf teacher flow network, that helps to align multi-scale pyramid features explicitly. In practice, the pre-trained teacher is only used during training, and we calculate its flow prediction as pseudo label in advance for efficiency. Note that RIFE [huang2021rife] also uses flow distillation. However, their indiscriminate distillation manner usually learns undesired noise existed in pseudo label. Even if ground truth is available, optical flow itself is often a sub-optimal representation for specific video task [xue2019video]. To overcome above limitations, we propose task-oriented flow distillation loss that can decrease the adverse impacts while focusing on the useful knowledge for better VFI.

Observing that which directly control frame synthesis are sensitive to harmful information in pseudo label. Therefore, we impose multi-scale flow distillation except for the decoder , and leave its flow prediction totally constrained by the reconstruction loss in a task-oriented manner [xue2019video]. Furthermore, we can compare above relaxed flow prediction with pseudo label to calculate robustness masks , and use them to adjust the robustness of distillation loss spatially in lower multiple scales for better task-oriented flow distillation, whose procedure is depicted in Figure 3. Specifically, we can obtain by the following formulation

(7)

where calculates per-pixel end-point-error, the coefficient controlling sensibility for robustness is set to according to grid search. Foundation of above operations is based on the assumption that task-oriented flow generally agrees with true optical flow but differs in some details.

Following previous experience [Sun_2014, 8579034], our task-oriented flow distillation employs the generalized Charbonnier loss for better robust learning of intermediate flow, where parameters and control the robustness of this loss. Formally, it can be written as

(8)

where is the bilinear upsampling operation with scale factor . However, different from the fixed format like previous methods [Sun_2014, 8579034], we make it adjustable about VFI task by letting and be functions of the robustness parameter , where means the robustness value of any position in aforementioned robustness masks . In general, we employ the linear and exponential linear functions to generate and separately as follows

(9)

The coefficients are selected based on two typical cases. For example, when , becomes the surrogate loss in Eq. 6. And when , it turns to be the robust loss used in LiteFlowNet [8579034]. Figure 5 gives some intuitive examples of this adaptive robust loss. Comprehensively speaking, in each spatial location, if the task-oriented flow prediction of decoder is consistent with that in pseudo label, the gradient of the adaptive distillation loss is relatively steep, which tends to distill this helpful information to the bottom three decoders by common gradient descent optimizer. On the other hand, the loss will become more robust to downgrade this relatively harmful flow knowledge.

Figure 5: Task-oriented flow distillation loss. It takes the format of generalized Charbonnier loss, while the concrete form in each location is controlled by the corresponding robustness parameter , which is determined by Eq. 7 to acquire task adaptive ability.

3.3 Feature Space Geometry Consistency Loss

Besides above task-oriented flow distillation loss for facilitating multi-scale intermediate flow estimation, better supervision of intermediate feature is preferred for further improvement. Observing that extracted pyramid features by the encoder , in a sense, play an equivalent role as the reconstructed intermediate feature from the decoder , we try to employ the same parameter shared encoder to extract a pyramid of features from ground truth frame , and use to regularize the reconstructed intermediate feature in multi-scale feature domain.

Method Vimeo90K UCF101 SNU-FILM Time (s) Params (M) FLOPs (T)
Easy Medium Hard Extreme
SepConv [8237299] 33.79/0.9702 34.78/0.9669 39.41/0.9900 34.97/0.9762 29.36/0.9253 24.31/0.8448 0.065 21.7 0.36
CAIN [choi2020cain] 34.65/0.9730 34.91/0.9690 39.89/0.9900 35.61/0.9776 29.90/0.9292 24.78/0.8507 0.069 42.8 1.29
AdaCoF [Lee_2020_CVPR] 34.47/0.9730 34.90/0.9680 39.80/0.9900 35.05/0.9754 29.46/0.9244 24.31/0.8439 0.054 21.8 0.36
RIFE [huang2021rife] 35.62/0.9780 35.28/0.9690 40.06/0.9907 35.75/0.9789 30.10/0.9330 24.84/0.8534 0.026 9.8 0.20
IFRNet 35.80/0.9794 35.29/0.9693 40.03/0.9905 35.94/0.9793 30.41/0.9358 25.05/0.8587 0.025 5.0 0.21
IFRNet small 35.59/0.9786 35.28/0.9691 39.96/0.9905 35.92/0.9792 30.36/0.9357 25.05/0.8582 0.019 2.8 0.12
ToFlow [xue2019video] 33.73/0.9682 34.58/0.9667 39.08/0.9890 34.39/0.9740 28.44/0.9180 23.39/0.8310 0.152 1.4 0.62
CyclicGen [liu2019cyclicgen] 32.09/0.9490 35.11/0.9684 37.72/0.9840 32.47/0.9554 26.95/0.8871 22.70/0.8083 0.161 19.8 1.77
DAIN [8954114] 34.71/0.9756 34.99/0.9683 39.73/0.9902 35.46/0.9780 30.17/0.9335 25.09/0.8584 1.033 24.0 5.51
SoftSplat [Niklaus_2020_CVPR] 36.10/0.9700 35.39/0.9520 - - - - 0.195 12.2 0.90
BMBC [BMBC] 35.01/0.9764 35.15/0.9689 39.90/0.9902 35.31/0.9774 29.33/0.9270 23.92/0.8432 3.845 11.0 2.50
CDFI full [ding2021cdfi] 35.17/0.9640 35.21/0.9500 40.12/0.9906 35.51/0.9778 29.73/0.9277 24.53/0.8476 0.380 5.0 0.82
ABME [park2021asymmetric] 36.18/0.9805 35.38/0.9698 39.59/0.9901 35.77/0.9789 30.58/0.9364 25.42/0.8639 0.905 18.1 1.30
IFRNet large 36.20/0.9808 35.42/0.9698 40.10/0.9906 36.12/0.9797 30.63/0.9368 25.27/0.8609 0.079 19.7 0.79
Table 1: Quantitative comparison (PSNR/SSIM) of VFI results on the Vimeo90K, UCF101 and SNU-FILM datasets. For each item, the best result is boldfaced, and the second best is underlined. Top and bottom parts are divided by running time.

Intuitively, we can adopt the commonly used loss to restrict to be close to . However, the overtighten constraint will harm the global context and occlusion information contained in reconstructed intermediate feature . To relax it and inspired by the local geometry alignment property of census transform [Zabih_1994], we extend the census loss  [Meister_2018] into multi-scale feature space for progressive supervision, where the soft Hamming distance is calculated between census-transformed corresponding feature maps with 33 patches in a channel-by-channel manner. Formally, this loss can be written as

(10)

Our motivation is that the extracted pyramid feature, containing useful low-level structure information for frame synthesizing, can regularize the reconstructed intermediate feature to keep better geometry layout. For each spatial location, only constrain the geometry of its neighbor local patch in every feature map. Consequently, there is no restriction on the channel-wise representation for to encode bilateral occlusion and residual information.

Based on above analysis, our final loss function, containing three parts for joint optimization, is formulated as

(11)

where weighting parameters are set to .

4 Experiments

In this section, we first introduce implementation details and datasets used in this paper. Then, we quantitatively and qualitatively compare IFRNet with recent state-of-the-arts on various benchmarks. Finally, ablation studies are carried out to analyze the contribution of proposed approaches. Experiments in the main paper follow a common practice of , that is synthesizing the single middle frame. IFRNet also supports multi-frame interpolation with temporal encoding , whose results are presented in supplementary.

4.1 Implementation Details

We implement proposed algorithm in PyTorch, and use Vimeo90K 

[xue2019video] training set to train IFRNet from scratch. Our model is optimized by AdamW [Loshchilov_2019]

algorithm for 300 epochs with total batch size 24 on four NVIDIA Tesla V100 GPUs. The learning rate is initially set to

, and gradually decays to following a cosine attenuation schedule. During training, we augment the samples by random flipping, rotating, reversing sequence order and random cropping patches with size 224 224. For optical flow distillation, we extract pseudo label of bilateral intermediate flow fields with the pre-trained LiteFlowNet [8579034] in advance, and perform consistent augmentation operations with frame triplets during the whole training process.

4.2 Evaluation Metrics and Datasets

We evaluate our method on various datasets covering diverse motion scenes for comprehensive comparison. Common metrics, such as PSNR and SSIM [1284395] are adopted for quantitative evaluation. For Middlebury, we use the official IE and NIE indices. Now, we briefly introduce the used test datasets to assess our approaches.

Vimeo90K [xue2019video]: It contains frame triplets of 448256 resolution. There are 3,782 triplets consisted in the test part.

UCF101 [Soomro_2012]: We adopt the test set selected in DVF [8237740], which includes 379 triplets of 256256 frame size.

SNU-FILM [choi2020cain]: SNU-FILM contains 1,240 frame triplets of approximate 1280720 resolution. According to motion magnitude, it is divided into four different parts, namely, Easy, Medium, Hard, and Extreme for detailed comparison.

Middlebury [Baker_2011]: The Middlebury benchmark is a widely used dataset to evaluate optical flow and VFI methods. Image resolution in this dataset is around 640480. In this paper, we test on the Evaluation set without using Other set.

4.3 Comparison with the State-of-the-Arts

We compare IFRNet with state-of-the-art VFI methods, including kernel-based SepConv [8237299], AdaCoF [Lee_2020_CVPR] and CDFI [ding2021cdfi], flow-based ToFlow [xue2019video], DAIN [8954114], SoftSplat [Niklaus_2020_CVPR], BMBC [BMBC], RIFE [huang2021rife] and ABME [park2021asymmetric], and hallucination-based CAIN [choi2020cain] and FeFlow [Gui_2020_CVPR]. For results on SNU-FILM, we execute the released codes of CDFI and RIFE and refer to the other results tested in ABME. For Middlebury, we directly test on the Evaluation part and submit interpolation results to the online benchmark. To measure the inference speed and computation complexity, we run all methods on one Tesla V100 GPU under 1280720 resolution and average the running time with 100 iterations. For fair comparison, we further build a large and a small version of IFRNet by scaling feature channels with 2.0 and 0.75, respectively, and separate above methods into two classes, i.e., fast and slow, according to their inference time.

Figure 6: Qualitative comparison of different VFI methods on SNU-FILM (Hard) dataset. Proposed IFRNet algorithm can synthesize fast moving objects with sharp boundary while maintaining distinct contextual details. Zoom in for best view.
Method Average Mequon Schefflera Urban Teddy Backyard Basketball Dumptruck Evergreen
IE NIE IE NIE IE NIE IE NIE IE NIE IE NIE IE NIE IE NIE IE NIE
SuperSlomo [8579036] 5.310 0.778 2.51 0.59 3.66 0.72 2.91 0.74 5.05 0.98 9.56 0.94 5.37 0.96 6.69 0.60 6.73 0.69
ToFlow [xue2019video] 5.490 0.840 2.54 0.55 3.70 0.72 3.43 0.92 5.05 0.96 9.84 0.97 5.34 0.98 6.88 0.72 7.14 0.90
DAIN [8954114] 4.856 0.713 2.38 0.58 3.28 0.60 3.32 0.69 4.65 0.86 7.88 0.87 4.73 0.85 6.36 0.59 6.25 0.66
FeFlow [Gui_2020_CVPR] 4.820 0.719 2.28 0.51 3.50 0.66 2.82 0.70 4.75 0.87 7.62 0.84 4.74 0.86 6.07 0.64 6.78 0.67
AdaCoF [Lee_2020_CVPR] 4.751 0.730 2.41 0.60 3.10 0.59 3.48 0.84 4.84 0.92 8.68 0.90 4.13 0.84 5.77 0.58 5.60 0.57
BMBC [BMBC] 4.479 0.696 2.30 0.57 3.07 0.58 3.17 0.77 4.24 0.84 7.79 0.85 4.08 0.82 5.63 0.58 5.55 0.56
SoftSplat [Niklaus_2020_CVPR] 4.223 0.645 2.06 0.53 2.80 0.52 1.99 0.52 3.84 0.80 8.10 0.85 4.10 0.81 5.49 0.56 5.40 0.57
IFRNet large 4.216 0.644 2.08 0.53 2.78 0.51 1.74 0.43 3.96 0.83 7.55 0.87 4.42 0.84 5.56 0.56 5.64 0.58
Table 2: Evaluation results on the Middlebury benchmark. For each item, the best result is boldfaced, and the second best is underlined.

Quantitative Evaluation. Table 1 and Table 2 summarize quantitative results on diverse benchmarks. On Vimeo90K and UCF101 test datasets, IFRNet large achieves the best results on both PSNR and SSIM metrics. A recent method ABME [park2021asymmetric] also gets similar accuracy. However, our model runs 11.5 faster with similar amount of parameters due to the efficiency of single encoder-decoder based architecture. Our large model also obtains leading results on the Easy, Medium and Hard parts of SNU-FILM datasets, while only falls behind ABME on the Extreme part. We attribute the reason to be that the bilateral cost volume constructed by ABME is good at estimating large displacement motion. In Table 2, IFRNet large achieves top-performing VFI accuracy in most of the eight Middlebury test sequences, and outperforms the previous state-of-the-art SoftSplat [Niklaus_2020_CVPR] on both average IE and NIE metrics. Although the improvement is limited, our approach runs 2.5 faster than SoftSplat which takes cascaded VFI architecture. For FLOPs in convolution layers, IFRNet large also consumes significantly less computation than other VFI architectures.

In regard to real-time and lightweight VFI approaches, IFRNet yields about 0.2 dB better result than RIFE [huang2021rife] on Vimeo90K, and the margin is more distinct on large motion cases in SNU-FILM dataset. It is worth noting that IFRNet only contains half parameters to achieve better results than RIFE thanks to the superiority of joint refinement of intermediate flow and context feature. Compared with CDFI full [ding2021cdfi], IFRNet has the same 5M parameters, while achieving 0.63 dB higher PSNR on Vimeo90K with 15.2 faster inference speed. Moreover, IFRNet small can further improve speed by 31% and reduce parameters and computation complexity by 44% than IFRNet while with only slight frame interpolation accuracy decrease.

Qualitative Evaluation. Figure 6 visually compares well-behaved VFI methods on SNU-FILM (Hard) dataset which contains large and complex motion scenes. It can be seen that kernel-based [8237299, Lee_2020_CVPR, ding2021cdfi] and hallucination-based [choi2020cain] methods fail to synthesize sharp motion boundary, containing ghost and blur artifacts. Compared with flow-based algorithms [8954114, park2021asymmetric], our approach can generate texture details faithfully thanks to the powerfulness of gradually refined intermediate feature. In short, IFRNet can synthesize pleasing target frame with more comfortable visual experience. More qualitative results can be found in our supplementary.

4.4 Ablation Study

To verify the effectiveness of proposed approaches, we carry out ablation study in terms of network architecture and loss function on Vimeo90K and SNU-FILM Hard datasets.

Architecture Vimeo90K Hard
  IF R PSNR PSNR
  ✗ 34.83 29.96
  ✓ 35.22 30.22
  ✗ 35.11 30.06
  ✓ 35.51 30.27
Table 3: Ablation study on different architecture variants. ‘IF’ means intermediate feature and ‘R’ stands for residual .
Figure 7: Visual comparison of intermediate flow and predicted frame of IFRNet w/o and w/ intermediate feature.

Intermediate Feature. To ablate the effectiveness of intermediate feature in IFRNet, we build a model by removing from the input and output of multiple decoders, while keeping feature channels of middle parts of decoders unchanged. Also, we selectively remove residual R in Eq. 4 to isolate the improvement from intermediate flow and residual. We train them with only the reconstruction loss under the same learning schedule as before. As listed in Table 3, from the first two rows, we can observe that intermediate feature can provide reference anchor information to promote intermediate flow estimation. Figure 7 also presents some visual examples to confirm the conclusion. Compared with the last and the second rows in Table 3, it demonstrates that gradually refined intermediate feature, containing global context information, can compensate better scene details. Conclusively, residual compensation from the intermediate context feature is necessary for IFRNet to achieve advanced VFI performance, since intermediate flow prediction is substantively unreliable. Overall, the two-fold benefits from intermediate feature greatly improves VFI accuracy of IFRNet with relatively small additional cost.

Task-Oriented Flow Distillation. Table 4 compares VFI accuracy under different combinations of proposed loss functions quantitatively. It can be seen that adding task-oriented flow distillation loss consistently improves PSNR of 0.2 dB on Vimeo90K. To verify the superiority of its task adaptive ability, we also perform flow distillation with generalized Charbonnier loss under different robustness shown in Figure 5, whose results are summarized in Figure 8. It turns out that robustness parameter achieves best VFI accuracy in the fixed robustness setting. On the other hand, flow distillation can damage frame quality when approaches to 1.0 due to the harmful knowledge in pseudo label. In a word, proposed task-oriented approach achieves the best accuracy thanks to its spatial adaptive ability for adjusting robustness loss during flow distillation.

Loss Function Vimeo90K Hard
PSNR PSNR
  ✓ 35.51 30.27
  ✓ 35.72 30.38
  ✓ 35.61 30.30
  ✓ 35.80 30.41
Table 4: Ablation study on different loss functions.
Figure 8: Ablation study on different flow distillation losses.
Figure 9: Visual comparison of mean feature map of intermediate feature w/o and w/ . Leftmost is the ground truth.

Feature Space Geometry Consistency. As shown in Table 4, adding proposed feature space geometry consistency loss based on above contributions, we can obtain a further improvement, that confirms the complementary effect of in regard to . Figure 9 visually compares mean feature maps of intermediate feature w/o and w/ . It shows that can regularize the reconstructed intermediate feature to keep better geometry layout in multi-scale feature space, resulting in better VFI performance.

5 Conclusion

In this paper, we have devised an efficient deep architecture, termed IFRNet, for video frame interpolation, without any cascaded synthesis or refinement module. It gradually refines intermediate flow together with a powerful intermediate feature, that can not only boost intermediate flow estimation to synthesize sharp motion boundary but also provide global context representation to generate vivid motion details. Moreover, we have presented task-oriented flow distillation loss and feature space geometry consistency loss to fully release its potential. Experiments on various benchmarks demonstrate the state-of-the-art performance and fast inference speed of proposed approaches. We expect proposed single encoder-decoder joint refinement based IFRNet to be a useful component for many frame rate up-conversion and intermediate view synthesis systems.

IFRNet: Intermediate Feature Refine Network for Efficient Frame Interpolation Supplementary Material

Lingtong Kong1, Boyuan Jiang2, Donghao Luo2, Wenqing Chu2, Xiaoming Huang2,

Ying Tai2, Chengjie Wang2, Jie Yang1

1

Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University,

2Youtu Lab, Tencent

{ltkong, jieyang}@sjtu.edu.cn

{byronjiang, michaelluo, wenqingchu, skyhuang, yingtai, jasoncjwang}@tencent.com

[width=0.245autoplay, poster=0, palindrome, final, nomouse, method=widget]8figures_supp/fig1/fig1_1/0008 [width=0.245autoplay, poster=0, palindrome, final, nomouse, method=widget]8figures_supp/fig1/fig1_2/0008 [width=0.245autoplay, poster=0, palindrome, final, nomouse, method=widget]8figures_supp/fig1/fig1_3/0008 [width=0.245autoplay, poster=0, palindrome, final, nomouse, method=widget]8figures_supp/fig1/fig1_4/0008
[width=0.245autoplay, poster=0, palindrome, final, nomouse, method=widget]8figures_supp/fig1/fig1_5/0008 [width=0.245autoplay, poster=0, palindrome, final, nomouse, method=widget]8figures_supp/fig1/fig1_6/0008 [width=0.245autoplay, poster=0, palindrome, final, nomouse, method=widget]8figures_supp/fig1/fig1_7/0008 [width=0.245autoplay, poster=0, palindrome, final, nomouse, method=widget]8figures_supp/fig1/fig1_8/0008
Figure 10: Qualitative results of IFRNet for 8 interpolation on GoPro [Nah_2017_CVPR] and Adobe240 [8099516] test datasets. Please watch the video with Adobe Reader. Each video has 9 frames where the first and the last frames are input, and the middle 7 frames are predicted by IFRNet.
footnotetext: Equal contribution. This work was done when Lingtong Kong was an intern at Tencent Youtu Lab. Code is available at https://github.com/ltkong218/IFRNet.footnotetext: Corresponding author: Jie Yang (jieyang@sjtu.edu.cn). This research is partly supported by NSFC, China (No: 61876107, U1803261).

In the supplementary, we first present multi-frame interpolation experiments of IFRNet. Second, qualitative video comparisions with other advanced VFI approaches are displayed. Third, we depict structure details of IFRNet and its variants. Fourth, we provide more visual examples and analysis of middle components for better understanding the workflow of IFRNet. Finally, we show the screenshot of VFI results on the Middlebury benchmark. Please note that the numbering within this supplementary has manually been adjusted to continue the ones in our main paper.

6 Multi-Frame Interpolation

Different from other multi-frame interpolation methods which scales optical flow [8579036, 8954114] or interpolates middle frames recursively [choi2020cain, Lee_2020_CVPR], IFRNet can predict multiple intermediate frames by proposed one-channel temporal encoding mask , which is one of the input of the coarsest decoder . The temporal encoding is a conditional input signal whose values are all the same and set to , where in 8 interpolation setting. Also, proposed task-oriented flow distillation loss and feature space geometry consistency loss still work for any intermediate time instance . To evaluate IFRNet for 8 interpolation, we use the train/test split of FLAVR [kalluri2021flavr], where we train IFRNet on GoPro [Nah_2017_CVPR] training set with the same learning schedule and loss functions as our main paper. Then we test the pre-trained model on GoPro testing and Adobe240 [8099516] datasets whose results are listed in Table 5.

Method GoPro [Nah_2017_CVPR] Adobe240 [8099516] Time
PSNR SSIM PSNR SSIM (s)
DVF [8237740] 21.94 0.776 28.23 0.896 0.87
SuperSloMo [8579036] 28.52 0.891 30.66 0.931 0.44
DAIN [8954114] 29.00 0.910 29.50 0.910 4.10
IFRNet (Ours) 29.84 0.920 31.93 0.943 0.16
Table 5: Quantitative comparison for 8 interpolation.
[width=0.33autoplay, poster=0, palindrome, final, nomouse, method=widget]2figures_supp/fig2/GT/video_1/0004 [width=0.33autoplay, poster=0, palindrome, final, nomouse, method=widget]2figures_supp/fig2/DAIN/video_1/0004 [width=0.33autoplay, poster=0, palindrome, final, nomouse, method=widget]2figures_supp/fig2/CAIN/video_1/0004
Ground Truth DAIN [8954114] CAIN [choi2020cain]
[width=0.33autoplay, poster=0, palindrome, final, nomouse, method=widget]2figures_supp/fig2/AdaCoF/video_1/0004 [width=0.33autoplay, poster=0, palindrome, final, nomouse, method=widget]2figures_supp/fig2/ABME/video_1/0004 [width=0.33autoplay, poster=0, palindrome, final, nomouse, method=widget]2figures_supp/fig2/IFRNet/video_1/0004
AdaCoF [Lee_2020_CVPR] ABME [park2021asymmetric] IFRNet (Ours)
[width=0.33autoplay, poster=0, palindrome, final, nomouse, method=widget]2figures_supp/fig2/GT/video_2/0004 [width=0.33autoplay, poster=0, palindrome, final, nomouse, method=widget]2figures_supp/fig2/DAIN/video_2/0004 [width=0.33autoplay, poster=0, palindrome, final, nomouse, method=widget]2figures_supp/fig2/CAIN/video_2/0004
Ground Truth DAIN [8954114] CAIN [choi2020cain]
[width=0.33autoplay, poster=0, palindrome, final, nomouse, method=widget]2figures_supp/fig2/AdaCoF/video_2/0004 [width=0.33autoplay, poster=0, palindrome, final, nomouse, method=widget]2figures_supp/fig2/ABME/video_2/0004 [width=0.33autoplay, poster=0, palindrome, final, nomouse, method=widget]2figures_supp/fig2/IFRNet/video_2/0004
AdaCoF [Lee_2020_CVPR] ABME [park2021asymmetric] IFRNet (Ours)
[width=0.33autoplay, poster=0, palindrome, final, nomouse, method=widget]2figures_supp/fig2/GT/video_3/0004 [width=0.33autoplay, poster=0, palindrome, final, nomouse, method=widget]2figures_supp/fig2/DAIN/video_3/0004 [width=0.33autoplay, poster=0, palindrome, final, nomouse, method=widget]2figures_supp/fig2/CAIN/video_3/0004
Ground Truth DAIN [8954114] CAIN [choi2020cain]
[width=0.33autoplay, poster=0, palindrome, final, nomouse, method=widget]2figures_supp/fig2/AdaCoF/video_3/0004 [width=0.33autoplay, poster=0, palindrome, final, nomouse, method=widget]2figures_supp/fig2/ABME/video_3/0004 [width=0.33autoplay, poster=0, palindrome, final, nomouse, method=widget]2figures_supp/fig2/IFRNet/video_3/0004
AdaCoF [Lee_2020_CVPR] ABME [park2021asymmetric] IFRNet (Ours)
Figure 11: Video comparison on SNU-FILM [choi2020cain] dataset. Please watch the video with Adobe Reader and zoom in for best view.

IFRNet outperforms all of the other SOTA methods with input frames on both GoPro and Adobe240 datasets in both PSNR and SSIM metrics. For example, IFRNet achieves 0.84 dB better results than DAIN [8954114] on GoPro and exceeds SuperSloMo [8579036] by 1.27 dB on Adobe240. Thanks to the modularity character of IFRNet, the encoder only needs a single forward pass, while the decoders infer times with different temporal embedding to convert videos from fps into fps. Therefore, the speed advantage of IFRNet is still or even more obvious than other approaches. Figure 10 gives some qualitative results of IFRNet for 8 interpolation, demonstrating its superior ability for frame rate up-conversion and slow motion generation.

7 Video Comparison

In this part, we qualitatively compare interpolated videos by proposed IFRNet against other open source VFI methods on SNU-FILM 

[choi2020cain] dataset, whose results are shown in Figure 11. As can be seen, our approach can generate motion boundary and texture details faithfully thanks to the powerfulness of gradually refined intermediate feature.

8 Network Architecture

In this section, we present the structure details of five sub-networks of IFRNet, i.e., pyramid encoder and coarse-to-fine decoders

. In each following figure, arguments of ‘Conv’ and ‘Deconv’ from left to right are input channels, output channels, kernel size, stride and padding, respectively. Dimensions of input and output tensors from left to right stand for feature channels, height and width, separately. A PReLU 

[7410480] follows each ‘Conv’ layer, while there is no activation after each ‘Deconv’ layer. In practice, the intermediate flow fields are estimated in a residual manner, which is not reflected in the figures to emphasize the primary network structure. We take input frames with spatial size of 640480 as example.

Figure 12: Details of the pyramid encoder . The two input frames are encoded by the same Siamese network.
Figure 13: Details of the bottom decoder .
Figure 14: Details of the middle decoder .
Figure 15: Details of the middle decoder .
Figure 16: Details of the top decoder .

As for IFRNet large and IFRNet small, feature channels from the first to the fourth pyramid levels are set to 64, 96, 144, 192 and 24, 36, 54, 72, respectively. Correspondingly, channel numbers in multiple decoders are adjusted. Also, feature channels of the third and the fifth convolution layers in coarse-to-fine decoders of IFRNet large and IFRNet small are set to 64 and 24, separately.

9 Visualization and Discussion

Figure 17: Illustration of task-oriented flow distillation. From top to bottom rows are ground truth frame , pseudo label of intermediate flow fields , predicted intermediate flow fields , task-oriented robustness masks . Darker color in approaches to 1, while brighter color tends to 0. Each column represents a separate example on Vimeo90K [xue2019video] dataset. Zoom in for best view.

Figure 17 presents some visual examples to show the robustness masks in proposed task-oriented flow distillation loss, which can decrease the adverse impacts while focusing on the useful knowledge for better frame interpolation. It seems that intermediate flow prediction of IFRNet behaves smoother and contains less artifacts than flow prediction of pseudo label, that helps to achieve better VFI accuracy.

Figure 18: Illustration of mean feature map of intermediate feature w/o and w/ . From top to bottom rows are ground truth frame , mean feature map of w/o , mean feature map of w/ . Each column represents a separate example on Vimeo90K [xue2019video] dataset. Zoom in for best view.

Figure 18 depicts more visual results of mean feature maps of intermediate feature w/o and w/ proposed geometry consistency loss, demonstrating its effect on regularizing refined intermediate feature to keep better structure layout.

Figure 19: Illustration of intermediate components of IFRNet. From top to bottom rows are input frames , predicted intermediate flow fields , warped input frames , merge mask , merged frame , residual , final prediction and ground truth , where merged frame is calculated by . For better visualization of residual , we multiply it by 10 and add a bias of 0.5. Each column represents a separate example on Vimeo90K [xue2019video] dataset. Zoom in for best view.

Figure 19 gives visual understanding of frame interpolation process of IFRNet. Thanks to the reference anchor information offered by intermediate feature together with effective supervision provided by geometry consistency loss and task-oriented flow distillation loss, IFRNet can estimate relatively good intermediate flow with clear motion boundary. Further, we can see that merge mask can identify occluded regions of warped frames by adjusting the mixing weight, where it tends to average the candidate regions when both views are visible. Finally, residual can compensate for some contextual details, which usually response at motion boundary and image edges. Different from other flow-based VFI methods that take cascaded structure design, merge mask and residual in IFRNet share the same encoder-decoder with intermediate optical flow, making proposed architecture achieve better VFI accuracy while being more lightweight and fast.

Figure 20: Screenshot of our IE-ranking on the Middlebury benchmark (taken on the November 16th, 2021).
Figure 21: Screenshot of our NIE-ranking on the Middlebury benchmark (taken on the November 16th, 2021).

Readers may think our IFRNet is similar with PWC-Net [8579029] which is designed for optical flow. However, It is non-trivial to adapt PWC-Net for frame interpolation, since previous related works employ it as one of many components. We summarize their difference in several aspects: 1) Anchor feature in PWC-Net is extracted by the encoder, while in IFRNet, it is reconstructed by the decoder. 2) Besides motion information in intermediate feature, there are occlusion, texture and temporal information in it. 3) PWC-Net designed for motion estimation, is optimized only by flow regression loss with strong augmentation. However, IFRNet designed for frame synthesizing, is optimized in a multi-target manner with weak data augmentation.

10 Screenshots of the Middlebury Benchmark

We take screenshots of the online Middlebury benchmark for VFI on the November 16th, 2021, whose results are shown in Figure 20 and Figure 21. Since the average rank is a relative indicator, previous methods [8954114, Niklaus_2020_CVPR, Gui_2020_CVPR, BMBC] usually report average IE (interpolation error) and average NIE (normalized interpolation error) for comparison. As summarized in Table 2 in our main paper, proposed IFRNet large model achieves best results on both IE and NIE metrics among all published VFI methods that are trained on Vimeo90K [xue2019video] dataset. Moreover, IFRNet large runs several times faster than previous state-of-the-art algorithms [Niklaus_2020_CVPR, park2021asymmetric], demonstrating the superior VFI accuracy and fast inference speed of proposed approaches.

References