Log In Sign Up

Spatio-Temporal Multi-Flow Network for Video Frame Interpolation

by   Duolikun Danier, et al.
University of Bristol

Video frame interpolation (VFI) is currently a very active research topic, with applications spanning computer vision, post production and video encoding. VFI can be extremely challenging, particularly in sequences containing large motions, occlusions or dynamic textures, where existing approaches fail to offer perceptually robust interpolation performance. In this context, we present a novel deep learning based VFI method, ST-MFNet, based on a Spatio-Temporal Multi-Flow architecture. ST-MFNet employs a new multi-scale multi-flow predictor to estimate many-to-one intermediate flows, which are combined with conventional one-to-one optical flows to capture both large and complex motions. In order to enhance interpolation performance for various textures, a 3D CNN is also employed to model the content dynamics over an extended temporal window. Moreover, ST-MFNet has been trained within an ST-GAN framework, which was originally developed for texture synthesis, with the aim of further improving perceptual interpolation quality. Our approach has been comprehensively evaluated – compared with fourteen state-of-the-art VFI algorithms – clearly demonstrating that ST-MFNet consistently outperforms these benchmarks on varied and representative test datasets, with significant gains up to 1.09dB in PSNR for cases including large motions and dynamic textures. Project page:


page 1

page 4

page 6

page 8

page 10


Enhancing Deformable Convolution based Video Frame Interpolation with Coarse-to-fine 3D CNN

This paper presents a new deformable convolution-based video frame inter...

Texture-aware Video Frame Interpolation

Temporal interpolation has the potential to be a powerful tool for video...

A Perceptual Quality Metric for Video Frame Interpolation

Research on video frame interpolation has made significant progress in r...

RIFE: Real-Time Intermediate Flow Estimation for Video Frame Interpolation

We propose RIFE, a Real-time Intermediate Flow Estimation algorithm for ...

TimeLens: Event-based Video Frame Interpolation

State-of-the-art frame interpolation methods generate intermediate frame...

Deep Animation Video Interpolation in the Wild

In the animation industry, cartoon videos are usually produced at low fr...

Frame Interpolation for Dynamic Scenes with Implicit Flow Encoding

In this paper, we propose an algorithm to interpolate between a pair of ...

1 Introduction

Video frame interpolation (VFI) has been extensively employed to deliver an improved user experience across a wide range of important applications. VFI increases the temporal resolution (frame rate) of a video through synthesizing intermediate frames between every two consecutive original frames. It can mitigate the need for costly high frame rate acquisition processes [kalluri2020flavr], enhance the rendering of slow-motion content [jiang2018super], support view synthesis [flynn2016deepstereo] and improve rate-quality trade-offs in video coding [wu2018video].

In recent years, deep learning has empowered a variety of VFI algorithms. These methods can be categorized as flow-based [jiang2018super, xu2019quadratic] or kernel-based [niklaus2017video, lee2020adacof]. While flow-based methods use the estimated optical flow maps to warp input frames, kernel-based methods learn local or shared convolution kernels for synthesizing the output. To handle challenging scenarios encountered in VFI applications, various techniques have been employed to enhance these methods, including non-linear motion models [xu2019quadratic, sim2021xvfi, park2021asymmetric], coarse-to-fine architectures [park2020bmbc, sim2021xvfi, chen2021pdwn, zhang2020flexible], attention mechanisms [choi2020channel, kalluri2020flavr], and deformable convolutions [lee2020adacof, gui2020featureflow].

Figure 1: High-level architecture of ST-MFNet, which employs a two-stage workflow to interpolate an intermediate frame.

Although these methods have significantly improved performance compared with conventional VFI approaches [baker2011database]

, their performance can still be inconsistent, especially for content exhibiting large motions, occlusions and dynamic textures. Large motion typically means large pixel displacements, which are difficult to capture using Convolutional Neural Networks (CNNs) with limited receptive fields 

[adaconv, niklaus2017video]. In the case of occlusion, pixels relating to occluded objects will not appear in all input frames, thus preventing interpolation algorithms from accurately estimating the intermediate locations of those pixels [choi2020channel, kalluri2020flavr]. Finally, dynamic textures (e.g. water, fire, foliage, etc.) exhibit more complex motion characteristics compared to the movements of rigid objects [zhang2011parametric, tafi1]. Typically, they are spatially irregular and temporally stochastic, causing most existing VFI methods to fail, especially those based on optical flow[liu2017video, jiang2018super].

To solve these problems, we propose a novel video frame interpolation model, the Spatio-Temporal Multi-Flow Network (ST-MFNet), which consistently offers improved interpolation performance across a wide range of content types. ST-MFNet employs a two-stage architecture, as shown in Figure 1. In Stage I, the Multi-InterFlow Network (MIFNet) first predicts multi-interflows [dai2017deformable, lee2020adacof] at multiple scales (including an up-sampling scale simulating sub-pixel motion estimation), using a customized CNN architecture, UMSResNext, with variable kernel sizes. The multi-flows here correspond to a many-to-one mapping which enables more flexible transformation, facilitating the modeling of complex motions. To further improve the performance for large motions, a Bi-directional Linear Flow Network (BLFNet) is employed to linearly approximate the intermediate flows based on the bi-directional flows between input frames, which are estimated using a coarse-to-fine architecture [sun2018pwc]. In the second stage, inspired by recent work on texture synthesis [xie2019learning, yang2021spatiotemporal], we integrate a 3D CNN, Texture Enhancement Network (TENet) that performs spatial and temporal filtering to capture longer-range dynamics and to predict textural residuals. Finally, we trained our model based on the ST-GAN [yang2021spatiotemporal] methodology, which was originally proposed for texture synthesis. This ensures both spatial consistency and temporal coherence of interpolated content. Extensive quantitative and qualitative studies have been performed which demonstrate the superior performance of ST-MFNet over current state-of-the-art VFI methods on a wide range of test data including large and complex motions and dynamic textures.

The primary contributions of this work are:

  • [noitemsep,nolistsep,leftmargin=*]

  • A novel VFI method where multi-flow based (MIFNet) and single-flow based warping (BLFNet) are combined to enhance the capturing of complex and large motions.

  • A new CNN architecture (UMSResNext) for the MIFNet, which predicts multiple intermediate flows at various scales, including an up-sampling scale for high precision sub-pixel motion estimation.

  • The use of a spatio-temporal CNN (TENet) and ST-GAN, which were originally designed for texture synthesis, to enhance the interpolation of complex textures.

  • Validation, through comprehensive experiments, that our model consistently outperforms state-of-the-art VFI methods on various scenarios, including large and complex motions and various texture types.

2 Related Work

In this section, we summarize recent advances in video frame interpolation (VFI) and then briefly introduce examples of dynamic texture synthesis, which have inspired the development of our method.

2.1 Video Frame Interpolation

Most existing VFI methods can be classified as flow-based or kernel-based:

Flow-based VFI. This class typically involves two steps: optical flow estimation and image warping. Input frames, and , are warped to a target temporal location based on either the intermediate optical flows (backward warping [jaderberg2015spatial]), or (forward warping [niklaus2018context]). These flows can be approximated from bi-directional optical flows ( and ) between the input frames [jiang2018super, reda2019unsupervised, bao2019memc, bao2019depth, xu2019quadratic, niklaus2018context, niklaus2020softmax, liu2020enhanced, sim2021xvfi]. Such approximations often assume motion linearity, and hence are prone to errors in non-linear motion scenarios. Various efforts have been made to alleviate this issue, including the use of depth information [bao2019depth], higher order motion models [xu2019quadratic, liu2020enhanced], and adaptive forward warping [niklaus2020softmax]. A second group of methods [liu2017video, xue2019video, park2020bmbc, zhang2020flexible, huang2020rife, park2021asymmetric] have been developed to improve approximation by directly predicting intermediate flows. These approaches typically employ a coarse-to-fine architecture, which supports a larger receptive field for capturing large motions. In all of the above methods, the predicted flows correspond to a one-to-one pixel mapping, which inherently limits the ability to capture complex motions.

Kernel-based VFI. In these methods, various convolution kernels [adaconv, niklaus2017video, lee2020adacof, shi2020video, ding2021cdfi, cheng2021multiple, chen2021pdwn, gui2020featureflow, long2016learning, choi2020channel, kalluri2020flavr] are learned as a basis for synthesizing interpolated pixels. Earlier approaches [adaconv, niklaus2017video] predict a fixed-size kernel for each output location, which is then convolved with co-located input pixels. This limits the magnitude of captured motions to the kernel size used, while more memory and computational capacity are required when larger kernel sizes are adopted. To overcome this problem, deformable convolution (DefConv) [dai2017deformable] was adapted to VFI in AdaCoF [lee2020adacof]

, which allows kernels to be convolved with any input pixels pointed by local offset vectors. This can be considered as

multi-interflows, representing a many-to-one mapping. Further improvements to AdaCoF have been achieved by allowing space-time sampling [shi2020video], feature pyramid warping [ding2021cdfi], and using a coarse-to-fine architecture [chen2021pdwn].

2.2 Dynamic Texture Synthesis

Dynamic textures (e.g. water, fire, leaves blowing in the wind etc.) generally exhibit high spatial frequency energy alongside temporal stochasticity, with inter-frame motions irregular in both the spatial and temporal domains. Classic synthesis methods rely on mathematical models such as Markov random fields [wei2000fast] and auto-regressive moving average model [doretto2003dynamic] to capture underlying motion characteristics. More recently, deep learning techniques, in particular 3D CNNs and GAN-based training [gatys2015texture, yang2016stationary, xie2019learning, wang2021conditional, yang2021spatiotemporal], have been adopted to achieve more realistic synthesis results. It should be noted that both dynamic texture synthesis and VFI require accurate modeling of spatio-temporal characteristics. However the techniques developed specifically for texture synthesis have not yet been fully exploited in VFI methods. This is a focus of our work.

[MIFNet]           [Multi-flow head]

Figure 2: Illustration of the MIFNet. (a) The overall architecture of MIFNet, with a U-Net style backbone and multi-flow estimation heads at three scales. (b) The convolutional layers inside the multi-flow head at each scale.
Figure 3: Illustration of the MSResNext block, which consists of two ResNext branches with different kernel sizes, followed by a channel attention module.

3 Proposed Method: ST-MFNet

The architecture of ST-MFNet is shown in Figure 1. While conventional VFI methods are formulated as generating intermediate frame () between two given consecutive frames , we instead employ two more frames to improve the modeling of motion dynamics. Given the consecutive frames , our model first processes in two branches. The Multi-InterFlow Network (MIFNet) branch estimates the multi-scale many-to-one multi-flows from to ; the Bi-directional Linear Flow Network (BLFNet) branch approximates one-to-one optical flows from to . The input frames are warped based on the flows generated by MIFNet and BLFNet, and then fused by the Multi-Scale Fusion module to obtain an intermediate result . This multi-branch structure combines both single-flow and multi-flow based methods and was found to offer enhanced interpolation performance. In the second stage, this frame is combined with all the inputs in temporal order and fed into the Texture Enhancement Network (TENet), which captures longer-range dynamic and generates residual signals for the final output.

3.1 Multi-InterFlow Network

Multi-InterFlow warping. For self-completeness, we first briefly describe the multi-interflow warping operation [lee2020adacof]. Given two images with size , conventional optical flow from to specifies the x- and y-components of pixel-wise offset vectors, where . The pixel value at each location of the corresponding backwarped [jaderberg2015spatial] is defined as


where the values at non-integer grid locations are obtained via bilinear interpolation. The multi-interflow proposed in [lee2020adacof] can be defined as , but now represent a collection of the x- and y-components of flow vectors respectively and is their weighting kernels (). That is, for each location , contains flow vectors and weights. The corresponding warping is defined as follows.


Such multi-flow warping corresponds to a many-to-one mapping, which allows flexible sampling of source pixels. This enables the capture of larger and more complex motions.

Given input frames , the MIFNet predicts the multi-interflows from the intermediate frame to the inputs at three scale levels: , where means spatial down-sampling by (i.e. denotes up-sampling), so that re-sampled inputs can be warped to time using Equation (2) to produce respectively. Here the incorporation of the finer scale () further increases the precision of multi-flow warping (through 8-tap filter up-sampling, see below).

Architecture. The architecture of the MIFNet is shown in Figure 2 (a). In order to capture pixel movements at multiple scales, we devise a U-Net style feature extractor, U-MultiScaleResNext (UMSResNext), consisting of eight MSResNext blocks (illustrated in Figure 3). Each MSResNext block employs two ResNext blocks [xie2017aggregated] in parallel with different kernel sizes in the middle layer, 33 and 77, which further increases the network cardinality [xie2017aggregated, ma2020cvegan]. The outputs of these two ResNext blocks are then concatenated and connected to a channel attention module [Hu_2018_CVPR]

, which learns adaptive weighting of the feature maps extracted by the two ResNext blocks. Such feature selection mechanism has also been found to enhance motion modeling 

[choi2020channel, kalluri2020flavr]. In UMSResNext, the up-sampling operation is performed by replacing the kk grouped convolutions in the middle layer with (k+1)(k+1) grouped transposed convolutions.

The features extracted by UMSResNext are then passed to the multi-flow heads for multi-interflow prediction. In contrast to

[lee2020adacof], multi-flows here are predicted at various scales , and occlusion maps are not generated (occlusion is handled by the BLFNet). As shown in Figure 2 (b), each multi-flow head contains 6 sub-branches, predicting the x-, y-components () and the kernel weights () of . The predicted flows are then used to backwarp the inputs at corresponding scales using Equation (2). Here a bilinear filter is used for down-sampling input frames, and an 8-tap filter originally designed for sub-pixel motion estimation [sullivan2012overview] is employed for up-sampling.

3.2 Bi-directional Linear Flow Network

To improve large motion interpolation, bi-directional flows between inputs are also predicted using a pre-trained flow estimator [sun2018pwc], which is based on a coarse-to-fine architecture. The intermediate flows are then linearly approximated as follows.


According to the intermediate flows, the frames are forward warped using the efficient softsplat operator [niklaus2020softmax], which learns occlusion-related softmax-alike weighting of reference pixels in the forward warping process. Another advantage of softsplat is that it is differentiable, allowing the flow estimator to be end-to-end optimized. Finally, BLFNet branch outputs warped frames . The employment of the BLFNet branch was found to be essential for handling large motion and occlusion and improving the overall capacity of the proposed model.

3.3 Multi-Scale Fusion Module

The Multi-Scale Fusion Module is employed to produce an intermediate interpolation result using the frames warped at multiple scales in the previous steps. Here we adopt the GridNet [fourure2017residual] architecture due to its superior performance on fusing multi-scale information [niklaus2018context, niklaus2020softmax]. The GridNet is configured here to have 4 columns and 3 rows, with the first, second and third rows corresponding to scales of respectively. The first and third rows take and as inputs, while the second row takes , where denotes channel-wise concatenation. Finally, this module outputs the intermediate result at the original spatial resolution ().

Figure 4: The architecture of the Texture Enhancement Network.

3.4 Texture Enhancement Network

At the end of the first stage, the output of the Multi-Scale Fusion module, , is concatenated with four original inputs to form , which are then fed into the Texture Enhancement Network (TENet). Including additional frames here allows better modeling of higher-order motions and also provides more information on longer-term spatio-temporal characteristics. Motivated by recent work in dynamic texture synthesis [xie2019learning, yang2021spatiotemporal], where spatio-temporal filtering was found to be effective for generating coherent video textures, we integrate a 3D CNN for texture enhancement. This CNN architecture (shown in Figure 4) is a modified version of the network developed in [kalluri2020flavr], but with reduced layer widths. This is based on the consideration that the intermediately warped frame has already been produced which is relatively close to the target. It is different from the original scenario in [kalluri2020flavr], where the network is expected to directly synthesize the interpolated output using the four original input frames. Finally, the TENet is expected to output a residual signal containing textural difference between and the target frame, which contributes to the final output of ST-MFNet.

3.5 Loss Functions

We trained two versions of ST-MFNet in this work. For the distortion oriented model, a Laplacian pyramid loss [bojanowski2017optimizing] () was used as the objective function. This model was further fine-tuned using an ST-GAN based perceptual loss () to obtain the perceptually optimized version.

Laplacian pyramid loss. ST-MFNet was trained end-to-end by matching its output with the ground-truth intermediate frame using the Laplacian pyramid loss [bojanowski2017optimizing], which has been previously used for VFI in [niklaus2018context, niklaus2020softmax, liu2020enhanced]

. The loss function is defined below.


Here denotes the th level of the Laplacian pyramid of an image , and is the maximum level.

Spatio-temporal adversarial loss.

To further improve the perceptual quality of the ST-MFNet output, we also trained our model using the Spatio-Temporal Generative Adversarial Networks (ST-GAN) training methodology 

[yang2021spatiotemporal]. Different from the conventional GAN [goodfellow2020generative] focusing on a single image, the discriminator of the ST-GAN also processes temporally adjacent video frames which improves temporal consistency. This is key for video frame interpolation. The architecture of the ST-GAN discriminator used in this work is provided in Appendix A. This discriminator was trained with the following loss.


The corresponding adversarial loss for the generator (ST-MFNet) is given below.


This is then combined with the Laplacian pyramid loss to form the perceptual loss for ST-MFNet fine-tuning,


where is a weighting hyper-parameter that controls the perception-distortion trade-off [blau2018perception].

4 Experimental Setup

Implementation details. In our implementation, we set the number of flows (which is the default value of the original multi-flows in [lee2020adacof]) for the MIFNet branch. The maximum level for was set to 5, and the weighting hyper-parameter . We used the AdaMax optimizer [kingma2014adam] in the training with

. The learning rate was set to 0.001 and reduced by a factor of 0.5 whenever the validation performance stops improving for 5 epochs. The pre-trained flow estimator 

[sun2018pwc] in the BLFNet branch was frozen for the first 60 epochs and then fine-tuned for 10 more epochs to further improve VFI performance. The network was trained for a total number of 70 epochs using a batch size of 4. All training and evaluation processes were executed with a NVIDIA P100 GPU.

Training datasets. We used the training split of Vimeo-90k (septuplet) dataset [xue2019video] which contains 91,701 frame septuplets at a spatial resolution of 448256. It is noted that Vimeo-90k was produced with constrained motion magnitude and complexity. To further enhance the VFI performance on large motion and dynamic textures, we used an additional dataset, BVI-DVC [ma2020bvi], which contains 800 videos of 64 frames at four spatial resolutions (200 videos each): 2160p, 1080p, 540p and 270p. This dataset covers a wide range of texture/motion types and frame rates (from 24 to 120 FPS). For each training epoch, we randomly sampled 12800, 6400, 800, 800 septuplets from these four resolution groups respectively, leaving out a subset of video frames for validation. We augmented all septuplets from both Vimeo-90k and BVI-DVC by randomly cropping 256256 patches and performing flipping and temporal order reversing. This resulted in more than 100,000 septuplets of 256256 patches. In each septuplet, the 1st, 3rd, 5th and 7th frames were used as inputs and the 4th as the ground-truth target. The test split of Vimeo-90k together with unused subset of BVI-DVC was utilized as the validation set for hyper-parameter tuning and training monitoring.

[Overlay]   [GT]   [w/o MIFNet]   [w/ MIFNet]    [Overlay]   [GT]   [w/o BLFNet]   [w/ BLFNet]

[Overlay]   [GT]   [U-Net]   [UMSResNext]    [Overlay]   [GT]   [w/o TENet]   [w/ TENet]

[Overlay]   [GT]   [Ours-]   [TGAN]   [FIGAN]   [Ours-]

Figure 5: Qualitative results interpolated by different variants of our method. Here “Overlay” means the overlaid adjacent frames. Figures (a)-(d): w/ MIFNet vs w/o MIFNet; figures (e)-(h): w/ BLFNet vs w/o BLFNet; figures (i)-(j): UMSResNext vs U-Net; figures (m)-(p): w/ TENet vs w/o TENet; figures (q)-(v): comparison of different GANs.

Evaluation dataset. Since our model takes four frames as input, the evaluation dataset should be able to provide frame quintuplets (). In this work, we used the test quintuplets in [xu2019quadratic], which were extracted from the UCF-101 [soomro2012ucf101] (100 quintuplets) and DAVIS [perazzi2016benchmark] (2847 quintuplets) datasets. The evaluation was also based on the SNU-FILM dataset [choi2020channel], which specifies a list of 310 triplets at four motion magnitude levels. As original sequences are provided in the SNU-FILM dataset, we extended its pre-defined test triplets into quintuplets for the evaluation here. Other commonly used test datasets, e.g. Middlebury [baker2011database] and UCF-DVF [liu2017video], have not been employed here. This is because these databases only contain frame triplets, which cannot provide sufficient input frames for our model.

To further test interpolation performance on various texture types, we developed a new test set, VFITex, which contains twenty 100-frame videos at UHD or HD resolution and with a frame rate of 24, 30 or 50 FPS, collected from the Xiph [montgomery3xiph], Mitch Martinez Free 4K Stock Footage [mitch], UVG database [mercat2020uvg] and the Pexels website [pexels]. This dataset covers diverse textured scenes, including crowds, flags, foliage, animals, water, leaves, fire and smoke. Based on the computational capacity available, we center-cropped HD patches from the UHD sequences, preserving the original UHD characteristics. All frames in each sequence were used for evaluation, totaling 940 quintuplets. More details of the training and evaluation datasets and their license information are provided in Appendix D.

Ours-w/o BLFNet 33.218/0.970 27.767/0.881 28.498/0.915
Ours-w/o MIFNet 33.202/0.969 27.886/0.889 28.357/0.911
Ours-w/o TENet 32.895/0.970 27.484/0.880 28.241/0.910
Ours-unet 33.378/0.970 28.096/0.892 28.898/0.925
Ours 33.384/0.970 28.287/0.895 29.175/0.929
Table 1: Ablation study results (PSNR/SSIM) for ST-MFNet.

Evaluation Methods. Two most commonly used quality metrics, PSNR and SSIM [wang2004image], were employed here for objective assessment of the interpolated content. We note that these metrics do not always correlate well with video quality as perceived by a human observer [hore2010image, kalluri2020flavr]. Hence, in order to further compare the video frames interpolated by our method and the benchmark references, a user study was conducted based on a psychophysical experiment. The details of the user study are described in Section 5.3.

Easy Medium Hard Extreme
DVF [liu2017video] 32.251/0.965 20.403/0.673 27.528/0.876 24.091/0.817 21.556/0.760 19.709/0.705 19.946/0.709 0.157 3.82
SuperSloMo [jiang2018super] 32.547/0.968 26.523/0.866 36.255/0.984 33.802/0.973 29.519/0.930 24.770/0.855 27.914/0.911 0.107 39.61
SepConv [niklaus2017video] 32.524/0.968 26.441/0.853 39.894/0.990 35.264/0.976 29.620/0.926 24.653/0.851 27.635/0.907 0.062 21.68
DAIN [bao2019depth] 32.524/0.968 27.086/0.873 OOM OOM OOM OOM OOM 0.896 24.03
BMBC [park2020bmbc] 32.729/0.969 26.835/0.869 OOM OOM OOM OOM OOM 1.425 11.01
AdaCoF [lee2020adacof] 32.610/0.968 26.445/0.854 39.912/0.990 35.269/0.977 29.723/0.928 24.656/0.851 27.639/0.904 0.051 21.84
FeFlow [gui2020featureflow] 32.520/0.967 26.555/0.856 OOM OOM OOM OOM OOM 1.385 133.63
CDFI [ding2021cdfi] 32.653/0.968 26.471/0.857 39.881/0.990 35.224/0.977 29.660/0.929 24.645/0.854 27.576/0.906 0.321 4.98
CAIN [choi2020channel] 32.537/0.968 26.477/0.857 39.890/0.990 35.630/0.978 29.998/0.931 25.060/0.857 28.184/0.911 0.071 42.78
SoftSplat [niklaus2020softmax] 32.835/0.969 27.582/0.881 40.165/0.991 36.017/0.979 30.604/0.937 25.436/0.864 28.813/0.924 0.206 12.46
EDSC [cheng2021multiple] 32.677/0.969 26.689/0.860 39.792/0.990 35.283/0.977 29.815/0.929 24.872/0.854 27.641/0.904 0.067 8.95
XVFI [sim2021xvfi] 32.224/0.966 26.565/0.863 38.849/0.989 34.497/0.975 29.381/0.929 24.677/0.855 27.759/0.909 0.108 5.61
QVI [xu2019quadratic] 32.668/0.967 27.483/0.883 36.648/0.985 34.637/0.978 30.614/0.947 25.426/0.866 28.819/0.926 0.257 29.23
FLAVR [kalluri2020flavr] 33.389/0.971 27.450/0.873 40.135/0.990 35.988/0.979 30.541/0.937 25.188/0.860 28.487/0.915 0.695 42.06
ST-MFNet (Ours) 33.384/0.970 28.287/0.895 40.775/0.992 37.111/0.985 31.698/0.951 25.810/0.874 29.175/0.929 0.901 21.03
Table 2: Quantitative comparison results (PSNR/SSIM) for ST-MFNet and 14 tested methods. In some cases, underlined scores based on the pre-trained models are provided in the table, when they outperform their re-trained counterparts. OOM denotes cases where our GPU runs out of memory for the evaluation. For each column, the best result is colored in red and the second best is colored in blue. The average runtime (RT) for interpolating a 480p frame as well as the number of model parameters (#P) for each method are also reported.

5 Results and Analysis

In this section, we analyze our proposed model through ablation studies, and compare it with 14 state-of-the-art methods both quantitatively and qualitatively.

5.1 Ablation Study

The key ablation study results are summarized in Table 3, where five versions of ST-MFNet have been evaluated. Figure 9 provides a visual comparison between the frames generated by each test variant and the full ST-MFNet model. Additional ablation study results are available in Appendix B.

MIFNet and BLFNet branches. To verify that the MIFNet and BLFNet branches are both effective, two variants of ST-MFNet, Ours-w/o BLFNet and Ours-w/o MIFNet, were created by removing the BLFNet and MIFNet branches respectively. Both variants were trained and evaluated using the same configurations described above. While in both cases, Ours-w/o MIFNet and Ours-w/o BLFNet achieve lower overall performance compared to the full ST-MFNet (Ours), Ours-w/o MIFNet exhibits a larger drop in its performance on the VFITex test set compared to Ours-w/o BLFNet. This indicates the more important role of the MIFNet branch for capturing complex texture dynamics. On the other hand, for the DAVIS dataset that contains a lot of large motions, Ours-w/o MIFNet performs better than Ours-w/o BLFNet, which further confirms the larger contribution of the coarse-to-fine architecture in the BLFNet on this type of content. It can also be observed in Figure 9, for the case without MIFNet (sub-figures (a-d)), that the model fails to capture the complex motion of the wave. When BLFNet was removed from the original ST-MFNet (sub-figures (e-h)), the occluded region which is also undergoing a large movement has not been interpolated properly.

UMSResNext for multi-flow estimation. To measure the efficacy of the new UMSResNext, we replaced the UMSResNext-based feature extractor described in Section 3.1 with the U-Net used in [lee2020adacof] to predict similar multi-flows. This is denoted as Ours-unet. As shown in Table 3, ST-MFNet with UMSResNext achieves enhanced performance on all test sets, and this is also demonstrated by the visual comparison example in Figure 9 (i-l). Another advantage of using UMSResNext is that it has much fewer parameters (4M) than U-Net (21M).

Texture Enhancement. The importance of the TENet was also analyzed by training another variant Ours-w/o TENet, where the TENet is removed. Table 3 shows that there is a significant performance decrease compared to the full version, especially on DAVIS and VFITex. This demonstrates the contribution of the spatio-temporal filtering on frames over a wider temporal window for content with large and complex motions. Figure 9 (m-p) also shows an example, where the full ST-MFNet with the TENet produces richer textural detail compared to the version without TENet.

ST-GAN. To investigate the effectiveness of the ST-GAN training, we compared the visual quality of the interpolated content generated by the fine-tuned network (Ours-) and the distortion-oriented model Ours-. We also replaced the ST-GAN with two existing GANs, FIGAN [lee2020adacof] and TGAN [saito2017temporal]. Example frames produced by these variants are shown in Figure 9 (q-v), where the result generated by Ours- exhibits sharper edges and clearer structures compared to those produced by Ours-, and other GANs.

Figure 6: Qualitative interpolation examples by different methods. The first column (a) shows the overlaid adjacent frames. Columns (b-f) correspond to some of the best-performing benchmark methods. The results of our distortion-oriented model (g) and perception-oriented model (h) are also included, along with the ground truth frames (i).
Figure 7:

Results of the user study showing preference ratios for the tested interpolation methods. The error bars denote standard deviation over test videos.

5.2 Quantitative Evaluation

We compared the proposed ST-MFNet with 14 state-of-the-art VFI models including DVF [liu2017video], SuperSloMo [jiang2018super], SepConv [niklaus2017video], DAIN [bao2019depth], BMBC [park2020bmbc], AdaCoF [lee2020adacof], FeFlow [gui2020featureflow], CDFI [ding2021cdfi], CAIN [choi2020channel], Softsplat [niklaus2020softmax], EDSC [cheng2021multiple], XVFI [sim2021xvfi], QVI [xu2019quadratic] and FLAVR [kalluri2020flavr]. For fair comparison, we re-trained all benchmark models with the same training and validation datasets used for ST-MFNet under identical training configurations. The comprehensive evaluation results are summarized in Table 2, where the best and second best results in each column are highlighted in red and blue respectively. For all benchmark networks, we additionally evaluated their pre-trained versions provided in the original literature (where applicable). For each test set, if the pre-trained results are better than the re-trained counterparts, the former are presented and underlined.

Two key observations can be made from Table 2. Firstly, by using our training set (Vimeo-90k+BVI-DVC), the re-trained performance of all compared models has been improved over their pre-trained versions on large and complex motions, i.e. on DAVIS, SNU-FILM (medium, hard, extreme) and VFITex. For seven models, the pre-trained versions achieved higher PSNR and SSIM values on the UCF-101 dataset. This may be due to the similar characteristics between their pre-training dataset, Vimeo-90k and UCF-101. We also noted that our ST-MFNet offers the best results for DAVIS, SNU-FILM (all subsets) and VFITex, with a significant improvement of 0.36-1.09dB (PSNR) over the runner-up for each test set. It is only outperformed by the pre-trained FLAVR on UCF101 with marginal difference of 0.005dB (PSNR) and 0.001 (SSIM). This demonstrates the excellent generalization ability of the proposed ST-MFNet.

Complexity. Considering the fact that some models cannot be tested on high resolution content, we measured the model complexity solely based on the 480p sequences from DAVIS test set. The average runtime (RT, in seconds) for interpolating one frame is reported in Table 2 for each tested network, alongside its total number of parameters. We noticed that ST-MFNet has a relatively high computational complexity among all tested models. The reduction of model complexity remains one of our future works.

5.3 Qualitative Evaluation

Visual comparisons. Examples frames interpolated by our model and several best-performing state-of-the-art methods are shown in Figure 6. It can be observed that the results generated by the perceptually trained ST-MFNet (Ours-) are closer to the ground truth, containing fewer visual artifacts and exhibiting better perceptual quality.

User Study. As single frames cannot fully reflect the perceptual quality of interpolated content, we conducted a user study where our method was compared against three competitive benchmark approaches, QVI, FLAVR and Softsplat (re-trained using its original perceptual loss [niklaus2020softmax]). For this study, 20 videos randomly selected from DAVIS, SNU-FILM and VFITex were used as the test content for the three tested models. In each trial of a test session, participants were shown a pair of videos including one interpolated by perceptually optimized ST-MFNet and the other one by QVI, FLAVR or Softsplat. This results in a total of 60 trials in each test session. The order of video presentation was randomized in each trial (the order of trials was also random), and the subject in each case was asked to choose the sequence with higher perceived quality. Twenty subjects were paid to participate in this study. More details of the user study can be found in Appendix C.

The collected user study results are summarized in Figure 7

. It can be observed that approaching 70% of users on average preferred ST-MFNet against QVI, and this figure is statistically significant for 95% confidence based on a t-test experiment (

). The average preference difference between our method and FLAVR is smaller, with 56% users in favor of ST-MFNet results. This was also significant at a 95% confidence level (). Finally, when comparing against Softsplat, around 60% of subjects favored our method, where the significance holds again at 95% level ().

6 Limitations and Potential Negative Impacts

Although superior interpolation performance has been observed from the proposed method, we are aware of the relatively low inference speed associated with this model. This is mainly due to its large network capacity. Training such large models can also potentially introduce negative impact on the environment due to the significant power consumption of computational hardware [lacoste2019quantifying]. This can be mitigated by using more efficient hardware, and through model complexity reduction based on network compression [modelcompression] and knowledge distillation [hinton2015distilling].

7 Conclusion

In this paper, we propose a novel video frame interpolation algorithm, ST-MFNet, which consistently achieved improved interpolation performance (up to a 1.09dB PSNR gain) over state-of-the-art methods on various challenging video content. The proposed method features three main innovative design elements. Firstly, flexible many-to-one multi-flows were combined with conventional one-to-one optical flows in a multi-branch fashion, which enhances the ability of capturing large and complex motions. Secondly, a novel architecture was designed to predict multi-interflows at multiple scales, leading to reduced complexity but enhanced performance. Thirdly, we employed a 3D CNN architecture and the ST-GAN originally proposed for texture synthesis to enhance the visual quality of textures in the interpolated content. Our quantitative and qualitative experiments showed that all of these contribute to the final performance of our model, which consistently outperforms many state-of-the-art methods with significant gains.

8 Acknowledgment

The author Duolikun Danier is funded jointly by the University of Bristol and China Scholarship Council.


Appendix A Discriminator for ST-MFNet

The architecture of the discriminator employed in this work is illustrated in Figure 8; this was originally designed to train ST-GAN [yang2021spatiotemporal] for texture synthesis. It contains a temporal and a spatial branch. The former takes the differences between the interpolated output (where ) of ST-MFNet and its two adjacent original frames as input. The differences here represent the high-frequency temporal information within these three frames. The spatial branch in this network processes the ST-MFNet output to generate spatial features. Finally, the temporal and spatial features generated in these two branches are concatenated before fed into the final fully connected layers.

Figure 8: Architecture of the discriminator used for training ST-MFNet.

Appendix B Additional Ablation Study

In the main paper, we presented key ablation study results where the primary contributions in the proposed ST-MFNet are evaluated. Here the effectiveness of the up-sampling scale is further investigated, which has been employed during the multi-flow prediction in the MIFNet branch (see Section 3.1 of the main paper).

Up-sampling. To evaluate the contribution of the up-sampling scale during the multi-flow prediction, the version of ST-MFNet (Ours-w/o US) with only two multi-flow estimation heads (at scales) were implemented. It was also trained and evaluated using the same configurations described in the main paper. Its interpolation results are summarized in Table 3 alongside more comprehensive ablation study results for the other four variants of ST-MFNet (described in the main paper). It can be observed that Ours-w/o US was outperformed by the full version of ST-MFNet (Ours) on all test datasets. The performance difference can also be demonstrated through visual comparison as shown in Figure 9. All of these confirm the effectiveness of the up-sampling scale in multi-flow estimation.

Easy Medium Hard Extreme
Ours-w/o BLFNet 33.218/0.970 27.767/0.881 40.655/0.990 36.890/0.984 31.205/0.947 25.492/0.869 28.498/0.915
Ours-w/o MIFNet 33.202/0.969 27.886/0.889 40.331/0.991 36.530/0.982 31.321/0.949 25.620/0.871 28.357/0.911
Ours-w/o TENet 32.895/0.970 27.484/0.880 40.275/0.991 35.983/0.980 30.527/0.937 25.374/0.864 28.241/0.910
Ours-unet 33.378/0.970 28.096/0.892 40.616/0.991 36.797/0.984 31.383/0.950 25.680/. 872 28.898/0.925
Ours-w/o US 33.371/0.970 28.155/0.893 40.248/0.990 36.689/0.983 31.384/0.949 25.636/0.873 28.977/0.925
Ours 33.384/0.970 28.287/0.895 40.775/0.992 37.111/0.985 31.698/0.951 25.810/0.874 29.175/0.929
Table 3: Comprehensive ablation study results on ST-MFNet.

[Overlay]   [GT]   [Ours-w/o US]   [Ours-w/ US]

Figure 9: Qualitative results interpolated by the ST-MFNet with the up-sampled scale removed (Ours-w/o US) and the full version of ST-MFNet (Ours-w/ US). Here “Overlay” means the overlaid adjacent frames.

Appendix C User Study

The user study was conducted in a darkened, lab-based environment. The test sequences were played on a SONY PVM-X550 display, with screen size 124.271.8cm. The display resolutions were configured to 19201080 (spatial) and 60Hz (temporal), and the viewing distance was 2.15 meters (three times the screen height) [itu2002500]. The presentation of video sequences was controlled by a Windows PC running Matlab Psychtoolbox [pychotoolbox]. In each trial, a pair of videos to be compared were played twice, then the participant was asked to select the video with better perceived quality through an interface developed using the Psychtoolbox. This user study and the use of human data have undergone an internal ethics review and has been approved by the Institutional Review Board.

Appendix D Attribution of Assets

The data and code assets employed in this work and their corresponding license information are summarized in Table 4 and 5 respectively.

Dataset Dataset URL License / Terms of Use
Vimeo-90k [xue2019video] MIT license.
BVI-DVC [ma2020bvi] All sequences are allowed for academic research.
UCF101 [soomro2012ucf101] No explicit license terms, but compiled and made available for research use by the University of Central Florida.
DAVIS [perazzi2016benchmark] BSD license.
SNU-FILM [choi2020channel] MIT license .
Xiph [montgomery3xiph] Sequences used are available for research use.
Mitch Martinez Free 4K Stock Footage [mitch] Sequences used are available for research use.
UVG database [mercat2020uvg] Non-commercial Creative Commons BY-NC license.
Pexels [pexels] All sequences are available for research use.
Table 4: License information for the datasets used in this work.
Method Source code URL License / Teams of Use
DVF [liu2017video] Non-commercial research and education only.
SuperSloMo [jiang2018super] MIT license.
SepConv [niklaus2017video] Academic purposes only.
DAIN [bao2019depth] MIT license.
BMBC [park2020bmbc] MIT license.
AdaCoF [lee2020adacof] MIT license.
FeFlow [gui2020featureflow] MIT license.
CAIN [choi2020channel] MIT license.
SoftSplat [niklaus2020softmax] Academic purposes only.
XVFI [sim2021xvfi] Research and education only.
FLAVR [kalluri2020flavr] Apache-2.0 License.
Table 5: License information for the code assets used in this work.