Optical-Flow-Reuse-Based Bidirectional Recurrent Network for Space-Time Video Super-Resolution

10/13/2021
by   Yuantong Zhang, et al.
Wuhan University
0

In this paper, we consider the task of space-time video super-resolution (ST-VSR), which simultaneously increases the spatial resolution and frame rate for a given video. However, existing methods typically suffer from difficulties in how to efficiently leverage information from a large range of neighboring frames or avoiding the speed degradation in the inference using deformable ConvLSTM strategies for alignment. achieved promising results. To solve the above problem of the existing methods, we propose a coarse-to-fine bidirectional recurrent neural network instead of using ConvLSTM to leverage knowledge between adjacent frames. Specifically, we first use bi-directional optical flow to update the hidden state and then employ a Feature Refinement Module (FRM) to refine the result. Since we could fully utilize a large range of neighboring frames, our method leverages local and global information more effectively. In addition, we propose an optical flow-reuse strategy that can reuse the intermediate flow of adjacent frames, which considerably reduces the computation burden of frame alignment compared with existing LSTM-based designs. Extensive experiments demonstrate that our optical-flow-reuse-based bidirectional recurrent network(OFR-BRN) is superior to the state-of-the-art methods both in terms of accuracy and efficiency.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 8

page 11

page 12

05/12/2021

FDAN: Flow-guided Deformable Alignment Network for Video Super-Resolution

Most Video Super-Resolution (VSR) methods enhance a video reference fram...
09/24/2019

Deformable Non-local Network For Video Super-Resolution

The video super-resolution (VSR) task aims to restore a high-resolution ...
02/25/2021

Learning for Unconstrained Space-Time Video Super-Resolution

Recent years have seen considerable research activities devoted to video...
12/07/2018

TDAN: Temporally Deformable Alignment Network for Video Super-Resolution

Video super-resolution (VSR) aims to restore a photo-realistic high-reso...
09/11/2021

Dual-view Snapshot Compressive Imaging via Optical Flow Aided Recurrent Neural Network

Dual-view snapshot compressive imaging (SCI) aims to capture videos from...
10/22/2021

HDRVideo-GAN: Deep Generative HDR Video Reconstruction

High dynamic range (HDR) videos provide a more visually realistic experi...
03/18/2022

Beyond a Video Frame Interpolator: A Space Decoupled Learning Approach to Continuous Image Transition

Video frame interpolation (VFI) aims to improve the temporal resolution ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the development of economy and society, the public’s pursuit of audiovisual experience is getting higher and higher. High resolution (4K, 8K) and high frame rate (120FPS, 240FPS) televisions are becoming increasingly popular. However, many video sources have relatively low frame rates and resolutions. Therefore, improving the resolution and frame rate of ordinary video is of great application value. Space-Time Video Super-Resolution aims to transform a low spatial resolution video with a low frame rate to a video with higher spatial resolutions and temporal resolutions. [33]

first proposed the concept of Space-Time Super-Resolution. Some early works are not based on deep learning. To mine the temporal and spatial information of video, these methods

[29] [36] used hand-designed feature extractors

Fig. 1: Comparation of accuracy(PSNR) and speed(FPS) of different methods on Vid4[24] dataset. Our method is faster and more accurate than other state-of-the-art methods, such as Zooming[42],STARnet[14],TMNet[43], while maintaining a relatively small amount of parameters.

(such as SVM, Hough pyramid, etc.) to extract video features and then carried out pixel regularization. Many of the earlier methods are based on a series of strict mathematical assumptions, such as the scene changes less, the motion amplitude between two frames is relatively small, the brightness of adjacent pixels is consistent, and so on. However, it is almost impossible for real scenes to meet these conditions. The reconstruction results are often inferior when modeling diverse scenes with these traditional methods, particularly dealing with large motion and inconsistent brightness scenes. Moreover, most of these methods take a pretty long time to process a video. With the development of deep learning, many video super-resolution methods[40][13][20] and video interpolation methods [6][32][45] based on CNN have made great progress. In fact, Space-Time Super-Resolution can be completed in two steps, that is, performing video interpolation method and video super-resolution method independently or combine the two tasks and joint training. However, the two-step schema undoubtedly ignores the correlation of temporal and spatial information. Furthermore, because the key of both VFI and VSR is the alignment of frames or features, the two-step scheme and the joint training scheme will inevitably align twice. (VFI and VSR) This not only slows down the dealing speed but also brings huge parameter redundancy.

In the past few years, video frame interpolation and video super-resolution have made considerable progress. Spatio-temporal super-resolution has also received more attention recently. Several recent STVSR methods have been proposed to solve the problems of two-stage methods and aim to better exploit the correlation between time and space. These methods can perform VFI and VSR reconstruction on low-frame-rate and low-resolution videos simultaneously. STARnet[14]

first estimates the optical flow of two adjacent frames and performs feature warping to interpolate the intermediate frame, and finally perform reconstruction to obtain a high-resolution frame. However, this method can only use the information from two adjacent frames and fail to leverage information in long distant. And its iteration-based optimization method inevitably brings expensive costs to computation and memory. Xiang

et al.[42] proposed a deformable convLSTM backbone and performed STVSR on the feature space. Compared with STARnet, this method supports relatively longer input video sequences and meanwhile costs less on computation and memory. Based on Zooming Slow-Mo, Xu et al. further proposed TMNet [43]

which can perform controllable frame interpolation at any intermediate moment. After serious research and Reflection on the existing work, we design a bidirectional recurrent network for ST-VSR that can make better use of local information and global information with high efficiency. Our contributions are summarized as:

  • An optical-flow-reuse-based bidirectional recurrent network is proposed by effectively leveraging long-range temporal knowledge and refined fusing of warped features. The proposed flow-reuse strategy reduces the temporal alignment computation cost by half compared with previous LSTM-based alignment methods.

  • Feature Refinement Module (FRM) is utilized to enhance the proposed hidden state features in bidirectional recurrence, bringing performance improvement by refining warped features with additional temporal information.

  • With the merit of the above technologies, our proposed framework reaches the best restoration performance with the highest computing speed on public test sequences, compared with other ST-VSR algorithms.

The remainder of the paper is organized as follows: Section II reviews the background and related works. Details of our proposed flow-reuse-based bidirectional network are given in Section III. Experiments and analysis are provided in Section IV. Finally, the paper is concluded in Section V.

Ii related Work

We will introduce the related work from three parts: Video Frame Interpolation(VFI), Video space super-resolution(VSR) and video space-time super-resolution(ST-VSR)

Ii-a Video Frame Interpolation

The goal of Video Frame Interpolation(VFI) is to synthesis nonexistent intermediate frames between consecutive frames. Many previous methods use two reference frames to interpolate the intermediate frame. For flow-based methods,[27] proposed a fully convolutional network to estimate voxel flow and generate intermediate frames. Jiang et al. introduced a Similar method [19] and used two U-Net architectures to compute bi-directional optical flow and visibility maps. Bao et al. [41] first linearly weigh the optical flow and then use a depth-aware projection layer to adaptively blend the warped frames. Xu et al. [44]proposed QVI, which exploits four consecutive frames and estimate flow fields starting from the unknown target frame to the source frame. Different from those backward warping methods, Niklaus et al. proposed SoftSplat [31] to forwardly warp frames and their corresponding feature map using softmax splatting.

Apart from using optical flow, another major trend for VFI is to replace the two-step interpolation operation as a convolution process. Niklaus et al. [32] use a pair of 1D kernels to perform a spatially adaptive convolution to estimate the motion. Cheng et al. [6] proposed DSepConv that use deformable separable convolution to enlarge receptive field of kernel-based methods and further proposed EDSC [7] to perform multiple interpolation. Choi et al.[8] proposed CAIN, which replaces the optical flow computation module with a channel attention module to capture the motion information implicitly.

Ii-B Video Super-resolution

Video Super-Resolution (VSR) aims at reconstructing high-resolution video from the conrresponding low-resolution video. The key to VSR is how to fully use complementary information across frames. Most recent methods can be divided into two categories according to generator networks: iterative [20] [40] [47] and recurrent [13],[17]. specifically, TDAN [39] adopts deformable convolutions module (DCNs) [9] [52] to align different frames at the feature level. EDVR [40], as a representative method, further uses DCNs in a multi-scale pyramid and utilizes multiple attention layers to adopt alignment and then integrate the features. PFNL[48], a progressive fusion network with an improved non-local operation to avoid the complex motion estimation and motion compensation, which has obtained a favorable result in terms of both performance and complexity. Isobe et al. [18] proposed TGA that divides temporal information into groups and utilizes both 2D and 3D residual blocks for inter-group fusion from a temporally sliding window.

For iterative-based method, Tao el al. [38] proposed a sub-pixel motion compensation layer in a CNN framework and utilizes ConvLSTM [35] module for capturing long-range temporal information. RBPN [13] extend Image super-resolution [12] to VSR which sends LR frames into a projection module step by step with the recurrent back-projection network. MTUDB [49]

embed convolutional long-short-term memory (ConvLSTM) into ultra dense residual block to construct a multi-temporal ultra dense memory (MTUDM) network for video super-resolution. RSDN designs a structure-detail recurrent network to learn low-frequency and high-frequency information of the image respectively.

Still, most previous iterative-based or sliding windows methods refuse the assist from subsequent LR frames. Some recent works [50],[5] show that it is essential to utilize both neighboring LR frames and long-distance LR frames( previous and subsequent) to reconstruct HR frames.

Ii-C Video Time-Space Super-resolution

Space-Time Video Super-Resolution aims to transform a low spatial resolution video with a low frame-rate to a video with higher spatial resolutions and temporal resolutions Some of the earlier methods [33] [29], which are not based on deep learning, are slow in processing speed and often fail to generate promising effects when processing complex scenes. Some recent work based on deep learning has made great progress in both speed and effect. STARnet[14] leveraged mutually informative relationships between time and space with an optical flow estimation module[10], and perform feature warping of two consecutive frames to interpolate the intermediate LR frame. Zooming Slow-Mo [42] developed a unified framework with deformable ConvLSTM to align and aggregate temporal information and then synthesize the intermediate features by a bidirectional recurrent network before performing feature fusion for STVSR. Based on Zooming Slow-Mo, xu et al. [43] proposed a temporal modulation network via locally-temporal feature comparison module and deformable convolution kernels for controllable feature interpolation, which can interpolate arbitrary intermediate frames.

Fig. 2:

A schematic representation of our proposed method. We first calculate the bidirectional optical flow from the intermediate frame and then reuse the optical flow between LRs according to the vector addition criterion. After that, we send LRs into a bidirectional recurrent structure, perform forward recurrence and backward recurrence and send the results of each recurrence to the FRM module for further optimization, and then we merge the feature map with the features extracted by the PCD module within the sliding window. Lastly, we apply PixelShuffle to reconstruct the HR frames.

Iii Proposed Method

Our network structure consists of three parts: 1) optical flow estimation module, 2) Bidirectional Recurrent neural network 3) Frame Reconstruction module. We first compare the two ways of estimating bidirectional optical flow, and introduce the bidirectional recurrence structure. Then we discussed how to effectively fuse local and global information to obtain better performance. Finally, we describe the structure of the reconstruction module. The symbol table(Table 

I) shows the mainly used symbols in this paper. The structure of the entire recurrence network is shown in 2 .

Iii-a Optical Flow Estimation Module

stage 1: estimate flow between two frames
We employ optical flow to perform motion estimation and motion compensation. The whole reconstruction process of a high-resolution frame is an RNN like structure. When reconstructing a certain frame, we use optical flow to align the hidden state containing forward and backward historical information to the current frame. It should be noted that our task is spatio-temporal super-resolution, which is different from ordinary video super-resolution. During the process of recurrence, We not only need to align hidden state to the frame with the low-resolution image (), but also need to align it to the hidden state of the Synthetic intermediate frame(). Therefore, an appropriate intermediate flow estimation strategy is the key to align the intermediate hidden state. Theoretically, we can use any off-the-shelf optical flow estimation module to estimate the optical flow between . But how to estimate the optical flow towards the intermediate frame has become a difficult problem. Some previous works [44] [26] first compute bidirectional flows and refine them to get intermediate results by multiplying the optical flow by a time factor. And another way is directly estimating the intermediate flows. To better illustrate the difference, we compare two flow estimation methods in Figure 3. We assume that we need to estimate the optical flow between the two frames, and also the optical flow from the intermediate frame and . Represents an optical flow estimation model.
method 1 We simply compute flow between two frames and multiply the optical flow by a time factor to get the intermediate result.

(1)
(2)

method 2 We directly estimate the intermediate flows, and using a revised IFNet[16]. specifically, Because the resolution of the input LR is relatively low, we only downsample the LR twice and set the channels of IFblock to 64. As for flow between two LR frames, we reuse them according to the law of vector addition. Represents revised IFNet.

(3)
(4)
(5)

After experiments, we found that directly estimating the optical flow of the intermediate frames is better than estimating the optical flow between LRs. We will prove this in the ablation study.

Fig. 3: Comparation of two flow Estimation methods The figure on the left shows the direct estimation of the optical flow between LRs, and figure b on the right is to directly estimate the optical flow of the intermediate frame.

Symbol Definition
Low Resolution frame of input sequence
Synthetic Intermediate frame of Low Resolution
Hidden State of low resoluton frame for backward or forward recurrence
Refined through stacked residual block
Hidden State of Synthetic Intermediate frame for backward or forward recurrence
Refined through stacked residual block
Network that processes forward or backward propagation
Alignment from coarse(flow warping) to fine(feature refinement module)
estimated forward or backward flow
flow Warping
TABLE I: A list of notations mainly used in this paper

stage 2: estimate bidirectional flow across frames
In stage 1, we talked about methods of estimating flow between two consecutive frames, now we talk about how to estimate bidirectional flow across frames. Given a sequence of low resolution consecutive frames,we extract two subsequences from this sequence, and , as shown in Figure2. For the frame with low resolution image(LR), we estimate the optical flow of the corresponding positions in the two subsequences by a flow estimation method as metioned above, and refer to forward flow and backward flow.

(6)
(7)

So far, we have finished the estimation of optical flow and made preparation for the bidirectional recurrent network.

Iii-B Bidirectional Recurrent Neutral Network


In this section, we first introduce the basic structure of bidirectional recurrent network, then we will talk about how to apply alignment from coarse(flow warping) to fine (feature refinement module) across frames of the whole input video sequence. Finally, we explore how to efficiently combine the local information and global information to obtain better performance.

Iii-B1 basic structure of bidirectional recurrent network

As shown in 2, the overall structure is a bidirectional recurrence network, which is an rnn like structure, but different from the conventional RNN, the input video frames need to be transferred forward and backward respectively. In this process, the gained information from other frames is transferred through a hidden state. In our bidirectional settings, both frames and features are propagated in two opposite directions. We hope any LR in an input sequence can leverage knowledge from any other frames. Also, any synthetic intermediate frame(SILR) can also leverage the information that comes from neighouring frames and hidden state. Specifically, we call two recurrent processes: forward recurrence and backward recurrence. For clarity, we will describe propagation of LR and SILR respectively. The backward recurrence of LR can be described as:

(8)

In the process of recurrence, the hidden state passes through the ”pipeline”, and we must apply MEMC( in formula 9) to align the hidden state to the current frame. Specifically, we first conduct an alignment with flow warping to get coarse results. Note that we conduct backward warping; therefore, the direction of optical flow is opposite to the direction of recurrence. After warping, we perform a feature refinement module(FRM) to further optimize the hidden state, which will be introduced in detail later. After carrying out ”alignment,” the aligned feature will be first concatenated with corresponding LR and then fed into a stacked fusion residual block to get the refined result.

The forward recurrence of LR can be described as:

(9)

We can clearly see that the process of the forward recurrence is basically the same as that of the backward recurrence, that is, the MEMC is performed first, and then the refinement is performed. The only difference is that the refined results of the backward recurrence are concatenated with LR and corresponding forward hidden state before being sent to the refining network.

Next, we describe the recurrence of the intermediate state in the same manner. Nevertheless, It should be pointed out that we can’t access the Ground True of the SILR, so in the procedure of MEMC, we should not only align the hidden state in feature space but also in frame space so as to be as consistent with recurrence of LR. The backward recurrence of intermediate state(HSI) can be described as:

(10)

The forward recurrence of intermediate state(HSI) can be described as:

(11)

The recurrence process of the intermediate state roughly shares the same idea with that of LR. Due to the possible occlusion of objects, and camera panning in boundary regions of the image, simply mixing the features from backward and forward recurrence may introduce errors. In order to reduce possible errors, we introduced two masks, and to reveal these occlusion areas between the two adjacent frames and . Notice that points with a value of 0 in the mask refer to those pixels that exist in the estimated intermediate frame and disappear due to occlusion or movement in and , where is the Hadamard product.

(12)

Iii-B2 feature refinement module


Thanks to the bidirectional recurrent structure, any frame in the input sequence can obtain information gain from any other frames. However, the feature map of the hidden state brings not only information gain but also some noise caused by inaccurate alignment and occlusion. In order to solve this problem, one of the most direct ideas is to replace RNN with LSTM. In fact, since LSTM can capture longer information dependence and suppress noise, most of the work in recent years has used various variants of LSTM, such as convLSTM [21], deformable LSTM [42]. But we did not choose to use LSTM. This is because compared to RNN, LSTM needs to save more intermediate states in the loop process, specifically, including forgetting gate, input gate, output gate, and cell state, but RNN only need to save hidden state, which means that LSTM will cause more memory occupancy. That is to say, under the same hardware equipment conditions, the LSTM-based network structure can only handle a shorter input sequence than the RNN-based structure. Some recent works of VSR [5] [50] pointed out that for the VSR task, longer video input sequence will benefit from more long-term information and therefore achieve better performance. With this assumption, we carefully consider how to ensure less memory usage and meanwhile suppress noise.

That is to say, we need such a structure that can adaptively measure the relevance between the current candidate frame and the hidden state, and decide which part in the hidden state needs to be highlighted and which part needs to be suppressed. Starting from[3]

, many works have studied how to dynamically generate a convolution kernel and calculate the similarity of two input tensors.

[17] proposed RSDN and designed a hidden state adaptation module that allows the current frame to selectively useful information from the hidden state. [23] proposed MuCAN, which performs temporal multi-correspondence aggregation strategy and cross-scale nonlocal-correspondence aggregation scheme to explore the self-similarity of images across scales.

Similar to [11] which proposed ’correlation layer’ that performs multiplicative patch comparisons between two feature maps, we compute the correlation between feature map of LR and aligned hidden state. Specifically, we first feed LR into a ”Conv-LeakyReLU” layer; then we compute the local correlation of the LR feature and hidden state for each channel. The correlation of LR feature() centered at (x) in map and hidden state() centered at (y) in map can be described as:

(13)

represents offset that limited in a square patch. After that, we apply sigmoid activation function of

matrix and transforms it into a matrix whose value is in range [0, 1] and then perform element-wise multiplication between the hidden state and matrix to get the optimized hidden feature. Finally, We concatenate optimized result with hidden state.

Iii-B3 Hybrid of Iteration and Sliding Window


As the two main frameworks in the VSR field, the sliding window and iteration method actually represent different concerns for global information and local information. Obviously, when a frame is restored, the neighboring frames can provide more information, while the long distant frame contains relatively less useful information. To make comprehensive use of local and global information at the same time, we have adopted a hybrid approach of iteration and sliding window. The iteration method has been discussed in detail in the previous article, namely: applying optical flow warping and Feature Refinement Module(FRM) to update the hidden state. For sliding windows features, inspired by the design of EDVR [40], We use a PCD (Pyramid Cascading Deformable) alignment block to extract features from LRs within the sliding window. Specifically, we first use the multi-layer stacked residual network to extract the features and then align the neighboring frames to the current reference frame with pyramid, cascading and deformable convolution. We set the window size to five(consecutive five LR). After that, we fuse feature maps with a simple Conv layer. Finally, we concatenate and fuse the obtained features in the channel dimension with the features obtained by the bidirectional network. Through experiments, we found that hybridization can make better use of local and global information, and at the same time, fully utilize the advantages of kernel-based and flow-based methods.

Iii-C Reconstruction Module


Until now, We have obtained hidden states that contain the temporal and spatial features of LR and SILR. Then, we perform spatial reconstruction for the features. Specifically, we feed the reconstructed feature maps into two sub-pixel upscaling modules with PixelShuffle [34] and finally output the reconstructed HR video frames

Iii-D Implementation Details

We use a pre-trained IFNet[16] on Vimeo[46] as our flow estimation module. and randomly crop a sequence of down-sampled image patches with the size of 64

64 and take the odd-indexed LR as inputs, the corresponding consecutive HR sequence with the size of 256

256 as supervision. During training, we adopt Adam optimizer with = 0.9 and = 0.999 and apply standard augmentation, such as rotation, flipping, and random cropping. The initial learning rates of the flow estimator and other parts are set to and respectively and decay to with a cosine annealing[28]. The batch size is set to be 24 and we trained the model on 8 Nvidia 1080Ti GPUs. We initialize the parameters of our network by kaiming initialization [15] except for pre-trained weights(IFNet).

Iv Experiments

In this section,we conduct experiments on three mainly used datasets for VSR and VFI. Vid4[24] ,Vimeo[46], and REDS[30].

Iv-a Experimental Setup

Method
VFI+(V)SR/STVSR
Vid4
PSNR SSIM
Vimeo-Fast
PSNR SSIM
Vimeo-Medium
PSNR SSIM
Vimeo-Slow
PSNR SSIM
speed
fps
parameters
millions
SuperSloMo+Bicubic 22.84 0.5772 31.88 0.8793 29.94 0.8477 28.73 0.8102 - 19.8
SuperSloMo+RCAN 23.80 0.6397 34.52 0.9076 32.50 0.8844 30.69 0.8624 1.91 19.8+16.0
SuperSloMo+RBPN 23.76 0.6362 34.73 0.9108 32.79 0.8930 30.48 0.8354 1.55 19.8+12.7
SuperSloMo+EDVR 24.40 0.6706 35.05 0.9136 33.85 0.8967 30.99 0.8673 4.94 19.8+20.7
Sepconv+Bicubic 23.51 0.6273 32.27 0.8890 30.61 0.8633 29.04 0.8290 - 21.7
Sepconv+RCAN 24.92 0.7236 34.97 0.9195 33.59 0.9125 32.13 0.8967 1.86 21.7+16.0
Sepconv+RBPN 26.08 0.7751 35.07 0.9238 34.09 0.9229 32.77 0.9090 1.51 21.7+12.7
Sepconv+EDVR 25.93 0.7792 35.23 0.9252 34.22 0.9240 32.96 0.9112 4.96 21.7+20.7
DAIN+Bicubic 23.55 0.6268 32.41 0.8910 30.67 0.8636 29.06 0.8289 - 24.0
DAIN+RCAN 25.03 0.7261 35.27 0.9242 33.82 0.9146 32.26 0.8974 1.84 24.0+16.0
DAIN+RBPN 25.96 0.7784 35.55 0.9300 34.45 0.9262 32.92 0.9097 1.43 24.0+12.7
DAIN+EDVR 26.12 0.7836 35.81 0.9323 34.66 0.9281 33.11 0.9119 4.00 24.0+20.7
STARnet 26.06 0.8046 36.19 0.9368 34.86 0.9356 33.10 0.9164 10.54 111.61
Zooming Slow-Mo 26.31 0.7976 36.81 0.9415 35.41 0.9361 33.36 0.9138 12.40 11.10
TMnet 26.43 0.8016 37.04 0.9435 35.60 0.9380 33.51 0.9159 11.67 12.26
OFR-BRN(ours) 27.09 0.8242 37.14 0.9476 35.69 0.9399 33.46 0.9188 22.55 11.77
OFR-BRN+(ours) 26.87 0.8294 37.30 0.9487 35.83 0.9431 33.60 0.9202 13.80 13.84
TABLE II: Comparison of PSNR, SSIM, speed (fps), and parameters ( million) on Vimeo and Vid4

Datasets We first adopt a Vimeo-90K-T septuplet trainset [46] for training. Vimeo-90K contains 91,701 video sequences, each of which consists of 7 frames, and the HR frames are at the resolution 448 × 256. We follow the setting of [42] and divide testing dataset of Vimeo-90K-T into 3 categories according to the average motion flow magnitude: fast motion, medium motion, and slow motion, which include 1225, 4977, and 1613 video clips, respectively. For fair comparison, we removed 5 video clips from the original medium motion set and 3 clips from the slow motion set because these sets contain only all-black backgrounds. which will lead to infinite values on PSNR. We also test on Vid4 which contains four scenes and this dataset is widly used in VSR. Finally, to verify the robustness of our method across different datasets, we also test the model on REDS which is very challengeable due to its diverse scenes and large motions. We follow the experimental setup of [42], which generate LR frames with a downsampling factor 4 and use odd-indexed LR frames as input to predict the corresponding consecutive HR and synthesis intermediate HR.
Evaluation

We adopt Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) for evaluation which are widely used for VFI and VSR. We also provide model size and inference FPS of different networks to compare efficiency. Note that we compute inference FPS of the entire Vid4 dataset measured on one Nvidia 1080Ti GPU.

Iv-B Comparison with State-of-the-arts

Comparison methods We provide two models for comparison with other methods, the feature fusion model(OFR-BRN+) and the light model(OFR-BRN). The light model does not concatenate the sliding window feature in III-B3 before frame reconstruction. We compare the performance of our model with some state-of-the-art two-stage methods (VFI and VSR) and one-stage STVSR methods. For the two-stage methods, we perform video frame interpolation (VFI) by SuperSloMo[19],[2] or SepConv [9], and perform video super-resolution (VSR) by Bicubic Interpolation (BI), RCAN [51], RBPN[13] or EDVR [40]. For one-stage STVSR models, we compare our network with recently state-of-the-art methods Zooming SlowMo [42] ,STARnet [14] or TMnet [43]. When training, we use Vimeo-90K trainset [46] and feed odd LR frames into the model and reconstruct HR frames corresponding to the frames of the entire sequence. All these methods are trained on Vimeo-90K and evaluated on the Vimeo-90K test set and the Vid4[24] dataset.

Objective Metrics We listed the Quantitative results in Table II. Red and blue colors indicate the best and the second-best performance. Following the suggestion of [42],[43] we omit baseline models with Bicubic Interpolation when comparing the speed. For the light model, one can see that it outperforms the second-best method by 0.68dB on Vid4 and runs 2 faster than other SOTA STVSR methods. On the Vimeo dataset, the light model also outperforms other methods on qualitative evaluation indicators except for PSNR on Vimeo-slow. Through further experiments, we found that fusing sliding window features can improve performance, especially for short input sequences. Most importantly, our fusion model can still keep a relatively fast speed while maintaining high performance. In addition, the parameters of both models remain close to other SOTA methods. These experimental results prove that our model is quite competitive in performance, running speed, and parameter size.

ours
STARnet[14]
Zooming[42]
TMnet[43]
HR
Fig. 4: Qualitative comparison on Vid4.

Further experiments on REDS In order to further verify the robustness of our method across different datasets, we pretrained the model on Vimeo and tested it on REDS[30]. Compared with Vimeo and Vid4, REDS has a larger resolution (7201280), more complex and diverse scenes, and more significant motion. Therefore, REDS is closer to the super-resolution requirements of real scenes. Specifically, we set up the test setting with reference to [40], and compared the PSNR, SSIM and running speed of different ST-VSR schemes.

STVSR method PSNR SSIM Speed(fps)
STARnet 26.39 0.7444 4.98
Zooming Slow-Mo 26.68 0.7427 5.56
TMnet 26.76 0.7354 5.18
OFR-BRN(ours) 27.36 0.7634 8.44
TABLE III: performance comparison of different ST-VSR methods on REDS. All models are trained on Vimeo and test on REDS testset

It can be seen in Table III, our model outperforms other ST-VSR methods in accuracy and speed. Not only that, our method can often achieve better subjective visual effects when synthesizing intermediate frames. We will give more subjective experimental comparisons in appendix.

Iv-C Ablation Study

After demonstrating our model’s superiority over existing one-stage framework and two-stage networks. Here, we conduct examinations of different modules of our network.

Specifically, we mainly focus on 1) Different strategies of recurrence 2) Different methods of flow estimation 3) The effectiveness of Feature Refinement Module(FRM) 4) Influences of feature fusion space. The results of the ablation study on different modules are listed in V.
1. One-way recurrence vs. bidirectional recurrence To test the usefulness of the bidirectional mechanism, we removed the backward recurrent branch of bidirectional recurrence and named it one-way recurrence. The ablation results are shown in c and d in V. One can see that the accuracy of one-way recurrence is much lower than that of bidirectional recurrence. In a one-way recurrence, all frames can only leverage the knowledge of the previous frames but cannot make use of knowledge from subsequent frames. So, one-way recurrence may cause quite severe performance degradation.
2.Flow-reuse strategy vs. naive flow estimation To illustrate the effectiveness of the .flow-reuse strategy (intermediate flow estimation), we used two different optical flow estimation schemes as described in section 3.1 for experiments. When applying naive flow estimation between LRs(method 1 in III-A), we employ a pretrained pwcnet[37] as flow estimator.

Frame
Type
pwcnet
PSNR SSIM
IFNet
PSNR SSIM
odd frames 36.27 0.9401 36.72 0.9447
even frames 33.05 0.9217 33.77 0.9300
TABLE IV: Quantitative Results of odd and even frame on Vimeo

First of all, we need to confirm that the reuse of optical flow will not affect the alignment of odd frames (LR). As shown in table IV, We divide the reconstructed result frame into odd frames(LR) and even frames(SILR). The result of odd frames(LR) can illustrate that compared with estimating the optical flow between LRs, optical flow reuse will not affect the alignment results of LRs. From the results of even frames,we find that intermediate flow estimation method can get better performance. We may attribute the performance improvement to the method of directly estimating the intermediate flows, which can better fit the non-linear motion between frames than the linear motion estimation.
3.Effectiveness of Feature Refinement Module In fact, it is feasible that we only use optical flow to perform alignment between frames. In order to verify the effectiveness of feature refinement module module, we remove this module and compare the performance with the original network on Vid4 and Vimeo. The results are shown in Table V b and d.

method a b c d e
bidirectional
recurrence
intermediate flow
estimation
feature refinement
module
sliding window
feature
Vid4 26.26 26.70 26.28 27.09 26.87
vimeo 34.89 35.11 34.30 35.46 35.60
TABLE V: ablation on different modules
Fig. 5: visual result of ablation on feature refinement module

Since our network adopts a recurrent structure, when reconstructing a frame, we hope to align the hidden state, which contains knowledge from other frames to the state of the current LR in order to obtain information gain. However, it is almost impossible for the optical flow to be completely accurate. The flow warping may also bring noise while performing MEMC, especially for images with complex textures. Therefore, we hope that the information in the aligned hidden state should be as relevant as possible to the current frame. As shown in Figure 5, we display a typical visual contrast, pay attention to the cluttered branches pointed by the arrows. We can see that the restored results without the feature refinement module have obvious noise points. After applying feature refinement, our model can reconstruct more details, especially for images with complex texture.
4. Analysis of fusion space Since we cannot access ground truth of even-numbered frames(SILR), We must explore a reasonable way to estimate the intermediate state as accurately as possible. The simplest idea is to directly blend the warped color frames in the image space to produce the intermediate frame. this approach is commonly used in image stitching [1, 4], video extrapolation[22], video Stabilization[25]. However, image space fusion often easily leads to ghosting and checkerboard artifacts. In order to avoid these negative effects, some works[50, 39] pointed out that fusion at the feature space will achieve better results. To this end, we explored the effect of feature fusion in different spaces, namely: a) image space space b) feature space fusion and c) hybrid-space fusion. When performing image space-fusion, we use bidirectional optical flow to warp two adjacent video frames, and average the results of bidirectional warping, then we use a 11 convolution to keep the channel dimension of the feature consistent with LR (odd frame). When applying feature space fusion, we use bidirectional optical flow to warp adjacent HS, and also average the results from both sides. When performing hybrid space fusion, we concatenate the results of a and b in the channel dimension, and use 11 convolution to process the result to ensure that the final channel dimension is consistent with the odd-numbered frame. Quantitative results on testset of Vimeo are listed in VI

(a) Image-space (b) feature-space (c) hybrid-space
PSNR 35.43 35.48 35.60
SSIM 0.9302 0.9385 0.9393
TABLE VI: Quantitative evaluation of fusion spaces on Vimeo

It can be seen that hybrid-space fusion can achieve the best results compared to the other two methods. This may be because if only image-space fusion is used, the feature dimension is too low. Similarly, because HS will inevitably lose information during the recurrence, only feature-space fusion cannot fully use all the features. So we apply hybrid-space to achieve the best performance.

V Conclusion

In this paper, we propose an efficient and accurate structure for video space-time super-resolution. Thanks to our flow-reuse-based strategy and coarse-to-fine feature refinement module, our model has a considerable improvement in the speed and performance compared with previous state-of-the-art method, particularly estimating extreme motions. We also discussed how to integrate local and global information to reconstruct HR frames quickly and well to adapt to different needs.

References

  • [1] A. Agarwala, M. Dontcheva, M. Agrawala, S. Drucker, A. Colburn, D. Salesin, and M. Cohen (2012) Bibliography. Cited by: §IV-C.
  • [2] W. Bao, W. Lai, C. Ma, X. Zhang, Z. Gao, and M. Yang (2019) Depth-aware video frame interpolation. In

    IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Cited by: §IV-B.
  • [3] B. D. Brabandere, X. Jia, T. Tuytelaars, and L. V. Gool (2016) Dynamic filter networks. CoRR abs/1605.09673. External Links: Link, 1605.09673 Cited by: §III-B2.
  • [4] T. M. Camera (2016) Image alignment and stitching: a tutorial. Cited by: §IV-C.
  • [5] K. Chan, X. Wang, K. Yu, C. Dong, and C. L. Chen (2020) BasicVSR: the search for essential components in video super-resolution and beyond. Cited by: §II-B, §III-B2.
  • [6] X. Cheng and Z. Chen (2020) Video frame interpolation via deformable separable convolution.

    Proceedings of the AAAI Conference on Artificial Intelligence

    34 (7), pp. 10607–10614.
    Cited by: §I, §II-A.
  • [7] X. Cheng and Z. Chen (2020) Multiple video frame interpolation via enhanced deformable separable convolution. CoRR abs/2006.08070. External Links: Link, 2006.08070 Cited by: §II-A.
  • [8] M. Choi, H. Kim, B. Han, N. Xu, and K. M. Lee (2020) Channel attention is all you need for video frame interpolation. Proceedings of the AAAI Conference on Artificial Intelligence 34 (7), pp. 10663–10671. Cited by: §II-A.
  • [9] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. IEEE. Cited by: §II-B, §IV-B.
  • [10] P. Fischer, A. Dosovitskiy, E. Ilg, P. Husser, C. Hazrba, V. Golkov, V. Patrick, Θ. Cremers, and T. Brox (2016) FlowNet: learning optical flow with convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV), Cited by: §II-C.
  • [11] P. Fischer, A. Dosovitskiy, E. Ilg, P. Häusser, C. Hazırbaş, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox (2015) FlowNet: learning optical flow with convolutional networks. External Links: 1504.06852 Cited by: §III-B2.
  • [12] M. Haris, G. Shakhnarovich, and N. Ukita (2018) Deep back-projection networks for super-resolution. arXiv. Cited by: §II-B.
  • [13] M. Haris, G. Shakhnarovich, and N. Ukita (2019) Recurrent back-projection network for video super-resolution. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II-B, §II-B, §IV-B.
  • [14] M. Haris, G. Shakhnarovich, and N. Ukita (2020) Space-time-aware multi-resolution video enhancement. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Fig. 1, §I, §II-C, Fig. 4, §IV-B, Fig. 6, Fig. 7, Fig. 8, Fig. 9.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2015)

    Delving deep into rectifiers: surpassing human-level performance on imagenet classification

    .
    In CVPR, Cited by: §III-D.
  • [16] Z. Huang, T. Zhang, W. Heng, B. Shi, and S. Zhou (2021) RIFE: real-time intermediate flow estimation for video frame interpolation. External Links: 2011.06294 Cited by: §III-A, §III-D.
  • [17] T. Isobe, X. Jia, S. Gu, S. Li, S. Wang, and Q. Tian (2020) Video super-resolution with recurrent structure-detail network. Cited by: §II-B, §III-B2.
  • [18] T. Isobe, S. Li, X. Jia, S. Yuan, G. Slabaugh, C. Xu, Y. L. Li, S. Wang, and Q. Tian (2020) Video super-resolution with temporal group attention. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II-B.
  • [19] H. Jiang, D. Sun, V. Jampani, M. H. Yang, E. Learned-Miller, and J. Kautz Super slomo: high quality estimation of multiple intermediate frames for video interpolation. Cited by: §II-A, §IV-B.
  • [20] Y. Jo, S. W. Oh, J. Kang, and S. J. Kim (2018) Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II-B.
  • [21] S. Kim, S. Hong, M. Joh, and S. K. Song (2017) DeepRain: convlstm network for precipitation prediction using multichannel radar data. Cited by: §III-B2.
  • [22] S. Lee, J. Lee, B. Kim, K. Kim, and J. Noh (2019) Video extrapolation using neighboring frames. ACM Transactions on Graphics (TOG) 38 (3), pp. 1–13. Cited by: §IV-C.
  • [23] W. Li, X. Tao, T. Guo, L. Qi, and J. Jia (2020) MuCAN: multi-correspondence aggregation network for video super-resolution. Cited by: §III-B2.
  • [24] C. Liu and D. Sun (2011) A bayesian approach to adaptive video super resolution. In CVPR 2011, Vol. , pp. 209–216. External Links: Document Cited by: Fig. 1, §IV-B, §IV.
  • [25] Y. L. Liu, W. S. Lai, M. H. Yang, Y. Y. Chuang, and J. B. Huang (2021) Hybrid neural fusion for full-frame video stabilization. Cited by: §IV-C.
  • [26] Y. Liu, L. Xie, S. Li, W. Sun, and C. Dong (2020) Enhanced quadratic video interpolation. Computer Vision – ECCV 2020 Workshops. Cited by: §III-A.
  • [27] Z. Liu, R. Yeh, X. Tang, Y. Liu, and A. Agarwala (2017) Video frame synthesis using deep voxel flow. IEEE. Cited by: §II-A.
  • [28] I. Loshchilov and F. Hutter (2016)

    SGDR: stochastic gradient descent with warm restarts

    .
    arXiv e-prints. Cited by: §III-D.
  • [29] U. Mudenagudi, S. Banerjee, and P. K. Kalra (2011) Space-time super-resolution using graph-cut optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (5), pp. 995–1008. Cited by: §I, §II-C.
  • [30] S. Nah, S. Baik, S. Hong, G. Moon, S. Son, R. Timofte, and K. Mu Lee (2019-06) NTIRE 2019 challenge on video deblurring and super-resolution: dataset and study. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §IV-B, §IV.
  • [31] S. Niklaus and F. Liu (2020) Softmax splatting for video frame interpolation. IEEE. Cited by: §II-A.
  • [32] S. Niklaus, M. Long, and F. Liu (2017) Video frame interpolation via adaptive separable convolution. In 2017 IEEE International Conference on Computer Vision (ICCV), Cited by: §I, §II-A.
  • [33] E. Shechtman, Y. Caspi, and M. Irani (2002) Increasing space-time resolution in video. Computer Vision — ECCV 2002. Cited by: §I, §II-C.
  • [34] W. Shi, J. Caballero, F. Huszár, J. Totz, and Z. Wang (2016)

    Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

    .
    IEEE. Cited by: §III-C.
  • [35] X. Shi, Z. Chen, H. Wang, D. Y. Yeung, W. K. Wong, and W. C. Woo (2015)

    Convolutional lstm network: a machine learning approach for precipitation nowcasting

    .
    MIT Press. Cited by: §II-B.
  • [36] H. Song, L. Qing, Y. Wu, and X. He (2013) Adaptive regularization-based space–time super-resolution reconstruction. Signal Processing Image Communication 28 (7), pp. 763–778. Cited by: §I.
  • [37] D. Sun, X. Yang, M. Y. Liu, and J. Kautz (2017) PWC-net: cnns for optical flow using pyramid, warping, and cost volume. Cited by: §IV-C.
  • [38] X. Tao, H. Gao, R. Liao, J. Wang, and J. Jia (2017) Detail-revealing deep video super-resolution. IEEE Computer Society. Cited by: §II-B.
  • [39] Y. Tian, Y. Zhang, Y. Fu, and C. Xu (2020) TDAN: temporally-deformable alignment network for video super-resolution. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II-B, §IV-C.
  • [40] X. Wang, K. Chan, K. Yu, C. Dong, and C. C. Loy (2019) EDVR: video restoration with enhanced deformable convolutional networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Cited by: §I, §II-B, §III-B3, §IV-B, §IV-B.
  • [41] Wenbo, Bao, Wei-Sheng, Lai, Xiaoyun, Zhang, Zhiyong, Gao, Ming-Hsuan, and Yang (2019) MEMC-net: motion estimation and motion compensation driven neural network for video interpolation and enhancement.. IEEE transactions on pattern analysis and machine intelligence. Cited by: §II-A.
  • [42] X. Xiang, Y. Tian, Y. Zhang, Y. Fu, J. P. Allebach, and C. Xu (2021) Zooming slowmo: an efficient one-stage framework for space-time video super-resolution. Cited by: Fig. 1, §I, §II-C, §III-B2, Fig. 4, §IV-A, §IV-B, §IV-B, Fig. 6, Fig. 7, Fig. 8, Fig. 9.
  • [43] G. Xu, J. Xu, Z. Li, L. Wang, X. Sun, and M. Cheng (2021-06) Temporal modulation network for controllable space-time video super-resolution. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Fig. 1, §I, §II-C, Fig. 4, §IV-B, §IV-B, Fig. 6, Fig. 7, Fig. 8, Fig. 9.
  • [44] X. Xu, L. Siyao, W. Sun, Q. Yin, and M. H. Yang (2019) Quadratic video interpolation. Cited by: §II-A, §III-A.
  • [45] T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman (2019) Video enhancement with task-oriented flow. International Journal of Computer Vision. Cited by: §I.
  • [46] T. Xue, B. Chen, J. Wu, D. Wei, and W. T. Freeman (2017) Video enhancement with task-oriented flow. CoRR abs/1711.09078. External Links: Link, 1711.09078 Cited by: §III-D, §IV-A, §IV-B, §IV.
  • [47] P. Yi, Z. Wang, K. Jiang, J. Jiang, and J. Ma (2020)

    A progressive fusion generative adversarial network for realistic and consistent video super-resolution

    .
    IEEE Transactions on Pattern Analysis and Machine Intelligence PP (99), pp. 1–1. Cited by: §II-B.
  • [48] P. Yi, Z. Wang, K. Jiang, J. Jiang, and J. Ma (2020) Progressive fusion video super-resolution network via exploiting non-local spatio-temporal correlations. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §II-B.
  • [49] P. Yi, Z. Wang, K. Jiang, Z. Shao, and J. Ma (2019) Multi-temporal ultra dense memory network for video super-resolution. IEEE Transactions on Circuits and Systems for Video Technology PP (99), pp. 1–1. Cited by: §II-B.
  • [50] P. Yi, Z. Wang, K. Jiang, J. Jiang, T. Lu, X. Tian, and J. Ma (2021) Omniscient video super-resolution. External Links: 2103.15683 Cited by: §II-B, §III-B2, §IV-C.
  • [51] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu (2018) Image super-resolution using very deep residual channel attention networks. Cited by: §IV-B.
  • [52] X. Zhu, H. Hu, S. Lin, and J. Dai (2018) Deformable convnets v2: more deformable, better results. arXiv preprint arXiv:1811.11168. Cited by: §II-B.

Vi Appendix

Here, we provide more visual comparisons of synthetic intermediate frames (even-numbered frames) on the REDS dataset. Since some scenes in the REDS dataset have very large motions and severe camera shake, these scenes can better verify the model’s ability to handle extreme motions. We can clearly see from following figures that when there is huge motion between two adjacent frames, the kernel-based method will appear obvious blurring, and some grid-like parts will be severely distorted. Our model can often restore HR with better visual effects when dealing with extreme motions.

GT crop
overlay crop
Starnet[14]
Zooming[42]
TMnet[43]
ours
Fig. 6: Qualitative comparison on REDS clip 000.
GT crop
overlay crop
Starnet[14]
Zooming[42]
TMnet[43]
ours
Fig. 7: Qualitative comparison on REDS clip 001.
GT crop
overlay crop
Starnet[14]
Zooming[42]
TMnet[43]
ours
Fig. 8: Qualitative comparison on REDS clip 011.
GT crop
overlay crop
Starnet[14]
Zooming[42]
TMnet[43]
ours
Fig. 9: Qualitative comparison on REDS clip 020.