With the development of economy and society, the public’s pursuit of audiovisual experience is getting higher and higher. High resolution (4K, 8K) and high frame rate (120FPS, 240FPS) televisions are becoming increasingly popular. However, many video sources have relatively low frame rates and resolutions. Therefore, improving the resolution and frame rate of ordinary video is of great application value. Space-Time Video Super-Resolution aims to transform a low spatial resolution video with a low frame rate to a video with higher spatial resolutions and temporal resolutions. 
first proposed the concept of Space-Time Super-Resolution. Some early works are not based on deep learning. To mine the temporal and spatial information of video, these methods  used hand-designed feature extractors
(such as SVM, Hough pyramid, etc.) to extract video features and then carried out pixel regularization. Many of the earlier methods are based on a series of strict mathematical assumptions, such as the scene changes less, the motion amplitude between two frames is relatively small, the brightness of adjacent pixels is consistent, and so on. However, it is almost impossible for real scenes to meet these conditions. The reconstruction results are often inferior when modeling diverse scenes with these traditional methods, particularly dealing with large motion and inconsistent brightness scenes. Moreover, most of these methods take a pretty long time to process a video. With the development of deep learning, many video super-resolution methods and video interpolation methods  based on CNN have made great progress. In fact, Space-Time Super-Resolution can be completed in two steps, that is, performing video interpolation method and video super-resolution method independently or combine the two tasks and joint training. However, the two-step schema undoubtedly ignores the correlation of temporal and spatial information. Furthermore, because the key of both VFI and VSR is the alignment of frames or features, the two-step scheme and the joint training scheme will inevitably align twice. (VFI and VSR) This not only slows down the dealing speed but also brings huge parameter redundancy.
In the past few years, video frame interpolation and video super-resolution have made considerable progress. Spatio-temporal super-resolution has also received more attention recently. Several recent STVSR methods have been proposed to solve the problems of two-stage methods and aim to better exploit the correlation between time and space. These methods can perform VFI and VSR reconstruction on low-frame-rate and low-resolution videos simultaneously. STARnet
first estimates the optical flow of two adjacent frames and performs feature warping to interpolate the intermediate frame, and finally perform reconstruction to obtain a high-resolution frame. However, this method can only use the information from two adjacent frames and fail to leverage information in long distant. And its iteration-based optimization method inevitably brings expensive costs to computation and memory. Xianget al. proposed a deformable convLSTM backbone and performed STVSR on the feature space. Compared with STARnet, this method supports relatively longer input video sequences and meanwhile costs less on computation and memory. Based on Zooming Slow-Mo, Xu et al. further proposed TMNet 
which can perform controllable frame interpolation at any intermediate moment. After serious research and Reflection on the existing work, we design a bidirectional recurrent network for ST-VSR that can make better use of local information and global information with high efficiency. Our contributions are summarized as:
An optical-flow-reuse-based bidirectional recurrent network is proposed by effectively leveraging long-range temporal knowledge and refined fusing of warped features. The proposed flow-reuse strategy reduces the temporal alignment computation cost by half compared with previous LSTM-based alignment methods.
Feature Refinement Module (FRM) is utilized to enhance the proposed hidden state features in bidirectional recurrence, bringing performance improvement by refining warped features with additional temporal information.
With the merit of the above technologies, our proposed framework reaches the best restoration performance with the highest computing speed on public test sequences, compared with other ST-VSR algorithms.
The remainder of the paper is organized as follows: Section II reviews the background and related works. Details of our proposed flow-reuse-based bidirectional network are given in Section III. Experiments and analysis are provided in Section IV. Finally, the paper is concluded in Section V.
Ii related Work
We will introduce the related work from three parts: Video Frame Interpolation(VFI), Video space super-resolution(VSR) and video space-time super-resolution(ST-VSR)
Ii-a Video Frame Interpolation
The goal of Video Frame Interpolation(VFI) is to synthesis nonexistent intermediate frames between consecutive frames. Many previous methods use two reference frames to interpolate the intermediate frame. For flow-based methods, proposed a fully convolutional network to estimate voxel flow and generate intermediate frames. Jiang et al. introduced a Similar method  and used two U-Net architectures to compute bi-directional optical flow and visibility maps. Bao et al.  first linearly weigh the optical flow and then use a depth-aware projection layer to adaptively blend the warped frames. Xu et al. proposed QVI, which exploits four consecutive frames and estimate flow fields starting from the unknown target frame to the source frame. Different from those backward warping methods, Niklaus et al. proposed SoftSplat  to forwardly warp frames and their corresponding feature map using softmax splatting.
Apart from using optical flow, another major trend for VFI is to replace the two-step interpolation operation as a convolution process. Niklaus et al.  use a pair of 1D kernels to perform a spatially adaptive convolution to estimate the motion. Cheng et al.  proposed DSepConv that use deformable separable convolution to enlarge receptive field of kernel-based methods and further proposed EDSC  to perform multiple interpolation. Choi et al. proposed CAIN, which replaces the optical flow computation module with a channel attention module to capture the motion information implicitly.
Ii-B Video Super-resolution
Video Super-Resolution (VSR) aims at reconstructing high-resolution video from the conrresponding low-resolution video. The key to VSR is how to fully use complementary information across frames. Most recent methods can be divided into two categories according to generator networks: iterative    and recurrent ,. specifically, TDAN  adopts deformable convolutions module (DCNs)   to align different frames at the feature level. EDVR , as a representative method, further uses DCNs in a multi-scale pyramid and utilizes multiple attention layers to adopt alignment and then integrate the features. PFNL, a progressive fusion network with an improved non-local operation to avoid the complex motion estimation and motion compensation, which has obtained a favorable result in terms of both performance and complexity. Isobe et al.  proposed TGA that divides temporal information into groups and utilizes both 2D and 3D residual blocks for inter-group fusion from a temporally sliding window.
For iterative-based method, Tao el al.  proposed a sub-pixel motion compensation layer in a CNN framework and utilizes ConvLSTM  module for capturing long-range temporal information. RBPN  extend Image super-resolution  to VSR which sends LR frames into a projection module step by step with the recurrent back-projection network. MTUDB 
embed convolutional long-short-term memory (ConvLSTM) into ultra dense residual block to construct a multi-temporal ultra dense memory (MTUDM) network for video super-resolution. RSDN designs a structure-detail recurrent network to learn low-frequency and high-frequency information of the image respectively.
Ii-C Video Time-Space Super-resolution
Space-Time Video Super-Resolution aims to transform a low spatial resolution video with a low frame-rate to a video with higher spatial resolutions and temporal resolutions Some of the earlier methods  , which are not based on deep learning, are slow in processing speed and often fail to generate promising effects when processing complex scenes. Some recent work based on deep learning has made great progress in both speed and effect. STARnet leveraged mutually informative relationships between time and space with an optical flow estimation module, and perform feature warping of two consecutive frames to interpolate the intermediate LR frame. Zooming Slow-Mo  developed a unified framework with deformable ConvLSTM to align and aggregate temporal information and then synthesize the intermediate features by a bidirectional recurrent network before performing feature fusion for STVSR. Based on Zooming Slow-Mo, xu et al.  proposed a temporal modulation network via locally-temporal feature comparison module and deformable convolution kernels for controllable feature interpolation, which can interpolate arbitrary intermediate frames.
Iii Proposed Method
Our network structure consists of three parts: 1) optical flow estimation module, 2) Bidirectional Recurrent neural network 3) Frame Reconstruction module. We first compare the two ways of estimating bidirectional optical flow, and introduce the bidirectional recurrence structure. Then we discussed how to effectively fuse local and global information to obtain better performance. Finally, we describe the structure of the reconstruction module. The symbol table(TableI) shows the mainly used symbols in this paper. The structure of the entire recurrence network is shown in 2 .
Iii-a Optical Flow Estimation Module
stage 1: estimate flow between two frames
We employ optical flow to perform motion estimation and motion compensation. The whole reconstruction process of a high-resolution frame is an RNN like structure. When reconstructing a certain frame, we use optical flow to align the hidden state containing forward and backward historical information to the current frame. It should be noted that our task is spatio-temporal super-resolution, which is different from ordinary video super-resolution. During the process of recurrence, We not only need to align hidden state to the frame with the low-resolution image (), but also need to align it to the hidden state of the Synthetic intermediate frame(). Therefore, an appropriate intermediate flow estimation strategy is the key to align the intermediate hidden state. Theoretically, we can use any off-the-shelf optical flow estimation module to estimate the optical flow between . But how to estimate the optical flow towards the intermediate frame has become a difficult problem. Some previous works   first compute bidirectional flows and refine them to get intermediate results by multiplying the optical flow by a time factor. And another way is directly estimating the intermediate flows. To better illustrate the difference, we compare two flow estimation methods in Figure 3. We assume that we need to estimate the optical flow between the two frames, and also the optical flow from the intermediate frame and . Represents an optical flow estimation model.
method 1 We simply compute flow between two frames and multiply the optical flow by a time factor to get the intermediate result.
method 2 We directly estimate the intermediate flows, and using a revised IFNet. specifically, Because the resolution of the input LR is relatively low, we only downsample the LR twice and set the channels of IFblock to 64. As for flow between two LR frames, we reuse them according to the law of vector addition. Represents revised IFNet.
After experiments, we found that directly estimating the optical flow of the intermediate frames is better than estimating the optical flow between LRs. We will prove this in the ablation study.
|Low Resolution frame of input sequence|
|Synthetic Intermediate frame of Low Resolution|
|Hidden State of low resoluton frame for backward or forward recurrence|
|Refined through stacked residual block|
|Hidden State of Synthetic Intermediate frame for backward or forward recurrence|
|Refined through stacked residual block|
|Network that processes forward or backward propagation|
|Alignment from coarse(flow warping) to fine(feature refinement module)|
|estimated forward or backward flow|
stage 2: estimate bidirectional flow across frames
In stage 1, we talked about methods of estimating flow between two consecutive frames, now we talk about how to estimate bidirectional flow across frames. Given a sequence of low resolution consecutive frames,we extract two subsequences from this sequence, and , as shown in Figure2. For the frame with low resolution image(LR), we estimate the optical flow of the corresponding positions in the two subsequences by a flow estimation method as metioned above, and refer to forward flow and backward flow.
So far, we have finished the estimation of optical flow and made preparation for the bidirectional recurrent network.
Iii-B Bidirectional Recurrent Neutral Network
In this section, we first introduce the basic structure of bidirectional recurrent network, then we will talk about how to apply alignment from coarse(flow warping) to fine (feature refinement module) across frames of the whole input video sequence. Finally, we explore how to efficiently combine the local information and global information to obtain better performance.
Iii-B1 basic structure of bidirectional recurrent network
As shown in 2, the overall structure is a bidirectional recurrence network, which is an rnn like structure, but different from the conventional RNN, the input video frames need to be transferred forward and backward respectively. In this process, the gained information from other frames is transferred through a hidden state. In our bidirectional settings, both frames and features are propagated in two opposite directions. We hope any LR in an input sequence can leverage knowledge from any other frames. Also, any synthetic intermediate frame(SILR) can also leverage the information that comes from neighouring frames and hidden state. Specifically, we call two recurrent processes: forward recurrence and backward recurrence. For clarity, we will describe propagation of LR and SILR respectively. The backward recurrence of LR can be described as:
In the process of recurrence, the hidden state passes through the ”pipeline”, and we must apply MEMC( in formula 9) to align the hidden state to the current frame. Specifically, we first conduct an alignment with flow warping to get coarse results. Note that we conduct backward warping; therefore, the direction of optical flow is opposite to the direction of recurrence. After warping, we perform a feature refinement module(FRM) to further optimize the hidden state, which will be introduced in detail later. After carrying out ”alignment,” the aligned feature will be first concatenated with corresponding LR and then fed into a stacked fusion residual block to get the refined result.
The forward recurrence of LR can be described as:
We can clearly see that the process of the forward recurrence is basically the same as that of the backward recurrence, that is, the MEMC is performed first, and then the refinement is performed. The only difference is that the refined results of the backward recurrence are concatenated with LR and corresponding forward hidden state before being sent to the refining network.
Next, we describe the recurrence of the intermediate state in the same manner. Nevertheless, It should be pointed out that we can’t access the Ground True of the SILR, so in the procedure of MEMC, we should not only align the hidden state in feature space but also in frame space so as to be as consistent with recurrence of LR. The backward recurrence of intermediate state(HSI) can be described as:
The forward recurrence of intermediate state(HSI) can be described as:
The recurrence process of the intermediate state roughly shares the same idea with that of LR. Due to the possible occlusion of objects, and camera panning in boundary regions of the image, simply mixing the features from backward and forward recurrence may introduce errors. In order to reduce possible errors, we introduced two masks, and to reveal these occlusion areas between the two adjacent frames and . Notice that points with a value of 0 in the mask refer to those pixels that exist in the estimated intermediate frame and disappear due to occlusion or movement in and , where is the Hadamard product.
Iii-B2 feature refinement module
Thanks to the bidirectional recurrent structure, any frame in the input sequence can obtain information gain from any other frames. However, the feature map of the hidden state brings not only information gain but also some noise caused by inaccurate alignment and occlusion. In order to solve this problem, one of the most direct ideas is to replace RNN with LSTM. In fact, since LSTM can capture longer information dependence and suppress noise, most of the work in recent years has used various variants of LSTM, such as convLSTM , deformable LSTM . But we did not choose to use LSTM. This is because compared to RNN, LSTM needs to save more intermediate states in the loop process, specifically, including forgetting gate, input gate, output gate, and cell state, but RNN only need to save hidden state, which means that LSTM will cause more memory occupancy. That is to say, under the same hardware equipment conditions, the LSTM-based network structure can only handle a shorter input sequence than the RNN-based structure. Some recent works of VSR   pointed out that for the VSR task, longer video input sequence will benefit from more long-term information and therefore achieve better performance. With this assumption, we carefully consider how to ensure less memory usage and meanwhile suppress noise.
That is to say, we need such a structure that can adaptively measure the relevance between the current candidate frame and the hidden state, and decide which part in the hidden state needs to be highlighted and which part needs to be suppressed. Starting from
, many works have studied how to dynamically generate a convolution kernel and calculate the similarity of two input tensors. proposed RSDN and designed a hidden state adaptation module that allows the current frame to selectively useful information from the hidden state.  proposed MuCAN, which performs temporal multi-correspondence aggregation strategy and cross-scale nonlocal-correspondence aggregation scheme to explore the self-similarity of images across scales.
Similar to  which proposed ’correlation layer’ that performs multiplicative patch comparisons between two feature maps, we compute the correlation between feature map of LR and aligned hidden state. Specifically, we first feed LR into a ”Conv-LeakyReLU” layer; then we compute the local correlation of the LR feature and hidden state for each channel. The correlation of LR feature() centered at (x) in map and hidden state() centered at (y) in map can be described as:
represents offset that limited in a square patch. After that, we apply sigmoid activation function ofmatrix and transforms it into a matrix whose value is in range [0, 1] and then perform element-wise multiplication between the hidden state and matrix to get the optimized hidden feature. Finally, We concatenate optimized result with hidden state.
Iii-B3 Hybrid of Iteration and Sliding Window
As the two main frameworks in the VSR field, the sliding window and iteration method actually represent different concerns for global information and local information. Obviously, when a frame is restored, the neighboring frames can provide more information, while the long distant frame contains relatively less useful information. To make comprehensive use of local and global information at the same time, we have adopted a hybrid approach of iteration and sliding window. The iteration method has been discussed in detail in the previous article, namely: applying optical flow warping and Feature Refinement Module(FRM) to update the hidden state. For sliding windows features, inspired by the design of EDVR , We use a PCD (Pyramid Cascading Deformable) alignment block to extract features from LRs within the sliding window. Specifically, we first use the multi-layer stacked residual network to extract the features and then align the neighboring frames to the current reference frame with pyramid, cascading and deformable convolution. We set the window size to five(consecutive five LR). After that, we fuse feature maps with a simple Conv layer. Finally, we concatenate and fuse the obtained features in the channel dimension with the features obtained by the bidirectional network. Through experiments, we found that hybridization can make better use of local and global information, and at the same time, fully utilize the advantages of kernel-based and flow-based methods.
Iii-C Reconstruction Module
Until now, We have obtained hidden states that contain the temporal and spatial features of LR and SILR. Then, we perform spatial reconstruction for the features. Specifically, we feed the reconstructed feature maps into two sub-pixel upscaling modules with PixelShuffle  and finally output the reconstructed HR video frames
Iii-D Implementation Details
64 and take the odd-indexed LR as inputs, the corresponding consecutive HR sequence with the size of 256256 as supervision. During training, we adopt Adam optimizer with = 0.9 and = 0.999 and apply standard augmentation, such as rotation, flipping, and random cropping. The initial learning rates of the flow estimator and other parts are set to and respectively and decay to with a cosine annealing. The batch size is set to be 24 and we trained the model on 8 Nvidia 1080Ti GPUs. We initialize the parameters of our network by kaiming initialization  except for pre-trained weights(IFNet).
Iv-a Experimental Setup
We first adopt a Vimeo-90K-T septuplet trainset  for training.
Vimeo-90K contains 91,701 video sequences, each of which
consists of 7 frames, and the HR frames are at the
resolution 448 × 256. We follow the setting of  and divide testing dataset of Vimeo-90K-T into 3 categories according to the average motion
flow magnitude: fast motion, medium motion, and slow motion, which include 1225, 4977, and 1613 video clips, respectively.
For fair comparison, we removed 5 video clips
from the original medium motion set and 3 clips from the slow motion set because these sets contain only all-black backgrounds.
which will lead to infinite values on PSNR. We also test on Vid4 which contains
four scenes and this dataset is widly used in VSR. Finally, to verify the robustness of our method across different datasets, we
also test the model on REDS which is very challengeable due to its diverse scenes and large motions.
We follow the experimental setup of , which generate LR frames with a downsampling factor 4 and use odd-indexed LR frames as input to predict the
corresponding consecutive HR and synthesis intermediate HR.
We adopt Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) for evaluation which are widely used for VFI and VSR. We also provide model size and inference FPS of different networks to compare efficiency. Note that we compute inference FPS of the entire Vid4 dataset measured on one Nvidia 1080Ti GPU.
Iv-B Comparison with State-of-the-arts
We provide two models for comparison with other methods, the feature fusion model(OFR-BRN+) and the light model(OFR-BRN).
The light model does not concatenate the sliding window feature in III-B3 before frame reconstruction.
We compare the performance of our model with
some state-of-the-art two-stage methods (VFI and VSR) and one-stage STVSR methods.
For the two-stage methods, we perform video frame interpolation (VFI) by SuperSloMo, or
SepConv , and perform video super-resolution (VSR) by Bicubic Interpolation (BI), RCAN , RBPN or
EDVR . For one-stage STVSR models, we compare our network with recently state-of-the-art methods
Zooming SlowMo  ,STARnet  or TMnet . When training, we use Vimeo-90K trainset  and feed odd LR frames into the model and reconstruct HR frames corresponding to the frames of the entire sequence.
All these methods are trained on Vimeo-90K and evaluated on the
Vimeo-90K test set and the Vid4 dataset.
We listed the Quantitative results in Table II. Red and
blue colors indicate the best and the second-best performance.
Following the suggestion of , we omit baseline models with Bicubic Interpolation when comparing
the speed. For the light model, one can see that it outperforms the second-best method by 0.68dB on Vid4 and runs 2 faster than other SOTA STVSR methods.
On the Vimeo dataset, the light model also outperforms other methods
on qualitative evaluation indicators except for PSNR on Vimeo-slow.
Through further experiments, we found that fusing sliding window features can improve performance,
especially for short input sequences. Most importantly, our fusion model can still keep a relatively fast speed while maintaining high performance.
In addition, the parameters of both models remain close to other SOTA methods.
These experimental results prove that our model is quite competitive in performance, running speed, and parameter size.
Further experiments on REDS In order to further verify the robustness of our method across different datasets, we pretrained the model on Vimeo and tested it on REDS. Compared with Vimeo and Vid4, REDS has a larger resolution (7201280), more complex and diverse scenes, and more significant motion. Therefore, REDS is closer to the super-resolution requirements of real scenes. Specifically, we set up the test setting with reference to , and compared the PSNR, SSIM and running speed of different ST-VSR schemes.
It can be seen in Table III, our model outperforms other ST-VSR methods in accuracy and speed. Not only that, our method can often achieve better subjective visual effects when synthesizing intermediate frames. We will give more subjective experimental comparisons in appendix.
Iv-C Ablation Study
After demonstrating our model’s superiority over existing one-stage framework and two-stage networks. Here, we conduct examinations of different modules of our network.
Specifically, we mainly focus on 1) Different strategies of recurrence
2) Different methods of flow estimation
3) The effectiveness of Feature Refinement Module(FRM) 4) Influences of feature fusion space.
The results of the ablation study on different modules are listed in V.
1. One-way recurrence vs. bidirectional recurrence To test the usefulness of the bidirectional mechanism, we removed the backward recurrent branch of bidirectional recurrence and named it one-way recurrence. The ablation results are shown in c and d in V. One can see that the accuracy of one-way recurrence is much lower than that of bidirectional recurrence. In a one-way recurrence, all frames can only leverage the knowledge of the previous frames but cannot make use of knowledge from subsequent frames. So, one-way recurrence may cause quite severe performance degradation.
2.Flow-reuse strategy vs. naive flow estimation To illustrate the effectiveness of the .flow-reuse strategy (intermediate flow estimation), we used two different optical flow estimation schemes as described in section 3.1 for experiments. When applying naive flow estimation between LRs(method 1 in III-A), we employ a pretrained pwcnet as flow estimator.
First of all, we need to confirm that the reuse of optical flow will not affect the alignment of odd frames (LR).
As shown in table IV, We divide the reconstructed result frame into odd frames(LR) and
even frames(SILR). The result of odd frames(LR) can illustrate that compared with estimating
the optical flow between LRs, optical flow reuse will not affect the alignment results of LRs.
From the results of even frames,we find that intermediate flow estimation method can get better performance.
We may attribute the performance improvement to the method of directly estimating the intermediate flows,
which can better fit the non-linear motion between frames than the linear motion estimation.
3.Effectiveness of Feature Refinement Module In fact, it is feasible that we only use optical flow to perform alignment between frames. In order to verify the effectiveness of feature refinement module module, we remove this module and compare the performance with the original network on Vid4 and Vimeo. The results are shown in Table V b and d.
Since our network adopts a recurrent structure, when reconstructing a frame, we hope to align
the hidden state, which contains knowledge from other frames to the state of the current LR in order to obtain information gain.
However, it is almost impossible for the optical flow to be completely accurate.
The flow warping may also bring noise while performing MEMC, especially for images with complex textures.
Therefore, we hope that the information in the aligned hidden state should be as relevant as possible to the current frame.
As shown in Figure 5, we display a typical visual contrast, pay attention to the
cluttered branches pointed by the arrows. We can see that the restored
results without the feature refinement module have obvious noise points.
After applying feature refinement, our model can reconstruct more details, especially for images with complex texture.
4. Analysis of fusion space Since we cannot access ground truth of even-numbered frames(SILR), We must explore a reasonable way to estimate the intermediate state as accurately as possible. The simplest idea is to directly blend the warped color frames in the image space to produce the intermediate frame. this approach is commonly used in image stitching [1, 4], video extrapolation, video Stabilization. However, image space fusion often easily leads to ghosting and checkerboard artifacts. In order to avoid these negative effects, some works[50, 39] pointed out that fusion at the feature space will achieve better results. To this end, we explored the effect of feature fusion in different spaces, namely: a) image space space b) feature space fusion and c) hybrid-space fusion. When performing image space-fusion, we use bidirectional optical flow to warp two adjacent video frames, and average the results of bidirectional warping, then we use a 11 convolution to keep the channel dimension of the feature consistent with LR (odd frame). When applying feature space fusion, we use bidirectional optical flow to warp adjacent HS, and also average the results from both sides. When performing hybrid space fusion, we concatenate the results of a and b in the channel dimension, and use 11 convolution to process the result to ensure that the final channel dimension is consistent with the odd-numbered frame. Quantitative results on testset of Vimeo are listed in VI
|(a) Image-space||(b) feature-space||(c) hybrid-space|
It can be seen that hybrid-space fusion can achieve the best results compared to the other two methods. This may be because if only image-space fusion is used,
the feature dimension is too low. Similarly, because HS will inevitably lose information during the recurrence, only feature-space fusion cannot fully use all the features. So we apply hybrid-space to achieve the best performance.
In this paper, we propose an efficient and accurate structure for video space-time super-resolution. Thanks to our flow-reuse-based strategy and coarse-to-fine feature refinement module, our model has a considerable improvement in the speed and performance compared with previous state-of-the-art method, particularly estimating extreme motions. We also discussed how to integrate local and global information to reconstruct HR frames quickly and well to adapt to different needs.
-  (2012) Bibliography. Cited by: §IV-C.
-  (2019) Depth-aware video frame interpolation. In , Cited by: §IV-B.
-  (2016) Dynamic filter networks. CoRR abs/1605.09673. External Links: Cited by: §III-B2.
-  (2016) Image alignment and stitching: a tutorial. Cited by: §IV-C.
-  (2020) BasicVSR: the search for essential components in video super-resolution and beyond. Cited by: §II-B, §III-B2.
Video frame interpolation via deformable separable convolution.
Proceedings of the AAAI Conference on Artificial Intelligence34 (7), pp. 10607–10614. Cited by: §I, §II-A.
-  (2020) Multiple video frame interpolation via enhanced deformable separable convolution. CoRR abs/2006.08070. External Links: Cited by: §II-A.
-  (2020) Channel attention is all you need for video frame interpolation. Proceedings of the AAAI Conference on Artificial Intelligence 34 (7), pp. 10663–10671. Cited by: §II-A.
-  (2017) Deformable convolutional networks. IEEE. Cited by: §II-B, §IV-B.
-  (2016) FlowNet: learning optical flow with convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV), Cited by: §II-C.
-  (2015) FlowNet: learning optical flow with convolutional networks. External Links: Cited by: §III-B2.
-  (2018) Deep back-projection networks for super-resolution. arXiv. Cited by: §II-B.
-  (2019) Recurrent back-projection network for video super-resolution. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II-B, §II-B, §IV-B.
-  (2020) Space-time-aware multi-resolution video enhancement. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Fig. 1, §I, §II-C, Fig. 4, §IV-B, Fig. 6, Fig. 7, Fig. 8, Fig. 9.
Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In CVPR, Cited by: §III-D.
-  (2021) RIFE: real-time intermediate flow estimation for video frame interpolation. External Links: Cited by: §III-A, §III-D.
-  (2020) Video super-resolution with recurrent structure-detail network. Cited by: §II-B, §III-B2.
-  (2020) Video super-resolution with temporal group attention. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II-B.
-  Super slomo: high quality estimation of multiple intermediate frames for video interpolation. Cited by: §II-A, §IV-B.
-  (2018) Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I, §II-B.
-  (2017) DeepRain: convlstm network for precipitation prediction using multichannel radar data. Cited by: §III-B2.
-  (2019) Video extrapolation using neighboring frames. ACM Transactions on Graphics (TOG) 38 (3), pp. 1–13. Cited by: §IV-C.
-  (2020) MuCAN: multi-correspondence aggregation network for video super-resolution. Cited by: §III-B2.
-  (2011) A bayesian approach to adaptive video super resolution. In CVPR 2011, Vol. , pp. 209–216. External Links: Cited by: Fig. 1, §IV-B, §IV.
-  (2021) Hybrid neural fusion for full-frame video stabilization. Cited by: §IV-C.
-  (2020) Enhanced quadratic video interpolation. Computer Vision – ECCV 2020 Workshops. Cited by: §III-A.
-  (2017) Video frame synthesis using deep voxel flow. IEEE. Cited by: §II-A.
SGDR: stochastic gradient descent with warm restarts. arXiv e-prints. Cited by: §III-D.
-  (2011) Space-time super-resolution using graph-cut optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (5), pp. 995–1008. Cited by: §I, §II-C.
-  (2019-06) NTIRE 2019 challenge on video deblurring and super-resolution: dataset and study. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §IV-B, §IV.
-  (2020) Softmax splatting for video frame interpolation. IEEE. Cited by: §II-A.
-  (2017) Video frame interpolation via adaptive separable convolution. In 2017 IEEE International Conference on Computer Vision (ICCV), Cited by: §I, §II-A.
-  (2002) Increasing space-time resolution in video. Computer Vision — ECCV 2002. Cited by: §I, §II-C.
Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. IEEE. Cited by: §III-C.
Convolutional lstm network: a machine learning approach for precipitation nowcasting. MIT Press. Cited by: §II-B.
-  (2013) Adaptive regularization-based space–time super-resolution reconstruction. Signal Processing Image Communication 28 (7), pp. 763–778. Cited by: §I.
-  (2017) PWC-net: cnns for optical flow using pyramid, warping, and cost volume. Cited by: §IV-C.
-  (2017) Detail-revealing deep video super-resolution. IEEE Computer Society. Cited by: §II-B.
-  (2020) TDAN: temporally-deformable alignment network for video super-resolution. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §II-B, §IV-C.
-  (2019) EDVR: video restoration with enhanced deformable convolutional networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Cited by: §I, §II-B, §III-B3, §IV-B, §IV-B.
-  (2019) MEMC-net: motion estimation and motion compensation driven neural network for video interpolation and enhancement.. IEEE transactions on pattern analysis and machine intelligence. Cited by: §II-A.
-  (2021) Zooming slowmo: an efficient one-stage framework for space-time video super-resolution. Cited by: Fig. 1, §I, §II-C, §III-B2, Fig. 4, §IV-A, §IV-B, §IV-B, Fig. 6, Fig. 7, Fig. 8, Fig. 9.
-  (2021-06) Temporal modulation network for controllable space-time video super-resolution. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Fig. 1, §I, §II-C, Fig. 4, §IV-B, §IV-B, Fig. 6, Fig. 7, Fig. 8, Fig. 9.
-  (2019) Quadratic video interpolation. Cited by: §II-A, §III-A.
-  (2019) Video enhancement with task-oriented flow. International Journal of Computer Vision. Cited by: §I.
-  (2017) Video enhancement with task-oriented flow. CoRR abs/1711.09078. External Links: Cited by: §III-D, §IV-A, §IV-B, §IV.
A progressive fusion generative adversarial network for realistic and consistent video super-resolution. IEEE Transactions on Pattern Analysis and Machine Intelligence PP (99), pp. 1–1. Cited by: §II-B.
-  (2020) Progressive fusion video super-resolution network via exploiting non-local spatio-temporal correlations. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §II-B.
-  (2019) Multi-temporal ultra dense memory network for video super-resolution. IEEE Transactions on Circuits and Systems for Video Technology PP (99), pp. 1–1. Cited by: §II-B.
-  (2021) Omniscient video super-resolution. External Links: Cited by: §II-B, §III-B2, §IV-C.
-  (2018) Image super-resolution using very deep residual channel attention networks. Cited by: §IV-B.
-  (2018) Deformable convnets v2: more deformable, better results. arXiv preprint arXiv:1811.11168. Cited by: §II-B.
Here, we provide more visual comparisons of synthetic intermediate frames (even-numbered frames) on the REDS dataset. Since some scenes in the REDS dataset have very large motions and severe camera shake, these scenes can better verify the model’s ability to handle extreme motions. We can clearly see from following figures that when there is huge motion between two adjacent frames, the kernel-based method will appear obvious blurring, and some grid-like parts will be severely distorted. Our model can often restore HR with better visual effects when dealing with extreme motions.