The target of super-resolution (SR) is to generate a corresponding high-resolution (HR) image or video from its low-resolution (LR) version. As an extension of single image super-resolution (SISR), video super-resolution (VSR) provides a solution to restore correct content from degraded video, so that the reconstructed video frames will contain more details with higher clarity. Such kind of technology with important practical significance can be widely used in large amounts of fields such as video surveillance, ultra-high definition television and so on.
Different from SISR which only considers one single low-resolution image as input at a time, VSR comes down to how to effectively make use of intrinsic temporal information among multiple low-resolution video frames. While vanilla SISR approaches could directly apply to video frames and treat them as single images, missing details available from the other frames would be wasted and such practice is hard to reconstruct promising video frames, therefore they are not well adapted to the VSR task.
To overcome the limitation of the SISR, the existing VSR methods usually take a LR reference frame and its multiple neighboring frames as inputs to reconstruct a corresponding HR reference frame. Due to the motion of the camera or object, the neighboring frames should be spatially aligned first so as to utilize the information and extract missing details from them. To this end, the traditional VSR methods [16, 20, 18, 1] usually calculate the optical flow and estimate the sub-pixel motion between LR frames to warp the neighboring frames and achieve the alignment operation. However, fast and reliable flow estimation still remains a challenging problem. The illumination consistency hypothesis that the estimation algorithm relies on may fail due to the changes of illumination and posture as well as the existence of motion blur and occlusion. Incorrect motion compensation will introduce artifacts in aligned neighboring frames and affect the quality of final restored images. Therefore, explicit flow estimation and motion compensation methods could be sub-optimal for video super-resolution task.
In this paper, we propose a novel deformable non-local network (DNLN) which is non-flow-based to perform both implicit motion estimation and video super-resolution. Our network consists of three modules: the first module is alignment module. Inspired by TDAN , we apply deformable convolution  and enhance its ability of adaptively warping the frames. Specifically, we use the hierarchical feature fusion block (HFFB) to effectively generate the deformable convolution parameters. And through the stacked deformable convolutions, we perform finer alignment operations and gradually align the input neighboring frame features. The second module is attention guidance module, we exploit a non-local structure to capture the global correlation between neighboring frame and reference frame, which is used to assess the importance of different regions in neighboring frame. They are expected to highlight the features complementary to the reference frame and exclude regions with improper alignment. The features with attention guidance are then concatenated and fed into a merging layer to reduce channels. The third module is SR reconstruction module. We use residual in residual dense block (RRDB) to generate HR reference frame. RRDB helps to make full use of information from different hierarchical levels, so as to retain more details of the input LR frame.
In summary, the main contributions of this paper can be concluded as follows:
We propose a novel deformable non-local network (DNLN) to accomplish high quality video super-resolution. Our method achieves the most advanced VSR performance on several benchmark datasets.
We design an alignment module based on deformable convolution, which can realize the feature level alignment in a coarse to fine manner without explicitly motion compensation.
We propose a non-local attention guidance module to select useful features from neighboring frames that are conducive to the recovery of reference frame.
2 Related Work
2.1 Single Image Super-resolution
Dong et al.  first proposed SRCNN for single image super-resolution to learn the nonlinear mapping between LR and HR images in an end-to-end manner, and achieved better performance than previous work. VDSR  further improved SRCNN by stacking more convolution layers and using residual learning to increase network depth. DRCN 
first introduced recursive learning in very deep networks for parameter sharing. All of these methods first interpolate the LR input to the desired size and the reconstruction process is based on the interpolated products. Such pre-processing step inevitably results in loss of details and additional computation cost. To avoid the above problems, extracting features from the original LR input and upscaling spatial resolution at the end of the network became the main direction of SR network. Dong et al. directly took the original LR image as input and brought in the transpose convolution layer (also known as the deconvolution layer) for upsampling to high resolution. Shi et al.  proposed an effective sub-pixel convolution layer for amplifying the final LR feature map to HR output and accelerating the network.
Afterwards, Timofte et al.  provided a new large dataset (DIV2K) in the NTIRE 2017 challenge that consists of 1000 2K resolution images. This dataset enables researchers to train deeper and wider networks which leads to various development of SR methods. The most advanced SISR networks, such as EDSR , DBPN  and RCAN , have far better training performance on this dataset than previous networks.
2.2 Video Super-resolution
Liao et al.  proposed DECN and made use of two classical optical flow methods: TV-L1 and MDP flow to generate SR drafts with different parameters, and then produced the final result through the deep network. Kappeler et al.  proposed VSRnet, which uses a hand-designed optical flow algorithm to perform motion compensation on the input LR frame, and uses the warped frame as the CNN input to predict the HR video frame. Caballero et al.  introduced the first end-to-end VSR network: ESPCN, which studies early fusion, slow fusion, and 3D convolution to learn temporal relationships. They use a multi-scale spatial transformer to warp the LR frame and eventually generate an HR frame through another deep network. Tao et al.  proposed a sub-pixel motion compensation layer for frame alignment and used a convolution LSTM architecture in following SR reconstruction network.
Most prior VSR methods exploit optical flow to estimate motion between frames and warp them to integrate effective features. However, it is intractable to obtain precise flow estimation in the case of occlusion and large movement. Xue et al.  proposed task-oriented TOFlow with learnable task-oriented motion prompts. It achieves better VSR results than fixed flow algorithm, which reveals that standard optical flow is not the best motion representation for video recovery. To circumvent this problem, DUF  uses the adaptive upsampling with dynamic filters that depend on input frames instead of the explicit estimation process. TDAN  uses deformable convolution to adaptively align the video frame at the feature level without computing optical flow. They transcend the flow-based approach through implicit motion compensation.
2.3 Deformable Convolution
To enhance the CNNs’ capability of modeling geometric transformations, Dai et al.  proposed deformable convolutions. It adds additional offsets to the regular grid sampling locations in the standard convolution and enables arbitrary deformation of the sampling grid. To further enhance the modeling capability, they presented modulated deformable convolutions  which can also learn dynamic weights for sampling kernels. The deformable convolution is effective for high-level vision tasks such as object detection and semantic segmentation. TDAN  is the first to utilize deformable convolutions in the VSR task. It is used to adaptively align the input frames at the feature level without explicit motion estimation and shows superior performance to previous optical flow-based VSR networks.
2.4 Non-local Block
Inspired by the classic non-local means in computer vision, Wang et al. proposed a building block for video classification by virtue of non-local operations. For image data, long-range dependencies are commonly modeled via large receptive fields formed by deep stacks of convolutional layers. While the non-local operations capture long-range dependencies directly by computing interactions between any two positions, regardless of their positional distance. It computes the response at a position as a weighted sum of all positions in the input feature maps. The set of positions can be in space, time, or spacetime, so the non-local operations are applicable for image or video problems.
3 Deformable Non-local Networks
3.1 Network Architecture
Given a sequence of 2N+1 consecutive low-resolution frames, where is the reference frame and the others are the neighboring frames. Our goal is to recover the corresponding high quality video frame through the reference frame and its 2N neighboring frames. Therefore, our network takes as input, and finally reconstructs . The overall network structure is shown in Fig. 1
, which can be divided into four parts, including feature extraction module, alignment module, non-local attention guidance module and the final SR reconstruction module.
For all the input LR frames, we first extract their features via a shared feature extraction module. It consists of one convolutional layer and several residual blocks. Then the LR feature of each neighboring frame will enter the alignment module along with the LR reference feature. Our alignment module which consists of stacked deformable convolutional layers is responsible for performing adaptive feature level alignment. Subsequently, the reference feature and each aligned neighboring feature are fed into a non-local attention guidance module. By calculating the global correlation between them, connections of pixels are established and missing details of the target frame will be further enhanced. The last part is the SR reconstruction module, here we use the residual in residual dense block (RRDB). We concatenate 2N+1 features and fuse them through a convolution layer, the output fused feature maps are then fed into RRDBs. Moreover, we use a skip connection to propagate LR reference feature to the end of the network and do an element-wise addition with the outcomes of RRDBs. Finally, a high quality HR reference frame is recovered through an upsampling layer.
3.2 Alignment Module
In order to make use of temporal information from consecutive frames, traditional VSR methods are based on optical flow to perform frame alignment. However, explicit motion compensation method could be sub-optimal for video super-resolution task. Therefore, DNLN exploits modulated deformable convolutions  in the alignment module to get rid of such limitation.
For each location p on the output feature map , a normal convolution process can be expressed as:
where represents the sampling grid with K sampling locations and denotes the weights for each location. For example, and defines a convolutional kernel. In the modulated deformable convolution, predicted offsets and modulation scalar are added to the sampling grid making deformable kernels spatially-variant. In our alignment module, let and denote the features at location in the input feature maps and aligned output feature maps , respectively. The operation of modulated deformable convolution is as follows:
where and are the learnable offset and modulation scalar for the -th location, respectively. The convolution will be operated on the irregular positions with dynamic weights to achieve adaptive sampling on input features. The offsets and modulation scalar are both learned, each input neighboring feature and the reference one are concatenated to generate the required deformable sampling parameters:
Here , . They are obtained from the general function . As the may be fractional, we use the bilinear interpolation, which is the same as that proposed in .
The alignment module proposed in DNLN is composed of several deformable convolution layers as shown in Fig. 2. In each deformable convolution, as illustrated in Fig. 3, a reference feature and a neighboring feature are concatenated as an input. Then they will pass through a convolution layer to reduce channels, and another hierarchical feature fusion block (HFFB)  is used to obtain the offset and modulation scalar of the convolution kernel. The HFFB in Fig. 4 introduces a spatial pyramid of dilated convolution to effectively enlarge receptive field with relatively low computational cost, which contributes to deal with complicated and large motions between frames. In HFFB, the feature maps obtained using kernels of different dilation rates are hierarchically added before concatenating them, which is beneficial to acquire an effective receptive field. So that we can more efficiently exploit the temporal dependency of pixels to generate the sampling parameters.
According to Eq.(2), the deformable kernel can adaptively select sampling positions on neighboring features, learn implicit motion compensation between two frames, and complete the alignment of features. With a cascade of several deformable convolutions, we can gradually align the neighboring features and improve the alignment accuracy of sub-pixels. It’s noticed that when passing through a deformable convolution layer, the reference feature always keeps unchanged, only to provide a reference for the alignment of neighboring features. Through such a coarse to fine process, the neighboring frames can be warped adaptively at the feature level.
3.3 Non-local Attention Module
Due to factors such as occlusion, blurring, and parallax problems, even after the alignment module, the neighboring frames still have some areas that are not well aligned and don’t contain the missing details needed for the reference frame. Therefore, it is essential to dynamically select valid inter-frame information before merging and DNLN introduces a non-local attention module to achieve this goal. By capturing the global correlation between the aligned neighboring feature and the reference one , the non-local module can effectively enhance desirable fine details in which can be complementary to the reference frame, and suppress the misaligned areas.
The non-local operation shown in Fig. 5 can be defined as:
where x and y denote the and , respectively. We take x and y as the inputs and generate the attention guided features z. Here m is the index of an output position, and n is the index that enumerates all positions of y. The function can be expressed as which computes the expression of input y at position n. The paired function calculates the relationship between and . We use embedded gaussian function to represent this pairwise relationship and it is normalized by a factor :
, are used to linearly embed the input and obtain a pairwise relationship. Then a value is calculated by using the representation of all positions on y, and then is added to the input to get the final output . Through non-local operation, the neighboring features can make full use of the correlation with reference feature at the pixel level to enhance the desired missing details.
3.4 SR Reconstruction Module
SR reconstruction module consists of stacked RRDBs and a global skip connection. RRDBs can make full use of hierarchical features from neighboring frames to obtain better restoration quality. The structure of RRDB can be seen in Fig. 6. More details about RRDB can be found in . The global skip connection transfers the shallow features of the reference frame to the end of the network, making the reconstruction module focuses on learning residual features from the neighboring frames. It can well keep spatial information of the input LR reference frame and make sure the input frame and the corresponding super-resolved one have more structural similarity. A final recovered outcome will be produced by an upsampling layer.
4.1 Training Datasets and Details
Datasets To train high-performance VSR networks, a large video dataset is required. Xue  et al. collected videos from Vimeo and released a VSR dataset vimeo-90k after processing. The dataset contains 64612 training samples with various and complex real-world motions. Each sample contains seven consecutive frames with a fixed resolution of . We use the vimeo-90k dataset as our training dataset. To generate LR images, we downscale the HR images with MATLAB imresize function, which first blurs the input frames using cubic filters and then downsamples them using bicubic interpolation.
Training Details In the feature extraction module, we utilize 5 residual blocks to extract shallow features. Then the alignment module adopts 5 deformable convolution layers to perform feature alignment. In the reconstruction module, we use 16 RRDBs and set the number of increased channels to 64.
In the training process, we perform data augmentation by doing horizontal or vertical flipping, rotation and random cropping of the images. In each training batch, 8 LR patches with the size of are extracted as inputs. Our model is trained by Adam optimizer  with , , and . The initial learning rate is17]. We train the network end-to-end by minimizing L1 loss between the predicted frame and the ground truth HR frame.
|“Calendar”||(a) Bicubic||(b) DBPN||(c) RCAN||(d) VESPCN||(e) TOFlow|
||(f) FRVSR||(g) DUF||(h) RBPN||(i) Ours||(j) GT|
|“Walk”||(a) Bicubic||(b) DBPN||(c) RCAN||(d) VESPCN||(e) TOFlow|
||(f) FRVSR||(g) DUF||(h) RBPN||(i) Ours||(j) GT|
|Flow||Bicubic||DBPN ||RCAN ||VESPCN ||TOFlow ||FRVSR ||DUF ||RBPN ||DNLN(Ours)|
|Clip Name||Magnitude||(1 Frame)||(1 Frame)||(1 Frame)||(3 Frames)||(7 Frames)||(recurrent)||(7 Frames)||(7 Frames)||(7 Frames)|
|Calendar||1.14||20.39 / 0.5720||22.27 / 0.7178||22.31 / 0.7248||-||22.44 / 0.7290||-||24.07 / 0.8123||23.95 / 0.8076||24.13 / 0.8149|
|City||1.63||25.17 / 0.6024||25.84 / 0.6835||26.07 / 0.6938||-||26.75 / 0.7368||-||28.32 / 0.8333||27.74 / 0.8051||27.97 / 0.8136|
|Foliage||1.48||23.47 / 0.5666||24.70 / 0.6615||24.69 / 0.6628||-||25.24 / 0.7065||-||26.41 / 0.7713||26.21 / 0.7578||26.30 / 0.7611|
|Walk||1.44||26.11 / 0.7977||28.65 / 0.8706||28.64 / 0.8718||-||29.03 / 0.8777||-||30.63 / 0.9144||30.70 / 0.9111||30.85 / 0.9131|
|Average||1.42||23.79 / 0.6347||25.37 / 0.7334||25.43 / 0.7383||25.35 / 0.7557||25.86 / 0.7625||26.69 / 0.822||27.36 / 0.8328||27.15 / 0.8204||27.31 / 0.8257|
|Flow||Bicubic||DBPN ||RCAN ||TOFlow ||DUF ||RBPN ||DNLN(Ours)|
|Clip Name||Magnitude||(1 Frame)||(1 Frame)||(1 Frame)||(7 Frames)||(7 Frames)||(7 Frames)||(7 Frames)|
|car05001||6.21||27.75 / 0.7825||29.81 / 0.8463||29.86 / 0.8484||30.10 / 0.8626||30.79 / 0.8707||31.95 / 0.9021||31.95 / 0.8997|
|hdclub003001||0.70||19.42 / 0.4863||20.37 / 0.6041||20.41 / 0.6096||20.86 / 0.6523||22.05 / 0.7438||21.91 / 0.7257||22.16 / 0.7380|
|hitachiisee5001||3.01||19.61 / 0.5938||23.44 / 0.8202||23.71 / 0.8369||22.88 / 0.8044||25.77 / 0.8929||26.30 / 0.9049||26.76 / 0.9098|
|hk004001||0.49||28.54 / 0.8003||31.64 / 0.8614||31.68 / 0.8631||30.89 / 0.8654||32.98 / 0.8988||33.38 / 0.9016||33.55 / 0.9049|
|HKVTG004||0.11||27.46 / 0.6831||28.71 / 0.7588||28.81 / 0.7649||28.49 / 0.7487||29.16 / 0.7860||29.51 / 0.7979||29.57 / 0.7997|
|jvc009001||1.24||25.40 / 0.7558||27.97 / 0.8580||28.31 / 0.8717||27.85 / 0.8542||29.18 / 0.8961||30.06 / 0.9105||30.75 / 0.9213|
|NYVTG006||0.10||28.45 / 0.8014||29.79 / 0.8640||31.01 / 0.8859||30.12 / 0.8603||32.30 / 0.9090||33.22 / 0.9231||33.52 / 0.9287|
|PRVTG012||0.12||25.63 / 0.7136||26.57 / 0.7785||26.56 / 0.7806||26.62 / 0.7788||27.39 / 0.8166||27.60 / 0.8242||27.73 / 0.8281|
|RMVTG011||0.18||23.96 / 0.6573||25.81 / 0.7489||26.02 / 0.7569||25.89 / 0.7500||27.56 / 0.8113||27.63 / 0.8170||27.78 / 0.8208|
|veni3011||0.36||29.47 / 0.8979||33.67 / 0.9597||34.58 / 0.9629||32.85 / 0.9536||34.63 / 0.9677||36.61 / 0.9735||36.92 / 0.9745|
|veni5015||0.36||27.41 / 0.8483||30.40 / 0.9221||31.04 / 0.9262||30.03 / 0.9118||31.88 / 0.9371||32.37 / 0.9409||33.28 / 0.9477|
|Average||1.17||25.73 / 0.7291||28.02 / 0.8202||28.36 / 0.8279||27.87 / 0.8220||29.43 / 0.8664||30.05 / 0.8747||30.36 / 0.8794|
|Bicubic||29.34 / 0.8330||31.29 / 0.8708||34.07 / 0.9050|
|DBPN ||32.80 / 0.9007||35.19 / 0.9249||38.25 / 0.9440|
|RCAN ||32.93 / 0.9032||35.35 / 0.9268||38.47 / 0.9456|
|TOFlow ||32.15 / 0.8900||35.01 / 0.9254||37.70 / 0.9430|
|DUF ||33.41 / 0.9110||36.71 / 0.9446||38.87 / 0.9510|
|RBPN ||34.26 / 0.9222||37.39 / 0.9494||40.16 / 0.9611|
|DNLN(ours)||34.53 / 0.9253||37.64 / 0.9512||40.36 / 0.9618|
|# of clips||1,616||4,983||1,225|
|Avg. Flow Mag.||0.6||2.5||8.3|
4.2 Comparison with the State-of-the-art Methods
We compare our DNLN with several state-of-the-art SISR and VSR methods: DBPN , RCAN , VESPCN , TOFlow , FRVSR , DUF  and RBPN . Note that most previous methods are trained with different datasets and we just compare with the results they provided. The SR results are evaluated with PSNR and SSIM  quantitatively on Y channel (i.e., luminance) of transformed YCbCr space. In the evaluation, the first and last two frames are not included and we do not crop any border pixels except DUF . Eight pixels near image boundary are cropped for DUF due to its severe boundary effects.
We evaluated our models on three datasets: Vid4 , SPMCS , and Vimeo-90K-T  with average flow magnitude (pixel/frame) provided in . Vid4 is a commonly used dataset which contains four video sequences: calendar, city, foliage and walk. However, we can observe that the Vid4 has limited inter-frame motion and there exists artifacts on its ground-truth frames. SPMCS consists of more high quality video clips with various motions and diverse scenes. Vimeo-90K-T is a much larger dataset. It contains a wide range of flow magnitude between frames which can well judge the performance of the VSR methods.
Table 1 displays numerical results on Vid4. It is shown that our model outperforms other methods except for DUF. However, DNLN is still better on “calendar” and “walk” clips which proves the superiority of our model. Qualitative results are shown in Fig. 7. As the “walk” clip displays that most previous methods blur the rope and clothing together, only DNLN can clearly distinguish these two parts and restore the pattern closest to the ground truth frame.
Results on SPMCS are shown in Table 2. DNLN achieves best results by a large margin on PSNR compared with both SISR and VSR approaches. Moreover, our model exceeds the optical flow based methods which demonstrates the effectiveness of our flow free alignment module. In comparison to Vid4, SPMCS contains more high frequency information with higher resolution which requires the superb recovery abilities of algorithms. Visual comparisons are depicted in Fig. 8. Although DUF and RBPN could reproduce part of the HR patterns compared with other methods, it is obvious that our DNLN is the unique approach to restore the abundant details and clean edges.
, we classified the video clips into three different groups (e.g. slow, medium and fast) according to the motion velocity. While the motion velocity increases, the large movement of the object in the video frame will result in the appearance of the fuzzy objects. It can be seen that the PSNR and SSIM of all methods increase with the acceleration of motion velocity, which demonstrates video frames with larger motion amplitude contain more useful temporal information. Our DNLN ensures optimal performance on all three categories, surpassing RBPN by 0.27 db, 0.24 db and 0.19 db on PSNR, respectively. Since the fast flow magnitude of Vimeo-90K-T is higher than Vid4 and SPMCS, the content between video frames varies greatly, which reflects that DNLN could take full advantage of information among multi frames effectively. More qualitative evaluations are shown in Fig.9. We mark out the positions which display obvious distinctions among different methods. For the railing texture in third row, only our method restores the correct and clear pattern while others have varying degrees of blurring. Even if some SR frames have the same sharp edges, our DNLN is more accurate. Such as the letter “P” and “W” in fifth row, the results restored by RBPN and DNLN are equally clear, while the former connects the two letters together by mistake.
||(a) TOFlow||(b) DUF||(c) RBPN||(d) Ours||(e) GT|
|(b) TOFlow||(c) DUF||(d) RBPN||(e) Ours||(f) GT|
|PSNR / SSIM|
|36.86 / 0.9420|
|37.38 / 0.9474|
|37.28 / 0.9464|
|37.42 / 0.9475|
|model||PSNR / SSIM|
|w/o deform||36.86 / 0.9420|
|1dconv||37.14 / 0.9449|
|2dconv||37.23 / 0.9458|
|3dconv||37.37 / 0.9471|
|4dconv||37.40 / 0.9473|
|5dconv||37.42 / 0.9475|
|5dconv, w/o HFFB||37.13 / 0.9449|
|(b) 2dconv||(c) 3dconv||(d) 4dconv||(e) 5dconv||(f) w/o HFFB||(g) GT|
4.3 Ablation Study
To further investigate the proposed method, we conducted ablation experiments by removing the components of our network. The results are shown in Table 4. First, we remove the alignment module, thus the shallow frame features would be directly fed into the following network without warping. The PSNR of the results on Vimeo-90K-T is relatively low, which indicates that an alignment operation is crucial for utilizing the inter-frame information. Second, we remove the non-local attention guided module and the performance decreases a lot. Third, we replace the RRDBs by simply stacking common residual blocks and it also harms the performance. These quantitative results demonstrate the effectiveness and benefits of our proposed three modules.
From the ablation experiments above, we can found that the network performance would be significantly affected by the alignment preprocessing. So we further validated the impact of deformable convolutions on the reconstruction capability. As shown in Table 5, with only one deformable convolution, the PSNR value can improve greatly. This demonstrates the importance of alignment operations for making efficient use of the neighboring frames. As the number of deformable convolutions increases, the network gains a better performance. The visual comparisons are shown in Fig. 10. From left to right, the network alleviates the blurring artifacts of the office building to some degree and recovers more details. In addition, we replaced the HFFB used in generating deformable sampling parameters with a convolution layer, the performance of network decreases by roughly 0.29 dB. It proves that by enlarging the receptive field, the deformable convolution can more effectively deal with complex and large motions.
In order to study the influence of inter-frame information on the recovery results, we leveraged different length of frames to train our network. From Fig. 11, we can observe that there is a dramatical improvement in DNLN when switching from 3 frames to 5 frames, and the performance of DNLN/5 is even better than RBPN which uses 7frames. When further switching to 7 frames, we can still get a better result but the improvement becomes minor.
In this paper, we propose a novel deformable non-local network (DNLN). It is a non-flow-based method for effective and efficient video super-resolutions. To deal with complicated and large motion compensation, we introduce the deformable convolution with HFFB in our alignment module which can well align the frames at the feature level. In addition, we adopt a non-local attention guidance module to further extract complementary features between frames. By making full use of the temporal information, we finally restore a high quality video frame through a reconstruction module. Extensive experiments demonstrate that DNLN achieves state-of-the-art performance on several benchmark datasets.
Jose Caballero, Christian Ledig, Andrew Aitken, Alejandro Acosta, Johannes
Totz, Zehan Wang, and Wenzhe Shi.
Real-time video super-resolution with spatio-temporal networks and
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4778–4787, 2017.
-  Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017.
-  Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for image super-resolution. In European conference on computer vision, pages 184–199. Springer, 2014.
Chao Dong, Chen Change Loy, and Xiaoou Tang.
Accelerating the super-resolution convolutional neural network.In European conference on computer vision, pages 391–407. Springer, 2016.
-  Muhammad Haris, Gregory Shakhnarovich, and Norimichi Ukita. Deep back-projection networks for super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1664–1673, 2018.
-  Muhammad Haris, Gregory Shakhnarovich, and Norimichi Ukita. Recurrent back-projection network for video super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3897–3906, 2019.
-  Zheng Hui, Jie Li, Xinbo Gao, and Xiumei Wang. Progressive perception-oriented network for single image super-resolution. arXiv preprint arXiv:1907.10399, 2019.
-  Younghyun Jo, Seoung Wug Oh, Jaeyeon Kang, and Seon Joo Kim. Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3224–3232, 2018.
-  Armin Kappeler, Seunghwan Yoo, Qiqin Dai, and Aggelos K Katsaggelos. Video super-resolution with convolutional neural networks. IEEE Transactions on Computational Imaging, 2(2):109–122, 2016.
-  Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1646–1654, 2016.
-  Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1637–1645, 2016.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Renjie Liao, Xin Tao, Ruiyu Li, Ziyang Ma, and Jiaya Jia. Video super-resolution via deep draft-ensemble learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 531–539, 2015.
-  Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 136–144, 2017.
-  Ce Liu and Deqing Sun. On bayesian adaptive video super resolution. IEEE transactions on pattern analysis and machine intelligence, 36(2):346–360, 2013.
-  Ding Liu, Zhaowen Wang, Yuchen Fan, Xianming Liu, Zhangyang Wang, Shiyu Chang, and Thomas Huang. Robust video super-resolution with learned temporal dynamics. In Proceedings of the IEEE International Conference on Computer Vision, pages 2507–2515, 2017.
-  Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
-  Mehdi SM Sajjadi, Raviteja Vemulapalli, and Matthew Brown. Frame-recurrent video super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6626–6634, 2018.
-  Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016.
-  Xin Tao, Hongyun Gao, Renjie Liao, Jue Wang, and Jiaya Jia. Detail-revealing deep video super-resolution. In Proceedings of the IEEE International Conference on Computer Vision, pages 4472–4480, 2017.
-  Yapeng Tian, Yulun Zhang, Yun Fu, and Chenliang Xu. Tdan: Temporally deformable alignment network for video super-resolution. arXiv preprint arXiv:1812.02898, 2018.
-  Radu Timofte, Eirikur Agustsson, Luc Van Gool, Ming-Hsuan Yang, and Lei Zhang. Ntire 2017 challenge on single image super-resolution: Methods and results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 114–125, 2017.
-  Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.
-  Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 0–0, 2018.
-  Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
-  Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and William T Freeman. Video enhancement with task-oriented flow. International Journal of Computer Vision, pages 1–20, 2017.
-  Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 286–301, 2018.
-  Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2472–2481, 2018.
-  Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9308–9316, 2019.