Deformable Non-local Network For Video Super-Resolution

09/24/2019 ∙ by Hua Wang, et al. ∙ 10

The video super-resolution (VSR) task aims to restore a high-resolution video frame by using its corresponding low-resolution frame and multiple neighboring frames. At present, many deep learning-based VSR methods rely on optical flow to perform frame alignment. The final recovery results will be greatly affected by the accuracy of optical flow. However, optical flow estimation cannot be completely accurate, and there are always some errors. In this paper, we propose a novel deformable non-local network (DNLN) which is non-flow-based. Specifically, we apply the improved deformable convolution in our alignment module to achieve adaptive frame alignment at the feature level. Furthermore, we utilize a non-local module to capture the global correlation between the reference frame and aligned neighboring frame, and simultaneously enhance desired fine details in the aligned frame. To reconstruct the final high-quality HR video frames, we use residual in residual dense blocks to take full advantage of the hierarchical features. Experimental results on several datasets demonstrate that the proposed DNLN can achieve state of the art performance on video super-resolution task.



There are no comments yet.


page 6

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The target of super-resolution (SR) is to generate a corresponding high-resolution (HR) image or video from its low-resolution (LR) version. As an extension of single image super-resolution (SISR), video super-resolution (VSR) provides a solution to restore correct content from degraded video, so that the reconstructed video frames will contain more details with higher clarity. Such kind of technology with important practical significance can be widely used in large amounts of fields such as video surveillance, ultra-high definition television and so on.

Different from SISR which only considers one single low-resolution image as input at a time, VSR comes down to how to effectively make use of intrinsic temporal information among multiple low-resolution video frames. While vanilla SISR approaches could directly apply to video frames and treat them as single images, missing details available from the other frames would be wasted and such practice is hard to reconstruct promising video frames, therefore they are not well adapted to the VSR task.

To overcome the limitation of the SISR, the existing VSR methods usually take a LR reference frame and its multiple neighboring frames as inputs to reconstruct a corresponding HR reference frame. Due to the motion of the camera or object, the neighboring frames should be spatially aligned first so as to utilize the information and extract missing details from them. To this end, the traditional VSR methods [16, 20, 18, 1] usually calculate the optical flow and estimate the sub-pixel motion between LR frames to warp the neighboring frames and achieve the alignment operation. However, fast and reliable flow estimation still remains a challenging problem. The illumination consistency hypothesis that the estimation algorithm relies on may fail due to the changes of illumination and posture as well as the existence of motion blur and occlusion. Incorrect motion compensation will introduce artifacts in aligned neighboring frames and affect the quality of final restored images. Therefore, explicit flow estimation and motion compensation methods could be sub-optimal for video super-resolution task.

In this paper, we propose a novel deformable non-local network (DNLN) which is non-flow-based to perform both implicit motion estimation and video super-resolution. Our network consists of three modules: the first module is alignment module. Inspired by TDAN [21], we apply deformable convolution [29] and enhance its ability of adaptively warping the frames. Specifically, we use the hierarchical feature fusion block (HFFB) to effectively generate the deformable convolution parameters. And through the stacked deformable convolutions, we perform finer alignment operations and gradually align the input neighboring frame features. The second module is attention guidance module, we exploit a non-local structure to capture the global correlation between neighboring frame and reference frame, which is used to assess the importance of different regions in neighboring frame. They are expected to highlight the features complementary to the reference frame and exclude regions with improper alignment. The features with attention guidance are then concatenated and fed into a merging layer to reduce channels. The third module is SR reconstruction module. We use residual in residual dense block (RRDB) to generate HR reference frame. RRDB helps to make full use of information from different hierarchical levels, so as to retain more details of the input LR frame.

In summary, the main contributions of this paper can be concluded as follows:

  • We propose a novel deformable non-local network (DNLN) to accomplish high quality video super-resolution. Our method achieves the most advanced VSR performance on several benchmark datasets.

  • We design an alignment module based on deformable convolution, which can realize the feature level alignment in a coarse to fine manner without explicitly motion compensation.

  • We propose a non-local attention guidance module to select useful features from neighboring frames that are conducive to the recovery of reference frame.

2 Related Work

2.1 Single Image Super-resolution

Dong et al. [3] first proposed SRCNN for single image super-resolution to learn the nonlinear mapping between LR and HR images in an end-to-end manner, and achieved better performance than previous work. VDSR [10] further improved SRCNN by stacking more convolution layers and using residual learning to increase network depth. DRCN [11]

first introduced recursive learning in very deep networks for parameter sharing. All of these methods first interpolate the LR input to the desired size and the reconstruction process is based on the interpolated products. Such pre-processing step inevitably results in loss of details and additional computation cost. To avoid the above problems, extracting features from the original LR input and upscaling spatial resolution at the end of the network became the main direction of SR network. Dong et al. 

[4] directly took the original LR image as input and brought in the transpose convolution layer (also known as the deconvolution layer) for upsampling to high resolution. Shi et al. [19] proposed an effective sub-pixel convolution layer for amplifying the final LR feature map to HR output and accelerating the network.

Afterwards, Timofte et al. [22] provided a new large dataset (DIV2K) in the NTIRE 2017 challenge that consists of 1000 2K resolution images. This dataset enables researchers to train deeper and wider networks which leads to various development of SR methods. The most advanced SISR networks, such as EDSR [14], DBPN [5] and RCAN [27], have far better training performance on this dataset than previous networks.

2.2 Video Super-resolution

Liao et al. [13] proposed DECN and made use of two classical optical flow methods: TV-L1 and MDP flow to generate SR drafts with different parameters, and then produced the final result through the deep network. Kappeler et al. [9] proposed VSRnet, which uses a hand-designed optical flow algorithm to perform motion compensation on the input LR frame, and uses the warped frame as the CNN input to predict the HR video frame. Caballero et al. [1] introduced the first end-to-end VSR network: ESPCN, which studies early fusion, slow fusion, and 3D convolution to learn temporal relationships. They use a multi-scale spatial transformer to warp the LR frame and eventually generate an HR frame through another deep network. Tao et al. [20] proposed a sub-pixel motion compensation layer for frame alignment and used a convolution LSTM architecture in following SR reconstruction network.

Most prior VSR methods exploit optical flow to estimate motion between frames and warp them to integrate effective features. However, it is intractable to obtain precise flow estimation in the case of occlusion and large movement. Xue et al. [26] proposed task-oriented TOFlow with learnable task-oriented motion prompts. It achieves better VSR results than fixed flow algorithm, which reveals that standard optical flow is not the best motion representation for video recovery. To circumvent this problem, DUF [8] uses the adaptive upsampling with dynamic filters that depend on input frames instead of the explicit estimation process. TDAN [21] uses deformable convolution to adaptively align the video frame at the feature level without computing optical flow. They transcend the flow-based approach through implicit motion compensation.

2.3 Deformable Convolution

To enhance the CNNs’ capability of modeling geometric transformations, Dai et al. [2] proposed deformable convolutions. It adds additional offsets to the regular grid sampling locations in the standard convolution and enables arbitrary deformation of the sampling grid. To further enhance the modeling capability, they presented modulated deformable convolutions [29] which can also learn dynamic weights for sampling kernels. The deformable convolution is effective for high-level vision tasks such as object detection and semantic segmentation. TDAN [21] is the first to utilize deformable convolutions in the VSR task. It is used to adaptively align the input frames at the feature level without explicit motion estimation and shows superior performance to previous optical flow-based VSR networks.

Figure 1: The architecture of the proposed DNLN framework.We only show one neighboring frame in the figure. Each neighboring frame would pass through extraction module, alignment module and non-local attention guidance module. Then all the features are concatenated and fed into SR reconstruction module to generate HR reference frame.

2.4 Non-local Block

Inspired by the classic non-local means in computer vision, Wang et al. 

[23] proposed a building block for video classification by virtue of non-local operations. For image data, long-range dependencies are commonly modeled via large receptive fields formed by deep stacks of convolutional layers. While the non-local operations capture long-range dependencies directly by computing interactions between any two positions, regardless of their positional distance. It computes the response at a position as a weighted sum of all positions in the input feature maps. The set of positions can be in space, time, or spacetime, so the non-local operations are applicable for image or video problems.

3 Deformable Non-local Networks

3.1 Network Architecture

Given a sequence of 2N+1 consecutive low-resolution frames, where is the reference frame and the others are the neighboring frames. Our goal is to recover the corresponding high quality video frame through the reference frame and its 2N neighboring frames. Therefore, our network takes as input, and finally reconstructs . The overall network structure is shown in Fig. 1

, which can be divided into four parts, including feature extraction module, alignment module, non-local attention guidance module and the final SR reconstruction module.

For all the input LR frames, we first extract their features via a shared feature extraction module. It consists of one convolutional layer and several residual blocks. Then the LR feature of each neighboring frame will enter the alignment module along with the LR reference feature. Our alignment module which consists of stacked deformable convolutional layers is responsible for performing adaptive feature level alignment. Subsequently, the reference feature and each aligned neighboring feature are fed into a non-local attention guidance module. By calculating the global correlation between them, connections of pixels are established and missing details of the target frame will be further enhanced. The last part is the SR reconstruction module, here we use the residual in residual dense block (RRDB). We concatenate 2N+1 features and fuse them through a convolution layer, the output fused feature maps are then fed into RRDBs. Moreover, we use a skip connection to propagate LR reference feature to the end of the network and do an element-wise addition with the outcomes of RRDBs. Finally, a high quality HR reference frame is recovered through an upsampling layer.

Figure 2: The proposed alignment module.

3.2 Alignment Module

In order to make use of temporal information from consecutive frames, traditional VSR methods are based on optical flow to perform frame alignment. However, explicit motion compensation method could be sub-optimal for video super-resolution task. Therefore, DNLN exploits modulated deformable convolutions [29] in the alignment module to get rid of such limitation.

For each location p on the output feature map , a normal convolution process can be expressed as:


where represents the sampling grid with K sampling locations and denotes the weights for each location. For example, and defines a convolutional kernel. In the modulated deformable convolution, predicted offsets and modulation scalar are added to the sampling grid making deformable kernels spatially-variant. In our alignment module, let and denote the features at location in the input feature maps and aligned output feature maps , respectively. The operation of modulated deformable convolution is as follows:


where and are the learnable offset and modulation scalar for the -th location, respectively. The convolution will be operated on the irregular positions with dynamic weights to achieve adaptive sampling on input features. The offsets and modulation scalar are both learned, each input neighboring feature and the reference one are concatenated to generate the required deformable sampling parameters:


Here , . They are obtained from the general function . As the may be fractional, we use the bilinear interpolation, which is the same as that proposed in [2].

The alignment module proposed in DNLN is composed of several deformable convolution layers as shown in Fig. 2. In each deformable convolution, as illustrated in Fig. 3, a reference feature and a neighboring feature are concatenated as an input. Then they will pass through a convolution layer to reduce channels, and another hierarchical feature fusion block (HFFB)  [7] is used to obtain the offset and modulation scalar of the convolution kernel. The HFFB in Fig. 4 introduces a spatial pyramid of dilated convolution to effectively enlarge receptive field with relatively low computational cost, which contributes to deal with complicated and large motions between frames. In HFFB, the feature maps obtained using kernels of different dilation rates are hierarchically added before concatenating them, which is beneficial to acquire an effective receptive field. So that we can more efficiently exploit the temporal dependency of pixels to generate the sampling parameters.

According to Eq.(2), the deformable kernel can adaptively select sampling positions on neighboring features, learn implicit motion compensation between two frames, and complete the alignment of features. With a cascade of several deformable convolutions, we can gradually align the neighboring features and improve the alignment accuracy of sub-pixels. It’s noticed that when passing through a deformable convolution layer, the reference feature always keeps unchanged, only to provide a reference for the alignment of neighboring features. Through such a coarse to fine process, the neighboring frames can be warped adaptively at the feature level.

Figure 3: Detailed illustration of deformable convolution operation.
Figure 4: The hierarchical feature fusion block (HFFB). It contains 8 dilated convolutions with a dilation rate from 1 to 8. The feature maps obtained using kernels of different dilation rates are hierarchically added before concatenating them.

3.3 Non-local Attention Module

Due to factors such as occlusion, blurring, and parallax problems, even after the alignment module, the neighboring frames still have some areas that are not well aligned and don’t contain the missing details needed for the reference frame. Therefore, it is essential to dynamically select valid inter-frame information before merging and DNLN introduces a non-local attention module to achieve this goal. By capturing the global correlation between the aligned neighboring feature and the reference one , the non-local module can effectively enhance desirable fine details in which can be complementary to the reference frame, and suppress the misaligned areas.

The non-local operation shown in Fig. 5 can be defined as:


where x and y denote the and , respectively. We take x and y as the inputs and generate the attention guided features z. Here m is the index of an output position, and n is the index that enumerates all positions of y. The function can be expressed as which computes the expression of input y at position n. The paired function calculates the relationship between and . We use embedded gaussian function to represent this pairwise relationship and it is normalized by a factor :


, are used to linearly embed the input and obtain a pairwise relationship. Then a value is calculated by using the representation of all positions on y, and then is added to the input to get the final output . Through non-local operation, the neighboring features can make full use of the correlation with reference feature at the pixel level to enhance the desired missing details.

Figure 5: The non-local attention module.

3.4 SR Reconstruction Module

SR reconstruction module consists of stacked RRDBs and a global skip connection. RRDBs can make full use of hierarchical features from neighboring frames to obtain better restoration quality. The structure of RRDB can be seen in Fig. 6. More details about RRDB can be found in [24]. The global skip connection transfers the shallow features of the reference frame to the end of the network, making the reconstruction module focuses on learning residual features from the neighboring frames. It can well keep spatial information of the input LR reference frame and make sure the input frame and the corresponding super-resolved one have more structural similarity. A final recovered outcome will be produced by an upsampling layer.

Figure 6: The residual in residual dense block.

4 Experiments

4.1 Training Datasets and Details

Datasets To train high-performance VSR networks, a large video dataset is required. Xue [26] et al. collected videos from Vimeo and released a VSR dataset vimeo-90k after processing. The dataset contains 64612 training samples with various and complex real-world motions. Each sample contains seven consecutive frames with a fixed resolution of . We use the vimeo-90k dataset as our training dataset. To generate LR images, we downscale the HR images with MATLAB imresize function, which first blurs the input frames using cubic filters and then downsamples them using bicubic interpolation.

Training Details In the feature extraction module, we utilize 5 residual blocks to extract shallow features. Then the alignment module adopts 5 deformable convolution layers to perform feature alignment. In the reconstruction module, we use 16 RRDBs and set the number of increased channels to 64.

In the training process, we perform data augmentation by doing horizontal or vertical flipping, rotation and random cropping of the images. In each training batch, 8 LR patches with the size of are extracted as inputs. Our model is trained by Adam optimizer [12] with , , and . The initial learning rate is

before 70 epochs and later decreases to half every 20 epochs. All experiments were conducted on two NVIDIA RTX 2080 GPUs using PyTorch 1.0 

[17]. We train the network end-to-end by minimizing L1 loss between the predicted frame and the ground truth HR frame.

“Calendar” (a) Bicubic (b) DBPN (c) RCAN (d) VESPCN (e) TOFlow

(f) FRVSR (g) DUF (h) RBPN (i) Ours (j) GT

“Walk” (a) Bicubic (b) DBPN (c) RCAN (d) VESPCN (e) TOFlow

(f) FRVSR (g) DUF (h) RBPN (i) Ours (j) GT

Figure 7: Visual results on Vid4 for scaling factor. Zoom in to see better visualization.
Flow Bicubic DBPN [5] RCAN [27] VESPCN [1] TOFlow [26] FRVSR [18] DUF [8] RBPN [6] DNLN(Ours)
Clip Name Magnitude (1 Frame) (1 Frame) (1 Frame) (3 Frames) (7 Frames) (recurrent) (7 Frames) (7 Frames) (7 Frames)
Calendar 1.14 20.39 / 0.5720 22.27 / 0.7178 22.31 / 0.7248 - 22.44 / 0.7290 - 24.07 / 0.8123 23.95 / 0.8076 24.13 / 0.8149
City 1.63 25.17 / 0.6024 25.84 / 0.6835 26.07 / 0.6938 - 26.75 / 0.7368 - 28.32 / 0.8333 27.74 / 0.8051 27.97 / 0.8136
Foliage 1.48 23.47 / 0.5666 24.70 / 0.6615 24.69 / 0.6628 - 25.24 / 0.7065 - 26.41 / 0.7713 26.21 / 0.7578 26.30 / 0.7611
Walk 1.44 26.11 / 0.7977 28.65 / 0.8706 28.64 / 0.8718 - 29.03 / 0.8777 - 30.63 / 0.9144 30.70 / 0.9111 30.85 / 0.9131
Average 1.42 23.79 / 0.6347 25.37 / 0.7334 25.43 / 0.7383 25.35 / 0.7557 25.86 / 0.7625 26.69 / 0.822 27.36 / 0.8328 27.15 / 0.8204 27.31 / 0.8257
Table 1: Quantitative comparison of state-of-the-art SR algorithms on Vid4 for . Red indicates the best and blue indicates the second best performance. In the evaluation, the first and last two frames are not included and we do not crop any border pixels except DUF. Eight pixels near image boundary are cropped for DUF.
Flow Bicubic DBPN [5] RCAN [27] TOFlow [26] DUF [8] RBPN [6] DNLN(Ours)
Clip Name Magnitude (1 Frame) (1 Frame) (1 Frame) (7 Frames) (7 Frames) (7 Frames) (7 Frames)
car05001 6.21 27.75 / 0.7825 29.81 / 0.8463 29.86 / 0.8484 30.10 / 0.8626 30.79 / 0.8707 31.95 / 0.9021 31.95 / 0.8997
hdclub003001 0.70 19.42 / 0.4863 20.37 / 0.6041 20.41 / 0.6096 20.86 / 0.6523 22.05 / 0.7438 21.91 / 0.7257 22.16 / 0.7380
hitachiisee5001 3.01 19.61 / 0.5938 23.44 / 0.8202 23.71 / 0.8369 22.88 / 0.8044 25.77 / 0.8929 26.30 / 0.9049 26.76 / 0.9098
hk004001 0.49 28.54 / 0.8003 31.64 / 0.8614 31.68 / 0.8631 30.89 / 0.8654 32.98 / 0.8988 33.38 / 0.9016 33.55 / 0.9049
HKVTG004 0.11 27.46 / 0.6831 28.71 / 0.7588 28.81 / 0.7649 28.49 / 0.7487 29.16 / 0.7860 29.51 / 0.7979 29.57 / 0.7997
jvc009001 1.24 25.40 / 0.7558 27.97 / 0.8580 28.31 / 0.8717 27.85 / 0.8542 29.18 / 0.8961 30.06 / 0.9105 30.75 / 0.9213
NYVTG006 0.10 28.45 / 0.8014 29.79 / 0.8640 31.01 / 0.8859 30.12 / 0.8603 32.30 / 0.9090 33.22 / 0.9231 33.52 / 0.9287
PRVTG012 0.12 25.63 / 0.7136 26.57 / 0.7785 26.56 / 0.7806 26.62 / 0.7788 27.39 / 0.8166 27.60 / 0.8242 27.73 / 0.8281
RMVTG011 0.18 23.96 / 0.6573 25.81 / 0.7489 26.02 / 0.7569 25.89 / 0.7500 27.56 / 0.8113 27.63 / 0.8170 27.78 / 0.8208
veni3011 0.36 29.47 / 0.8979 33.67 / 0.9597 34.58 / 0.9629 32.85 / 0.9536 34.63 / 0.9677 36.61 / 0.9735 36.92 / 0.9745
veni5015 0.36 27.41 / 0.8483 30.40 / 0.9221 31.04 / 0.9262 30.03 / 0.9118 31.88 / 0.9371 32.37 / 0.9409 33.28 / 0.9477
Average 1.17 25.73 / 0.7291 28.02 / 0.8202 28.36 / 0.8279 27.87 / 0.8220 29.43 / 0.8664 30.05 / 0.8747 30.36 / 0.8794
Table 2: Quantitative comparison of state-of-the-art SR algorithms on SPMCS-11 for .
Algorithm Slow Medium Fast
Bicubic 29.34 / 0.8330 31.29 / 0.8708 34.07 / 0.9050
DBPN  [5] 32.80 / 0.9007 35.19 / 0.9249 38.25 / 0.9440
RCAN  [27] 32.93 / 0.9032 35.35 / 0.9268 38.47 / 0.9456
TOFlow  [26] 32.15 / 0.8900 35.01 / 0.9254 37.70 / 0.9430
DUF  [8] 33.41 / 0.9110 36.71 / 0.9446 38.87 / 0.9510
RBPN  [6] 34.26 / 0.9222 37.39 / 0.9494 40.16 / 0.9611
DNLN(ours) 34.53 / 0.9253 37.64 / 0.9512 40.36 / 0.9618
# of clips 1,616 4,983 1,225
Avg. Flow Mag. 0.6 2.5 8.3
Table 3: Quantitative comparison of state-of-the-art SR algorithms on Vimeo-90K-T for .

4.2 Comparison with the State-of-the-art Methods

We compare our DNLN with several state-of-the-art SISR and VSR methods: DBPN [5], RCAN [27], VESPCN [1], TOFlow [26], FRVSR [18], DUF [8] and RBPN [6]. Note that most previous methods are trained with different datasets and we just compare with the results they provided. The SR results are evaluated with PSNR and SSIM [25] quantitatively on Y channel (i.e., luminance) of transformed YCbCr space. In the evaluation, the first and last two frames are not included and we do not crop any border pixels except DUF [8]. Eight pixels near image boundary are cropped for DUF due to its severe boundary effects.

We evaluated our models on three datasets: Vid4 [15], SPMCS [20], and Vimeo-90K-T [26] with average flow magnitude (pixel/frame) provided in [6]. Vid4 is a commonly used dataset which contains four video sequences: calendar, city, foliage and walk. However, we can observe that the Vid4 has limited inter-frame motion and there exists artifacts on its ground-truth frames. SPMCS consists of more high quality video clips with various motions and diverse scenes. Vimeo-90K-T is a much larger dataset. It contains a wide range of flow magnitude between frames which can well judge the performance of the VSR methods.

Table 1 displays numerical results on Vid4. It is shown that our model outperforms other methods except for DUF. However, DNLN is still better on “calendar” and “walk” clips which proves the superiority of our model. Qualitative results are shown in Fig. 7. As the “walk” clip displays that most previous methods blur the rope and clothing together, only DNLN can clearly distinguish these two parts and restore the pattern closest to the ground truth frame.

Results on SPMCS are shown in Table 2. DNLN achieves best results by a large margin on PSNR compared with both SISR and VSR approaches. Moreover, our model exceeds the optical flow based methods which demonstrates the effectiveness of our flow free alignment module. In comparison to Vid4, SPMCS contains more high frequency information with higher resolution which requires the superb recovery abilities of algorithms. Visual comparisons are depicted in Fig. 8. Although DUF and RBPN could reproduce part of the HR patterns compared with other methods, it is obvious that our DNLN is the unique approach to restore the abundant details and clean edges.

Table 3 presents the quantitative outcomes of Vimeo-90K-T. As suggested in  [6]

, we classified the video clips into three different groups (e.g. slow, medium and fast) according to the motion velocity. While the motion velocity increases, the large movement of the object in the video frame will result in the appearance of the fuzzy objects. It can be seen that the PSNR and SSIM of all methods increase with the acceleration of motion velocity, which demonstrates video frames with larger motion amplitude contain more useful temporal information. Our DNLN ensures optimal performance on all three categories, surpassing RBPN by 0.27 db, 0.24 db and 0.19 db on PSNR, respectively. Since the fast flow magnitude of Vimeo-90K-T is higher than Vid4 and SPMCS, the content between video frames varies greatly, which reflects that DNLN could take full advantage of information among multi frames effectively. More qualitative evaluations are shown in Fig. 

9. We mark out the positions which display obvious distinctions among different methods. For the railing texture in third row, only our method restores the correct and clear pattern while others have varying degrees of blurring. Even if some SR frames have the same sharp edges, our DNLN is more accurate. Such as the letter “P” and “W” in fifth row, the results restored by RBPN and DNLN are equally clear, while the former connects the two letters together by mistake.

(a) TOFlow (b) DUF (c) RBPN (d) Ours (e) GT
Figure 8: Visual results on SPMCS for scaling factor. Zoom in to see better visualization.

(a) Bicubic
(b) TOFlow (c) DUF (d) RBPN (e) Ours (f) GT
Figure 9: Visual results on Vimeo-90K-T for scaling factor. Zoom in to see better visualization.
36.86 / 0.9420
37.38 / 0.9474
37.28 / 0.9464
37.42 / 0.9475
Table 4: Ablation study of proposed network on Vimeo-90K-T for .
model PSNR / SSIM
w/o deform 36.86 / 0.9420
1dconv 37.14 / 0.9449
2dconv 37.23 / 0.9458
3dconv 37.37 / 0.9471
4dconv 37.40 / 0.9473
5dconv 37.42 / 0.9475
5dconv, w/o HFFB 37.13 / 0.9449
Table 5: Ablation study on alignment module.

(a) 1dconv
(b) 2dconv (c) 3dconv (d) 4dconv (e) 5dconv (f) w/o HFFB (g) GT

Figure 10: Qualitative results of ablation on alignment module.
Figure 11: Comparison of super-resolution results with a different number of input frames on Vimeo-90K-T for .

4.3 Ablation Study

To further investigate the proposed method, we conducted ablation experiments by removing the components of our network. The results are shown in Table 4. First, we remove the alignment module, thus the shallow frame features would be directly fed into the following network without warping. The PSNR of the results on Vimeo-90K-T is relatively low, which indicates that an alignment operation is crucial for utilizing the inter-frame information. Second, we remove the non-local attention guided module and the performance decreases a lot. Third, we replace the RRDBs by simply stacking common residual blocks and it also harms the performance. These quantitative results demonstrate the effectiveness and benefits of our proposed three modules.

From the ablation experiments above, we can found that the network performance would be significantly affected by the alignment preprocessing. So we further validated the impact of deformable convolutions on the reconstruction capability. As shown in Table 5, with only one deformable convolution, the PSNR value can improve greatly. This demonstrates the importance of alignment operations for making efficient use of the neighboring frames. As the number of deformable convolutions increases, the network gains a better performance. The visual comparisons are shown in Fig. 10. From left to right, the network alleviates the blurring artifacts of the office building to some degree and recovers more details. In addition, we replaced the HFFB used in generating deformable sampling parameters with a convolution layer, the performance of network decreases by roughly 0.29 dB. It proves that by enlarging the receptive field, the deformable convolution can more effectively deal with complex and large motions.

In order to study the influence of inter-frame information on the recovery results, we leveraged different length of frames to train our network. From Fig. 11, we can observe that there is a dramatical improvement in DNLN when switching from 3 frames to 5 frames, and the performance of DNLN/5 is even better than RBPN which uses 7frames. When further switching to 7 frames, we can still get a better result but the improvement becomes minor.

5 Conclusion

In this paper, we propose a novel deformable non-local network (DNLN). It is a non-flow-based method for effective and efficient video super-resolutions. To deal with complicated and large motion compensation, we introduce the deformable convolution with HFFB in our alignment module which can well align the frames at the feature level. In addition, we adopt a non-local attention guidance module to further extract complementary features between frames. By making full use of the temporal information, we finally restore a high quality video frame through a reconstruction module. Extensive experiments demonstrate that DNLN achieves state-of-the-art performance on several benchmark datasets.


  • [1] Jose Caballero, Christian Ledig, Andrew Aitken, Alejandro Acosta, Johannes Totz, Zehan Wang, and Wenzhe Shi. Real-time video super-resolution with spatio-temporal networks and motion compensation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 4778–4787, 2017.
  • [2] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 764–773, 2017.
  • [3] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for image super-resolution. In European conference on computer vision, pages 184–199. Springer, 2014.
  • [4] Chao Dong, Chen Change Loy, and Xiaoou Tang.

    Accelerating the super-resolution convolutional neural network.

    In European conference on computer vision, pages 391–407. Springer, 2016.
  • [5] Muhammad Haris, Gregory Shakhnarovich, and Norimichi Ukita. Deep back-projection networks for super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1664–1673, 2018.
  • [6] Muhammad Haris, Gregory Shakhnarovich, and Norimichi Ukita. Recurrent back-projection network for video super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3897–3906, 2019.
  • [7] Zheng Hui, Jie Li, Xinbo Gao, and Xiumei Wang. Progressive perception-oriented network for single image super-resolution. arXiv preprint arXiv:1907.10399, 2019.
  • [8] Younghyun Jo, Seoung Wug Oh, Jaeyeon Kang, and Seon Joo Kim. Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3224–3232, 2018.
  • [9] Armin Kappeler, Seunghwan Yoo, Qiqin Dai, and Aggelos K Katsaggelos. Video super-resolution with convolutional neural networks. IEEE Transactions on Computational Imaging, 2(2):109–122, 2016.
  • [10] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1646–1654, 2016.
  • [11] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1637–1645, 2016.
  • [12] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [13] Renjie Liao, Xin Tao, Ruiyu Li, Ziyang Ma, and Jiaya Jia. Video super-resolution via deep draft-ensemble learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 531–539, 2015.
  • [14] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 136–144, 2017.
  • [15] Ce Liu and Deqing Sun. On bayesian adaptive video super resolution. IEEE transactions on pattern analysis and machine intelligence, 36(2):346–360, 2013.
  • [16] Ding Liu, Zhaowen Wang, Yuchen Fan, Xianming Liu, Zhangyang Wang, Shiyu Chang, and Thomas Huang. Robust video super-resolution with learned temporal dynamics. In Proceedings of the IEEE International Conference on Computer Vision, pages 2507–2515, 2017.
  • [17] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
  • [18] Mehdi SM Sajjadi, Raviteja Vemulapalli, and Matthew Brown. Frame-recurrent video super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6626–6634, 2018.
  • [19] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016.
  • [20] Xin Tao, Hongyun Gao, Renjie Liao, Jue Wang, and Jiaya Jia. Detail-revealing deep video super-resolution. In Proceedings of the IEEE International Conference on Computer Vision, pages 4472–4480, 2017.
  • [21] Yapeng Tian, Yulun Zhang, Yun Fu, and Chenliang Xu. Tdan: Temporally deformable alignment network for video super-resolution. arXiv preprint arXiv:1812.02898, 2018.
  • [22] Radu Timofte, Eirikur Agustsson, Luc Van Gool, Ming-Hsuan Yang, and Lei Zhang. Ntire 2017 challenge on single image super-resolution: Methods and results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 114–125, 2017.
  • [23] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.
  • [24] Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 0–0, 2018.
  • [25] Zhou Wang, Alan C Bovik, Hamid R Sheikh, Eero P Simoncelli, et al. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  • [26] Tianfan Xue, Baian Chen, Jiajun Wu, Donglai Wei, and William T Freeman. Video enhancement with task-oriented flow. International Journal of Computer Vision, pages 1–20, 2017.
  • [27] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), pages 286–301, 2018.
  • [28] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2472–2481, 2018.
  • [29] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9308–9316, 2019.