DRFN: Deep Recurrent Fusion Network for Single-Image Super-Resolution with Large Factors

08/23/2019 ∙ by Xin Yang, et al. ∙ Dalian University of Technology 23

Recently, single-image super-resolution has made great progress owing to the development of deep convolutional neural networks (CNNs). The vast majority of CNN-based models use a pre-defined upsampling operator, such as bicubic interpolation, to upscale input low-resolution images to the desired size and learn non-linear mapping between the interpolated image and ground truth high-resolution (HR) image. However, interpolation processing can lead to visual artifacts as details are over-smoothed, particularly when the super-resolution factor is high. In this paper, we propose a Deep Recurrent Fusion Network (DRFN), which utilizes transposed convolution instead of bicubic interpolation for upsampling and integrates different-level features extracted from recurrent residual blocks to reconstruct the final HR images. We adopt a deep recurrence learning strategy and thus have a larger receptive field, which is conducive to reconstructing an image more accurately. Furthermore, we show that the multi-level fusion structure is suitable for dealing with image super-resolution problems. Extensive benchmark evaluations demonstrate that the proposed DRFN performs better than most current deep learning methods in terms of accuracy and visual effects, especially for large-scale images, while using fewer parameters.



There are no comments yet.


page 1

page 2

page 6

page 7

page 8

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Single-image super-resolution (SISR) refers to the transformation of an image from low-resolution (LR) to high-resolution (HR). SISR is a long-standing problem in computer graphics and vision. Higher-resolution images often provide more desired information and can be applied in many domains, such as security and surveillance imaging, medical imaging, satellite imaging, and other fields. Therefore, it is necessary to explore the reconstruction performance of image super-resolution with larger upscaling factors.

Various algorithms have been introduced to solve the super-resolution (SR) problem, beginning with initial work by Freeman et al. [14]. Currently, deep-learning-based methods, especially convolutional neural networks (CNNs), are widely used to handle image SR owing to the powerful learning ability of CNNs. Super-Resolution Convolutional Neural Network (SRCNN) [8] pioneered the use of three-layer CNNs to learn the mapping relationship between an interpolated image and HR image and significantly outperformed traditional non-deep learning methods. After that, Kumar et al. [22] tapped into the ability of polynomial neural networks to hierarchically learn refinements of a function that maps LR to HR patches. Shi et al. [33] developed a contextualized multitask learning framework to address the SR problem. Kim et al. proposed two neural network structures with 20-layer convolutions, termed VDSR [20] and DRCN [21] respectively, and achieved state-of-the-art performance. Lim et al. built a wide-network EDSR [25] using residual blocks. To generate photo-realistic natural images, Ledig et al. [24] presented a generative adversarial network for SR. Lai et al. [23] proposed a deep convolutional network within a Laplacian pyramid framework, which progressively predicts high-frequency residuals in a coarse-to-fine manner.

Fig. 1: Visual comparisons of 4 super-resolution on a challenging image from Urban100 [17]. Results of other methods have serious artifacts and are highly blurred. The proposed method suppresses artifacts effectively and generates clear texture details.
Fig. 2: DRFN architecture. Orange arrows indicate transposed convolutions; blue arrows indicate convolutional layers; and the green plus sign indicates the feature maps concatenation operation. Black numbers indicate the number of feature maps; 2 indicates enlargement of the image size by two times.

However, as the upscaling factor becomes higher, these methods exhibit strong visual artifacts (Figure 1) caused by their network design philosophy. Current approaches possess three inherent limitations. First, most existing methods [8][20][21] apply interpolation strategies such as bicubic interpolation to first process the input image to the desired size and then use CNNs to extract features and learn LR/HR mapping relations. This pre-processing step often results in visible reconstruction artifacts. Second, several methods extract raw features directly from input LR images and replace the pre-defined upsampling operator with transposed convolution [9] or sub-pixel convolution [32]. These methods, however, use relatively small networks and cannot learn complicated mapping well due to a limited network capacity. Moreover, these approaches reconstruct HR images in one upsampling step at the end of the network, which increases the difficulties of training for large scaling factors (e.g., 8). Third, in the reconstruction stage, many algorithms have only one reconstruction level and cannot fully leverage more underlying information, including original and complementary information, among different recovery stages. Additionally, images reconstructed by a single-level structure lack many realistic texture details.

To address the above limitations, in this paper, we propose a deep recurrent fusion framework (DRFN) for the large-factor SR problem. As illustrated in Figure 2, we jointly extract and upsample raw features from an input LR image by putting the transposed convolution in front of the network. This design does not require a pre-defined upsampling operator (e.g., bicubic interpolation) as the pre-processing step and allows the following convolutional layers to focus on mapping in the HR feature space. After that, we use recurrent residual blocks to gradually recover high-frequency information of the HR image using fewer parameters. Then, three convolutional layers are used to extract features with different receptive field sizes at each recovery stage. In doing so, we can make full use of complementary information among three different level features. Finally, we use a convolutional layer to fuse feature maps and reconstruct HR images.

In summary, we propose a novel DRFN end-to-end framework for single-image super-resolution with high upscaling factors (4 and 8). Without extraneous steps, this DRFN training from scratch can produce HR images with more texture details and better visual performance. As demonstrated through extensive experiments, the proposed DRFN significantly outperforms existing deep learning methods in terms of accuracy and visual effects, especially when dealing with large scaling factors.

Ii Related Work

Extensive research has investigated the SR problem. In this section, we summarize the main related works with respect to conventional methods, learning-based methods, and deep-learning-based methods.

Conventional Methods. Early methods were mainly based on image interpolation, namely linear, bicubic, or Lanczos [10]. Later, prior information was introduced to promote results, such as edge prior [5] and edge statistics [12]. Michaeli et al. [28] utilized the recurrent property of image patches to recover an SR blur kernel. Efrat et al. [11] combined accurate reconstruction constraints and gradient regularization to improve reconstruction results. Although most conventional approaches are fast and generate smooth HR images, high-frequency information is difficult to recover as overly smooth solutions.

Learning-based Methods. More approaches focus on recovering complex mapping relations between LR and HR images. These mapping relations can be established by external or internal databases.

Several methods can learn LR/HR mapping relations from external databases using different models and strategies. Yang et al. [42] introduced sparse representations of LR/HR patch pairs. Freeman et al. [14] presented dictionaries of LR/HR patch pairs and reconstructed HR patches with the corresponding nearest neighbors from the LR space. Timofte et al. [37]

assumed all LR/HR patches lie on the manifold in the LR/HR space, so outputs were reconstructed by the retrieved patches. Additionally, K-means


and random forest

[30] algorithms were proposed to seek mapping by partitioning the image database. Methods based on external databases can obtain a mass of different prior knowledge to achieve good performance. Nevertheless, the efficiency of these approaches is rather poor given the cost of matching HR patches.

Methods based on internal databases create LR/HR patch pairs and utilize the self-similarity property of input images. Freedman et al. [13] used image pyramids to seek the local self-similarity property. Singh et al. [34] used directional frequency sub-bands to compose patches. Cui et al. [3] conducted image SR layer by layer. In each layer, they elaborately integrated the non-local self-similarity search and collaborative local auto-encoder. Huang et al. [17] warped the LR patch to find matching patches in the LR image and unwarped the matching patch as the HR patch. Methods based on internal databases have high computational costs to search patches, resulting in slow speed.

Liu et al. [26] proposed a group-structured sparse representation approach to make full use of internal and external dependencies to facilitate image SR. Xu et al. [39] proposed an integration model based on Gaussian conditional random fields, which learns the probabilistic distribution of the interaction between patch-based and deep-learning-based SR methods.

Deep-learning-based Methods. Deep learning methods have achieved great success with SR. Dong et al. [8] successfully pioneered a CNN to solve the SR problem. Shi et al. [32] presented an efficient sub-pixel convolution (ESPCN) layer to upscale LR feature maps into HR output. By doing so, ESPCN achieves a stunning average speed. Dong et al. [9] presented a hourglass-shaped CNN to accelerate SRCNN. Motivated by SRCNN, Kim et al. [20]

presented a very deep convolutional network (VDSR) to obtain a larger receptive field. VDSR achieves fast convergence via proposed residual learning and gradient clipping. Moreover, VDSR can handle multi-scale SR using a single network. Because deeper networks often introduce more parameters, recurrent learning strategies were applied in this study to reduce the number of parameters along with skip connections to accelerate convergence.

Kim et al. [21] presented an approach that used more layers to increase the receptive field of the network and proposed a very deep recursive layer to avoid excessive parameters. Zhang et al. [45] proposed an effective and fast SISR algorithm by combining clustering and collaborative representation. Tai et al. introduced recursive blocks in DRRN [36] and memory blocks in Memnet [35] for deeper networks, but each method must interpolate the original LR image to the desired size. Yang et al. [43] utilized the LR image and its edge map to infer sharp edge details of an HR image during the recurrent recovery process. Lai et al. [23] presented a Laplacian pyramid network for SR. The proposed model can predict high-frequency information with coarse feature maps. LapSRN is accurate and fast for SISR. Unlike most deep-learning methods, we adopted transposed convolution to replace bicubic interpolation to extract raw features and promote reconstruction performance. Furthermore, a multi-level structure was designed to obtain better reconstruction performance, including visual effects.

Iii Methodology

In this section, we describe the design methodology of our proposed DRFN. Figure 2 presents the DRFN for image reconstruction, which consists of three main parts: joint feature extraction and upsampling, recurrent mapping in the HR feature space, and multi-level fusion reconstruction. First, the proposed method uses multiple transposed convolution operations to jointly extract and upsample raw features from the input image. Second, two recurrent residual blocks are used for mapping in the HR feature space. Finally, DRFN uses three convolutional layers to extract features with different receptive field sizes at each recovery stage and uses one convolutional layer for multi-level fusion reconstruction. Now, we will present additional technical details of each part of our model.

Fig. 3: Structure of recurrent residual block. The red line indicates a skip connection, and the orange line indicates a recurrent connection. 3364 indicates that the size of the convolution kernel is 33, and the number of output channels is 64.

Iii-a Joint Feature Extraction and Upsampling

In this subsection, we show how to jointly extract and upsample raw features. The transposed convolutional layer consists of diverse, automatically learned upsampling kernels, from which raw features can be extracted simultaneously from the input image to achieve upsampling. For higher SR factors, compared with bicubic interpolation, transposed convolution can alleviate training difficulties while effectively suppressing artifacts. Therefore, we use transposed convolutions to amplify the original LR input image () to the desired size.

FSRCNN [9] extracts feature maps in the LR space and replaces the bicubic upsampling operation with one transposed convolution layer at the end of the network, and LapSRN progressively reconstructs an HR image throughout the structure. By contrast, the proposed method first puts the transposed convolution at the forefront of the network for joint feature extraction and upsampling. This setting is conducive to allowing the rest of the network to extract features in the HR feature space to further improve performance. Moreover, extracting raw features from the original image enhances the reconstruction details, which is beneficial in generating large-scale and visually satisfying images. The upsampling process can be formulated as


where represents the transposed convolution operation that doubles the size of the image, and

is a non-linear operation achieved using a parametric rectified linear unit (PReLU).

denotes the feature maps extracted from the transposed convolution of step . When , is the input LR image . Then, the size of the input image is magnified by iterating Eq. 1. The number of iterations can be adjusted to determine the SR scale factor; for example, the scale factor is when .

Iii-B Recurrent Mapping in HR Feature Space

In this subsection, we show the structure of our recurrent residual block and how recurrent blocks gradually recover high-frequency information in HR images. In ResNet [16]

, the basic residual unit uses batch normalization (BN)


and ReLU as the activation function after the weight layers. BN layers normalize features to limit networks’ range of flexibility and occupy considerable GPU memory; therefore, we removed the batch normalization layers from the proposed network as Nah et al.

[29] did in their image-deblurring work. In addition, we replaced ReLU with parametric ReLU to avoid ”dead features” caused by zero gradients in ReLU. Our basic unit for mapping in an HR feature space is illustrated in Figure 3. Deep models are prone to overfitting and become disk hungry, hence our adoption of recurrence learning strategies to reduce the number of parameters. Skip connections were used in recurrent blocks to provide fast and improved convergence.

The size of the convolution kernel used for feature extraction was 3, and padding was set to 1 to prevent the size of the feature maps from changing. We used two recurrent residual blocks, each of which looped 10 times. Therefore, our network had a much larger receptive field, which benefits large factors. Different from other methods that map from an LR feature space to HR feature space, the proposed DRFN used recurrent blocks to map in the HR feature space. Output feature maps of the recurrent residual block are progressively updated as follows:


where represents the -th cyclically generated feature maps; and and denote convolution and PReLU operation, respectively.

Iii-C Multi-level Fusion Reconstruction

In this subsection, we show how to fuse different-level features and perform HR image reconstruction. A larger SR factor requires more diverse feature information; to meet this need, we propose fusing different-level features to recover the final HR image. As shown in Figure 2, three convolutional layers were used to automatically extract features at different levels. Then, we concatenated these features and ultimately apply one convolutional layer to integrate features with different receptive field sizes. Each recurrent residual block gradually refined the rough image information from the front block but may lose original feature information for reconstruction in this process. Therefore, different-level feature information must be integrated, including refined information and easy-to-lose information (i.e., original feature information), to make full use of complementary information among three different level features. Corresponding experiments demonstrated that fusion networks with three levels improve reconstruction accuracy and visual performance compared with single and double levels, especially for larger-scale factors.

We chose mean square error (MSE) as the loss function. Let

denote the input LR image patch and indicate the corresponding HR image patch. Given a training dataset containing patches, the goal is to minimize the the following formula:



represents the feed-forward neural network parameterized by

. We used mini-batch stochastic gradient descent (SGD) with backpropagation to optimize the penalty function, and the DRFN was implemented by Caffe


Iv Experiments

In this section, we first describe the training and test datasets of our method. We then introduce the implementation details of the algorithm. Next, we compare our proposed method with several state-of-the-art SISR methods and demonstrate the superiority of DRFN. Finally, the contributions of different components are analyzed.

Iv-a Datasets

Training dataset: RFL [30] and VDSR [20] use a training dataset of 291 images, containing 91 images from Yang et al. [42] and 200 images from the Berkeley Segmentation Dataset [27]. We also used 291 images to ensure fair comparison with other methods. In addition, we rotated the original images by , , and and flipped them horizontally. After this process, each image had eight versions for a total training set of images.

Scale Set5 Set14 BSDS100 Urban100 ImageNet400

2 33.64 0.9293 5.714 30.31 0.8693 5.699 29.55 0.8432 5.256 26.88 0.8409 6.191 30.03 0.8667 5.970
A+[37] 2 36.55 0.9545 8.465 32.40 0.9064 8.001 31.23 0.8868 7.282 29.24 0.8944 8.246 32.05 0.8998 6.913
JOR[4] 2 36.58 0.9543 8.511 32.38 0.9063 8.052 31.22 0.8867 7.321 29.25 0.8951 8.301 32.05 0.8998 6.969
SRCNN[8] 2 36.35 0.9521 7.522 32.29 0.9046 7.227 31.15 0.8851 6.653 29.10 0.8900 7.446 31.98 0.8970 6.374
FSRCNN[9] 2 37.00 0.9557 8.047 32.75 0.9095 7.727 31.51 0.8910 7.068 29.88 0.9015 8.005 32.52 0.9031 6.712
VDSR[20] 2 37.53 0.9587 8.580 33.15 0.9132 8.159 31.90 0.8960 7.494 30.77 0.9143 8.605 33.22 0.9106 7.096
LapSRN_x2[23] 2 37.44 0.9581 8.400 33.06 0.9122 8.011 31.78 0.8944 7.293 30.39 0.9096 8.430 32.98 0.9082 6.912
DRFN_x2 2 37.71 0.9595 8.927 33.29 0.9142 8.492 32.02 0.8979 7.721 31.08 0.9179 9.076 33.42 0.9123 8.002

3 30.39 0.8673 3.453 27.62 0.7756 3.327 27.20 0.7394 3.003 24.46 0.7359 3.604 27.91 0.7995 3.363
A+[37] 3 32.59 0.9077 4.922 29.24 0.8208 4.491 28.30 0.7844 3.971 26.05 0.7984 4.812 29.42 0.8351 4.034
JOR[4] 3 32.55 0.9067 4.892 29.19 0.8204 4.485 28.27 0.7837 3.966 25.97 0.7972 4.766 29.34 0.8343 4.028
SRCNN[8] 3 32.39 0.9026 4.315 29.11 0.8167 4.027 28.22 0.7809 3.608 25.87 0.7889 4.240 29.27 0.8294 3.617
FSRCNN[9] 3 33.16 0.9132 4.963 29.55 0.8263 4.551 28.52 0.7901 4.025 26.43 0.8076 4.841 29.78 0.8376 4.078
VDSR[20] 3 33.66 0.9213 5.203 29.88 0.8330 4.692 28.83 0.7976 4.151 27.14 0.8284 5.163 30.37 0.8509 4.251
LapSRN_x4111Due to the network design of LapSRN, scale factors for training are limited to the power of 2 (e.g., 2, 4, or 8). LapSRN performs SR to other scales by first upsampling input images to a larger scale and then downsampling the output to the desired resolution. As mentioned in their paper, we tested the results for 3 SR by using their 4 model.[23] 3 33.78 0.9209 5.079 29.87 0.8328 4.552 28.81 0.7972 3.946 27.06 0.8269 5.019 30.32 0.8497 4.085
DRFN_x3 3 34.01 0.9234 5.421 30.06 0.8366 4.897 28.93 0.8010 4.281 27.43 0.8359 5.481 30.59 0.8539 4.582

4 28.42 0.8099 2.342 26.00 0.7025 2.259 25.96 0.6692 2.021 23.15 0.6592 2.355 26.70 0.7530 2.137
A+[37] 4 30.28 0.8587 3.248 27.32 0.7497 2.962 26.82 0.7100 2.551 24.34 0.7201 3.180 27.92 0.7877 2.660
JOR[4] 4 30.19 0.8563 3.190 27.27 0.7479 2.923 26.79 0.7083 2.534 24.29 0.7181 3.113 27.87 0.7865 2.630
SRCNN[8] 4 30.48 0.8618 2.991 27.50 0.7517 2.751 26.90 0.7115 2.396 24.16 0.7066 2.769 28.18 0.7903 2.492
FSRCNN[9] 4 30.70 0.8646 2.986 27.59 0.7539 2.707 26.96 0.7174 2.359 24.62 0.7281 2.907 28.16 0.7895 2.412
VDSR[20] 4 31.35 0.8838 3.542 28.02 0.7678 3.106 27.29 0.7252 2.679 25.18 0.7534 3.462 28.77 0.8056 2.820
LapSRN_x4[23] 4 31.54 0.8852 3.515 28.19 0.7716 3.089 27.32 0.7275 2.618 25.21 0.7554 3.448 28.82 0.8082 2.785
DRFN_x4 4 31.55 0.8861 3.693 28.30 0.7737 3.250 27.39 0.7293 2.766 25.45 0.7629 3.693 28.99 0.8106 2.954

TABLE I: Performance comparison of the proposed method with seven SR algorithms on five benchmarks with scale factors of 2, 3, and 4. Red numbers denote the best performance, and blue numbers denote the second-best performance.

Upscaling Factor 8
Set5 Set14 BSDS100

24.39 0.657 0.836 23.19 0.568 0.784 23.67 0.547 0.646
A+[37] 25.52 0.692 1.007 23.98 0.597 0.983 24.20 0.568 0.797
SRCNN[8] 25.33 0.689 0.938 23.85 0.593 0.865 24.13 0.565 0.705
FSRCNN[9] 25.41 0.682 0.989 23.93 0.592 0.928 24.21 0.567 0.772
VDSR[20] 25.72 0.711 1.123 24.21 0.609 1.016 24.37 0.576 0.816
LapSRN_x8[23] 26.14 0.738 1.295 24.44 0.623 1.123 24.54 0.586 0.880
DRFN_x8 26.22 0.740 1.331 24.57 0.625 1.138 24.60 0.587 0.893

TABLE II: Performance comparison of the proposed method with six SR algorithms on three benchmarks with a scale factor of 8. Red numbers denote the best performance, and blue numbers denote the second-best performance.

Test dataset: We evaluated the results of five test sets—Set5 [2], Set14 [44], BSDS100 [1], Urban100[17], and ImageNet400—which contained , , , , and images, respectively. In these datasets, Set5, Set14, and BSDS100 were composed of natural scenes and have been used often in other studies. Urban100 was created by Huang et al. [17]

and includes 100 images of various real-world structures, which presents challenges for many methods. Images in ImageNet400 are randomly selected from ImageNet


Iv-B Implementation Details

We converted original-color RGB images into grayscale images and performed training and testing on the luminance channel. We generated LR training images using the bicubic downsampling and cut them into patches with a stride of four. Approximately

patches were generated after this operation. We set the mini-batch size of SGD with momentum to 32, such that each epoch contained

iterations for training. In addition, we set the momentum parameter to 0.9 and weight decay to .

All PReLUs were initially set to 0.33. The stride of transposed convolution was to ensure that each transposed convolution would magnify the image twice. The kernel size of each convolution was . We used the same strategy as He et al. [15] for convolution weight initialization. The initial learning rate was set to 0.1 and then reduced by a factor of 10 every 10 epochs. We also adopted adjustable gradient clipping [20] to ease the difficulty of training the network. The gradient of each iteration update was limited to , where is the maximum value of each update step size and is the current learning rate. We stopped training when the loss ceased to fall. It took approximately 3 days to train the network using an NVIDIA GTX Ti graphics card.

Ground Truth (GT) GT (PSNR,SSIM) Ours (28.48,0.8605) LapSRN[23] (27.83,0.8442) VDSR[20] (28.03,0.8458) FSRCNN[9] (26.78,0.8059) SRCNN[8] (26.27,0.7859) A[37] (26.65,0.8034) Bicubic[6] (24.76,0.7220)
Fig. 4: Visual comparisons between different algorithms for Urban100 [17] image with scale factor 4.
Ground Truth (GT) GT (PSNR,SSIM) Ours (24.53,0.7907) LapSRN[23] (24.00,0.7838) VDSR[20] (24.21,0.7856) FSRCNN[9] (23.33,0.7630) SRCNN[8] (22.74,0.7474) A[37] (21.67,0.7524) Bicubic[6] (21.60,0.7016)
Fig. 5: Visual comparisons between different algorithms for ImageNet400 image with scale factor 4.

Iv-C Benchmark Results

Quantitative Evaluation. Here, we provide quantitative comparisons for 2, 3, and 4 SR results in Table I and 8 SR results in Table II, respectively. We compare our proposed method with bicubic interpolation and the following six state-of-the-art SR methods: A+ [37], JOR [4], SRCNN [8], FSRCNN [9], VDSR [20], and LapSRN [23]. We evaluated SR images based on three commonly used image quality metrics: peak signal-to-noise ratio (PSNR), structural similarity (SSIM) [38], and information fidelity criterion (IFC) [31]. In particular, IFC has been shown to be related to human visual perception [40]. For fair comparison of the 8 factor, we followed LapSRN [23] in using their datasets generated by retrained models of A+ [37], SRCNN [8], FSRCNN [9], and VDSR [20]. For scale factors of four and eight, our approach was superior to other SR methods on all datasets.

Visual Performance. As indicated in Tables I and II, DRFN was found to be far superior to other methods in IFC on all datasets. Combined with the visual samples in Figures 4, 5, 6, and 7

, findings show that the proposed DRFN estimated better visual details. For instance, images generated by other SR methods exhibited visible artifacts, whereas the proposed method generated a more visually pleasant image with clean details and sharp edges. For example, in Figure

5, the results of other methods are completely blurred, and only our result demonstrates clear textures. The experimental results show that the proposed method can achieve good visual performance.

Ground Truth (GT) GT (PSNR,SSIM) Ours (20.31,0.7926) LapSRN[23] (20.27,0.7844) VDSR[20] (18.73,0.6827) FSRCNN[9] (19.45,0.7096) SRCNN[8] (19.52,0.7246) A[37] (19.48,0.7171) Bic[6] (18.76,0.6838)
Fig. 6: Visual comparisons between different algorithms for Set14 [1] image with scale factor 8.
Ground Truth (GT) GT (PSNR,SSIM) Ours (27.79,0.8018) LapSRN[23] (27.50,0.7968) VDSR[20] (26.68,0.7634) FSRCNN[9] (26.44,0.7452) SRCNN[8] (26.40,0.7436) A[37] (26.48,0.7544) Bicubic[6] (25.15,0.7161)
Fig. 7: Visual comparisons between different algorithms for BSDS100 [1] with scale factor 8.

Iv-D Model Analysis

In this subsection, we first compare and SR results of the proposed DRFN with existing methods. Then, we study the contributions of different components of the proposed DRFN to SR reconstruction and explore the effects of cycle times of recurrent blocks on reconstruction performance.

Comparisons of and SR Results. Although our work is intended for for large-factor SR problems, the proposed DRFN can also perform and SR well. The quantitative results for and SR are presented in Table I. The proposed DRFN significantly outperformed existing methods on and SR results, suggesting that our methodology is reasonable and effective. This DRFN is hence powerful enough to handle different scaling factors.

Set5 Set14 BSDS100

31.41 28.10 27.29
4-posttransconv 31.50 28.19 27.34
4-pretransconv 31.55 28.30 27.39
8-posttransconv 26.03 24.39 24.50
8-pretransconv 26.22 24.47 24.60

TABLE III: Average PSNR when DRFN performs image magnification using bicubic and transposed convolution at the front and last of the network, respectively, for scale factor and on dataset Set5 [2], Set14 [44], and BSDS100 [1].
Fig. 8: Feature maps of Levels 1, 2, and 3 from Figure 2 with the image of a “butterfly” in Set5 as input. The figure shows parts of all feature maps.
Fig. 9: Contribution of different components in the proposed network.
Fig. 10: PSNR and parameters of existing CNN models for scale factor 4 on Set14 [44]. Red point is our model. With an appropriate number of parameters, DRFN achieves better performance than state-of-the-art methods.

Transposed Convolution. First, to verify the superiority of transposed convolution compared to bicubic interpolation, we used an interpolated image as input and replaced the transposed convolution with general convolution. Second, we used the original small image as input and placed transposed convolution at the final part of the network to enlarge the image. By doing so, we can prove that the location of transposed convolution has an effect on reconstruction results. We carried out experiments on 4 and 8 scale factors; results are shown in Table III. In the table, 4-prebic and 4(8)-posttransconv represent the above two mentioned contrast experiments, respectively; and 4(8)-pretransconv is our DRFN_4(8). Visual results are shown in Figure 9(the three-level is DRFN_4). Quantitative and qualitative results indicate that using transposed convolution to replace bicubic interpolation and placing the transposed convolution at the forefront of the network can boost performance. For example, 4-pretransconv was higher than 4-prebic on Set5, and 8-pretransconv was higher than 8-posttransconv on BSDS100. Although pretransconv and posttransconv had the same number of model parameters, pretransconv was more computationally intensive in the prediction phase because it caused the following convolution layers to recover high-frequency information in the HR space. The average inference times of post-transconv and pre-transconv for each image in Set5 [2] with a scale factor of 4 were 0.584s and 1.187s, respectively. For greater performance improvements, we chose pretransconv as our model.

Versions Set5 Set14 BSDS100

one-level 31.25 0.879 28.09 0.767 27.23 0.723
two-level 31.41 0.883 28.14 0.770 27.28 0.726
three-level 31.55 0.886 28.30 0.774 27.39 0.729
8 one-level 25.75 0.716 24.19 0.610 24.40 0.578
two-level 25.85 0.721 24.28 0.614 24.45 0.580
three-level 26.22 0.740 24.57 0.625 24.60 0.587

TABLE IV: Average PSNR and SSIM of DRFN at different levels, for scale factor and on dataset Set5 [2], Set14 [44], and BSDS100 [1].

Recurrent residual learning. Kim et al. [20] demonstrated that the performance of the network improved as the depth increased. However, deeper networks need to train more parameters. Our use of a recurrent learning strategy greatly reduced the model complexity. For instance, for a recurrent residual block with three convolution layers, looping five times eliminated parameters. After calculation, our DRFN contained total parameters. Figure 10 shows the PSNR performance of several recent CNN models for SR versus the number of parameters. The proposed method achieved better performance with an appropriate number of model parameters.

Cycle Times

26.67 0.7046 0.62
  5 27.32 0.7270 0.80
  10 27.39 0.7293 1.28
TABLE V: Comparison of different DRFN depths. Different depths can be achieved by setting cycle times of recurrent blocks. Time is the average inference time per image in BSD100 as measured on an NVIDIA GTX 1080Ti GPU.

Multi-level structure. To verify that the multi-level structure demonstrated an improved role in image reconstruction, (1) we removed the first two levels from DRFN (denoted as one-level), and (2) removed Level 2 from DRFN (denoted as two-level) for experimental comparison. The result of the three-level network was best as shown in Table IV. As displayed in Figure 9, the three-level network reconstructed the image with richer texture details compared with the one-level and two-level networks. Each level had a positive effect on the result. Taking the image “butterfly” in Set5 as input, feature maps of different levels appear in Figure 8. These results suggest that the features at each recovery stage had different context characteristics. The multi-level structure rendered our model more robust and more accurate.

Network depth. We also studies the effect of the cycle times of the recurrent block. Different cycle times indicated that the network had a different depth. We set the number of cycles to , , and , respectively, and the depth of the two recurrent blocks remained the same. We did not continue to train deeper networks due to GPU memory limitations. The experimental results in Table V show that increasing the number of cycles can boost performance but also increases time consumption. To achieve better results, we chose to cycle 10 times as a benchmark.

V Conclusions

In this paper, we propose a DRFN for large-scale accurate SISR. Our DRFN uses transposed convolution to jointly extract and upsample raw features, and the following convolution layers focus on mapping in the HR feature space. High-frequency information is gradually recovered by recurrent residual blocks. Multi-level fusion makes full use of potential information for HR image reconstruction. The proposed DRFN extends quantitative and qualitative SR performance to a new state-of-the-art level. Extensive benchmark experiments and analyses indicate that DRFN is a superior SISR method, especially for large factors.

As this DRFN has achieved outstanding performance on 4 and 8, we intend to apply it to more challenging up-scaling factors such as 12. We also plan to generalize our method for other applications, such as denoising and deblurring.


  • [1] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik (2011) Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence 33 (5), pp. 898–916. Cited by: Fig. 6, Fig. 7, §IV-A, TABLE III, TABLE IV.
  • [2] M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel (2012) Low-complexity single-image super-resolution based on nonnegative neighbor embedding. Cited by: §IV-A, §IV-D, TABLE III, TABLE IV.
  • [3] Z. Cui, H. Chang, S. Shan, B. Zhong, and X. Chen (2014) Deep network cascade for image super-resolution. In

    European Conference on Computer Vision

    pp. 49–64. Cited by: §II.
  • [4] D. Dai, R. Timofte, and L. Van Gool (2015) Jointly optimized regressors for image super-resolution. In Computer Graphics Forum, Vol. 34, pp. 95–104. Cited by: §IV-C, TABLE I.
  • [5] S. Dai, M. Han, W. Xu, Y. Wu, and Y. Gong (2007) Soft edge smoothness prior for alpha channel super resolution. In

    IEEE Conference on Computer Vision and Pattern Recognition

    pp. 1–8. Cited by: §II.
  • [6] C. De Boor (1962) Bicubic spline interpolation. Studies in Applied Mathematics 41 (1-4), pp. 212–218. Cited by: Fig. 4, Fig. 5, Fig. 6, Fig. 7, TABLE I, TABLE II.
  • [7] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. Cited by: §IV-A.
  • [8] C. Dong, C. C. Loy, K. He, and X. Tang (2016) Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence 38 (2), pp. 295–307. Cited by: §I, §I, §II, Fig. 4, Fig. 5, Fig. 6, Fig. 7, §IV-C, TABLE I, TABLE II.
  • [9] C. Dong, C. C. Loy, and X. Tang (2016) Accelerating the super-resolution convolutional neural network. In European Conference on Computer Vision, pp. 391–407. Cited by: §I, §II, §III-A, Fig. 4, Fig. 5, Fig. 6, Fig. 7, §IV-C, TABLE I, TABLE II.
  • [10] C. E. Duchon (1979) Lanczos filtering in one and two dimensions. Journal of Applied Meteorology 18 (8), pp. 1016–1022. Cited by: §II.
  • [11] N. Efrat, D. Glasner, A. Apartsin, B. Nadler, and A. Levin (2013) Accurate blur models vs. image priors in single image super-resolution. In IEEE International Conference on Computer Vision, pp. 2832–2839. Cited by: §II.
  • [12] R. Fattal (2007) Image upsampling via imposed edge statistics. In ACM Transactions on Graphics, Vol. 26, pp. 95. Cited by: §II.
  • [13] G. Freedman and R. Fattal (2011) Image and video upscaling from local self-examples. ACM Transactions on Graphics 30 (2), pp. 12. Cited by: §II.
  • [14] W. T. Freeman, T. R. Jones, and E. C. Pasztor (2002) Example-based super-resolution. IEEE Computer graphics and Applications 22 (2), pp. 56–65. Cited by: §I, §II.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In IEEE International Conference on Computer Vision, pp. 1026–1034. Cited by: §IV-B.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §III-B.
  • [17] J. Huang, A. Singh, and N. Ahuja (2015) Single image super-resolution from transformed self-exemplars. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5197–5206. Cited by: Fig. 1, §II, Fig. 4, §IV-A.
  • [18] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In

    International Conference on Machine Learning

    pp. 448–456. Cited by: §III-B.
  • [19] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell (2014) Caffe: convolutional architecture for fast feature embedding. In ACM international conference on Multimedia, pp. 675–678. Cited by: §III-C.
  • [20] J. Kim, J. Kwon Lee, and K. Mu Lee (2016) Accurate image super-resolution using very deep convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1646–1654. Cited by: §I, §I, §II, Fig. 4, Fig. 5, Fig. 6, Fig. 7, §IV-A, §IV-B, §IV-C, §IV-D, TABLE I, TABLE II.
  • [21] J. Kim, J. Kwon Lee, and K. Mu Lee (2016) Deeply-recursive convolutional network for image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1637–1645. Cited by: §I, §I, §II.
  • [22] N. Kumar and A. Sethi (2016) Fast learning-based single image super-resolution. IEEE Transactions on Multimedia 18 (8), pp. 1504–1515. Cited by: §I.
  • [23] W. Lai, J. Huang, N. Ahuja, and M. Yang (2017) Deep laplacian pyramid networks for fast and accurate super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §I, §II, Fig. 4, Fig. 5, Fig. 6, Fig. 7, §IV-C, TABLE I, TABLE II.
  • [24] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2016) Photo-realistic single image super-resolution using a generative adversarial network. IEEE Conference on Computer Vision and Pattern Recognition, pp. 4681–4690. Cited by: §I.
  • [25] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee (2017-07) Enhanced deep residual networks for single image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §I.
  • [26] J. Liu, W. Yang, X. Zhang, and Z. Guo (2017) Retrieval compensated group structured sparsity for image super-resolution. IEEE Transactions on Multimedia 19 (2), pp. 302–316. Cited by: §II.
  • [27] D. Martin, C. Fowlkes, D. Tal, and J. Malik (2001) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In IEEE International Conference on Computer Vision, Vol. 2, pp. 416–423. Cited by: §IV-A.
  • [28] T. Michaeli and M. Irani (2013) Nonparametric blind super-resolution. In IEEE International Conference on Computer Vision, pp. 945–952. Cited by: §II.
  • [29] S. Nah, T. H. Kim, and K. M. Lee (2016) Deep multi-scale convolutional neural network for dynamic scene deblurring. arXiv preprint arXiv:1612.02177. Cited by: §III-B.
  • [30] S. Schulter, C. Leistner, and H. Bischof (2015) Fast and accurate image upscaling with super-resolution forests. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3791–3799. Cited by: §II, §IV-A.
  • [31] H. R. Sheikh, A. C. Bovik, and G. De Veciana (2005) An information fidelity criterion for image quality assessment using natural scene statistics. IEEE Transactions on Image Processing 14 (12), pp. 2117–2128. Cited by: §IV-C.
  • [32] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1874–1883. Cited by: §I, §II.
  • [33] Y. Shi, K. Wang, C. Chen, L. Xu, and L. Lin (2017) Structure-preserving image super-resolution via contextualized multi-task learning. IEEE Transactions on Multimedia. Cited by: §I.
  • [34] A. Singh and N. Ahuja (2014) Super-resolution using sub-band self-similarity. In Asian Conference on Computer Vision, pp. 552–568. Cited by: §II.
  • [35] Y. Tai, J. Yang, X. Liu, and C. Xu (2017) MemNet: a persistent memory network for image restoration. In Proceedings of International Conference on Computer Vision, Cited by: §II.
  • [36] Y. Tai, J. Yang, and X. Liu (2017) Image super-resolution via deep recursive residual network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §II.
  • [37] R. Timofte, V. De Smet, and L. Van Gool (2014) A+: adjusted anchored neighborhood regression for fast super-resolution. In Asian Conference on Computer Vision, pp. 111–126. Cited by: §II, Fig. 4, Fig. 5, Fig. 6, Fig. 7, §IV-C, TABLE I, TABLE II.
  • [38] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli (2004) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. Cited by: §IV-C.
  • [39] K. Xu, X. Wang, X. Yang, S. He, Q. Zhang, B. Yin, X. Wei, and R. W. Lau (2018) Efficient image super-resolution integration. The Visual Computer 34 (6-8), pp. 1065–1076. Cited by: §II.
  • [40] C. Yang, C. Ma, and M. Yang (2014) Single-image super-resolution: a benchmark. In European Conference on Computer Vision, pp. 372–386. Cited by: §IV-C.
  • [41] C. Yang and M. Yang (2013) Fast direct super-resolution by simple functions. In IEEE International Conference on Computer Vision, pp. 561–568. Cited by: §II.
  • [42] J. Yang, J. Wright, T. S. Huang, and Y. Ma (2010) Image super-resolution via sparse representation. IEEE Transactions on Image Processing 19 (11), pp. 2861–2873. Cited by: §II, §IV-A.
  • [43] W. Yang, J. Feng, J. Yang, F. Zhao, J. Liu, Z. Guo, and S. Yan (2017) Deep edge guided recurrent residual learning for image super-resolution. IEEE Transactions on Image Processing 26 (12), pp. 5895–5907. Cited by: §II.
  • [44] R. Zeyde, M. Elad, and M. Protter (2010) On single image scale-up using sparse-representations. In International conference on curves and surfaces, pp. 711–730. Cited by: Fig. 10, §IV-A, TABLE III, TABLE IV.
  • [45] Y. Zhang, Y. Zhang, J. Zhang, and Q. Dai (2016) CCR: clustering and collaborative representation for fast single image super-resolution. IEEE Transactions on Multimedia 18 (3), pp. 405–417. Cited by: §II.