Log In Sign Up

Content-adaptive Representation Learning for Fast Image Super-resolution

by   Yukai Shi, et al.

Deep convolutional networks have attracted great attention in image restoration and enhancement. Generally, restoration quality has been improved by building more and more convolutional block. However, these methods mostly learn a specific model to handle all images and ignore difficulty diversity. In other words, an area in the image with high frequency tend to lose more information during compressing while an area with low frequency tends to lose less. In this article, we adrress the efficiency issue in image SR by incorporating a patch-wise rolling network(PRN) to content-adaptively recover images according to difficulty levels. In contrast to existing studies that ignore difficulty diversity, we adopt different stage of a neural network to perform image restoration. In addition, we propose a rolling strategy that utilizes the parameters of each stage more flexible. Extensive experiments demonstrate that our model not only shows a significant acceleration but also maintain state-of-the-art performance.


page 3

page 4

page 7

page 8

page 9


Difficulty-aware Image Super Resolution via Deep Adaptive Dual-Network

Recently, deep learning based single image super-resolution(SR) approach...

Image Super-Resolution Using Deep Convolutional Networks

We propose a deep learning method for single image super-resolution (SR)...

A Scale-Arbitrary Image Super-Resolution Network Using Frequency-domain Information

Image super-resolution (SR) is a technique to recover lost high-frequenc...

MASA-SR: Matching Acceleration and Spatial Adaptation for Reference-Based Image Super-Resolution

Reference-based image super-resolution (RefSR) has shown promising succe...

ClassPruning: Speed Up Image Restoration Networks by Dynamic N:M Pruning

Image restoration tasks have achieved tremendous performance improvement...

Densely Connected High Order Residual Network for Single Frame Image Super Resolution

Deep convolutional neural networks (DCNN) have been widely adopted for r...

CFSNet: Toward a Controllable Feature Space for Image Restoration

Deep learning methods have witnessed the great progress in image restora...

I Introduction

Deep learning has successfully applied in many computer vision fields such as image recognition [residual_net], semantic segmentation [unet] and object detection [ouyang2015deepid]. Inspired by the rapid development and superior performance, many efforts have been made to introduce deep learning in low-level vision as well as image processing tasks, including image suer-resolution [srcnn], image enhancement [dped], inpainting [shepard] etc. Meanwhile, Single image super-resolution(SISR), namely to predict high-resolution with low-resolution input, is widely used in many computer vision applications and draws plenty of attentions [srcnn, vdsr, srgan, edsr, subpixel, laplacian, fsrcnn].

Recently, Convolutional neural networks(CNNs) achieve magnificent improvement toward image restoration by adopting a building block strategy. VDSR [vdsr]

utilizes residual connection and a very deep model to achieve promising results in image SR. EDSR 

[edsr] further improves the results by adopting residual block [residual_net]

and remove batch normalization. However, they advance performance with numerous parameter gain and huge computational cost. Dense block 

[densenet] also exhibits its effectiveness in image enhancement. MemNet [memNet] realizes a coarse-to-fine restoration process by using dense block and recursive unit. Zhang et al. [residualdensenet] proposes an optimized block, which combines the strengths of the dense block and residual block, and achieve impressive promotion. However, deep learning-based SR methods [vdsr, memNet, residualdensenet, edsr] prefer to crop the image into patches before training phrase. As different patch has various texture and structure, it is inefficient to adopt a feed-forward network to super-resolve all samples, especially for those intensely simple patches. In addition, notwithstanding such a complicated model can bring positive performance with a graphics processing unit(GPU), it also leads to expensive computational cost and explosion of parameters.

Fig. 1: Efficiency analysis of the proposed model. The results are evaluated on BSDS100 with factor 4. The proposed model runs twice as fast as state-of-the-art SR inferences with superior performance.

Computer vision applications and technongies [dped, shufflenet, mobilenet] for mobile devices draw a lot of attention as it has wide application scenarios. However, using CNNs on the mobile platform has an extreme requirement towards efficiency. MobileNet [mobilenet] makes an attempt to accelerate speed by utilizing a depth-wise convolution to reduce redundancy of CNNs. Similar technology also adopted by ShuffleNet [shufflenet]. Moreover, ShuffleNet employs a novel shuffle unit, which maintains performance with efficiency improvement. However, their methods are limited by the optimization of the computational platform and sometimes run inefficiently. IGC [IGC] utilizes parameters of the deep network more efficiently by adopting group convolution and permutation of convolutional features. The similar idea also used by RRC [RRC]. RRC implements a rolling strategy on object detection, which not only utilizes multi-scale features but also realizes an efficient one-stage framework. Their methods reveal that features of different scale can be utilized more efficiently. MSDNet [msdnet] proposes a multi-scale dense net, which adaptively uses the specific stage in the deep model to deal with samples with different difficulty levels. For instance, MSDNet adopts early stage convolutional layers to handle easy samples and more parameters are applied to process difficult images. However, MSDNet can inherently distinguish difficult level with an internal high-level representation of the image itself. Since such internal high-level prior is not exist in low-level vision, MSDNet is fail to applied in images processing tasks.

Motivated by previous works, we make an attempt to propose a content-adaptive and flexible framework, which can accurately super-resolve image with different difficulty level according to gradient prior. In the proposed model, we first define the gradient prior to distinguish different samples. Then, a unified model is proposed to handle samples with different difficulty by a content-adaptive fashion. Since samples with different difficulty will cause frequency conflicts and result in a performance degradation. We also propose a flexible rolling strategy by alternating the convolution filters to address this problem.

Our main contributions are summarized as follows.

  • We find it is inefficient to adopt an expensive model to mild samples, which have less texture and simple structure. In contrast, an expensive model is appropriate for the samples, which have rich texture and complicated structure.

  • According above observation, we distinguish the difficulty of samples by its gradient prior and content-adaptively adopt different convolutional stage to super-resolve samples. This strategy helps us greatly improve SR efficiency.

  • Since the samples with different difficulty exhibit various property in the frequency domain, which causes frequency conflicts and leads to a performance degradation. We propose a flexible rolling strategy. With our rolling approach, our model not only achieve a balance between mild and severe samples but also increase the receptive field of early layers.

Ii Related Work

CNN for image SR. Recently, deep learning based SR methods have achieved a great successes in many computer vision fields. Super-resolution, which considered a typical low-level vision task and is well-known for its ill-posed property, plays an important role in image quality enhancement. Many researchers devote themselves to the studies of super-resolution and have proposed many insightful works. Recently, the rising of deep learning methods give new solution to image SR. Dong et al. [srcnn] first adopt deep convolutional neural networks to learn the mapping from LR to HR patches in an end-to-end manner and greatly boost the performance of image SR. Afterward, many deep learning based methods have been proposed to improve the performance mainly by developing the network architecture. VDSR [vdsr] and IRCNN [ircnn] increased the network depth by adding more convolutional layers, and DRCN[drcn] introduced recursive learning for parameter sharing. Tai et al. introduced recursive blocks in DRRN[drrn] and memory block in Memnet[memNet]. While all of these methods have greatly improved the SR performance by exploiting different network architecture, they have not considered the efficiency of SR, which lead to the learning based SR methods been away from application in reality.

In contrast to chasing a smaller mean square error, we focus on the improvement of image restoration quality as well as boost the speed of the algorithm, which has been neglected for a long time. FSRCNN [fsrcnn] make an attempt to address this issue by adopting down-sampled patches as input and deconvolution to speed up the computing process. Their method effectively reduce redundancy and inspired us to explore the potential of accelerating SR. ESPCN[espcn] used pixel shuffling operation to reduce features volume and checkerboard effect, which also greatly accelerated the SR network. Although these methods obtain a small running time, they don’t fully utilize the inherent property of SR problem. For image SR, it has internal difficulty diversity, that is an area of an image with high frequency tend to lose more information during compressing while an area with low frequency tends to lose less. However, aforementioned methods ignore this property and tend to adopt a feed-forward model to process all samples.

Neural network acceleration. Obtaining a better balance between accuracy and efficiency has attracted many research communities for decades. Many studies have been proposed to change the connectivity structure of the deep convolutional networks such as ShuffleNet[shufflenet] or introduce a more compact convolution operation such as in MobileNet[mobilenet] and MobileNetV2[mobilenetv2]. These studies have done great in reducing computation cost as well as maintain or even improve performance. However, these methods can be slower than a plain network in some computing platform. Some studies focus on reducing model size after training, such as weight pruning [lecun1990, li2016], weight quantization [hubara2016, rastegari2016]. These studies construct new models at the test time and re-train or fine-tune them to achieve a similar closer performance as the original models.

Other studies focus on alternating the evaluation manner. FractalNets[fractalnet] perform prediction at any time by progressively evaluating subnetworks of the full network. Bolukbasi et al. [bolukbasi2017] addresses this problem by adaptively evaluating neural networks. Different from these works, MSDNet[msdnet]

adopts a specially designed network with multiple classifiers, which can directly output confidence scores to control the evaluation process for each test example. The adaptive computation time method

[graves2016] and its extension [figurnov2017] also perform an adaptive evaluation of test examples but focus on skipping units rather than layers. Feedback Networks [zamir2017] heavily shares parameters and allows early predictions in a recurrent process. However, their methods are less efficient in sharing computation. Our method is most inspired by MSDNet. Different from MSDNet, our proposed method focus on the difficulty diversity of image itself. And we also explore the frequency conflict occurred in a single model and therefore propose an original rolling strategy to handle the conflict.

Iii Methodology

(a) Successful Examples
(b) Failure Examples
Fig. 2: Visualization results of a regular CNN for image SR. Left examples bring enormous PSNR(dB) promotion towards Bicubic. Right samples have higher absolute PSNR(dB) value as they contribute slight PSNR(dB) gain compared with Bicubic. It reveals that CNN is feasible to restore abundant texture images and bring great promotion, as patches with mild texture own a tiny upper bound.

Problem. We investigate the failure cases which lead to poor performance in image SR. Given a 7-layers CNN, which has 6 convolutional layer with size of and a convolutional layer with size of

, we train it with 10 epochs on General-100 

[fsrcnn] to super-resolve images with factor 3, we test it on BSDS100 [bsd]. In Figure.(b)b and (a)a, we visualize its successful and failure examples. Meanwhile, we define the examples, which achieve more than 1dB improvement over Bicubic, as successful cases. The failure cases are the smaller images that obtain improvement less than 1dB. In Figure.(b)b, we can observe that the examples, which achieve minor improvement, are mild or inherently blurry. In contrast, the successful cases have a rich texture and drastic gradient. Moreover, we also extract feature in the middle of VDSR [vdsr] and present it in Fig. (a)a and (b)b. It can be observed that responses around high-frequency places are strong and VDSR gives low response toward the mild place.

Fig. 3: The features of bird and butterfly are visualized in (a) and (b). For better visualization, we reduce dimension to 1 along channel with max operation.

According to our observation, we make three assumptions as follow. 1) The examples with rich texture can bring enormous gain. However, the mild examples are unable to demonstrate similar improvement. 2) Since deep neural network gives low response toward mild places and mild examples is intensely simple. A very deep neural network, which is widely used in image SR, owns a slight contribution to mild samples for further promotion. 3) The examples with severe, moderate and mild texture can be easily distinguished with its gradient information. To address the aforementioned problems, we propose an end-to-end framework that joint learning image SR task with gradient prior knowledge.

Overview of PRN. The proposed PRN aims at learning a framework, which can super-resolve images more efficiently. More specific, the proposed framework first label patches according to gradient prior. Thus, we can fetch the different patches from different feature level. Since the bottom convolutional stage has tiny receptive field and the mild patches is different from severe samples in term of frequency, we then relieve these problems by adopting a novel strategy to roll convolutional filters. Next, we first describe the definition of gradient prior and then present the setting of the proposed framework.

Fig. 4: Gradient properties of 10,000 severe and mild images. (a) Distribution of vertical axis gradient from high PSNR(dB) gain samples(e.g., severe images) in Figure.(a)a. (b) Distribution of vertical axis gradient from small PSNR(dB) gain cases(e.g., mild images) in Figure. (b)b. We can easily distinguish successful and failure samples according to the gradient value.
Fig. 5: Overview of the proposed PRN. The input image has plenty of blank regions and therefore we fetch results from the different position within the network to boost efficiency. We use red, blue and black bbox to label mild, moderate and severe patches and then send them into the network. The sub-images with different gradient prior will be fetched from the different convolutional layer. The arrow lines with the corresponding color indicate the stage where we fetch different patches. With this strategy, our method demonstrates a significant efficiency improvement.

Iii-a Gradient Prior

The proposed gradient prior is based on the observations that the failure samples in image SR usually have uniform gradient without sharp edges. As the samples with a uniform gradient contain rare pattern information and the upper bound for restoration is also pretty low, a simple and fast convolutional neural network can handle them well. We show the vertical gradient distribution of 10,000 successful and failure samples in Figure.(a)a and (b)b, respectively. It is obvious that mild samples have denser distribution among lower vertical gradient. And the distribution of severe images mainly lies on large value. With this gradient property, severe and mild samples can be distinguished. For an image, we describe the gradient property as follow:


where is the input image, counts the gradient along the vertical axis. are the gradient prior knowledge, which also serves as a tag in our model. With the , PRN is able to separate a set of images into mild, moderate and severe patches.


The and means the upper and low gradient threshold of for separating the images. Moreover, we make an ablation study on the gradient threshold in section IV-B. Although is proposed based on the assumption that mild texture image is too simple to bring enormous gain, we show this prior can also be applied to accelerate image SR.

Iii-B Network architecture

As illustrated in Fig.5, we put the patches with tag into the network for enhancement. To enable the network with a spacious receptive field, we use 64 convolutional kernels with a size of 5 5. To make full use of cuDNN [cudnn], we employ 4 convolutional layers with 3 3 kernels and 64 channels. Before deconvolution operation, we conduct a shrinking layer with 64 kernels of size 1

1 to reduce parameters. Meanwhile, we add Leaky ReLU 


as activation function after each convolutional layer. Due to the efficient 3

3 kernels work on small size feature map directly, the proposed model can significantly accelerate the speed. At last, we use a deconvolution layer, whose stride is same to down-sampling factor, to perform an up-sampling operation.

To obtain higher efficiency, we join auxiliary tag into our model by an end-to-end manner. In other words, the patch is able to be fetched from different feature level w.r.t tag knowledge. Since the mild patches are smooth and have less edge and texture, we prefer to obtain its feature from the first convolutional layer. Then, we use a deconvolution layer to obtain the restored patches. For moderate samples, we fetch from the third convolutional layer as they have a few texture and edge. A similar up-sampling operation also acts on the moderate samples for interpolation. Due to severe patches have rich texture and details, we conduct deconvolution layer on them after they forward all convolutional layers. The parameter of the network is optimized by

loss. Meanwhile, different level of parameters is learned with different training pairs. For instance, the early stage layer is not only training with mild image pairs but also optimized with severe pairs. In contrast, the high-level parameter is optimized with severe pairs only. As shown in Fig.5, we adopt such an efficient strategy to perform image SR.

Fig. 6: Demonstration of the rolling strategy. The parameters for different patches(e.g., mild, moderate and severe) are label with red, blue and black. When we input the severe patch, the model roll the first three dilated convolutional layers(e.g., red and blue columns) with regular convolutional layers(e.g., black column). Nevertheless, suppose we send mild patch into the model, the model will replace dilated convolution(e.g., red) with regular convolution layer(e.g., black column). With this content-adaptive and flexible strategy, our model improve the effectiveness and efficiency significantly.

Iii-C Rolling the convolutional filters

Although we can effectively enhance the image with the aforementioned model, we still find the following problems: 1) The receptive field of the early stage is tiny. When we try to improve the performance, the tiny receptive field of early stage become a bottleneck. 2) Frequency conflicts. The frequency domain of mild and moderate examples are significantly different from severe patches. When we train mild samples in the early stage of the network, the output of high level is influenced. Thus, we make an attempt to resolve the above questions by developing a novel rolling strategy.

Let be parameters of a CNN, it consists of four parts of parameters . Meanwhile, represents the early stage and consists of a convolutional layer. means the middle stage and has two layers. indicates the last stage and contains two convolutional layers and is the parameters of a deconvolutional layer. In addition, we define auxiliary two set of the parameter and . More specific, represent a dilated convolution layer [dilation] with size of 64 5 5 and 1 dilation and means two dilated convolution layers with size of 64 3 3 and 1 dilation as well. We use and represent high-res and low-res patch, respectively. Three superscripts , and are utilized to distinguish the mild, moderate and severe samples respectively. For instance, and indicate the low-res patch annotated with mild and severe tag. In sum, the enhancement toward severe patch can be defined as:


where means the enhancement operation. As sketched in Fig. 6, the network will roll and with and and fetch the patch from different stage according to tag. More specific, suppose we input into the network, the model will enhance the patch with explicit. By that analogy, the enhancement process toward and can be formulated as




With such flexible and content-adaptive rolling strategy, we not only resolve frequency conflicts but also increase the receptive field of early stage.

Iii-D End-to-end framework

In contrast to training models with different datasets, the proposed model not only be able to fetch images from the different stage but also optimize each stage with specific prior. The whole procedure can be formulated as an end-to-end framework to accelerate speed. We have sketch detailed algorithm in Algorithm.1. Since the down-scaled mild sample is similar to its ground-truth, the model is unable to learn how to recover realistic details and textures w.r.t mild training examples. To resolve this question, we adopt the mild and moderate samples as the training pairs for and . With such an efficient strategy, our model not only greatly improve the performance but also accelerate the training and testing speed.

0:  Training LR images ; HR images ;
1:  Crop high-res and low-res images into patches and and distinguish into , and w.r.t gradient prior;
2:  while  do
3:     ;
4:     Choose a set of LR and HR patches, send low-res patches into network;
5:     Obtain , via forward propagation;
6:     Update with and pairs;
7:     Update with and pairs;
8:     Update with and pairs;
9:  end while
Algorithm 1 Learning Algorithm of PRN

Iv Experiments

Datasets. To make full use of the parameters in PRN, we use VOC2012 [voc2012] to pre-train our model. VOC2012 [voc2012] contains 17,125 clear images, which are taken from natural scene. Then, we finetune our model with BSD200 [bsd], which contains 200 images and is close to the real-world scene. BSD200 [bsd]

is augmented with scaling and rotation. We employ Set5, Set14, BSDS100, and Urban100 to evaluate our model.

Implementation Details. We use Xavier [xavier]

initialize the parameters of the proposed model. Besides, the deconvolution layer is initialized according to the weight of Bicubic interpolation. We add pad with zero in each convolutional layer to assure the input tensor shares same size with the output. We convert all images from RGB to YCbCr and extract the Y channel for training. The training and testing images are cropped into 54

54 patches and down-scaled with the corresponding factor to obtain the input. For 54 54 patch, the and are set as and , respectively. In training, we set the batch size as 64 and learning rate is for all layers. In testing, we set the batch size as 1. The learning rate is reduced with factor 10 for every 300 epochs. We use leaky ReLU with a negative slope of 0.2 as the activate function. We perform our training and testing on a desktop computer with i7-4790 CPU, GTX980Ti GPU, and 32GB RAM.

Multi-scale training. Different from some state-of-the-arts [fsrcnn, subpixel, srcnn], which conduct its model with single factor training, we adopt multi-scale learning strategy to train PRN. Specifically, multi-scaling learning is to train the model with multiple down-sampling factors simultaneously. With the multi-scale learning, PRN can learn more contextual knowledge across different degeneration and achieves better performance.

Algorithm Scale Set5 Set14 BSDS100 URBAN100
Bicubic 2x 33.69 0.931 30.25 0.870 29.57 0.844 26.89 0.841
A+ 36.60 0.955 32.32 0.906 31.24 0.887 29.25 0.895
RFL 36.59 0.954 32.29 0.905 31.18 0.885 29.14 0.891
SelfEx 36.60 0.955 32.24 0.904 31.20 0.887 29.55 0.898
SRCNN 36.72 0.955 32.51 0.908 31.38 0.889 29.53 0.896
SCN 36.58 0.954 32.35 0.905 31.26 0.885 29.52 0.897
FSRCNN 37.05 0.956 32.66 0.909 31.53 0.892 29.88 0.902
Our 37.09 0.957 32.90 0.910 31.66 0.893 30.23 0.909
Bicubic 3x 30.41 0.869 27.55 0.775 27.22 0.741 24.47 0.737
A+ 32.62 0.909 29.15 0.820 28.31 0.785 26.05 0.799
RFL 32.47 0.906 29.07 0.818 28.23 0.782 25.88 0.792
SelfEx 32.66 0.910 29.18 0.821 28.30 0.786 26.45 0.810
SRCNN 32.78 0.909 29.32 0.823 28.42 0.788 26.25 0.801
SCN 32.62 0.908 29.16 0.818 28.33 0.783 26.21 0.801
FSRCNN 33.18 0.914 29.37 0.824 28.53 0.791 26.43 0.808
Our 33.32 0.916 29.64 0.828 28.72 0.794 26.75 0.815
Bicubic 4x 28.43 0.811 26.01 0.704 25.97 0.670 23.15 0.660
A+ 30.32 0.860 27.34 0.751 26.83 0.711 24.34 0.721
RFL 30.17 0.855 27.24 0.747 26.76 0.708 24.20 0.712
SelfEx 30.34 0.862 27.41 0.753 26.84 0.713 24.83 0.740
SRCNN 30.50 0.863 27.52 0.753 26.91 0.712 24.53 0.725
SCN 30.41 0.863 27.39 0.751 26.88 0.711 24.52 0.726
FSRCNN 30.72 0.866 27.61 0.755 26.98 0.715 24.62 0.728
Our 31.08 0.875 27.89 0.762 27.17 0.728 24.86 0.733
TABLE I: The PSNR and SSIM results of different approaches on Set5, Set14, BSDS100 and Urban100 with down-sampling factor 2, 3 and 4. We use the black to label the firs place.
L1 L2 L3 L4 U1 U2 U3 U4
Value(101) 1 2 5 7 3 5 8 10
TABLE II: We have compared a wide range of potential gradient threshold. Meanwhile, L indicates and U is . The suffix number along L and U means different threshold value.

Iv-a Comparison with State-of-the-arts.

We compare our model with state-of-the-art methods, including A+ [aplus], SRF [srf], SelfEx [selfsr], RFL [rfl], SCN [scn], SRCNN [srcnn], LapSRN [laplacian], VDSR [vdsr], DRCN [drcn], and FSRCNN [fsrcnn]. We adopt widely used quality metrics, e.g., PSNR and SSIM, to evaluate our model. For DRCN [drcn], we use our own implementation for comparison. For rest of other methods, we use their public code and model to obtain results.

As shown in table. I, our model achieve superior performance among light-weight methods [aplus, srf, selfsr, rfl, scn, srcnn, fsrcnn]. Compared with FSRCNN, our model achieve 0.13 dB and 0.19 dB promotion on BSDS100 with factor 2 and 3. Similarly, our model obtains 0.69 dB gain when compared with FSRCNN on Mange109 with factor 4. With the limitation of the parameter, our model is weak than heavy inferences [drcn, vdsr, laplacian]. As sketched in figure. 1, our model shows slightly lower performance compared with huge model [drcn, vdsr, laplacian], but our speed is accelerated about several times. Therefore, the model is particularly competitive for mobile devices and applications.

We also show qualitative comparison in Figure. 8910 and 11. For better visualization, we interpolate the chrominance space by bicubic to obtain color images. Compared with other methods, our approach can generate image clearer boundary and rich details.

Iv-B Ablation study

In this section, we mainly investigate different settings of the proposed model and provide insights into the choice of hyper-parameters.

Fig. 7: Efficiency and effectiveness analysis of different gradient threshold on BSDS100.
Bicubic 22.18 dB A+ [aplus] 24.65 dB RFL [rfl] 24.56 dB SelfExSR [selfsr] 24.29 dB SRCNN [srcnn] 25.65 dB SCN [scn] 25.51 dB LapSRN [laplacian] 27.55 dB FSRCNN [fsrcnn] 25.90 dB Our 26.71 dB Original
Fig. 8: Qualitative comparison on ’butterfly’ with the scaling factor of 4. We use red and blue to label best two results, respectively. Best viewed by zooming in the electronic version.
Bicubic 30.22 dB A+ [aplus] 32.63 dB RFL [rfl] 32.33 dB SelfExSR [selfsr] 32.90 dB SRCNN [srcnn] 32.61 dB SCN [scn] 32.47 dB LapSRN [laplacian] 33.82 dB FSRCNN [fsrcnn] 32.86 dB Our 33.16 dB Original
Fig. 9: Qualitative comparison on ’bird’ with the scaling factor of 4. We use red and blue to label best two results, respectively. Best viewed by zooming in the electronic version.
Bicubic 23.19 dB A+ [aplus] 23.62 dB RFL [rfl] 23.59 dB SelfExSR [selfsr] 23.51 dB SRCNN [srcnn] 23.67 dB SCN [scn] 23.60 dB LapSRN [laplacian] 23.74 dB FSRCNN [fsrcnn] 23.64 dB Our 23.75 dB Original
Fig. 10: Qualitative comparison on ’baboon’ with the scaling factor of 3. We use red and blue to label best two results, respectively. Best viewed by zooming in the electronic version.
Bicubic 31.54 dB A+ [aplus] 33.41 dB RFL [rfl] 33.33 dB SelfExSR [selfsr] 33.40 dB SRCNN [srcnn] 33.55 dB SCN [scn] 33.36 dB LapSRN [laplacian] 33.88 dB FSRCNN [fsrcnn] 33.59 dB Our 33.67 dB Original
Fig. 11: Qualitative comparison on ’lena’ with the scaling factor of 3. We use red and blue to label best two results, respectively. Best viewed by zooming in the electronic version.

Gradient threshold. We first analyze the setting of gradient threshold and by investigating a wide range of potential values. In table II, we list all threshold we have compared. In fact, different gradient threshold may influence efficiency and effectiveness. In Fig. 7, we show the performance and efficiency of each setting. With the increment of , our model deal more moderate samples at an early stage, which accelerate speed but bring significant performance drop. A similar situation also occur when we increase the value of . As the growth of , the proposed model exhibit promising efficiency with degradation of performance. Since the middle or first stage is unable to deal severe samples well, we think too low and may bring obvious performance drop. However, as illustrated in Fig. 7, the model becomes slower with a decrease of . To achieve a balance between efficiency and performance, we adopt ‘Our_YU2L2’ as default gradient threshold.

Depth of different stage. In this component, we compare the depth setting of each stage. In other words, we adjust the depth of and to verify our settings. In table. III, we use different depth setting in the early and middle stage for comparison. As shown in table. III, with the increase of , the model show sight PSNR promotion with slower efficiency. Since the early stage is adapted to handle mild patches only, we think too much parameter is meaningless for further promotion. In contrast, the middle stage is utilized to deal with the moderate sample, which carries some texture and details. Therefore, the performance becomes worse when we reduce parameter of . Thus, we use a light-weight setting at an early stage and increase the parameters of the middle stage to exhibit an efficient framework.

PSNR 27.12 27.13 27.15 27.05 27.12 27.15
Time 1.81 1.91 2.21 1.48 1.81 1.99
TABLE III: Comparison of different depth toward early and middle stage on BSDS100. The superscripts of mean different depth of each stage. The subscripts of indicate different stage.
o Rolling Rolling
PSNR 27.03 27.12
TABLE IV: Comparison of with and without rolling strategy on BSDS100.

Rolling strategy. In order to show the effectiveness of the proposed rolling strategy, we investigate models with and without rolling strategy. In table. IV, ’o Rolling’ means model without rolling strategy and ’ Rolling’ indicates the model with rolling component. Compared with ’o Rolling’, the model with a rolling strategy achieve 0.09 dB improvement. Although our model has an auxiliary parameter, we can use them content-adaptively to assure efficiency. Thus, our model achieves superior performance and maintains competitive efficiency by adopting a rolling strategy.

Iv-C Limitations

As our model achieves a good balance between effectiveness and efficiency, it still exists some limitations. To advance efficiency, we need to crop the image into smaller images and reconstruct them at last. Thus, our model needs additional time to accomplish the reconstruction procedure. The reconstruction cost is far less than the model computational cost, and we have count the reconstruction time into time complexity in efficiency analysis. Besides, the acceleration is influenced by datasets. For instance, our model can accelerate the speed greatly on BSDS100 or DIV2K as the images in BSDS100 or DIV2K have plenty of blank and mild region. Similar acceleration can not occur in General-100 as the images in General-100 are full with texture and edges. However, we think the majority of nature images, which is closed to BSD500 and DIV2K, are occupied with a certain percentage of the blank or mild region. Therefore, our model can perform similar acceleration in real-world scenarios.

V Conclusion and further work

In this article, to address efficiency problem in image SR, we have proposed an end-to-end gradient-aware rolling network. Our model mainly incorporates gradient prior to the image itself and content-adaptively utilize each stage of the deep neural network to super-resolve corrupted images. Moreover, we have proposed a rolling strategy, which super-resolve images with the different set of filters, to resolve frequency conflicts problem. Experiments have shown that our framework not only obtains competitive performance but also achieve appealing efficiency.

There are several directions for us to extend our work. First, we can introduce adversarial loss or perceptual loss in each stage, aiming to restore more realistic details and texture. Second, considering exist framework have to crop the image into patches, we intend to propose a more general framework, which can content-adaptively process different region with the different stride of convolution operation, to boost efficiency.