Deep learning has successfully applied in many computer vision fields such as image recognition [residual_net], semantic segmentation [unet] and object detection [ouyang2015deepid]. Inspired by the rapid development and superior performance, many efforts have been made to introduce deep learning in low-level vision as well as image processing tasks, including image suer-resolution [srcnn], image enhancement [dped], inpainting [shepard] etc. Meanwhile, Single image super-resolution(SISR), namely to predict high-resolution with low-resolution input, is widely used in many computer vision applications and draws plenty of attentions [srcnn, vdsr, srgan, edsr, subpixel, laplacian, fsrcnn].
Recently, Convolutional neural networks(CNNs) achieve magnificent improvement toward image restoration by adopting a building block strategy. VDSR [vdsr]
utilizes residual connection and a very deep model to achieve promising results in image SR. EDSR[edsr] further improves the results by adopting residual block [residual_net]
and remove batch normalization. However, they advance performance with numerous parameter gain and huge computational cost. Dense block[densenet] also exhibits its effectiveness in image enhancement. MemNet [memNet] realizes a coarse-to-fine restoration process by using dense block and recursive unit. Zhang et al. [residualdensenet] proposes an optimized block, which combines the strengths of the dense block and residual block, and achieve impressive promotion. However, deep learning-based SR methods [vdsr, memNet, residualdensenet, edsr] prefer to crop the image into patches before training phrase. As different patch has various texture and structure, it is inefficient to adopt a feed-forward network to super-resolve all samples, especially for those intensely simple patches. In addition, notwithstanding such a complicated model can bring positive performance with a graphics processing unit(GPU), it also leads to expensive computational cost and explosion of parameters.
Computer vision applications and technongies [dped, shufflenet, mobilenet] for mobile devices draw a lot of attention as it has wide application scenarios. However, using CNNs on the mobile platform has an extreme requirement towards efficiency. MobileNet [mobilenet] makes an attempt to accelerate speed by utilizing a depth-wise convolution to reduce redundancy of CNNs. Similar technology also adopted by ShuffleNet [shufflenet]. Moreover, ShuffleNet employs a novel shuffle unit, which maintains performance with efficiency improvement. However, their methods are limited by the optimization of the computational platform and sometimes run inefficiently. IGC [IGC] utilizes parameters of the deep network more efficiently by adopting group convolution and permutation of convolutional features. The similar idea also used by RRC [RRC]. RRC implements a rolling strategy on object detection, which not only utilizes multi-scale features but also realizes an efficient one-stage framework. Their methods reveal that features of different scale can be utilized more efficiently. MSDNet [msdnet] proposes a multi-scale dense net, which adaptively uses the specific stage in the deep model to deal with samples with different difficulty levels. For instance, MSDNet adopts early stage convolutional layers to handle easy samples and more parameters are applied to process difficult images. However, MSDNet can inherently distinguish difficult level with an internal high-level representation of the image itself. Since such internal high-level prior is not exist in low-level vision, MSDNet is fail to applied in images processing tasks.
Motivated by previous works, we make an attempt to propose a content-adaptive and flexible framework, which can accurately super-resolve image with different difficulty level according to gradient prior. In the proposed model, we first define the gradient prior to distinguish different samples. Then, a unified model is proposed to handle samples with different difficulty by a content-adaptive fashion. Since samples with different difficulty will cause frequency conflicts and result in a performance degradation. We also propose a flexible rolling strategy by alternating the convolution filters to address this problem.
Our main contributions are summarized as follows.
We find it is inefficient to adopt an expensive model to mild samples, which have less texture and simple structure. In contrast, an expensive model is appropriate for the samples, which have rich texture and complicated structure.
According above observation, we distinguish the difficulty of samples by its gradient prior and content-adaptively adopt different convolutional stage to super-resolve samples. This strategy helps us greatly improve SR efficiency.
Since the samples with different difficulty exhibit various property in the frequency domain, which causes frequency conflicts and leads to a performance degradation. We propose a flexible rolling strategy. With our rolling approach, our model not only achieve a balance between mild and severe samples but also increase the receptive field of early layers.
Ii Related Work
CNN for image SR. Recently, deep learning based SR methods have achieved a great successes in many computer vision fields. Super-resolution, which considered a typical low-level vision task and is well-known for its ill-posed property, plays an important role in image quality enhancement. Many researchers devote themselves to the studies of super-resolution and have proposed many insightful works. Recently, the rising of deep learning methods give new solution to image SR. Dong et al. [srcnn] first adopt deep convolutional neural networks to learn the mapping from LR to HR patches in an end-to-end manner and greatly boost the performance of image SR. Afterward, many deep learning based methods have been proposed to improve the performance mainly by developing the network architecture. VDSR [vdsr] and IRCNN [ircnn] increased the network depth by adding more convolutional layers, and DRCN[drcn] introduced recursive learning for parameter sharing. Tai et al. introduced recursive blocks in DRRN[drrn] and memory block in Memnet[memNet]. While all of these methods have greatly improved the SR performance by exploiting different network architecture, they have not considered the efficiency of SR, which lead to the learning based SR methods been away from application in reality.
In contrast to chasing a smaller mean square error, we focus on the improvement of image restoration quality as well as boost the speed of the algorithm, which has been neglected for a long time. FSRCNN [fsrcnn] make an attempt to address this issue by adopting down-sampled patches as input and deconvolution to speed up the computing process. Their method effectively reduce redundancy and inspired us to explore the potential of accelerating SR. ESPCN[espcn] used pixel shuffling operation to reduce features volume and checkerboard effect, which also greatly accelerated the SR network. Although these methods obtain a small running time, they don’t fully utilize the inherent property of SR problem. For image SR, it has internal difficulty diversity, that is an area of an image with high frequency tend to lose more information during compressing while an area with low frequency tends to lose less. However, aforementioned methods ignore this property and tend to adopt a feed-forward model to process all samples.
Neural network acceleration. Obtaining a better balance between accuracy and efficiency has attracted many research communities for decades. Many studies have been proposed to change the connectivity structure of the deep convolutional networks such as ShuffleNet[shufflenet] or introduce a more compact convolution operation such as in MobileNet[mobilenet] and MobileNetV2[mobilenetv2]. These studies have done great in reducing computation cost as well as maintain or even improve performance. However, these methods can be slower than a plain network in some computing platform. Some studies focus on reducing model size after training, such as weight pruning [lecun1990, li2016], weight quantization [hubara2016, rastegari2016]. These studies construct new models at the test time and re-train or fine-tune them to achieve a similar closer performance as the original models.
Other studies focus on alternating the evaluation manner. FractalNets[fractalnet] perform prediction at any time by progressively evaluating subnetworks of the full network. Bolukbasi et al. [bolukbasi2017] addresses this problem by adaptively evaluating neural networks. Different from these works, MSDNet[msdnet]
adopts a specially designed network with multiple classifiers, which can directly output confidence scores to control the evaluation process for each test example. The adaptive computation time method[graves2016] and its extension [figurnov2017] also perform an adaptive evaluation of test examples but focus on skipping units rather than layers. Feedback Networks [zamir2017] heavily shares parameters and allows early predictions in a recurrent process. However, their methods are less efficient in sharing computation. Our method is most inspired by MSDNet. Different from MSDNet, our proposed method focus on the difficulty diversity of image itself. And we also explore the frequency conflict occurred in a single model and therefore propose an original rolling strategy to handle the conflict.
Problem. We investigate the failure cases which lead to poor performance in image SR. Given a 7-layers CNN, which has 6 convolutional layer with size of and a convolutional layer with size of
, we train it with 10 epochs on General-100[fsrcnn] to super-resolve images with factor 3, we test it on BSDS100 [bsd]. In Figure.(b)b and (a)a, we visualize its successful and failure examples. Meanwhile, we define the examples, which achieve more than 1dB improvement over Bicubic, as successful cases. The failure cases are the smaller images that obtain improvement less than 1dB. In Figure.(b)b, we can observe that the examples, which achieve minor improvement, are mild or inherently blurry. In contrast, the successful cases have a rich texture and drastic gradient. Moreover, we also extract feature in the middle of VDSR [vdsr] and present it in Fig. (a)a and (b)b. It can be observed that responses around high-frequency places are strong and VDSR gives low response toward the mild place.
According to our observation, we make three assumptions as follow. 1) The examples with rich texture can bring enormous gain. However, the mild examples are unable to demonstrate similar improvement. 2) Since deep neural network gives low response toward mild places and mild examples is intensely simple. A very deep neural network, which is widely used in image SR, owns a slight contribution to mild samples for further promotion. 3) The examples with severe, moderate and mild texture can be easily distinguished with its gradient information. To address the aforementioned problems, we propose an end-to-end framework that joint learning image SR task with gradient prior knowledge.
Overview of PRN. The proposed PRN aims at learning a framework, which can super-resolve images more efficiently. More specific, the proposed framework first label patches according to gradient prior. Thus, we can fetch the different patches from different feature level. Since the bottom convolutional stage has tiny receptive field and the mild patches is different from severe samples in term of frequency, we then relieve these problems by adopting a novel strategy to roll convolutional filters. Next, we first describe the definition of gradient prior and then present the setting of the proposed framework.
Iii-a Gradient Prior
The proposed gradient prior is based on the observations that the failure samples in image SR usually have uniform gradient without sharp edges. As the samples with a uniform gradient contain rare pattern information and the upper bound for restoration is also pretty low, a simple and fast convolutional neural network can handle them well. We show the vertical gradient distribution of 10,000 successful and failure samples in Figure.(a)a and (b)b, respectively. It is obvious that mild samples have denser distribution among lower vertical gradient. And the distribution of severe images mainly lies on large value. With this gradient property, severe and mild samples can be distinguished. For an image, we describe the gradient property as follow:
where is the input image, counts the gradient along the vertical axis. are the gradient prior knowledge, which also serves as a tag in our model. With the , PRN is able to separate a set of images into mild, moderate and severe patches.
The and means the upper and low gradient threshold of for separating the images. Moreover, we make an ablation study on the gradient threshold in section IV-B. Although is proposed based on the assumption that mild texture image is too simple to bring enormous gain, we show this prior can also be applied to accelerate image SR.
Iii-B Network architecture
As illustrated in Fig.5, we put the patches with tag into the network for enhancement. To enable the network with a spacious receptive field, we use 64 convolutional kernels with a size of 5 5. To make full use of cuDNN [cudnn], we employ 4 convolutional layers with 3 3 kernels and 64 channels. Before deconvolution operation, we conduct a shrinking layer with 64 kernels of size 1
1 to reduce parameters. Meanwhile, we add Leaky ReLU[leakyrelu]
as activation function after each convolutional layer. Due to the efficient 3
3 kernels work on small size feature map directly, the proposed model can significantly accelerate the speed. At last, we use a deconvolution layer, whose stride is same to down-sampling factor, to perform an up-sampling operation.
To obtain higher efficiency, we join auxiliary tag into our model by an end-to-end manner. In other words, the patch is able to be fetched from different feature level w.r.t tag knowledge. Since the mild patches are smooth and have less edge and texture, we prefer to obtain its feature from the first convolutional layer. Then, we use a deconvolution layer to obtain the restored patches. For moderate samples, we fetch from the third convolutional layer as they have a few texture and edge. A similar up-sampling operation also acts on the moderate samples for interpolation. Due to severe patches have rich texture and details, we conduct deconvolution layer on them after they forward all convolutional layers. The parameter of the network is optimized byloss. Meanwhile, different level of parameters is learned with different training pairs. For instance, the early stage layer is not only training with mild image pairs but also optimized with severe pairs. In contrast, the high-level parameter is optimized with severe pairs only. As shown in Fig.5, we adopt such an efficient strategy to perform image SR.
Iii-C Rolling the convolutional filters
Although we can effectively enhance the image with the aforementioned model, we still find the following problems: 1) The receptive field of the early stage is tiny. When we try to improve the performance, the tiny receptive field of early stage become a bottleneck. 2) Frequency conflicts. The frequency domain of mild and moderate examples are significantly different from severe patches. When we train mild samples in the early stage of the network, the output of high level is influenced. Thus, we make an attempt to resolve the above questions by developing a novel rolling strategy.
Let be parameters of a CNN, it consists of four parts of parameters . Meanwhile, represents the early stage and consists of a convolutional layer. means the middle stage and has two layers. indicates the last stage and contains two convolutional layers and is the parameters of a deconvolutional layer. In addition, we define auxiliary two set of the parameter and . More specific, represent a dilated convolution layer [dilation] with size of 64 5 5 and 1 dilation and means two dilated convolution layers with size of 64 3 3 and 1 dilation as well. We use and represent high-res and low-res patch, respectively. Three superscripts , and are utilized to distinguish the mild, moderate and severe samples respectively. For instance, and indicate the low-res patch annotated with mild and severe tag. In sum, the enhancement toward severe patch can be defined as:
where means the enhancement operation. As sketched in Fig. 6, the network will roll and with and and fetch the patch from different stage according to tag. More specific, suppose we input into the network, the model will enhance the patch with explicit. By that analogy, the enhancement process toward and can be formulated as
With such flexible and content-adaptive rolling strategy, we not only resolve frequency conflicts but also increase the receptive field of early stage.
Iii-D End-to-end framework
In contrast to training models with different datasets, the proposed model not only be able to fetch images from the different stage but also optimize each stage with specific prior. The whole procedure can be formulated as an end-to-end framework to accelerate speed. We have sketch detailed algorithm in Algorithm.1. Since the down-scaled mild sample is similar to its ground-truth, the model is unable to learn how to recover realistic details and textures w.r.t mild training examples. To resolve this question, we adopt the mild and moderate samples as the training pairs for and . With such an efficient strategy, our model not only greatly improve the performance but also accelerate the training and testing speed.
Datasets. To make full use of the parameters in PRN, we use VOC2012 [voc2012] to pre-train our model. VOC2012 [voc2012] contains 17,125 clear images, which are taken from natural scene. Then, we finetune our model with BSD200 [bsd], which contains 200 images and is close to the real-world scene. BSD200 [bsd]
Implementation Details. We use Xavier [xavier]
initialize the parameters of the proposed model. Besides, the deconvolution layer is initialized according to the weight of Bicubic interpolation. We add pad with zero in each convolutional layer to assure the input tensor shares same size with the output. We convert all images from RGB to YCbCr and extract the Y channel for training. The training and testing images are cropped into 5454 patches and down-scaled with the corresponding factor to obtain the input. For 54 54 patch, the and are set as and , respectively. In training, we set the batch size as 64 and learning rate is for all layers. In testing, we set the batch size as 1. The learning rate is reduced with factor 10 for every 300 epochs. We use leaky ReLU with a negative slope of 0.2 as the activate function. We perform our training and testing on a desktop computer with i7-4790 CPU, GTX980Ti GPU, and 32GB RAM.
Multi-scale training. Different from some state-of-the-arts [fsrcnn, subpixel, srcnn], which conduct its model with single factor training, we adopt multi-scale learning strategy to train PRN. Specifically, multi-scaling learning is to train the model with multiple down-sampling factors simultaneously. With the multi-scale learning, PRN can learn more contextual knowledge across different degeneration and achieves better performance.
Iv-a Comparison with State-of-the-arts.
We compare our model with state-of-the-art methods, including A+ [aplus], SRF [srf], SelfEx [selfsr], RFL [rfl], SCN [scn], SRCNN [srcnn], LapSRN [laplacian], VDSR [vdsr], DRCN [drcn], and FSRCNN [fsrcnn]. We adopt widely used quality metrics, e.g., PSNR and SSIM, to evaluate our model. For DRCN [drcn], we use our own implementation for comparison. For rest of other methods, we use their public code and model to obtain results.
As shown in table. I, our model achieve superior performance among light-weight methods [aplus, srf, selfsr, rfl, scn, srcnn, fsrcnn]. Compared with FSRCNN, our model achieve 0.13 dB and 0.19 dB promotion on BSDS100 with factor 2 and 3. Similarly, our model obtains 0.69 dB gain when compared with FSRCNN on Mange109 with factor 4. With the limitation of the parameter, our model is weak than heavy inferences [drcn, vdsr, laplacian]. As sketched in figure. 1, our model shows slightly lower performance compared with huge model [drcn, vdsr, laplacian], but our speed is accelerated about several times. Therefore, the model is particularly competitive for mobile devices and applications.
Iv-B Ablation study
In this section, we mainly investigate different settings of the proposed model and provide insights into the choice of hyper-parameters.
Gradient threshold. We first analyze the setting of gradient threshold and by investigating a wide range of potential values. In table II, we list all threshold we have compared. In fact, different gradient threshold may influence efficiency and effectiveness. In Fig. 7, we show the performance and efficiency of each setting. With the increment of , our model deal more moderate samples at an early stage, which accelerate speed but bring significant performance drop. A similar situation also occur when we increase the value of . As the growth of , the proposed model exhibit promising efficiency with degradation of performance. Since the middle or first stage is unable to deal severe samples well, we think too low and may bring obvious performance drop. However, as illustrated in Fig. 7, the model becomes slower with a decrease of . To achieve a balance between efficiency and performance, we adopt ‘Our_YU2L2’ as default gradient threshold.
Depth of different stage. In this component, we compare the depth setting of each stage. In other words, we adjust the depth of and to verify our settings. In table. III, we use different depth setting in the early and middle stage for comparison. As shown in table. III, with the increase of , the model show sight PSNR promotion with slower efficiency. Since the early stage is adapted to handle mild patches only, we think too much parameter is meaningless for further promotion. In contrast, the middle stage is utilized to deal with the moderate sample, which carries some texture and details. Therefore, the performance becomes worse when we reduce parameter of . Thus, we use a light-weight setting at an early stage and increase the parameters of the middle stage to exhibit an efficient framework.
Rolling strategy. In order to show the effectiveness of the proposed rolling strategy, we investigate models with and without rolling strategy. In table. IV, ’o Rolling’ means model without rolling strategy and ’ Rolling’ indicates the model with rolling component. Compared with ’o Rolling’, the model with a rolling strategy achieve 0.09 dB improvement. Although our model has an auxiliary parameter, we can use them content-adaptively to assure efficiency. Thus, our model achieves superior performance and maintains competitive efficiency by adopting a rolling strategy.
As our model achieves a good balance between effectiveness and efficiency, it still exists some limitations. To advance efficiency, we need to crop the image into smaller images and reconstruct them at last. Thus, our model needs additional time to accomplish the reconstruction procedure. The reconstruction cost is far less than the model computational cost, and we have count the reconstruction time into time complexity in efficiency analysis. Besides, the acceleration is influenced by datasets. For instance, our model can accelerate the speed greatly on BSDS100 or DIV2K as the images in BSDS100 or DIV2K have plenty of blank and mild region. Similar acceleration can not occur in General-100 as the images in General-100 are full with texture and edges. However, we think the majority of nature images, which is closed to BSD500 and DIV2K, are occupied with a certain percentage of the blank or mild region. Therefore, our model can perform similar acceleration in real-world scenarios.
V Conclusion and further work
In this article, to address efficiency problem in image SR, we have proposed an end-to-end gradient-aware rolling network. Our model mainly incorporates gradient prior to the image itself and content-adaptively utilize each stage of the deep neural network to super-resolve corrupted images. Moreover, we have proposed a rolling strategy, which super-resolve images with the different set of filters, to resolve frequency conflicts problem. Experiments have shown that our framework not only obtains competitive performance but also achieve appealing efficiency.
There are several directions for us to extend our work. First, we can introduce adversarial loss or perceptual loss in each stage, aiming to restore more realistic details and texture. Second, considering exist framework have to crop the image into patches, we intend to propose a more general framework, which can content-adaptively process different region with the different stride of convolution operation, to boost efficiency.