Difficulty-aware Image Super Resolution via Deep Adaptive Dual-Network

04/11/2019 ∙ by Jinghui Qin, et al. ∙ SUN YAT-SEN UNIVERSITY 0

Recently, deep learning based single image super-resolution(SR) approaches have achieved great development. The state-of-the-art SR methods usually adopt a feed-forward pipeline to establish a non-linear mapping between low-res(LR) and high-res(HR) images. However, due to treating all image regions equally without considering the difficulty diversity, these approaches meet an upper bound for optimization. To address this issue, we propose a novel SR approach that discriminately processes each image region within an image by its difficulty. Specifically, we propose a dual-way SR network that one way is trained to focus on easy image regions and another is trained to handle hard image regions. To identify whether a region is easy or hard, we propose a novel image difficulty recognition network based on PSNR prior. Our SR approach that uses the region mask to adaptively enforce the dual-way SR network yields superior results. Extensive experiments on several standard benchmarks (e.g., Set5, Set14, BSD100, and Urban100) show that our approach achieves state-of-the-art performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Single image super-resolution(SISR) [1]

, has gained great research attention for decades, because it has been used in various computer vision applications, such as face hallucination 

[2], object detection [3], video compression [4], etc. As a typical ill-posed issue, Single Image Super-Resolution(SISR) aims to generate a visually clear high-resolution image from its corresponding single low-resolution image .

Recently, deep learning based image enhancement methods [5, 6, 7, 8, 9, 10, 11, 12] have achieved significant improvements over conventional SR methods on restoration quality. Among these methods, Dong et al. [5] proposed SRCNN, a three-layer CNN, to make the first attempt to learn a nonlinear mapping between LR and HR for image SR. To accelerate the training and testing of image SR, FSRCNN [6] extracts features from the LR inputs and upscales spatial resolution at the tail of the network. Lim et al. [7] built huge SR model called EDSR by using simplified residual blocks and achieved great improvement on restoration quality. LapSRN [13], based on a cascaded CNN framework, takes an LR image as input and reconstructs different scale SR image for restoration, progressively. Zhang et al. [14] proposed Residual Dense Network(RDN) based on residual dense blocks(RDB) to fully exploit the hierarchical features from all the convolutional layers. Although each image region has different difficulty, above methods process them equally, thus the representational ability of CNNs in SR task is limited. To address this problem, RCAN [10] proposed a residual in residual(RIR) to solve the difficulty of training deep SR network and a channel attention mechanism to improve the representation ability of SR network by discriminately treating the abundant low-frequency information across channels. As their approach can discriminately process image across channels, they still fail to settle the difficulty diversity on geometry, which has great potential to present a high-quality image SR.

Figure 1: Qualitative comparison with EDSR and SRCNN. EDSR reconstructs more clearly on the complex/hard region than SRCNN, but it reconstructs worse on the smooth/plain region than SRCNN. This reveals that it is suboptimal to incorporate a single model to process all regions.

Figure 2: Overview of our dual-way SR framework. Our framework consists of three key components: difficulty identifier module(DIM), mask generator and a dual-way SR network. Our dual-way SR network consists of two branches: a complex branch(CB) and a plain branch(PB). The CB is used to restore complex/hard patches as the PB is adopted to reconstructs plain/easy samples. DIM enforces the dual-way SR to yield superior results by applying different difficulty images into different branches.

Although above deep learning-based SR methods bring significant improvement of SISR, they use a unified model to process all regions of an image without considering the difficulty diversity at region-level. Generally, an image usually consists of some complex regions and some smooth/plain regions, but the difficulty of reconstructing them to high-resolution regions is not equal. As shown in Fig. 3, EDSR [7] presents more superior results on complex/hard regions than SRCNN [5], but it demonstrates poor restoration on smooth/plain regions. This shows that the difficulty level of all regions of an image is complicated. Therefore, it is suboptimal to use a single CNN to process all regions within an image. While a heavy model may reconstruct complex texture regions more accurately than a simple one, the simple model still shows better restoration quality in some regions. To address this problem, we propose a novel difficulty-aware region-based SR approach that uses a dual-way SR network to demonstrate a difficulty-adaptive SR process. In our dual-way SR network, one way is trained to handle easy image regions better and another is trained to handle hard image regions better. To identify the difficulty of image regions, we propose a novel image difficulty recognition network based on PSNR prior that we observed in the SR task. Our SR approach will use the region mask produced by our difficulty recognition method to adaptively enforce the dual-way enhancement SR model for accurate image SR.

In summary, the main contributions of this paper can be summarized as follows. First, we propose an image difficulty recognition network, which fully explores the PSNR prior knowledge to present a precise difficulty categorization. Second, we propose a novel difficulty-aware SR approach that can discriminately treat each region of an image for accurate SR. With the difficulty recognition network, our dual-way SR network exhibits a high-quality restoration by alternatively utilizing different branches. Third, extensive experiments demonstrate that our approach achieves state-of-the-art performance on several standard benchmarks.

(a) hard image patches (b) easy image patches
Figure 3: Examples of hard and easy image patches. We cropped some images from DIV2K into patches with the size of 48

48 , interpolated LR patches with Bicubic, and computed PSNR values between interpolated patches and their corresponding HR. It can be observed that hard image patches usually own low PSNR score while easy image patches tend to have high PSNR value.

2 Methodology

2.1 Framework Overview

To treat image regions based on its difficulty discriminately, we propose a novel multi-branch SR framework that can be trained to perform accurate super-resolution. As illustrated in Fig. 2, our proposed SR framework consists of three key components: 1) difficulty identifier module(DIM); 2) mask generator; and 3) a dual-way SR network. DIM is used for identifying the super-resolution difficulty of image regions/patches. We will detail our difficulty identifier and mask generator in Sec  1. Our dual-way SR network consists of two independent SR model, complex SR branch and plain SR branch, denoted as CB and PB respectively. In our framework, CB is trained to restore hard patches while PB is dedicated to reconstructing easy patches. Unlike other SR methods that use a full-size image SR inference, our framework demonstrates a novel SR procedure as follows. First, we divide an LR full image into patches with the size of 48

48. Then we input the patches into DIM. DIM generates a difficulty probability vector for each patch, where each item in the vector represent the possibility a patch belongs to the corresponding difficulty level. CB and PB reconstruct HR image with a feed-forward pipeline. Finally, our framework uses the masks generated from the mask generator to adaptively choose HR patches and recovers them into an HR full image.

2.2 Difficulty Identifier Module

Difficulty identifier module(DIM) is the key component for our framework to reconstruct images, the performance of dual-way SR network lies on the accuracy of DIM. As shown in Figure 3, we visualize some results of Bicubic interpolation. It is obvious that the hard patch tends to exhibit higher PSNR value as simple/plain patches show lower PSNR. Based on this observation, we use the Bicubic PSNR score as its SR difficulty indicator. However, PSNR is a full-reference assessment metric, thus it is difficult for us to identify SR difficulty by PSNR during test phase. To exploit the PSNR prior, we model the problem of SR difficulty identification as a classification problem and train a difficulty identifier by using LeNet [15] as the basic backbone of DIM. Specifically, we adopt Bicubic PSNR value as the target to train the DIM. First, we cropped LR and HR pairs from our training dataset and reconstructed the upscaled patches by Bicubic. Then, we compute the Bicubic PSNR values of all samples and categorize them values into 5 classes, where each class represent a difficulty level. Suppose we translate the difficulty level of a patch to a one-hot vector according to its Bicubic PSNR, and let be a one-hot ground-truth label vector, be an input patch and be the set of 5 possible difficulty labels. Then we use a network parameterized by weights as our difficulty identifier to learn a function mapping a patch to its difficulty level. Therefore, our goal is to find weights

to minimize the following softmax cross entropy loss function, which is used in our method:

(1)

where

(2)

and z is non-transformed logit output of our difficulty identifier.

Let {} represents the 5 difficulty levels, and the difficulty level is ordered by the index. The greater the index, the harder it is. The output of our difficulty identifier for each patch is a probability vector ={}, where represents the possibility a patch falls in difficulty level , .

2.3 Mask Generator

Our mask generator is used to generate mask by the probability vector from the difficulty identifier to help our framework adaptively enforce dual-way SR to yield superior results. Let ={} represent the probability output of an LR patch after it is passed into our difficulty identifier, where represents the possibility a patch falls in difficulty level , . If is the maximum value in vector , the mask generator generates a one mask with the size of its corresponding reconstructed patch, otherwise, the mask generator generates a zero mask. Therefore, our mask generator can be model as:

(3)

With the help of the mask generator, we can adaptively enforce the dual-way SR as follow:

(4)

where , , are the reconstructed patch and represents dot multiplication.

Dataset Scale Bicubic A+ SRCNN [5] FSRCNN [6] VDSR [16] LapSRN [13] MemNet [17] IDN [18] Our
PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM
Set5 2 33.66 0.9299 36.54 0.9544 36.66 0.9542 37.00 0.9558 37.53 0.9587 37.52 0.9591 37.83 0.9600 37.78 0.9597 37.87 0.9600
3 30.39 0.8682 32.58 0.9088 32.75 0.9090 33.16 0.9140 33.66 0.9213 33.81 0.9220 34.11 0.9253 34.09 0.9248 34.17 0.9252
4 28.42 0.8104 30.28 0.8603 30.48 0.8628 30.71 0.8657 31.35 0.8838 31.54 0.8852 31.82 0.8903 31.74 0.8893 31.81 0.8890
Set14 2 30.24 0.8688 32.28 0.9056 32.42 0.9063 32.63 0.9088 33.03 0.9124 32.99 0.9124 33.30 0.9148 33.28 0.9142 33.39 0.9161
3 27.55 0.7742 29.13 0.8188 29.28 0.8209 29.43 0.8242 29.77 0.8314 29.79 0.8325 29.99 0.8354 30.00 0.8350 30.02 0.8370
4 26.00 0.7027 27.32 0.7491 27.49 0.7503 27.59 0.7535 28.01 0.7674 28.09 0.7700 28.25 0.7730 28.26 0.7723 28.29 0.7750
B100 2 29.56 0.8431 31.21 0.8863 31.36 0.8879 - - 31.90 0.8960 31.80 0.8952 32.08 0.8985 32.08 0.8978 32.11 0.8988
3 27.21 0.7385 28.29 0.7835 28.41 0.7863 - - 28.82 0.7976 28.82 0.7980 28.95 0.8013 28.96 0.8001 28.98 0.8024
4 25.96 0.6675 26.82 0.7087 26.90 0.7101 - - 27.29 0.7251 27.32 0.7275 27.41 0.7297 27.40 0.7281 27.43 0.7312
Urban100 2 26.88 0.8403 29.20 0.8938 29.50 0.8946 - - 30.76 0.9140 30.41 0.9103 31.27 0.9196 31.31 0.9195 31.77 0.9247
3 24.46 0.7349 26.03 0.7973 26.24 0.7989 - - 27.14 0.8279 27.07 0.8275 27.42 0.8359 27.56 0.8376 27.78 0.8434
4 23.14 0.6577 24.32 0.7183 24.52 0.7221 - - 25.18 0.7524 25.21 0.7562 25.41 0.7632 25.50 0.7630 25.71 0.7725
Table 1: The PSNR and SSIM results of different approaches on Set5, Set14, BSDS100 and Urban100 with down-sampling factor 2, 3 and 4. We use red and blue to label first and second place, respectively.

2.4 Complex Branch and Plain Branch

In our approach, CB is used to recover hard images while we use PB for reconstructing easy images. We choose IDN [18] as the backbone of our CB as IDN is an efficient SR framework with competitive performance. Since Bicubic interpolation demonstrates superior performance in plain area with high efficiency, we adopt Bicubic interpolation as our PB. There are some advantages to our multi-branched SR framework. First, patch-wise SR can take full advantage of the powerful parallelism computation of GPU by reconstructing high-resolution patches in batch. Second, we can embrace the SR ability of different models for more accurate super-resolution. For example, we can use a heavy model, such as EDSR, as the backbone of CB to reconstruct hard image patches more accurately while using a light-weight model as the backbone of PB to processing easy/plain image patches.

3 Experiments

We first clarify the experimental settings about datasets, degradation models, evaluation metric, and training settings.

3.1 Datasets and Evaluation metrics

Following  [19], we use DIV2K as training set. For testing, we use four standard benchmark datasets, i.e., Set5, Set14, BSD100, Urban100 [20] for evaluation. We obtain the LR input with Bicubic downsample. Besides, We conduct the evaluation with PSNR and SSIM metrics on Y channel (i.e., luminance) of transformed YCbCr space.

3.2 Implementation details

Our framework adopts LeNet-5 [15] as the network backbone of our difficulty identifier, IDN [18] as the network backbone of our CB and Bicubic interpolation as our PB. The training procedure of our framework can be divided into two parts. The first part is the end-to-end training of our DIM. We use the dataset based on PSNR prior built by Bicubic to train the DIM. Another part is jointly end-to-end training the CB and PB with the help of DIM and mask generator. All patches will be passed through the DIM, CB, and PB to generate corresponding results. Only the reconstructed patches with one mask will be used to compute the loss and gradients of back-propagation, then the parameter of the corresponding SR branch is updated with gradients.

Data augmentation is performed on DIV2K training set, which are randomly rotated by , , and flipped horizontally. In each training batch, 64 LR patches with the size of 4848 are extracted as inputs. Our model is trained by ADAM optimizer [21] with = 0.9, =0.999, and = . The initial learning rate is set to 1

4 and then decreases to half every 100 epoch. We use PyTorch 

[22] to implement our models with a Titan XP GPU.

Figure 4: Qualitative comparisons on “89000” image from BSD100 and “ppt3” from Set14.

3.3 Comparison with state-of-the-art methods

We compare our approach with state-of-the-art SR methods mainly on two commonly-used image quality metrics: PSNR and the structural similarity index(SSIM). The state-of-the-art methods we used are Bicubic, A+, SRCNN [5], FSRCNN [6], VDSR [16], LapSRN [13], MemNet [17], and IDN [18].

Table 1 shows the quantitative comparisons on the Set5, Set14, B100, and Urban100 with factor 2, 3 and 4. As illustrated in Table 1, our approach surpasses IDN with a clear margin. Specifically, our approach outperforms IDN with 0.11 dB and 0.46 dB under factor 2 on Set14 and Urban100, respectively. Moreover, our full model surpasses IDN with 0.22 dB and 0.08 dB on Set5 and Urban100 under factor 3. As IDN is an efficient SR framework, our model still owns a superiority among state-of-the-art methods, which inherently verify the effectiveness of our model. When compared with the remaining models, our approach outperforms them by a large margin on Urban100 in term of the PSNR metric. A similar trend can also be observed for the SSIM score. Specifically, as illustrated in Table 1, our approach achieves 0.50 dB, 0.36 dB, and 0.30 dB improvement over MemNet [17] on Urban100.

Figure 4 visualizes some promising examples from B100 and Set14. We interpolate the Cb and Cr chrominance channels by the Bicubic method and translate YCbCr space to RGB space to generate color images for better views. We can observe that our approach restores sharper and clearer images with higher PSNR than other methods. As shown in Figure 4, our model restore a clear structure with fewer artifacts.

3.4 Efficiency

We demonstrate an efficiency comparison to verify the practicability of our framework. As shown in Table 2, we conduct this efficiency analysis on Urban100 with factor 4. Since DIM needs additional computational cost, our full model obtains a parameter gain and slower than IDN. However, our model still demonstrates a competitive efficiency among state-of-the-art image SR methods. For instance, compared LapSRN, our model achieves faster efficiency with fewer parameters.

Algorithm VDSR LapSRN IDN Our
Time(s/frame) 0.094 0.046 0.015 0.031
Parameter(MB) 2.824 3.327 2.597 3.226
Table 2: Efficiency comparison on Urban100 with factor 4.

3.5 Ablation study of PB and CB

Dataset Scale PB CB PB + CB
PSNR PSNR PSNR
Urban100 2 29.80 34.42 34.62
3 27.59 30.49 30.95
4 25.04 27.62 27.64
Table 3: Investigations of PB and CB on Urban100. Best results are highlighted with boldface. We can observe that “PB+CB” combination achieve the best performance.

In this section, we study the effects of each branch in our proposed approach, we disable one branch each time and compare their differences on different testing sets.

Effects on the dataset with diverse difficulty. We first compare the performance of different SR branch on Urban100 benchmark dataset. We crop the LR images of Urban100 into patches with the size of 4848 and the corresponding HR patches with the size of (48scale)(48scale). We use all the patches as the input of PB, CB, and our integrated adaptive approach(PB + CB). Then we compute the PSNR values between the reconstructed patches and their HR patches. The results are shown in Table 3. We can observe that PB and CB have different SR performance on the same testing set. CB can handle better than PB on datasets with diverse difficulty. Although the CB can achieve high PSNR, our dual-way adaptive approach achieves the best performance compared with CB and PB. This shows that our approach is an effective SR approach while handling dataset with diverse difficulty.

Dataset Scale PB CB
PSNR PSNR
Hard Patches 2 28.10 33.06
3 25.70 29.24
4 24.19 26.92
Easy Patches 2 53.95 52.71
3 63.66 54.25
4 54.26 51.86
Table 4: Study of PB and CB on hard/easy patches in Urban100. Best results are highlighted with boldface.

Effects on the dataset with single difficulty. We further show the effect of PB and CB on the dataset with single difficulty. For simplicity, we crop pair patches from Urban100, compute their PSNR values and divide them into two set(hard and easy) by their PSNR values. We set the threshold of PSNR value as 45. If the PSNR value of a pair patch exceeds 45, we divide it into easy patches set, otherwise dividing it into hard patches set. Table 4 shows the performance of PB and CB on hard/easy patches of Urban100. We can observe that PB and CB have different performance on the dataset with different difficulty. PB handle plain/easy images better than CB while CB process hard images better. This shows that a single model is hard to handle all regions with diverse difficulty well at the same time. We should embrace the different SR ability of different methods to produce more accurate results.

4 Conclusion

In this paper, we proposed a novel dual-way adaptive SR approach that can discriminately process each image region of an image by its difficulty. Extensive experiments on several standard benchmarks demonstrate the effectiveness of our approach.

References

  • [1] William T Freeman, Egon C Pasztor, and Owen T Carmichael, “Learning low-level vision,” International journal of computer vision, vol. 40, no. 1, pp. 25–47, 2000.
  • [2] Qingxing Cao, Liang Lin, Yukai Shi, Xiaodan Liang, and Guanbin Li,

    “Attention-aware face hallucination via deep reinforcement learning,”

    in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2017, pp. 690–698.
  • [3] Guanbin Li, Yukang Gan, Hejun Wu, Nong Xiao, and Liang Lin, “Cross-modal attentional context learning for rgb-d object detection,” IEEE Transactions on Image Processing, vol. 28, no. 4, pp. 1591–1601, 2019.
  • [4] Guo Lu, Wanli Ouyang, Dong Xu, Xiaoyun Zhang, Chunlei Cai, and Zhiyong Gao, “Dvc: An end-to-end deep video compression framework,” arXiv preprint arXiv:1812.00101, 2018.
  • [5] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang, “Learning a deep convolutional network for image super-resolution,” in ECCV, 2014, pp. 184–199.
  • [6] Chao Dong, Change Loy Chen, and Xiaoou Tang,

    “Accelerating the super-resolution convolutional neural network,”

    pp. 391–407, 2016.
  • [7] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee, “Enhanced deep residual networks for single image super-resolution,” in Computer Vision and Pattern Recognition Workshops, 2017, pp. 1132–1140.
  • [8] Yukai Shi, Keze Wang, Li Xu, and Liang Lin, “Local-and holistic-structure preserving image super resolution via deep joint component learning,” in 2016 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2016, pp. 1–6.
  • [9] Yukai Shi, Keze Wang, Chongyu Chen, Li Xu, and Liang Lin, “Structure-preserving image super-resolution via contextualized multitask learning,” IEEE transactions on multimedia, vol. 19, no. 12, pp. 2804–2815, 2017.
  • [10] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu, “Image super-resolution using very deep residual channel attention networks,” arXiv preprint arXiv:1807.02758, 2018.
  • [11] Andrey Ignatov, Radu Timofte, Thang Van Vu, Tung Minh Luu, Trung X Pham, Cao Van Nguyen, Yongwoo Kim, Jae-Seok Choi, Munchurl Kim, Jie Huang, et al., “Pirm challenge on perceptual image enhancement on smartphones: report,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 0–0.
  • [12] Pengju Liu, Hongzhi Zhang, Kai Zhang, Liang Lin, and Wangmeng Zuo, “Multi-level wavelet-cnn for image restoration,” arXiv preprint arXiv:1805.07071, 2018.
  • [13] Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan Yang, “Deep laplacian pyramid networks for fast and accurate superresolution,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, vol. 2, p. 5.
  • [14] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu, “Residual dense network for image super-resolution,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [15] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [16] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee, “Accurate image super-resolution using very deep convolutional networks,” 2016.
  • [17] Ying Tai, Jian Yang, Xiaoming Liu, and Chunyan Xu, “Memnet: A persistent memory network for image restoration,” pp. 4549–4557, 2017.
  • [18] Zheng Hui, Xiumei Wang, and Xinbo Gao, “Fast and accurate single image super-resolution via information distillation network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 723–731.
  • [19] Radu Timofte, Eirikur Agustsson, Luc Van Gool, Ming Hsuan Yang, Lei Zhang, Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee, “Ntire 2017 challenge on single image super-resolution: Methods and results,” in Computer Vision and Pattern Recognition Workshops, 2017, pp. 1110–1121.
  • [20] Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja, “Single image super-resolution from transformed self-exemplars,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5197–5206.
  • [21] Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” in ICLR, 2015.
  • [22] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer, “Automatic differentiation in pytorch,” 2017.