Distilling with Residual Network for Single Image Super Resolution

07/05/2019 ∙ by Xiaopeng Sun, et al. ∙ Xidian University 4

Recently, the deep convolutional neural network (CNN) has made remarkable progress in single image super resolution(SISR). However, blindly using the residual structure and dense structure to extract features from LR images, can cause the network to be bloated and difficult to train. To address these problems, we propose a simple and efficient distilling with residual network(DRN) for SISR. In detail, we propose residual distilling block(RDB) containing two branches, while one branch performs a residual operation and the other branch distills effective information. To further improve efficiency, we design residual distilling group(RDG) by stacking some RDBs and one long skip connection, which can effectively extract local features and fuse them with global features. These efficient features beneficially contribute to image reconstruction. Experiments on benchmark datasets demonstrate that our DRN is superior to the state-of-the-art methods, specifically has a better trade-off between performance and model size.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The task of super-resolution(SR) is to reconstruct a high-resolution(HR) image consistent with it from a low-resolution(LR) image. The tasks of super-resolution are quite extensive, such as in the field of video surveillance, medical imaging, and target detection. However, SR is a reverse process of information loss. LR images have abundant low-frequency information but lose high-frequency information only in HR images. In order to address these problems, plenty of learning based methods have been applied to learn a mapping between HR images and LR images pairs.

Recently, Convolutional neural networks(CNN) are applied to a large number of visual tasks, including SR, which achieves better results than traditional methods. Dong et al. [1] firstly proposed SRCNN by a fully convolutional neural network, which could learn an end-to-end mapping between LR images and HR images, and made significant improvement over the conventional method (such as A+ [2]) with only three layers. Later, Kim et al. [3] proposed VDSR increasing depth to 20 and made significant progress over SRCNN. Then, the DRCN Kim et al. [4]

proposed relieved the difficulty of deep network training by using gradient clips, skipping connections and recursive supervision. Lai et al. 

[5] proposed LapSRN that consisted of a deep laplacian pyramid, which reconstructed the HR image by step by step amplification. Based on ResNet [6], Lim et al. [7] designed a very deep and wide network EDSR. Tai et al. proposed MemNet [8] consisted of memory blocks, but increased the computational complexity. Hui et al. proposed IDN [9] reducing model computational complexity by distillation model. But IDN’s distillation operation has the problem of information loss. Based on the full integration of densenet and resnet, Yulun et al. proposed RDN [10], which achieved quite outstanding results. Later, Yulun et al. introduced the attention mechanism to propose RCAN [11], and achieved amazing achievements. But these two models are too complicated, and the amount of parameters is huge.

(a) Ground-Truth
(b) HR (c) Bicubic (d) LapSRN [5] (e) MemNet [8] (f) IDN [9] (g) SRMDNF [12] (h) MSRN [13] (i) DRN(ours)
Figure 1: Subjective quality assessment for upscaling on the general image: Image074 from Urban100.

For the purpose of using different receptive fields to extract information while reducing model complexity, Li et al. proposed MSRN [13] utilizing 33 and 55 convolutional kernel to fuse features. Although they have achieved the goal of optimizing the model, the experimental results are not outstanding enough.Their structure over-reuses features, causing the network structure to become bloated and difficult to train.

To address these problems, we propose a simple and efficient distilling with residual network for SISR. As shown in Fig3, residual distilling group(RDG) is proposed as the building module for DRN. As Fig.4 shows, we stack several residual distiliing blocks(RDB) with one long skip connection(LSC) in each RDG. These long memory connections in RDBs bypass rich low-frequency information, which simplifies the flow of information. The output of one RDB can directly access each layer of the next RDB, resulting in continuous feature transfer. In addition, we introduced a convolutional layer with a 11 kernel as feature fusion and dimensionality reduction at the last position of the RDB. The residual distilling operation is in each RDB and consists of 11, 33 and 11 convolution kernels. The output of the sum of RDG and the global residual learning is sent to image reconstruction by pixelshuffle [14].

Figure 2: Performance and number of parameters. Results are evaluated on Set5(4). Our models have a better trade-off between performance and model size.

In summary, our main contributions are three-fold:

  • We propose residual distilling block(RDB), which enjoys benefits from ResNet [6] and distills efficient imformation. It can fuse common feature while maintaining the ability to distill important features. Different from IDN [9], our RDB using residual distilling structure, retains as much information.

  • Our method has few network parameters and a simple network structure, which is easy to recurrent. It is a compact network with a significant trade-off between performance and model size.

  • We propose a simple and efficient distilling with residual network(DRN) for high-quality image SR. What’s more, it is easy to understand and better than most of the state-of-the-art methods.

2 Distilling with Residual Network

2.1 Network Architecture

As shown in Fig.3

, the proposed DRN mainly consists three parts: low-level feature extraction(LFE), residual distilling groups(RDGs), image reconstruction(IR). Here, let’s denote

and as the input and output of DRN. As referred in  [15, 7, 11], one convolutional layer is suitable to extract the low-level feature from the input LR


where represents convolutional function. is then sent to the residual distilling groups and used for global residual learning. Furthermore, we can have that’s the output of GDGs


where denotes the operations of the RDGs we proposed, which contains

groups. With the deep feature information being extracted by a set of RDGs, we can further fuse the features, which contains global residual learning and

. So, we have all the features extracted ,


Then is upscaled through image reconstruction module. We can get upscaled feature


where denotes the image reconstruction module. Then the upscaled feature is reconstructed by one convolution layer. In general, the overall process can be expressed as


where and denote the image reconstruction and the function of our RDGs repectively.

Figure 3: Distilling with Resiudal Network Network(DRN).

2.2 Residual Distilling Group

We now give more details about RDG. Through Fig.4, we can see each group contains residual distilling blocks(RDBs) and one long skip connection(LSC). Such our structure can achieve high performance in image super resolution with a general number of convolution layers.

With all of the above, a RDG in -th group is represented as


where denotes the function of -th RDG. and are the input and output of -th RDG. Unlimited use of the residual distilling will increase the number of channels by a very large amount. Therefore, we set a 11 convolution with ELU [16] to reduce the number of channels, but it also can combine the fused distillation features together. Finally, when is , we have the output of RDGs


where denotes the output of -th RDG.

Figure 4: Residual Distilling Group(RDG).

2.3 Resiudal Distilling Block

The LR images have abundant low frequency information, except high frequency information the HR images only have. Therefore, we need to extract LR information and generate high frequency information. From the perspective of feature sharing of learning,  [17] found that the connection in residual learning is an effective way to eliminate the phenomenon of disappearing gradients in deep networks. Inspired by the recent success of by [17]

on ImageNet, we design resiudal distilling as basic convolution in each RDB, that is, when the channel performs the residual operation, it simultaneously distills out the new channel. The channel operated by the residual operation, retains the input information as much as possible and the new distilled channel contains the useful features which is conducive to generate high frequency information. The resiudal distilling inherits the advantages of ResNet 

[6] and distilled efficient information, to achieve an effective reuse and re-exploitation.

For intuitive understanding, as shown in Fig.5, let’s denotes the feature map dimensions of the -th layers. In this way, the relationship of the convolution layers can be expressed as:


where denotes the channel that is distilled out between ()-th layer and -th layer. The number of dimensions perform residual operation, and dimensions perform cat operation. The whole process can be expressed as:


where [] are output of by convolutional function and denotes resiudal distilling function in Fig.5 left, and represents concatenation operation . The dimensions of is same as . Through this process, local residual information have been extracted by residual operation, and the net still remains a distilled path to learn new features flexibility. As we all know, high resolution to low resolution is a process of information degradation. Therefore, resiudal distilling helps the neural network ectract the useful features through potential information.

As shown in Fig.4, we stack RDBs in one RDG with one long connection. Too deep a network can cause the learned features to disappear. We design a long memory connection to allow the network reserve information about the previous block. We steak resiudal distilling blocks(RDB) in each RDG. So , the -th resiudal distilling block in -th RDG, can be expressed as


where represents the -th RDB function. The resiudal distilling block is simple, lightweight and accurate. Finally, we can get , the output of -th RDG


where denotes the compression using 11 convolution with ELU, and is the output of -th RDG when is .

Figure 5: Resiudal Distilling Block(RDB).

2.4 Image Reconstruction

As discussed in Session 2.3, the output and of the previous network represent global residual information and deep information respectively. Send the result of the two additions to the upsampling module.

There are several methods to upscaling modules, such as deconvolution layer, nearest-neighbor upsampling convolution and pixelshuffle proposed by ESPCN [14]

. However, with the upscaling factor increasing, the network will have some uncertain training problems. The weight of the deconvolution will change with the network training. Furthermore, these methods can’t work on odd upscaling factors(e.g. x3, x5). Based on the above situation, we choose pixelshuffle as upscaling module due to the best performance.Detailed parameters of pixelshuffle are in Table 1.

(a) Ground-Truth
(b) HR (c) Bicubic (d) LapSRN[5] (e) MemNet[8] (f) IDN[9] (g) SRMDNF[12] (h) MSRN[13] (i) DRN(ours)
Figure 6: Subjective quality assessment for upscaling on the general image: Image027 from Urban100.
(a) Ground-Truth
(b) HR (c) Bicubic (d) LapSRN[5] (e) MemNet[8] (f) IDN[9] (g) SRMDNF[12] (h) MSRN[13] (i) DRN(ours)
Figure 7: Subjective quality assessment for upscaling on the general image: ppt3 from Set14.
(a) Ground-Truth
(b) HR (c) Bicubic (d) LapSRN[5] (e) MemNet[8] (f) IDN[9] (g) SRMDNF[12] (h) MSRN[13] (i) DRN(ours)
Figure 8: Subjective quality assessment for upscaling on the general image: Image059 from Urban100.

2.5 Loss function

There are many loss functions available in the super-resolution field, such as mean square error(MSE) 

[1, 3, 18], mean absolute loss [5, 7], perceptual and adversarial loss [15]. With the MSE loss, the neural networks generate images that are not in line with human vision [5], so we optimize the model with MAE that is formulated as follows:


where denotes the number of training samples in each batch. is the reconstructed HR image. denotes the ground truth HR image respectively. We also make a comparison of the results of using MAE and MSE respectively, as shown in Fig.LABEL:fig:loss.

Laye name Input channel Output channel
conv input C CMM
PixelShuffle(M) CMM C
conv output C 3
Table 1: Detailed configuration information about the reconstruction structure.
Methods RDBs No RDBs DBN+
PSNR on Set5(3) 33.98 33.83 34.35
Table 2: Quantitative comparison of results with or without RDB on Set5(3

) at 100th epoch.

Dataset Scale Bicubic A+ [2] VDSR [3] DRCN [4] LapSRN [5] IDN [9] SRMDNF [12] MSRN [13]  DRN(ours)  DRN+(ours)
Set5 33.66/0.9300 36.60/0.9542 37.53/0.9583 37.63/0.9584 37.52/0.9581 37.83/0.9600 37.79/0.9601 38.08/0.9605 38.06/0.9607 38.18/0.9612
30.39/0.8688 32.63/0.9085 33.66/0.9201 33.82/0.9215 33.82/0.9207 34.11/0.9253 24.12/0.9254 34.38/0.9262 34.45/0.9274 34.68/0.9293
28.42/0.8104 30.33/0.8565 31.35/0.8838 31.53/0.8854 31.54/0.8852 31.82/0.8903 31.96/0.8925 32.07/0.8903 32.27/ 0.8964 32.49/ 0.8985
Set14 30.24/0.8688 32.42/0.9059 33.03/0.9124 33.04/0.9118 33.08/0.9124 33.32/0,9159 33.30/0.9148 33.74/0.9170 33.64/0.9179 33.85/0.9193
27.55/0.7742 29.25/0.8194 29.77/0.8314 29.76/0.831 29.87/0.8325 29.99/0.8354 30.04/0.8382 30.34/0.8395 30.30/0.8664 30.57/0.8466
26.00/0.7027 27.44/0.7450 28.01/0.7674 28.02/0.7670 28.19/0.7700 28.25/0.7730 28.35/0.7787 28.60/0.7751 28.69/0.7839 28.83/0.7872
BSDB100 29.56/0.8431 31.24/0.8870 31.90/0.8960 31.85/0.8942 31.80/0.8952 32.08/0.8985 32.05/0.8985 32.23/0.9013 32.23/0.9001 32.32/0.9013
27.21/0.7385 26.05/0.8019 28.82/0.7976 28.80/0.7963 28.82/0.7980 28.95/0.8013 28.97/0.8025 29.08/0.8554 29.12/0.8055 29.26/0.8090
25.96/0.6675 26.83/0.6999 27.29/0.7251 27.23/0.7232 27.32/0.7284 27.41/0.7297 27.49/0.7337 27.52/0.7273 27.65/0.7380 27.72/0.7403
Urban100 26.88/0.8403 29.25/0.8955 30.76/0.9140 30.75/0.9133 30.41/0.9103 31.27/0.9196 31.33/0.9204 32.22/0.9326 32.22/0.9288 32.65/0.9329
24.46/0.7349 26.05/0.8019 27.14/0.8279 27.15/0.8276 27.07/0.8275 27.42/0.8359 27.57/0.8398 28.08/0.8554 28.18/0.8520 28.73/0.8627
23.14/0.6577 24.34/0.7211 25.18/0.7524 25.14/0.7510 25.21/0.7562 25.41/0.7632 25.68/0.7731 26.04/0.7896 26.26/0.7903 26.54/0.7982
Manga109 30.82/0.9332 35.37/0.9663 37.22/0.9729 37.63/0.9723 37.27/0.9855 38.02/0.9749 38.07/0.9761 38.82/0.9868 38.75/0.9773 38.94/0.9779
26.96/0.8555 29.93/0.9089 32.01/0.9310 32.31/0.9328 32.21/0.9318 32.79/0.9391 33.00/0.9403 33.44/0.9427 33.78/0.9455 34.21/0.9486
24.91/0.7826 27.03/0.8439 28.83/0.8809 28.98/0.8816 29.09/0.8845 29.41/0.8936 30.09/0.9024 30.17/0.9034 30.87/0.9121 31.16/0.9157
Table 3: Benchmark results of state-of-the-art SR methods: Average PSNR/SSIM/IFC for 2, 3, and 4 upscaling. The bold figures indicate the best performance.

3 Experimental Results

3.1 Implementation Details

In the proposed networks, we set 3

3 as the size of all convolutional layers with one padding and one striding except convolutional layers of local and global feature fusion. The filter size of local and global feature fusion is 1

1 with no padding and one striding. Low-level feature extraction layers and feature fusion layers have 64 filters. The number of RDBs is 9, and the number of RDGs is 6. In RD, the number of distilled filters is 8. We treat the network with 256 original filters in each RDB as DRN+. Other layers in each RDB are followed by the exponential linear unit(ELU [16]) with parameter 0.2. The SR results are evaluated with PSNR and SSIM [19]. We train our model with ADAM optimizer [20] with MAE loss by setting =0.9, =0.999 and =. The learning rate is and halve at each 200 epochs.We trained DRN and DRN+ for about 800 and 300 epochs respectively.

3.2 Datasets

By following many existing image SR methods [7, 13], we use 800 training images of DIV2K dataset [21] as training set and five standard benchmark dataset: Set5 [22], Set14 [23], Urban100 [24], BSDB100 [25] and Manga109 [26] as testing set. We set the batchsize to 16. The size of the input image is 4848. Instead of transforming the RGB patches into a YCbCr space, we use 3 channels images information from the RGB patches in order to keep the images real.

3.3 Comparisons with state-of-the-arts

We compare our method with 10 state-of-the-art methods: A+ [2], SRCNN [1], FSRCNN [18], VDSR [3], DRCN [4], LapSRN [5], IDN [9], SRMDNF [12], EDSR [7] and MSRN [13]. We also use self-ensemble [27] to improve our models.

Table 3 shows quantitative comparison for 2,3 and 4 SR. Compared with previous methods, our DRN+ performs the best on most datasets with all scaling factors. Even with 64 filters, our DRN is also better than other comparison methods on most datasets. Table 2 shows ablation test on RDBs. The model with RDBs has a great performance, and DBN+ has a better performance. In Fig.1, Fig.6, Fig.7 and Fig.8 we present visual performance on different datasets with different upscaling factors. Fig.1 shows visual comparison on scale 4. For image ”Image075”, we observe that most methods can’t recover texture on the windows. In contrast, our DRN can better alleviate blurring artifacts and recover details consistent with the Groundtruth. In Fig.6 we observe that the lines of ”Image027” recovered by most methods don’t correspond to Groundtruth images well. However, the DRN our proposed have accurately recoverd the lines. Fig.7 showing ”ppt3”, although most methods have different degrees of blurring on the word ”with”, the DRN accurately removes the blurs in the picture that people can recognize that the word is ”with”. In Fig.8 most methods don’t recover lines of windows in ”Image059” that inconsistent with Groundtruth image, our DRN accurately restores the lines and removes the blur almost.

We also compared the trade-offs between performance and network parameters from DRN networks and existing networks. Fig. 2 shows the PSNR performance versus number of parameters, where the results are evaluated with Set5 dataset for 4 upscaling factor. We can see our DRN network is better than a relatively small models. In addition, the DRN+ achieves higher performance with 54% fewer parameters compared with EDSR. These comparisons show that our model has a better trade-off between performance and model size.

4 Conclusion

In this paper, we propose a simple and efficient distilling with residual network(DRN) for SISR, which is better than most of the state-of-the-art methods and has fewer parameters. Based on resiudal distilling(RD), the DRN inherits the advantages of the dense residue and connection paths, to achieve an effective reuse and re-exploitation. Our DRN and DRN+ have better tradeoff between model size and performance. In the future, we will apply this model to other areas to such as de-raining, dehanzing, and denoising.


  • [1] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang, “Learning a deep convolutional network for image super-resolution,” in ECCV. Springer, 2014, pp. 184–199.
  • [2] Radu Timofte, Vincent De Smet, and Luc Van Gool, “A+: Adjusted anchored neighborhood regression for fast super-resolution,” in ACCV. Springer, 2014, pp. 111–126.
  • [3] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee, “Accurate image super-resolution using very deep convolutional networks,” in CVPR, 2016, pp. 1646–1654.
  • [4] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee, “Deeply-recursive convolutional network for image super-resolution,” in CVPR, 2016, pp. 1637–1645.
  • [5] Wei-Sheng Lai, Jia-Bin Huang, Narendra Ahuja, and Ming-Hsuan Yang, “Deep laplacian pyramid networks for fast and accurate super-resolution,” in CVPR, 2017, pp. 5835–5843.
  • [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
  • [7] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee, “Enhanced deep residual networks for single image super-resolution,” in CVPRW, 2017, pp. 1132–1140.
  • [8] Ying Tai, Jian Yang, Xiaoming Liu, and Chunyan Xu, “Memnet: A persistent memory network for image restoration,” in CVPR, 2017, pp. 4539–4547.
  • [9] Zheng Hui, Xiumei Wang, and Xinbo Gao, “Fast and accurate single image super-resolution via information distillation network,” in CVPR, 2018, pp. 723–731.
  • [10] Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu, “Residual dense network for image super-resolution,” in CVPR, 2018.
  • [11] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu, “Image super-resolution using very deep residual channel attention networks,” arXiv preprint arXiv:1807.02758, 2018.
  • [12] Kai Zhang, Wangmeng Zuo, and Lei Zhang, “Learning a single convolutional super-resolution network for multiple degradations,” in CVPR, 2018, vol. 6.
  • [13] Juncheng Li, Faming Fang, Kangfu Mei, and Guixu Zhang, “Multi-scale residual network for image super-resolution,” in ECCV, 2018, pp. 517–532.
  • [14] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” .
  • [15] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew P Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al., “Photo-realistic single image super-resolution using a generative adversarial network.,” in CVPR, 2017, vol. 2, p. 4.
  • [16] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter, “Fast and accurate deep network learning by exponential linear units (elus),” arXiv preprint arXiv:1511.07289, 2015.
  • [17] Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, and Jiashi Feng, “Dual path networks,” in NeurIPS, 2017, pp. 4467–4475.
  • [18] Chao Dong, Chen Change Loy, and Xiaoou Tang, “Accelerating the super-resolution convolutional neural network,” in ECCV. Springer, 2016, pp. 391–407.
  • [19] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, 2004.
  • [20] D Kinga and J Ba Adam, “A method for stochastic optimization,” in ICLR, 2015, vol. 5.
  • [21] Radu Timofte, Eirikur Agustsson, Luc Van Gool, Ming-Hsuan Yang, Lei Zhang, Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, Kyoung Mu Lee, et al., “Ntire 2017 challenge on single image super-resolution: Methods and results,” in CVPRW. IEEE, 2017, pp. 1110–1121.
  • [22] Marco Bevilacqua, Aline Roumy, Christine Guillemot, and Marie Line Alberi-Morel, “Low-complexity single-image super-resolution based on nonnegative neighbor embedding,” 2012.
  • [23] Roman Zeyde, Michael Elad, and Matan Protter, “On single image scale-up using sparse-representations,” in International Conference on Curves and Surfaces. Springer, 2010, pp. 711–730.
  • [24] Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja, “Single image super-resolution from transformed self-exemplars,” in CVPR, 2015, pp. 5197–5206.
  • [25] David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in ICCV. IEEE, 2001, vol. 2, pp. 416–423.
  • [26] Yusuke Matsui, Kota Ito, Yuji Aramaki, Azuma Fujimoto, Toru Ogawa, Toshihiko Yamasaki, and Kiyoharu Aizawa, “Sketch-based manga retrieval using manga109 dataset,” Multimedia Tools and Applications, vol. 76, no. 20, pp. 21811–21838, 2017.
  • [27] Radu Timofte, Rasmus Rothe, and Luc Van Gool, “Seven ways to improve example-based single image super resolution,” in CVPR, 2016, pp. 1865–1873.