Cross-Scale Residual Network for Multiple Tasks:Image Super-resolution, Denoising, and Deblocking

11/04/2019 ∙ by Yuan Zhou, et al. ∙ 10

In general, image restoration involves mapping from low quality images to their high-quality counterparts. Such optimal mapping is usually non-linear and learnable by machine learning. Recently, deep convolutional neural networks have proven promising for such learning processing. It is desirable for an image processing network to support well with three vital tasks, namely, super-resolution, denoising, and deblocking. It is commonly recognized that these tasks have strong correlations. Therefore, it is imperative to harness the inter-task correlations. To this end, we propose the cross-scale residual network to exploit scale-related features and the inter-task correlations among the three tasks. The proposed network can extract multiple spatial scale features and establish multiple temporal feature reusage. Our experiments show that the proposed approach outperforms state-of-the-art methods in both quantitative and qualitative evaluations for multiple image restoration tasks.



There are no comments yet.


page 1

page 2

page 6

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Image restoration[15] has been a long-standing problem given its practical value for a variety of low-level vision applications, such as face restoration[6], semantic segmentation[30, 37] and target tracking[31, 52]. In general, image restoration aims to recover clean image from its corrupted observation , where is a ground-truth high-quality version of , is a degradation function, is additive noise. By accommodating different types of degradation function, the resulting mathematical models target at specific image restoration tasks, such as image super-resolution, denoising, and deblocking. Image super-resolution reconstructs a high-resolution (HR) image from the low-resolution (LR) counterpart with being a composite operator of blurring and down-sampling. Image denoising retrieves a clean image from a noisy observation, with commonly being the identity function and

being additive white Gaussian noise with standard deviation

. JPEG image deblocking aims to remove the blocking artifact from a lossy image caused by corresponding to the JPEG compression function.

For decades, model-based optimization and dictionary learning have been dominant in single-task image restoration[48, 4, 12, 42, 22]

. The recent development of deep learning, especially convolutional neural networks (CNNs), has notably increased progress of image restoration

[53, 40, 54, 55, 32]. Deep CNNs that enlarge the receptive field or enhance feature reusing provide state-of-the-art results in single-task image restoration, such as single image super-resolution[24, 29, 17], image denoising[18, 32] or JPEG image artifacts removal[14, 33], through residual learning and dense connections.

It is desirable for an image restoration network to well support all the three aforementioned tasks. Unfortunately, most existing models only perform well in one of these tasks. It is commonly recognized that these tasks happen to have strong correlations. In order to support all the tasks, the neural network of image restoration must fully harness the inter-task correlations.

Moreover, there exist critical differences on how to best treatment the three tasks. In particular, selection of feature scales is known to significantly impact the performance on these tasks. It is also well-known that each of these tasks has its own favorable scales of feature extraction. That is why we propose the cross-scale residual network (CSRnet) to improve multiple-scales features utilization and the performance on multiple tasks.

Fig. 1: Diagram of proposed CSRnet that comprises three parts: shallow feature extraction stage, hierarchical feature fusion stage, and reconstruction stage.

Fig. 1 shows the diagram of the proposed CSRnet, which extracts various features at different scales and fully uses all the hierarchical features throughout the network. Specifically, we propose cross-scale residual blocks (CSRBs) (see Fig.2), whose three states operate at different spatial resolutions, as the building blocks for the CSRnet. The states capture information at different scales, and the intra-block cross-scale connection of each CSRB produces an information flow from the fine to the coarse scale or vice versa. In addition, the inter-block connection combines information at a given resolution from all the preceding CSRBs, to provide rich features for the current CSRB. Extensive experimental results verify that each proposed component improves the network performance, and hence the CSRnet outperforms state-of-the-art methods in image super-resolution, denoising, and deblocking.

Our main contributions can be summarized in the following aspects:

  1. We propose a cross-scale residual network (CSRnet) which simultaneously implements multi-temporal feature reusing and multi-spatial scale feature learning for multiple image restoration tasks, namely, image super-resolution, denoising, and deblocking.

  2. Cross-scale residual blocks (CSRB) are proposed as the basic building block of the proposed network. The CSRB adaptively learns feature information from different scales. Though deep-learning-based methods have achieved a notable improvement over traditional methods in image restoration domain, most of them learn features from the image space at a single scale, thus cannot handle the scenario of multiple tasks. To this end, we design the CSRB to efficiently extract and adaptively fuse features from different scales for multiple tasks of image restoration.

  3. To enhance feature reusing in the blocks and gradient flow during training, we propose two kinds of connections, namely, intra-block cross-scale connection and inter-block connection. The former produces an information flow from the fine to the coarse scale or vice versa. The latter allows the information from preceding blocks to be reused for learning of succeeding block features.

Ii Related Work

Ii-a Image super-resolution

Methods based on CNNs have recently revolutionized the field of image super-resolution. The most commonly used approach is to consider the interpolated low-resolution image as input to the network. Dong et al.

[9] first introduced an end-to-end CNN model called SRCNN to reconstruct interpolated low-resolution images into their high-resolution counterparts. Improvements to the SRCNN include a very deep network for super-resolution (VDSR), which increases the network depth with a smaller filter size and residual learning [24], and a deeply recursive convolutional network (DRCN), which uses recursive layers and multi-supervision [23]. Deep CNN models using block structures [41, 40] based on residual units use features from different temporal levels for reconstruction. Although these methods [24, 50, 40, 9, 23, 41] have considerably improved super-resolution accuracy, the interpolated low-resolution inputs increase the computational complexity and might introduce additional noise.

Given the specificity of image super-resolution, another effective approach directly takes the low-resolution image as input to the CNN[29, 28, 43, 39, 7] for decreasing computational cost. Shi et al.[39] proposed a sub-pixel convolutional layer to effectively up-sample the low-resolution feature maps in an approach, which is also used in enhanced deep residual networks for super-resolution[29]. Based on dense connection, the SRDenseNet[43] and residual dense network employ dense blocks or residual dense blocks to learn high-level features, whose outputs are concatenated into a final output. In addition, generative adversarial networks have been used for image super-resolution[28, 7] to learn adversarial and perceptual content losses that can improve visual quality.

Ii-B Image denoising

Traditional methods such as the BM3D algorithm[26] and those based on dictionary learning [38] have improved the performance of image denoising to some extent. Still, methods based on CNNs are more suitable for this task. Xie et al.[46]

combined sparse coding with an auto-encoder structure for image denoising. Inspired by residual learning and batch normalization, Zhang et al.

[53] proposed the DnCNN model to improve the outcome of image denoising. Mao et al.[34] proposed a very deep convolutional auto-encoder network (RED) using symmetric skip connections for image denoising and super-resolution. Du et al.[11] proposed stacked convolutional denoising auto-encoders to map images to hierarchical representations without any label information. Zhang et al. [54] integrated CNN denoisers into model-based optimization for image super-resolution and denoising. Tai et al.[40] proposed a very deep persistent memory network (MemNet) that introduces a memory block consisting of a recursive unit and a gate unit to simultaneously perform several image restoration tasks.

Ii-C JPEG image deblocking

Given that JPEG compression often induces severe blocking artifacts and undermines visual quality, image deblocking is particularly important in restoration domain. Chen et al.[5] proposed a flexible learning framework based on nonlinear reaction diffusion models for JPEG image deblocking, super-resolution, and denoising. Wang et al.[45] designed a deep dual-domain-based fast restoration model for JPEG image deblocking, which combines prior knowledge from the JPEG compression scheme and the sparsity-based dual-domain approach. Unlike these traditional methods [5, 45], JPEG image deblocking based on CNNs is more effective to remove the blocking artifact and improve visual quality. Dong et al.[10] proposed an artifact reduction CNN (ARCNN) for JPEG image deblocking. From the method in [45], the dual-domain CNN proposed by Guo et al. [13] performs joint learning of the discrete cosine transform and pixel domains. To improve visual quality and artistic appreciation, Guo et al.[14] proposed a one-to-many network for JPEG image deblocking, which measures the output quality using perceptual, naturalness, and JPEG losses.

Ii-D Multiple task image processing

There exists only a few methods for multi-task image processing. The method proposed by Zhang et al.[54] uses CNN-based denoisers into model-based optimization for image denoising and super-resolution. A very deep persistent memory network [40] introduces a memory block to explicitly mine persistent memory through adaptive learning for image denoising, super-resolution, and deblocking. Likewise, Y. Zhang [55] proposed a residual dense network to exploit the hierarchical features from all the convolutional layers in three representative image restoration applications. However, these methods learn image mappings at a single scale, and ignore that different tasks may require features from different scales.

Iii Proposed CSRnet for Image Restoration

Iii-a Architecture

The proposed CSRnet illustrated in Fig. 1 comprises three stages: shallow feature extraction stage, hierarchical feature fusion stage, and reconstruction stage. They are respectively responsible for extracting shallow image features, fusing abundant feature maps, and adding image details. We denote and as the input and output of the CSRnet, respectively.

Shallow Feature Extraction Stage: We utilize two convolutional layers to extract shallow features from low-quality input images. The first convolutional layer extracts features from the input image, and the second convolutional layer reduces the dimension of the features. Shallow feature extraction stage can be expressed as:




where denotes the first convolutional operation,with filter size 77. Using a large convolutional kernel can produce a large receptive field which takes a large image context into account. denotes the second convolutional operation, with filter size 33. is further used for residual learning during the reconstruction stage by skip connection and is used as the input for the first CSRB.

Hierarchical Feature Fusion Stage: The CSRnet learns hierarchical features from every CSRB that has identical structure in this stage. If CSRBs are stacked by inter-block connection, hierarchical feature fusion stage makes full use of the scale state from each CSRB by:


where denotes the concatenation features of the outputs of scale from all the CSRBs and the shallow output from the previous stage, and introduces a convolutional layer to adaptively control the dimension of feature maps before inputting reconstruction stage.

Reconstruction Stage: To further improve information flow and reconstruct image details, this stage contains a skip connection and two convolutional layers and is expressed as:


where the skip connection adds output of the hierarchical feature fusion stage with shallow features from shallow feature extraction stage. And denotes two convolutional operations with filter sizes .

Given training set with training patches and is the ground truth high-quality patch corresponding to the low-quality patch

, we define the loss function of our model with the parameter

set as below:

Fig. 2: Diagram of proposed CSRB, the key component of CSRnet.

Iii-B Cross-Scale Residual Block (CSRB)

To determine image features at different scales, we propose the CSRB as the key component of the CSRnet. A CSRB adopts three branches using different scales (i.e., , , and ), to enable the use of cross-scale features. It is illustrated in Fig. 2 and detailed as follows.

Cross-Scale Design: Unlike models working at a single spatial resolution, the CSRB incorporates information from different scales. Specifically, boxes with different colored edges in Fig. 2 represent the structure designs at different scales. The boxes with black, purple, and yellow edges indicate scale = 0, 2, and 4, respectively. The value of represents the scale of downsampling, i.e., , , and . Two colored links, called intra-block cross-scale connections, indicate transitions between the three scales. The green and blue links respectively produce information flow from fine to coarse scale and vice versa. To learn abundant features from the previous blocks, we add the red link, called inter-block connection, at each scale.

The input at a given scale ( = 0, 2, 4) in the -th CSRB is computed by concatenating two kinds of features, namely, 1) same-scale features ( , = 0, 2, 4) from all the previous CSRBs; 2) either shallow features (, = 1) or finer-scale feature map (,

= 2, 4). The finer-scale feature map is obtained by a strided convolution from the higher-resolution layers. The overall inputs of the

-th CSRB are given as:


where denotes a convolutional layer intended to reduce and maintain the dimension for input at different scales, and and are described detailedly in the intra-block cross-scale connections and inter-block connections.

To facilitate feature specialization at different resolutions, we modify the residual blocks for each scale input of the -th CSRB by removing the batch normalization layers from our network, as performed by Nah et al[36]. Enhanced deep residual networks for super-resolution[29] have experimentally shown that this simple modification substantially increases performance. The outputs of different scales in the -th CSRB can be formulated as:


where denotes the composite operations of the -th residual block in

-th CSRB, including two convolutional layers and an activation function (ReLU).

denotes a convolutional layer, that maintains the dimension for the outputs at different scales of CSRB. , = 0, 2 is the coarser-scale feature map obtained by deconvolution from lower-resolution layers, as detailed in the intra-block cross-scale connection.

Intra-Block Connection: Features across different scales can provide various types of information for image restoration. Hence, we propose the intra-block cross-scale connection for producing information flow from fine to coarse scale and vice versa.

Finer-scale feature map ( , = 2, 4) is produced from the higher-resolution layers by:


where denotes the down-sample convolutional operations, whose corresponding layer uses a stride size of 2 to reduce the size of the feature map by half.

Likewise, coarser-scale feature map ( , = 0, 2) is obtained from lower-resolution layers by:


where denotes the up-sample convolutional operations, whose corresponding layer uses a stride size of 1/2 to double the size of the feature map.

Inter-Block Connection: To enhance feature reusing and gradient flow, we perform inter-block connection that utilizes the information at a given resolution from all previous blocks. The input at a particular scale ( = 0,2,4) of -th CSRB can receive the corresponding scale features of all the preceding CSRBs as follows:


where represents the concatenation of features retrieved by all the preceding CSRBs at a particular scale. When d=1, .

Fig. 3: Convergence analysis on multiple scales and inter-block connection. The curve for each model is based on the PSNR in 300k iterations of BSD100, with upscaling factor .

Iv Experiments

In this section, we first describe the experimental setup including the datasets and network settings of the proposed CSRnet. Then, taking image super-resolution as an example, we evaluate the contributions of different CSRnet components and parameters through an ablation study, and then analyze the effect of the CSRnet depth. Finally, we compare our model with state-of-the-art methods in both objective and subjective aspects on three image restoration tasks, namely, denoising, super-resolution, and deblocking.

Iv-a Datasets

(a) Ground Truth
(b) HR (PSNR/SSIM) (c) Bicubic (26.90/0.9434) (d) SRCNN (31.47/0.9790) (e) VDSR (32.76/0.9869) (f) DRCN(32.32/0.9867) (g) LapSRN (32.76/0.9878) (h) MemNet (34.46/0.9902) (i) Ours (36.25/0.9931)
Fig. 4: Qualitative super-resolution comparison of proposed CSRnet with other models on an image from Set14 dataset with upscaling factor . The CSRnet recovers sharp edges of letters , such as ”n” or ”g” in the image.
(a) Ground Truth
(b) HR (PSNR/SSIM) (c) Bicubic (19.03/0.6517) (d) SRCNN (20.40/0.7402) (e) VDSR (20.82/0.7672) (f) DRCN(20.86/0.7688) (g) LapSRN (20.82/0.7699) (h) MemNet (21.45/0.7886) (i) Ours (22.28/0.8121)
Fig. 5: Qualitative super-resolution comparison of proposed CSRnet with other models on an image from Urban100 dataset with upscaling factor . Only the CSRnet clearly recovers parallel line structures.
(a) Ground Truth
(c) VDSR (21.35/0.7489)
(d) LapSRN (22.40/0.8163)
(e) Ours (24.49/0.8498)
Fig. 6: Qualitative super-resolution comparison of proposed CSRnet with other models on an image from BSD100 dataset with upscaling factor .Our methods can more realistically restore the man’s eyes and his shirt’s stripes.

For image super-resolution, we generated the bicubic up-sampled image patches by using function imresize in MATLAB [8] with option bicubic as the input to CSRnet. Following [9, 23], we evaluate the proposed model on four popular benchmark datasets, namely Set5 [3], Set14 [51], BSD100[35] and Urban100 [19], with upscaling factors = 2, 4, and 8.

For image denoising, we generated noisy patches as CSRnet input by adding Gaussian noise at two levels =15, 30 and 50 to the clean patches. Four popular benchmarks, a dataset with 14 common images[40], BSD68[35], Urban100 [19] and the BSD testing set with 200 images [21], were used for evaluation.

For JPEG image deblocking, we compressed the images using the MATLAB JPEG encoder with compression quality settings = 10, 20 as JPEG deblocking input to the CSRnet. Like in [10], we evaluated the CSRnet and comparison methods on the Classic5 and LIVE1 datasets.

Iv-B Network Settings

The objective functions given by Eqn. 5 was optimized via minibatch stochastic gradient descent with backpropagation

[10]. To improve the tradeoff between the size of input patches and available computing power, we set the minibatch size to 10, momentum to 0.9, and weight decay to .

We use TensorFlow

[1] to implement the basic CSRnet network. Each convolutional layer, except for the first and final layers, has 32 filters. The first convolution layer has 64 filters, which are used to extract more shallow information. The final convolutional layer has a single feature channel (1 filter), which is used to output the high-quality image. Training the basic CSRnet for image super-resolution roughly required three days on a single GTX 1080 GP (Nvidia Co., Santa Clara, CA, USA). Due to space constraint, we focus on image super-resolution in Sec. IV.C and IV.D, while all three tasks in Sec. IV.E.

We evaluate the results for image restoration tasks in terms of the peak signal-to-noise ratio (PSNR) and structural similarity image measurement (SSIM) on the Y channel (luminance) in the YCbCr image space. The other two chrominance channels were directly transformed from the interpolated LR images for displaying the results.

Iv-C Ablation Study

TABLE I lists the PSNR obtained from the ablation study on the effects of multiple scales and inter-block connection. The baseline (denoted as CSRnet-1S) is at a single scale (). To further verify the effectiveness of the multiple spatial scale, CSRnet-2S adds a coarse scale () to baseline CSRnet-1S, and CSRnet-3S adds another coarse scale () to CSRnet-2S. These networks can exchange features among different scales via intra-block cross-scale connections. Among the three networks, CSRnet-3S achieves the best performance on the four testing datasets. To some extent, adding more scales enables a better learning of features, thereby further improves the network performance.

CSRnet-1S CSRnet-2S CSRnet-3S CSRnet-Dense
Set5 37.853 37.867 37.890 37.999
Set14 33.446 33.500 33.543 33.696
BSD100 32.115 32.155 32.202 32.251
Urban100 31.734 31.890 32.075 32.326

TABLE I: PSNR at upscaling factor obtained from ablation study to evaluate multiple scales and inter-block connection on different datasets. The red entries indicate the best performance.

Upscaling factor
Methods Set5 Set14 BSD100 Urban100

Bicubic 33.68 0.9304 30.24 0.8691 29.56 0.8440 26.88 0.8410
SRCNN16[9] 36.65 0.9536 32.45 0.9067 31.36 0.8879 29.52 0.8965
VDSR16[24] 37.53 0.9587 33.05 0.9127 31.90 0.8960 30.77 0.9141
DRCN15[23] 37.63 0.9588 33.06 0.9121 31.85 0.8942 30.76 0.9133
ESPCN16[39] 37.00 0.9559 32.75 0.9098 31.51 0.8939 29.87 0.9065
LapSRN17[27] 37.52 0.9591 32.99 0.9124 31.80 0.8949 30.41 0.9101

MemNet17[40] 37.78 0.9597 33.28 0.9142 32.08 0.8984 31.31 09195
WaveResNet17[2] 37.57 0.9586 33.09 0.9129 32.15 0.8995 30.96 0.9169
DSRN18[16] 37.66 0.9594 33.15 0.9132 32.10 0.8979 30.97 0.9163
DRFN18[49] 37.71 0.9595 33.29 0.9142 32.02 0.8979 31.08 0.9123
EEDS19[44] 37.78 0.9609 33.21 0.9151 31.95 0.8963 - -
Ours 38.00 0.9613 33.70 0.9198 32.25 0.9005 32.33 0.9298

Bicubic 28.42 0.8109 26.10 0.7023 25.96 0.6678 23.15 0.6574
SRCNN16[9] 30.48 0.8628 27.50 0.7513 26.91 0.7103 24.53 0.7226
VDSR16[24] 31.35 0.8838 28.03 0.7678 27.29 0.7252 25.18 0.7525
DRCN15[23] 31.53 0.8854 28.04 0.7673 27.24 0.7233 25.14 0.7511
ESPCN16[39] 30.66 0.8646 27.71 0.7562 26.98 0.7124 24.60 0.7360
LapSRN17[27] 31.54 0.8866 28.19 0.7694 27.32 0.7264 25.21 0.7553
WaveResNet17[2] 31.52 0.8864 28.11 0.7699 27.32 0.7266 25.36 0.7614
MemNet17[40] 31.74 0.8893 28.26 0.7723 27.40 0.7281 25.50 0.7630
DRFN18[49] 31.55 0.8861 28.30 0.7737 27.39 0.7293 25.45 0.7629
DSRN18[16] 31.40 0.8834 28.07 0.7702 27.25 0.7243 25.08 0.7471
EEDS19[44] 31.53 0.8869 28.13 0.7698 27.35 0.7263 - -
Ours 32.12 0.8929 28.51 0.7788 27.55 0.7343 26.10 0.7842

Bicubic 24.40 0.6045 23.19 0.5110 23.67 0.4808 20.74 0.4841
SRCNN16[9] 25.34 0.6471 23.86 0.5443 24.14 0.5043 21.29 0.5133
VDSR16[24] 25.73 0.6743 23.20 0.5110 24.34 0.5169 21.48 0.5289
DRCN15[23] 25.93 0.6743 24.25 0.5510 24.49 0.5168 21.71 0.5289
ESPCN16[39] 25.75 0.6738 24.21 0.5109 24.37 0.5277 21.59 0.5420
LapSRN17[27] 26.15 0.7028 24.45 0.5792 24.54 0.5293 21.81 0.5555
Ours 26.44 0.7523 24.65 0.6316 24.76 0.5924 22.31 0.6059

TABLE II: Average PSNR(dB) / SSIM results of the competing methods for image super-resolution task with upscaling factors = , , and on datasets Set5, Set14,BSD100 and Urban100. The red entries indicate the best performance.

BM3D07[26] PGPD15[47] TNRD15[5] DnCNN16[53] MemNet17[40] FOCNet19[20] Ours

14 images
15 - /- 32.01/0.8984 32.23 /0.9041 32.56/0.9110 -/- -/- 32.86/0.9162
30 28.49/0.8204 26.19/0.7442 27.03/0.7305 29.04/0.8389 29.22/0.8444 -/- 29.45/0.8516

50 26.08/0.7427 24.71/0.6913 26.27/0.7502 26.66/0.7678 26.91/0.7775 -/- 27.09/0.7875

15 -/- 31.38/0.8776 31.65/0.8890 31.99/0.8976 -/- -/- 32.16/0.9017
30 27.31 /0.7755 27.33 /0.7717 26.76/0.7101 28.52/0.8094 28.04/0.8053 -/- 28.82/0.8220

50 25.06/0.6831 25.18/0.6841 26.02/0.7111 26.31/0.7287 25.86/0.7202 -/- 26.64/0.7487

15 32.34/0.9220 32.18/0.9154 31.98/0.9187 32.67/0.9250 -/- 33.15/- 33.35/0.9361
30 -/- 28.59/0.8495 26.79/0.7612 28.88 /0.8566 29.11/0.8633 -/- 30.02/0.8895

50 25.94/0.7791 26.00/0.7760 25.71/0.7756 26.28/0.7869 26.64/0.8023 27.40/- 27.56/0.8373

15 31.08/0.8722 31.13/0.8693 31.42/0.8822 31.73/0.8906 -/- 31.83/- 31.87/0.8952
30 -/- 27.81/0.7693 26.76/0.7108 28.36/0.7999 28.46/0.8039 -/- 28.61/0.8105
50 25.62 /0.6869 25.75/0.6869 25.97/0.7021 26.23/0.7189 26.37/0.7290 26.50/- 26.53/0.7372

TABLE III: Average PSNR(dB)/SSIM results of the competing methods for image denoising task with noise levels =15, 30 and 50 on datasets S14 and BSD200. The red and blue entries indicate the best.

Q JPEG ARCNN15[10] TNRD15[5] DnCNN16[53] MemNet17[40] IACNN19[25] Ours

10 27.82/0.7595 29.03 /0.7929 29.28 /0.7992 29.40/0.8026 29.69/0.8107 29.43/0.8070 30.03/0.8199

20 30.12/0.8344 31.15/0.8517 31.47/0.8576 31.63/0.8610 31.90/0.8658 31.64/0.8628 32.21/0.8708

10 27.77/0.7730 28.96/0.8076 29.15/0.8111 29.19/0.8123 29.45/0.8193 29.34/0.8199 29.72/0.8257
20 30.07/0.8512 31.29 /0.8733 31.46/0.8769 31.59/0.8802 31.83/0.8846 31.73/0.8848 32.08/0.8886

TABLE IV: Average PSNR(dB) / SSIM results of the competing methods for JPEG image deblocking task with quality factors = 10, 20 on datasets Classic5 and LIVE1. The red entries indicate the best performance.

Then, we add inter-block connections to CSRnet-3S and denote the resulting network as CSRnet-Dense, which corresponds to the complete CSRnet. Compared to the previous CSRnet variants, CSRnet-Dense achieves the best results on the four testing datasets, which verifies the effect of inter-block connection. Through the inter-block connections, each component is able to contribute to information and gradient flow through the network.

To demonstrate the convergence of the four evaluated CSRnet variants, we determined PSNR curves shown in Fig. 3 with bicubic results being the reference. The four models have a stable training process without obvious performance degradation. In addition, multiple scales and inter-block connection not only accelerate convergence but also notably improve performance.

Iv-D Depth analysis of our network

B4R6 B6R6 B8R6
Set5 37.923 37.959 37.999
Set14 33.635 33.678 33.696
BSD100 32.193 32.215 32.251
Urban100 31.998 32.187 32.326

TABLE V: PSNR at upscaling factor retrieved from different network depths determined by the number of CSRBs on different datasets. The red entries indicate the best performance.

Besides different architectures, we evaluated different depths of the proposed CSRnet. The network depth is related to two basic parameters: number of CSRBs and number of residual blocks per CSRB. In this study, we only tested the effect of number of blocks, , by setting up three structures: . TABLE V lists the PSNR obtained from image super-resolution of these networks on the four evaluated datasets, Set5, Set14, BSD100 and Urban100 with upscaling factor 2. Increasing the number of CSRBs considerably improves the PSNR in the datasets given the increased network depth, which in turn retrieves more hierarchical features for improving performance.

Iv-E Comparisons with State-of-the-Art Models

We compared the CSRnet with the state-of-the-art models for three restoration tasks, namely, image super-resolution, image denoising, and JPEG image deblocking.

Iv-E1 Image Super-resolution

Regarding image super-resolution, we quantitatively compared the proposed CSRnet with eight state-of-the-art methods, namely, namely SRCNN16[9], VDSR16[24], DRCN15[23], ESPCN16[39], LapSRN17[27], MemNet17[40], WaveResNet17[2], DRFN18[49], DSRN18[16] and EEDS19[44]. For a fair comparison, we evaluated all the methods on the luminance channel for all upscaling factors. The comparison results on the four evaluated datasets for three upscaling factors (=2, 4, 8) are listed in TABLE II. The proposed CSRnet substantially outperforms the comparison models over the different upscaling factors and test datasets. On the Urban100 dataset, the CSRnet outperforms the second-best method by a PSNR gain of 0.60 dB at upscaling factor 4 . On the BSD100, the CSRnet achieves a PSNR gain of only 0.15 dB compared with the second-best method. Similar results occur at other scales and with respect to other comparison models. Hence, the proposed CSRnet performs better especially on structured images with similar geometric patterns across various spatial resolutions, such as urban scenes (Urban100). What’s more worth mentioning is the SSIM performance of our method at upscaling factors 8 . The SSIM value of CSRnet can be 0.050.07 higher than LapSRN at upscaling factor 8 , however, at other upscaling factors, the SSIM value of CSRnet is no more than 0.02 higher than LapSRN. This strongly proves that our method can retain higher structural similarity under larger upscaling factor.

Besides the quantitative comparison, Figs. 4, 5 and 6 show visual comparisons among the evaluated methods. Fig. 4 shows that the proposed CSRnet reconstructs clearer letters than the other models on an image from the Set14 dataset at upscaling factor 2 . Likewise, Fig. 5 shows that the CSRnet clearly recovers the parallel line structures on an image from the Urban100 dataset at upscaling factor 4 , whereas the other models retrieve obvious distortions. In Fig. 6, our model more realistically restores the man’s eyes and his shirt’s stripes for an image from the BSD100 dataset at upscaling factor 8 , whereas other methods are highly distorted. Overall, the CSRnet outperforms the other evaluated models both quantitatively and qualitatively.

(a) Ground Truth
(b) HR (PSNR/SSIM) (c) Noise (24.62/0.4561) (d) IRCNN(29.81/0.7096) (e) BM3D (35.29/0.9376) (f) TRND (35.44/0.9362) (g) PGPD(35.09/0.9309) (h) DnCNN 36.23/0.9459 (i) Ours (36.98/0.9559)
Fig. 7: Qualitative comparison of our methods with other methods on an image from BSD68 with noise level 15. Our method restores the shape of the coral fish in the water, especially the lip of fish.
(a) Ground Truth
(b) HR (PSNR/SSIM) (c) Noise (18.68/0.2271) (d) IRCNN(24.70/0.4889) (e) TRND (27.07/0.7206) (f) PGPD(28.74/0.8659) (g) DnCNN (28.77/0.8774) (h) MemNet (29.08/0.8977) (i) Ours (29.24/0.9052)
Fig. 8: Qualitative comparison of our methods with other methods on an image from BSD200 with noise level 30. Our method can recover the clearest identification information on the plane.
(a) Ground Truth
(b) HR (PSNR/SSIM) (c) Noise (14.82/0.1459) (d) IRCNN(22.41/0.3699) (e) TRND (29.52/0.8408) (f) PGPD(30.29/0.8438) (g) DnCNN (30.19/0.8501) (h) MemNet (30.90/0.8659) (i) Ours (31.63/0.8733)
Fig. 9: Qualitative comparison of our methods with other methods on an image from S14 with noise level 50. Our method restores the window of the house more clearly.

Iv-E2 Image Denoising

We trained the proposed CSRnet by using the gray images and compared the results to those obtained from eight denoising methods: BM3D07[26], TNRD15 [5],PGPD15[47] DnCNN16[53], IRCNN17[54], RED18[49], MemNet17[40] and FOCNet19[20]. TABLE III lists the average PSNR/SSIM results of the evaluated methods on four benchmark datasets for three noise levels. The PSNR values of CSRnet is better than those of the second-best method at any noise level or any dataset. Like for super-resolution, Figs. 7 , 8 and 9 show visual comparisons among the evaluated methods on an image from BSD68 with the noise level = 15, an image from BSD200 with the noise level = 30 and an image from S14 with the noise level = 50. The proposed CSRnet recovers relatively sharper and clearer images than the other methods, thus being more faithful to the ground truth.

(a) Ground Truth
(b) HR (PSNR/SSIM) (c) Deblocking (25.71/0.7610) (d) ARCNN(26.83/0.7951) (e) DnCNN(27.50/0.8150) (f) MemNet (28.04/0.8335) (g) Ours (28.87/0.8542)
Fig. 10: Qualitative comparison of our methods with other methods on an image from Classic5 with quality factor 10. Our method recovers lighthouse.
(a) Ground Truth
(b) HR (PSNR/SSIM) (c) Deblocking (29.39/0.7998) (d) ARCNN(34.11/0.8981) (e) DnCNN(34.59/0.9067) (f) MemNet (35.06/0.9136) (g) Ours (35.33/0.9196)
Fig. 11: Qualitative comparison of our methods with other methods on an image from LIVE1 with quality factor 20. Our method recovers lighthouse.

Iv-E3 JPEG image Deblocking

We applied the proposed CSRnet for deblocking considering only on the Y channel and compared it with four existing methods: ARCNN15[10], TNRD15[5], DnCNN16[53], MemNet17[40] and IACNN19[25]. Table IV lists the average PSNR/SSIM of the evaluated methods on two benchmark datasets, namely, Classic5 and LIVE1, for quality factors of 10 and 20. The CSRnet outperforms IACNN19[25], the current state-of-the-art method, by more than 0.60 and 0.57 dB in Classic5 dataset, and 0.38 and 0.35 dB in the LIVE1 dataset with quality factors of 10 and 20 respectively. Fig. 10 and 11 show visual comparisons for JPEG image deblocking. ARCNN, DnCNN, and MemNet were compared using their public codes. Clearly, CSRnet more effectively removes the blocking artifact and restores detailed textures than the comparison methods.

V Conclusion

This paper presents the CSRnet, a deep network intended to exploit scale-related features and the inter-task correlations among the three tasks: super-resolution, denoising, and deblocking. Several CSRBs are stacked in the CSRnet and adaptively learn image features at different scales. The same-resolution outputs from all the previous CSRBs are used by the current CSRB via inter-block connections for reusing information. The intra-block cross-scale connection within a CSRB at any scale allows to learn more abundant features from finer to coarser scales or vice versa. Extensive evaluations and comparisons with existing methods verify the advantages of the proposed CSRnet. In future developments, we will extend the CSRnet to handle more general restoration tasks such as image deblurring and blind deconvolution.


  • [1] M. n. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, and M. Devin (2016) TensorFlow: large-scale machine learning on heterogeneous distributed systems. Cited by: §IV-B.
  • [2] W. Bae, J. Yoo, and J. Chul Ye (2017) Beyond deep residual learning for image restoration: persistent homology-guided manifold simplification. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

    pp. 145–153. Cited by: §IV-E1, TABLE II.
  • [3] M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. A. Morel (2012) Neighbor embedding based single-image super-resolution using semi-nonnegative matrix factorization. In IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1289–1292. Cited by: §IV-A.
  • [4] A. Buades, B. Coll, and J. Morel (2008) Nonlocal image and movie denoising. International journal of computer vision 76 (2), pp. 123–139. Cited by: §I.
  • [5] Y. Chen and T. Pock (2015) Trainable nonlinear reaction diffusion: a flexible framework for fast and effective image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6), pp. 1256–1272. Cited by: §II-C, §IV-E2, §IV-E3, TABLE III, TABLE IV.
  • [6] Z. Chen, J. Lin, T. Zhou, and F. Wu (2019) Sequential gating ensemble network for noise robust multiscale face restoration. IEEE transactions on cybernetics. Cited by: §I.
  • [7] M. Cheon, J. Kim, J. Choi, and J. Lee (2018) Generative adversarial network-based image super-resolution using perceptual content losses. arXiv preprint arXiv:1809.04783. Cited by: §II-A.
  • [8] P. I. Corke (2002) A robotics toolbox for matlab. IEEE Robotics And Automation Magazine 3 (1), pp. 24–32. Cited by: §IV-A.
  • [9] C. Dong, C. C. Loy, K. He, and X. Tang (2016) Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (2), pp. 295–307. Cited by: §II-A, §IV-A, §IV-E1, TABLE II.
  • [10] C. Dong, Y. Deng, C. Change Loy, and X. Tang (2015) Compression artifacts reduction by a deep convolutional network. In Proceedings of the IEEE International Conference on Computer Vision, pp. 576–584. Cited by: §II-C, §IV-A, §IV-B, §IV-E3, TABLE IV.
  • [11] B. Du, W. Xiong, J. Wu, L. Zhang, L. Zhang, and D. Tao (2016) Stacked convolutional denoising auto-encoders for feature representation. IEEE transactions on cybernetics 47 (4), pp. 1017–1027. Cited by: §II-B.
  • [12] X. Fang, Q. Zhou, J. Shen, C. Jacquemin, and L. Shao (2018) Text image deblurring using kernel sparsity prior. IEEE transactions on cybernetics. Cited by: §I.
  • [13] J. Guo and H. Chao (2016) Building dual-domain representations for compression artifacts reduction. Springer International Publishing. Cited by: §II-C.
  • [14] J. Guo and H. Chao (2017) One-to-many network for visually pleasing compression artifacts reduction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4867–4876. Cited by: §I, §II-C.
  • [15] J. Han, L. Shao, D. Xu, and J. Shotton (2013) Enhanced computer vision with microsoft kinect sensor: a review. IEEE transactions on cybernetics 43 (5), pp. 1318–1334. Cited by: §I.
  • [16] W. Han, S. Chang, D. Liu, M. Yu, M. Witbrock, and T. S. Huang (2018) Image super-resolution via dual-state recurrent networks. In Proc. CVPR, Cited by: §IV-E1, TABLE II.
  • [17] M. Haris, G. Shakhnarovich, and N. Ukita (2018) Deep back-projection networks for super-resolution. Cited by: §I.
  • [18] S. Harmeling (2012) Image denoising: can plain neural networks compete with bm3d?. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2392–2399. Cited by: §I.
  • [19] J. B. Huang, A. Singh, and N. Ahuja (2015) Single image super-resolution from transformed self-exemplars. In Computer Vision and Pattern Recognition, pp. 5197–5206. Cited by: §IV-A, §IV-A.
  • [20] X. Jia, S. Liu, X. Feng, and L. Zhang (2019) FOCNet: a fractional optimal control network for image denoising. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6054–6063. Cited by: §IV-E2, TABLE III.
  • [21] Y. Jianchao, W. John, H. Thomas, and M. Yi (2010) Image super-resolution via sparse representation. IEEE Trans Image Process 19 (11), pp. 2861–2873. Cited by: §IV-A.
  • [22] J. Jiang, Y. Yu, Z. Wang, S. Tang, R. Hu, and J. Ma (2019) Ensemble super-resolution with a reference dataset. IEEE transactions on cybernetics. Cited by: §I.
  • [23] J. Kim, J. K. Lee, and K. M. Lee (2015) Deeply-recursive convolutional network for image super-resolution. pp. 1637–1645. Cited by: §II-A, §IV-A, §IV-E1, TABLE II.
  • [24] J. Kim, J. K. Lee, and K. M. Lee (2016) Accurate image super-resolution using very deep convolutional networks. In Computer Vision and Pattern Recognition, pp. 1646–1654. Cited by: §I, §II-A, §IV-E1, TABLE II.
  • [25] Y. Kim, J. W. Soh, J. Park, B. Ahn, H. Lee, Y. Moon, and N. I. Cho (2019) A pseudo-blind convolutional neural network for the reduction of compression artifacts. IEEE Transactions on Circuits and Systems for Video Technology. Cited by: §IV-E3, TABLE IV.
  • [26] D. Kostadin, F. Alessandro, K. Vladimir, and E. Karen (2007) Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on Image Processing 16 (8), pp. 2080. Cited by: §II-B, §IV-E2, TABLE III.
  • [27] W. S. Lai, J. B. Huang, N. Ahuja, and M. H. Yang (2017) Deep laplacian pyramid networks for fast and accurate super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5835–5843. Cited by: §IV-E1, TABLE II.
  • [28] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network.. In CVPR, Vol. 2, pp. 4. Cited by: §II-A.
  • [29] B. Lim, S. Son, H. Kim, S. Nah, K. M. Lee, B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee (2017) Enhanced deep residual networks for single image super-resolution. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1132–1140. Cited by: §I, §II-A, §III-B.
  • [30] D. Lin, R. Zhang, Y. Ji, P. Li, and H. Huang (2018) SCN: switchable context network for semantic segmentation of rgb-d images. IEEE transactions on cybernetics. Cited by: §I.
  • [31] F. Liu, C. Gong, X. Huang, T. Zhou, J. Yang, and D. Tao (2018) Robust visual tracking revisited: from correlation filter to template matching. IEEE Transactions on Image Processing 27 (6), pp. 2777–2790. Cited by: §I.
  • [32] P. Liu, H. Zhang, Z. Kai, L. Liang, and W. Zuo (2018) Multi-level wavelet-cnn for image restoration. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Cited by: §I.
  • [33] D. Maleki, S. Nadalian, M. M. Derakhshani, and M. A. Sadeghi (2018) BlockCNN: a deep network for artifact removal and image compression. arXiv preprint arXiv:1805.11091. Cited by: §I.
  • [34] X. Mao, C. Shen, and Y. Yang (2016) Image denoising using very deep fully convolutional encoder-decoder networks with symmetric skip connections. Cited by: §II-B.
  • [35] D. Martin, C. Fowlkes, D. Tal, and J. Malik (2001) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, Vol. 2, pp. 416–423. Cited by: §IV-A, §IV-A.
  • [36] S. Nah, T. H. Kim, and K. M. Lee (2016) Deep multi-scale convolutional neural network for dynamic scene deblurring. pp. 257–265. Cited by: §III-B.
  • [37] D. Nie, L. Wang, E. Adeli, C. Lao, W. Lin, and D. Shen (2018) 3-d fully convolutional networks for multimodal isointense infant brain image segmentation. IEEE transactions on cybernetics (99), pp. 1–14. Cited by: §I.
  • [38] C. Priyam and M. Peyman (2009) Clustering-based denoising with locally learned dictionaries. IEEE Trans Image Process 18 (7), pp. 1438–1451. Cited by: §II-B.
  • [39] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1874–1883. Cited by: §II-A, §IV-E1, TABLE II.
  • [40] Y. Tai, J. Yang, X. Liu, and C. Xu (2017) MemNet: a persistent memory network for image restoration. In IEEE International Conference on Computer Vision, pp. 4549–4557. Cited by: §I, §II-A, §II-B, §II-D, §IV-A, §IV-E1, §IV-E2, §IV-E3, TABLE II, TABLE III, TABLE IV.
  • [41] Y. Tai, J. Yang, and X. Liu (2017) Image super-resolution via deep recursive residual network. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2790–2798. Cited by: §II-A.
  • [42] R. Timofte, R. Rothe, and L. Van Gool (2016) Seven ways to improve example-based single image super resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1865–1873. Cited by: §I.
  • [43] T. Tong, G. Li, X. Liu, and Q. Gao (2017) Image super-resolution using dense skip connections. In IEEE International Conference on Computer Vision, pp. 4809–4817. Cited by: §II-A.
  • [44] Y. Wang, L. Wang, H. Wang, and P. Li (2019) End-to-end image super-resolution via deep and shallow convolutional networks. IEEE Access 7, pp. 31959–31970. Cited by: §IV-E1, TABLE II.
  • [45] Z. Wang, D. Liu, S. Chang, Q. Ling, Y. Yang, and T. S. Huang (2016) D3: deep dual-domain based fast restoration of jpeg-compressed images. pp. 2764–2772. Cited by: §II-C.
  • [46] J. Xie, L. Xu, and E. Chen (2012) Image denoising and inpainting with deep neural networks. In International Conference on Neural Information Processing Systems, pp. 341–349. Cited by: §II-B.
  • [47] J. Xu, L. Zhang, W. Zuo, D. Zhang, and X. Feng (2015) Patch group based nonlocal self-similarity prior learning for image denoising. In Proceedings of the IEEE international conference on computer vision, pp. 244–252. Cited by: §IV-E2, TABLE III.
  • [48] J. Yang, J. Wright, T. S. Huang, and Y. Ma (2010) Image super-resolution via sparse representation. IEEE transactions on image processing 19 (11), pp. 2861–2873. Cited by: §I.
  • [49] X. Yang, H. Mei, J. Zhang, K. Xu, B. Yin, Q. Zhang, and X. Wei (2018) DRFN: deep recurrent fusion network for single-image super-resolution with large factors. IEEE Transactions on Multimedia. Cited by: §IV-E1, §IV-E2, TABLE II.
  • [50] K. Zeng, J. Yu, R. Wang, C. Li, and D. Tao (2017)

    Coupled deep autoencoder for single image super-resolution

    IEEE transactions on cybernetics 47 (1), pp. 27–37. Cited by: §II-A.
  • [51] R. Zeyde, M. Elad, and M. Protter (2012) On single image scale-up using sparse-representations. In International Conference on Curves and Surfaces, pp. 711–730. Cited by: §IV-A.
  • [52] H. Zhang, X. Zhou, Z. Wang, H. Yan, and J. Sun (2018) Adaptive consensus-based distributed target tracking with dynamic cluster in sensor networks. IEEE transactions on cybernetics (99), pp. 1–12. Cited by: §I.
  • [53] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang (2016) Beyond a gaussian denoiser: residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing 26 (7), pp. 3142–3155. Cited by: §I, §II-B, §IV-E2, §IV-E3, TABLE III, TABLE IV.
  • [54] K. Zhang, W. Zuo, S. Gu, and L. Zhang (2017) Learning deep cnn denoiser prior for image restoration. In IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2. Cited by: §I, §II-B, §II-D, §IV-E2.
  • [55] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu (2018) Residual dense network for image restoration. CoRR abs/1812.10477. External Links: Link, 1812.10477 Cited by: §I, §II-D.