Attention Prior for Real Image Restoration

04/26/2020 ∙ by Saeed Anwar, et al. ∙ Australian National University 0

Deep convolutional neural networks perform better on images containing spatially invariant degradations, also known as synthetic degradations; however, their performance is limited on real-degraded photographs and requires multiple-stage network modeling. To advance the practicability of restoration algorithms, this paper proposes a novel single-stage blind real image restoration network (R^2Net) by employing a modular architecture. We use a residual on the residual structure to ease the flow of low-frequency information and apply feature attention to exploit the channel dependencies. Furthermore, the evaluation in terms of quantitative metrics and visual quality for four restoration tasks i.e. Denoising, Super-resolution, Raindrop Removal, and JPEG Compression on 11 real degraded datasets against more than 30 state-of-the-art algorithms demonstrate the superiority of our R^2Net. We also present the comparison on three synthetically generated degraded datasets for denoising to showcase the capability of our method on synthetics denoising. The codes, trained models, and results are available on



There are no comments yet.


page 1

page 3

page 6

page 7

page 8

page 9

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the present digital age, many hand-held camera-based devices allow humans, machines, etc., to record or capture video and image data. However, during image and video acquisition, various forms of corruption, for example, noise (Gaussian, speckle, thermal etc.), compression (JPEG etc.), blur (motion, defocus etc.

) are often inevitable and can downgrade the visual quality considerably. The process of reducing the artifacts to recover the missing details is called image restoration. Moreover, being a low-level vision task, image restoration is a crucial step for various computer vision and image analysis applications such as computational photography, surveillance, robotic vision, recognition, and classification,


Generally, restoration algorithms can be categorized as model-based and learning-based. Model-based algorithms include non-local self-similarity (NSS) [1, 2], sparsity [3, 4], gradient methods [5, 6], Markov random field models [7], and external restoration priors [8, 9, 10]. The model-based algorithms above are computationally expensive, time-consuming, unable to suppress spatially variant degradations directly as well as characterize complex image textures. On the other hand, discriminative learning aims to model the image prior from a set of degraded and ground-truth image sets. One technique is to learn the prior steps in the context of truncated inference [11] while another approach is to employ brute force learning, for example, CNN methods [12, 13]. CNN models improved restoration performance thanks to their modeling capacity, network training, and design. However, the performance of current learning models is limited and tailored for specific synthetic degradation types.

A practical restoration algorithm should be efficient, flexible, perform restoration using a single model and handle both spatially variant and invariant degradations when the degradation is known or unknown. Unfortunately, the current state-of-the-art algorithms are far from achieving all of these aims. We present a CNN model that is efficient and capable of handling synthetic as well as real degradation present in images. We summarize the contributions of this work in the following paragraphs.

Noisy CBDNet [14] RNet (Ours)
Fig. 1: A real noisy face image from the RNI15 dataset [15]. Unlike CBDNet [14], RNet does not suffer from over-smoothing or over-contrasting artifacts (Best viewed in color on a high-resolution display)

1.1 Contributions

  • CNN based approaches for real image restoration (synthetic image denoising) producing state-of-the-art results using a first of its kind single stage model.

  • To the best of our knowledge, the first use of feature attention in restoration tasks, specifically in denoising, JPEG compression, and raindrop removal (although feature attention is used in super-resolution, however, our model is lightweight and efficient).

  • A modular network less affected by vanishing gradients [16], thus enabling improved performance with an increasing number of modules.

  • Quantitative as well as qualitative experimental results on 11 real-image degradation datasets producing state-of-the-art results against more than 30 algorithms. Additionally, results on three synthetic noisy datasets are provided for denoising.

  • We introduce a single network that can handle spatially variant noise, significant local artifacts i.e. JPEG compression, pixel-by-pixel artifacts such as raindrop removal, and super-resolution.

2 Related Work

In the following sections, we present the literature for denoising, super-resolution, JPEG compression, and raindrop removal.

2.1 Image denoising

In this section, we present and discuss recent trends in image denoising. Two notable denoising algorithms, NLM [2] and BM3D [1], use self-similar patches. Due to their success, many subsequent variants were proposed, including SADCT [17], SAPCA [18], and NLB [19] which seek self-similar patches in different transform domains. Dictionary-based methods [20, 21] enforce sparsity by employing self-similar patches and learning over-complete dictionaries from clean images. Many algorithms [22, 23] investigated the maximum likelihood algorithm to learn a statistical prior, e.g.

the Gaussian Mixture Model of natural patches or patch groups for patch restoration. Furthermore, Levin

et al. [24] and Chatterjee et al. [25], motivated external denoising [8, 26, 10] by showing that an image can be recovered with negligible error by selecting reference patches from a clean external database. However, all of the external algorithms are class-specific.

Recently, Schmidt et al. [27]

introduced a cascade of shrinkage fields (CSF) which integrated half-quadratic optimization and random-fields. Shrinkage aims to suppress smaller values (noise values) and learn mappings discriminatively. CSF assumes the data fidelity term to be quadratic and that it has a discrete Fourier transform-based closed-form solution.

Due to the popularity of convolutional neural networks (CNNs), image denoising algorithms [12, 13, 28, 29, 27, 30] have achieved a performance boost. Notable denoising neural networks, DnCNN [12], and IrCNN [13]

predict the residue present in the image instead of the denoised image itself as the input to the loss function. That is, comparing against ground truth noise instead of the original clean image. Both networks achieved better results despite having a simple architecture where repeated blocks of convolutional, batch normalization, and ReLU activations are used. Furthermore, IrCNN

[13] and DnCNN [12] are dependent on blindly predicted noise i.e. without taking into account the underlying structures and textures of the noisy image.

Another essential image restoration framework is Trainable Nonlinear Reaction-Diffusion (TNRD) [11], which uses a field-of-experts prior [7] into the deep neural network for a specific number of inference steps by extending the non-linear diffusion paradigm into a profoundly trainable set of parametrized linear filters and influence functions. Although the results of TNRD are favorable, the model requires a significant amount of data to learn the parameters and influence functions as well as overall fine-tuning, hyper-parameter determination, and stage-wise training. Similarly, non-local color net (NLNet) [28] was motivated by non-local self-similar (NSS) priors, which employ non-local self-similarity coupled with discriminative learning. NLNet improved upon the traditional methods; but, it lags in performance compared to most of the CNNs [13, 12] due to the adaptation of NSS priors, as it is unable to find the analogs for all the patches in the image. Recently, Anwar et al. introduced CIMM, a deep denoising CNN architecture, composed of identity mapping modules [30]. The network learns features in cascaded identity modules using dilated kernels and uses self-ensemble to boost performance. CIMM improved upon all the previous CNN models [12, 31].

Fig. 2: The architecture of the proposed network. Different variants of green colors of the conv layers denote different dilations while the smaller width of the conv layer means the kernel is . The second row shows the architecture of each EAM.

Recently, many algorithms focused on blind denoising on real-noisy images [32, 14, 33]. The algorithms [13, 12, 31]

benefitted from the modeling capacity of CNNs and have shown the ability to learn a single-blind denoising model; however, the denoising performance is limited, and the results are not satisfactory on real photographs. Generally speaking, real-noisy image denoising is a two-step process: the first step involves noise estimation whereas the second step addresses non-blind denoising. Noise clinic (NC) 

[15] estimates the noise model dependent on signal and frequency followed by denoising the image using non-local Bayes (NLB). In comparison, Zhang et al. [34] proposed a non-blind Gaussian denoising network, termed FFDNet, that can produce satisfying results on some of the real-noisy images; however, it requires manual intervention to select a high value for high noise-level.

More recently, CBDNet [14] was used to train a blind denoising model for real photographs. CBDNet [14] is composed of two subnetworks: noise estimation and non-blind denoising. CBDNet [14]

also incorporated multiple losses, is engineered to be trained on real-synthetic noise and real-image noise, and enforces a higher noise standard deviation value for low noise images. Similarly, The methods of 

[14, 34] may require manual intervention to improve the results. In this paper, we present an end-to-end architecture that learns the noise characteristics and produces results on real noisy images without requiring separate subnets or manual intervention.

2.2 Super-resolution

In this section, we provide a chronological record of advancement in the area of deep super-resolution. The initial focus of CNN models was to have a simple architecture with no skip connections. For example, SRCNN [35] with three convolutional layers and FSRCNN [36] having eight convolutional layers utilizing shrinking and expansion of channels to make it run in real-time on a CPU. Next, SRMD [37], a linear network (resembling  [35, 13]), is able to handle multiple types of degradations. The input to the system is low-resolution images with the corresponding degradation maps.

The introduction of skip-connections in deep networks made its way into super-resolution algorithms. Kim et al. [38] employed a global skip connection to enforce residual learning, improving on the previous super-resolution methods. The same authors then developed a deep recursive structure (DRCN) [39] sharing layer parameters, which reduced the number of parameters significantly, though, it lags behind VDSR [38] in performance. Following that, to decrease the memory usage and computational complexity, Tai et al. [40] proposed a deep recursive residual network (DRRN) that utilizes basic skip-connections to implement residual learning along various convolutional blocks i.e. multi-path architecture.

The success of residual blocks motivated many super-resolution works. In the enhanced deep super-resolution (EDSR) network, Lim et al. [41] proposed to employ residual blocks and a global skip-connection rescaling each block output by a factor of 0.1 to avoid exploding gradients and significantly improving on all previous methods. More recently, Ahn et al. [42] proposed an efficient network, namely, the cascading residual network (CARN). The authors use cascading connections with a variant of residual blocks with three convolutional layers.

To improve performance, several super-resolution works are driven by the success of dense-concatenation [43]. For example, Tong et al. [44] takes the output of all the previous convolutional layers in a block and fed into the subsequent one. Similarly, residual-dense network (RDN) [45] learns relationships through dense-connections in the patches. Lately, Haris et al. [46] trains a series of densely connected downsampling and upsampling layers (single block) with feedback and feed-forward mechanism.

Recently, Zhang et al. [47] introduced visual attention [48] in their RCAN network. In addition, the authors also employ a series of residual blocks and multi-level skip-connections in the network. Furthermore, Kim et al. [49], in parallel to [47], suggested a dual attention procedure, namely, SRRAM. The performance of SRRAM [49] is, however, not on par with RCAN [47]. Recently, Anwar & Barnes introduced densely connected residual units with laplacian attention [50] to advance the performance of super-resolution. Comparatively, our proposed SR method is lightweight and achieves competitive performance.

2.3 Raindrop Removal

Many papers deal with visibility enhancement, which removes haze, fog, and rain streaks; however, these algorithms do not apply to raindrop removal on a window or camera lens as the image formation models are different.

To remove raindrops, Kurihata et al. [51] proposes to apply PCA on the learned shapes of raindrops. During testing, the learned shapes are compared to raindrops, and the matching entities are removed. Due to the irregular shapes of raindrops, it is challenging to learn all the representative shapes; similarly, it is difficult to model transparent raindrops via PCA. Furthermore, there is a risk that local content is falsely detected and removed from the image, mainly due to similar appearance and structure. Roser & Gieger [52] compare raindrops generated synthetically with real ones based on the assumption that the former are simply spherical in shape while the latter ones are inclined spheres. Although the assumptions seem to enhance visibility, however, the method lacks generalization capability due to random sizes and shapes of raindrops.

To detect raindrops, Yamashita et al. [53] employed stereo images using disparity measures. Neighborhood textures replaced the detected raindrops based on the assumption that the raindrop occludes a similar looking background. Furthermore, Yamashita et al. [54] proposed a method relevant to [51], where a sequence of images was used instead of stereo pairs. Similarly, You et al. [55] exploited a motion-based method for raindrop detection, while a video completion method is employed to remove it. The performance of these methods may be satisfactory in certain video scenes; however, a straightforward application to the case of a single image is not possible.

Eigne et al. [56] and DeRain [57] are the only methods specifically designed for this task. Eigne et al. [56] employs a shallow network with three layers, which works for sparse and small raindrops. However, it fails to remove drops that are dense or large, and the output of the network is not sharp. More recently, DeRain [57] is proposed by Qian et al.

, which uses a GAN as the backbone with LSTMs, attention, and both global and local assessments. Finally, we also compare with a general method, namely, Pix2Pix 

[58], that maps the input image to the output one by minimizing a loss function. On the other hand, we use attention to remove the raindrops without employing multiple backbone networks or loss functions.

2.4 JPEG Compression

JPEG algorithms fall into two categories: 1) deblocking-oriented and 2) restoration-oriented. Deblocking-oriented approaches remove the blocking artifacts in the spatial domain by the use of adaptive filters [59, 60]

. While in the frequency domain, transforms and thresholds are used at multiple scales to eliminate the artifacts,

e.g. the Pointwise Shape-Adaptive DCT (SA-DCT) [17]. Although deblocking methods remove artifacts, it fails to generate sharp edges and produce smooth textures.

On the other hand, the restoration based approach considers the compression as a form of distortion, including the sparse-coding-based method [61], the Regression Tree Fields based method (RTF) [62], and CNN-based methods [63, 12], etc. Recently, Artifacts Reduction Convolutional Neural Networks (AR-CNN)  [63], inspired by and with a similar architecture as SRCNN [35] except having more feature layers. AR-CNN [63] lags behind DnCNN [12] in performance. Contrary to the mentioned networks, our method is useful in suppressing artifacts and preserving edges and sharp details via feature attention, merge-and-run units, and enhanced residual modules.

3 CNN restorer

3.1 Network Architecture

Our model is composed of three main modules, i.e.feature extraction, feature learning residual on the residual, and reconstruction, as shown in Figure 2. Let us consider a degraded input image, and the restored output image. Our feature extraction module is composed of only one convolutional layer to extract initial features from the noisy input:


where performs convolution on the noisy input image. Next, is passed on to the feature learning residual on the residual module, termed ,


where are the learned features and is the main feature learning residual on the residual component, composed of enhancement attention modules (EAM) that are cascaded together as shown in Figure 2. Our network has a small depth but provides a wide receptive field through kernel dilation in each of the first two branches of convolutions in the EAM. The output features of the final layer are fed to the reconstruction module, which is again composed of one convolutional layer.


where denotes the reconstruction layer.

There are several choices available for the loss function to optimize such as  [12, 13, 30], perceptual loss [31, 14], total variation loss [31] and asymmetric loss [14]. Some networks [31, 14] make use of more than one loss to optimize the model. Contrary to earlier networks, we only employ one loss, i.e. . Now, given a batch of training pairs, , where is the noisy input and is the ground truth, the aim is to minimize the loss function


where RNet() is our network, and denotes the set of all the network parameters learned. Our feature extraction and reconstruction module resemble the previous algorithms [35, 30]. We now focus on the feature learning residual on the residual block and feature attention.

3.2 Feature Learning Residual on the Residual

In this section, we provide more detail on the enhancement attention modules that use a Residual on the Residual structure with a local skip and short skip connections. Each EAM is further composed of blocks, followed by feature attention. Thanks to the residual on the residual architecture, very deep networks are now possible that improve denoising performance; however, we restrict our model to four EAM modules only. The first part of EAM covers the full receptive field of input features, followed by learning on the features; then, the features are compressed for speed, and finally, a feature attention module enhances the weights of essential features from the maps. The first part of EAM is realized using a novel merge-and-run unit (MRU), as shown in the second row of Figure 2. The input features are branched and passed through two parallel dilated convolutions, then concatenated and passed through another convolution. Next, the features are learned using a residual block (RB) of two convolutions, while compression is achieved by an enhanced residual block (ERB) of three convolutional layers. The last layer of ERB flattens the features by applying a kernel. Finally, the output of the feature attention unit is added to the input of EAM.

In image recognition, residual blocks [64] are often stacked together to construct a network of more than 1000 layers. Similarly, in image superresolution, EDSR [41] stacked the residual blocks and used long skip connections (LSC) to form a very deep network. However, to date, very deep networks have not been investigated for denoising. Motivated by the success of [47], we introduce the residual on the residual as a basic module for our network to construct deeper systems. Now consider the m-th module of the EAM is given as


where is the output of the feature learning module, in other words . The output of each EAM is added to the input of the group as . The learned features i.e. are passed to the reconstruction layer to output the same number of channels as the input of the network. Furthermore, we use a long skip connection to add the input image to the final network output as


where are the weights and biases learned in the group. This addition, i.e. LSC eases the flow of information across groups and helps learning the residual (degradation) rather than the image. This technique helps in faster learning as compared to learning the original image thanks to the sparse representation of the degradation.

3.2.1 Feature Attention

Fig. 3: The feature attention mechanism for selecting the essential features.
Long skip connection (LSC)
Short skip connection (SSC)
Long connection (LC)
Feature attention (FA)
PSNR (in dB) 28.45 28.77 28.81 28.86 28.52 28.85 28.86 28.90 28.96
TABLE I: Investigation of skip connections and feature attention. The best result in PSNR (dB) on values on BSD68 [7] in 2 iterations is presented.

This section provides information about the feature attention mechanism. Attention [65] has been around for some time; however, it has not been employed in image denoising. Channel features in image denoising methods are treated equally, which is not appropriate for many cases. To exploit and learn the critical content of the image, we focus attention on the relationship between the channel features; hence, the name: feature attention (see Figure 3).

An important question here is how to generate attention differently for each channel-wise feature. Images generally can be considered as having low-frequency regions (smooth or flat areas), and high-frequency regions (e.g., lines edges, and texture). As convolutional layers exploit local information only and are unable to utilize global contextual information, we first employ global average pooling to express the statistics denoting the whole image, other options for aggregation of the features can also be explored to represent the image descriptor. Let be the output features of the last convolutional layer having feature maps of size ; global average pooling will reduce the size from to as:


where is the feature value at position in the feature maps.

Furthermore, as investigated in [66], we propose a self-gating mechanism to capture the channel dependencies from the descriptor retrieved by global average pooling. According to [66]

, the mentioned mechanism must learn the nonlinear synergies between channels as well as mutually-exclusive relationships. Here, we employ soft-shrinkage and sigmoid functions to implement the gating mechanism. Let us consider

, and are the soft-shrinkage and sigmoid operators, respectively. Then the gating mechanism is


where and are the channel reduction and channel upsampling operators, respectively. The output of the global pooling layer is convolved with a downsampling Conv layer, activated by the soft-shrinkage function. To differentiate the channel features, the output is then fed into an upsampling Conv layer followed by sigmoid activation. Moreover, to compute the statistics, the output of the sigmoid () is adaptively rescaled by the input of the channel features as


3.3 Implementation

Our proposed model contains four EAM blocks. The kernel size for each convolutional layer is set to , except the last Conv layer in the enhanced residual block and those of the features attention units, where the kernel size is

. Zero padding is used for

to achieve the same size outputs feature maps. The number of channels for each convolutional layer is fixed at 64, except for feature attention downscaling. A factor of 16 reduces these Conv layers, hence having only four feature maps. The final convolutional layer either outputs three or one feature maps depending on the input. As for running time, our method takes about 0.2 seconds to process a image.

4 Experiments

4.1 Training settings

To generate noisy synthetic images, we employ BSD500 [67], DIV2K [68], and MIT-Adobe FiveK [69], resulting in 4k images while for real noisy images, we use cropped patches of from SSID [70], Poly [71], and RENOIR [72]. Data augmentation is performed on training images, which includes random rotations of 90, 180, 270 and flipping horizontally. In each training batch, 32 patches are extracted as inputs with a size of . Adam [73] is used as the optimizer with default parameters. The learning rate is initially set to and then halved after

iterations. The network is implemented in the Pytorch 


framework and trained with an Nvidia Tesla V100 GPU. Furthermore, we use PSNR as the evaluation metric.

4.2 Ablation Studies

4.2.1 Influence of the skip connections

Skip connections play a crucial role in our network. Here, we demonstrate the effectiveness of the skip connections. Our model is composed of three basic types of connections, which include long skip connection (LSC), short skip connections (SSC), and local connections (LC). Table I shows the average PSNR for the BSD68 [7] dataset. The highest performance is obtained when all the skip connections are available while the performance is lower when any connection is absent. We also observed that increasing the depth of the network in the absence of skip connections does not benefit performance.

Noise Methods
15 31.08 31.32 31.19 31.42 31.44 31.73 31.63 31.52 31.63 31.81
25 28.57 28.83 28.68 28.92 29.04 29.23 29.15 29.03 29.23 29.34
50 25.62 25.83 25.67 26.01 26.06 26.23 26.19 26.07 26.29 26.40
TABLE II: The similarity between the denoised and the clean images of BSD68 dataset [7] for our method and competing measured in terms of average PSNR for =15, 25, and 50 on grayscale images.
Noise Methods
Levels CBM3D [75] MLP [29] TNRD [11] DnCNN [12] IrCNN [13] CNLNet [28] FFDNet [34] Ours
15 33.50 - 31.37 33.89 33.86 33.69 33.87 34.01
25 30.69 28.92 28.88 31.33 31.16 30.96 31.21 31.37
50 27.37 26.00 25.94 27.97 27.86 27.64 27.96 28.14
TABLE III: Performance comparison between our network and existing state-of-the-art algorithms on the color version of the BSD68 dataset [7].
Methods = 15 = 25 = 50
BM3D [1] 32.37 29.97 26.72
WNNM [3] 32.70 30.26 27.05
EPLL [22] 32.14 29.69 26.47
MLP [29] - 30.03 26.78
CSF [27] 32.32 29.84 -
TNRD [11] 32.50 30.06 26.81
DnCNN [12] 32.86 30.44 27.18
IrCNN  [13] 32.77 30.38 27.14
FFDNet [34] 32.75 30.43 27.32
Ours 32.91 30.60 27.43
TABLE IV: The quantitative comparison between denoising algorithms on 12 classical images, (in terms of PSNR). The best results are highlighted as bold.

4.2.2 Feature-attention

Another important aspect of our network is feature attention. Table I compares the PSNR values of the networks with and without feature attention. The results support our claim about the benefit of using feature attention. Since the inception of DnCNN [12], the CNN models have matured, and further performance improvement requires the careful design of blocks and rescaling of the feature maps. The two mentioned characteristics are present in our model in the form of feature-attention and skip connections.

4.3 Denoising Comparisons

We evaluate our algorithm using the Peak Signal-to-Noise Ratio (PSNR) index as the error metric and compare against many state-of-the-art competitive algorithms which include traditional methods

i.e. CBM3D [1], WNNM [3], EPLL [22], CSF [27] and CNN-based denoisers i.e. MLP [29], TNRD [11], DnCNN [12], IrCNN [13], CNLNet [28], FFDNet [34] and CBDNet [14]. To be fair in comparison, we use the default setting of the traditional methods provided by the corresponding authors.

4.3.1 Denoising Test Datasets

In the experiments, we test four noisy real-world datasets i.e. RNI15 [15], DND [76], Nam [77] and SSID [70]. Furthermore, we prepare three synthetic noisy datasets from the widely used 12 classical images, BSD68 [7] color and gray 68 images for testing. We corrupt the clean images by additive white Gaussian noise using noise sigma of 15, 25 and 50 standard deviations.

  • Classical images: The denoising comparisons would be incomplete without testing on the traditional images. Here, we use 12 classical images for testing.

  • BSD68 [7] is composed of 68 images grayscale (CBSD68 is the same but with color images). The ground-truth images are available as the degraded dataset is synthetically created.

  • RNI15 [15] provides 15 real-world noisy images. Unfortunately, the clean images are not given for this dataset; therefore, only the qualitative comparison is presented.

  • Nam [77] comprises of 11 static scenes and the corresponding noise-free images obtained by the mean of 500 noisy images of the same scene. The size of the images is enormous; hence, we cropped the images in patches and randomly selected 110 from those for testing.

  • DnD is recently proposed by Plotz et al. [76], which initially contains 50 pairs of real-world noisy and noise-free scenes. The scenes are further cropped into patches of size by the providers of the dataset, which resulted in 1000 smaller images. The near noise-free images are not publicly available, and the results (PSNR/SSIM) can only be obtained through the online system introduced by [76].

  • SSID [70] (Smartphone Image Denoising Dataset) is recently introduced. The authors have collected 30k real noisy images and their corresponding clean images; however, only 320 images are released for training and 1280 images pairs for validation, as testing images are not released yet. We use the validation images for testing our algorithm and the competitive methods.

31.68dB 32.21dB
Noisy BM3D [75] IRCNN [13]

32.33dB 32.84dB
DnCNN [12] Ours GT

Fig. 4: Denoising performance of our RIDNet versus state-of-the-art methods on a color images from [7] for

4.3.2 Classical noisy images

30.896dB 29.98dB 30.73dB 29.42dB
Noisy CBM3D [1] WNNM [3] NC [15] TWSC [78]

30.88dB 28.43dB 31.37dB 31.06dB 32.31dB
Noisy Image MCWNNM [79] NI [80] FFDNet [34] CBDNet [14] RNet (Ours)

Fig. 5: A real noisy example from DND dataset [76] for comparison of our method against the state-of-the-art algorithms.

In this subsection, we evaluate our model on the noisy grayscale images corrupted by spatially invariant additive white Gaussian noise. We compare against nonlocal self-similarity representative models i.e. BM3D [1] and WNNM [3], learning based methods i.e. EPLL, TNRD [11], MLP [29], DnCNN [12], IrCNN [13], and CSF [27].

SET12: In Table IV, we present the PSNR values on Set12. Our method outperforms all the competitive algorithms for all noise levels; this may be due to the larger receptive field in the merge-and-run unit as well as better network modeling capacity.

BSD68: We show the performance of our algorithm against competing methods on BSD68 [7] in Table II. It is to be remembered here that BSD68 [7] and BSD500 [67] are two disjoint sets. Our method shows improvement over all the competitive algorithms for all noise levels. The increase in PSNR proves the superior network design and better feature learning for denoising tasks. It should be noted here that even a marginal improvement on the synthetic noisy datasets is difficult as according to the Levin et al. [24] and Chatterjee [25] the synthetic denoising has already achieved optimal limits.

Color noisy images: Next, for noisy color image denoising, we keep all the parameters of the network similar to the grayscale model, except the first and last layer are changed to input and output three channels rather than one. Figure 4 presents the visual comparison and Table III reports the PSNR numbers between our methods and the alternative algorithms. Our algorithm consistently outperforms all the other techniques published in Table III for CBSD68 dataset [7]. Similarly, our network produces the best perceptual quality images as shown in Figure 4. A closer inspection on the vase reveals that our network generates textures closest to the ground-truth with fewer artifacts and more details.

4.3.3 Real-World noisy images

To assess the practicality of our model, we employ a real noise dataset. The evaluation is difficult because of the unknown level of noise, the various noise sources such as shot noise, quantization noise etc., imaging pipeline i.e. image resizing, lossy compression etc. Furthermore, the noise is spatially variant (non-Gaussian) and also signal-dependent; hence, the assumption that noise is spatially invariant, employed by many algorithms does not hold for real image noise. Therefore, real-noisy images evaluation determines the success of the algorithms in real-world applications.

Method Blind PSNR SSIM
CDnCNNB [12] 32.43 0.7900
EPLL [22] 33.51 0.8244
TNRD [11] 33.65 0.8306
NCSR [81] 34.05 0.8351
MLP [29] 34.23 0.8331
FFDNet [34] 34.40 0.8474
BM3D [1] 34.51 0.8507
FoE [7] 34.62 0.8845
WNNM [3] 34.67 0.8646
NC [15] 35.43 0.8841
NI [80] 35.11 0.8778
CIMM [30] 36.04 0.9136
KSVD [82] 36.49 0.8978
MCWNNM [79] 37.38 0.9294
TWSC [78] 37.96 0.9416
FFDNet+ [34] 37.61 0.9415
CBDNet [14] 38.06 0.9421
RNet (Ours) 39.23 0.9526
TABLE V: The Mean PSNR and SSIM denoising results of state-of-the-art algorithms evaluated on the DnD sRGB images [76]

DnD: Table V presents the quantitative results (PSNR/SSIM) on the sRGB data for competitive algorithms and our method obtained from the online DnD benchmark website available publicly. The blind Gaussian denoiser DnCNN [12] performs inefficiently and is unable to achieve better results than BM3D [1] and WNNM [3] due to the poor generalization of the noise during training. Similarly, the non-blind Gaussian traditional denoisers are able to report limited performance, although the noise standard-deviation is provided. This may be due to the fact that these denoisers [1, 3, 22] are tailored for AWGN only, and real-noise is different in characteristics to synthetic noise. Incorporating feature attention and capturing the appropriate characteristics of the noise through a novel module means our algorithm leads by large margin i.e. 1.17dB PSNR compared to the second performing method, CBDNet [14]. Furthermore, our algorithm only employs real-noisy images for training using only loss while CBDNet [14] uses many techniques such as multiple losses (i.e. total variation, and asymmetric learning) and both real-noise as well as synthetically generated real-noise. As reported by the author of CBDNet [14], it is able to achieve 37.72 dB with real-noise images only. Noise Clinic (NC) [15] and Neat Image (NI) [80] are the other two state-of-the-art blind denoisers other than [14]. NI [80] is commercially available as a part of Photoshop and Corel PaintShop. Our network is able to achieve 3.82dB and 4.14dB more PSNR from NC [15] and NI [80], respectively.

Next, we visually compare the result of our method with the competing methods on the denoised images provided by the online system of Plotz et al. [76] in Figure 5. The PSNR and SSIM values are also taken from the website. From Figure 5, it is clear that the methods of [14, 34, 12] perform poorly in removing the noise from the star and in some cases the image is over-smoothed, on the other hand, our algorithm can eliminate the noise while preserving the finer details and structures in the star image.

Noisy DnCNN FFDNet RNet

Fig. 6: A real high noise example from RNI15 dataset [15]. Our method is able to remove the noise in textured and smooth areas without introducing artifacts.

Noisy FFDNet CBDNet RIDNet
Fig. 7: Comparison of our method against the other methods on a real image from RNI15 [15] benchmark containing spatially variant noise.

RNI15: On RNI15 [15], we provide qualitative images only as the ground-truth images are not available. Figure 7 presents the denoising results on a low noise intensity image. FFDNet [34] and CBDNet [14] are unable to remove the noise in its totality as can been seen near the bottom left of handle and body of the cup image. On the contrary, our method is able to remove the noise without the introduction of any artifacts. We present another example from the RNI15 dataset [15] with high noise in Figure 6. CDnCNN [12] and FFDNet [34] produce results of limited nature as some noisy elements can be seen in the near the eye and gloves of the Dog image. In comparison, our algorithm recovers the actual texture and structures without compromising on the removal of noise from the images.

Noisy CBM3D (39.13) IRCNN (33.73)
DnCNN (37.56) CBDNet (40.40) RNet (40.50)
Fig. 8: An image from Nam dataset [77] with JPEG compression. CBDNet is trained explicitly on JPEG compressed images; still, our performed better.

Nam: We present the average PSNR scores of the resultant denoised images in Table VII. Unlike CBDNet [14], which is trained on Nam [77] to specifically deal with the JPEG compression, we use the same network to denoise the Nam images [77] and achieve favorable PSNR numbers. Our performance in terms of PSNR is higher than any of the current state-of-the-art algorithms. Furthermore, our claim is supported by the visual quality of the images produced by our model, as shown in Figure 8. The amount of noise present after denoising by our method is negligible as compared to CDnCNN and other counterparts.

25.75 dB 21.97 dB 20.76 dB

19.70 dB 28.84 dB 35.57 dB

Fig. 9: A challenging example from SSID dataset [70]. Our method can remove noise and restore true colors.

SSID: As the last dataset, we employ the SSID real noise dataset, which has the highest number of test (validation) images available. The results in terms of PSNR are shown in the second row of Table VII. Again, it is clear that our method outperforms FFDNet [34] and CBDNet [14] by a margin of 9.5dB and 7.93dB, respectively. In Figure 9, we show the denoised results of a challenging image by different algorithms. Our technique recovers the true colors which are closer to the original pixel values while competing methods are unable to restore original colors and in specific regions induce false colors.

Dataset Scale Bicubic TNRD VDSR DnCNN SRMD CARN RNet RNet+
2 33.66 / 0.9299 36.86 / 0.9556 37.56 / 0.9591 37.58 / 0.9590 37.79 / 0.9601 37.76 / 0.9590 37.95 / 0.9605 38.07 / 0.9608
Set5 3 30.39 / 0.8682 33.18 / 0.9152 33.67 / 0.9220 33.75 / 0.9222 34.12 / 0.9254 34.29 / 0.9255 34.37 / 0.9269 34.49 / 0.9278
4 28.42 / 0.8104 30.85 / 0.8732 31.35 / 0.8845 31.40 / 0.8845 31.96 / 0.8925 32.13 / 0.8937 32.15 / 0.8946 32.34 / 0.8970
2 30.24 / 0.8688 32.51 / 0.9069 33.02 / 0.9128 33.03 / 0.9128 33.32 / 0.9159 33.52 / 0.9166 33.54 / 0.9173 33.63 / 0.9182
Set14 3 27.55 / 0.7742 29.43 / 0.8232 29.77 / 0.8318 29.81 / 0.8321 30.04 / 0.8382 30.29 / 0.8407 30.34 / 0.8419 30.43 / 0.8433
4 26.00 / 0.7027 27.66 / 0.7563 27.99 / 0.7659 28.04 / 0.7672 28.35 / 0.7787 28.60 / 0.7806 28.62 / 0.7822 28.72 / 0.7842
2 29.56 / 0.8431 31.40 / 0.8878 31.89 / 0.8961 31.90 / 0.8961 32.05 / 0.8985 32.09 / 0.8978 32.19 / 0.9001 32.25 / 0.9007
BSD100 3 27.21 / 0.7385 28.50 / 0.7881 28.82 / 0.7980 28.85 / 0.7981 28.97 / 0.8025 29.06 / 0.8034 29.12 / 0.8055 29.17 / 0.8065
4 25.96 / 0.6675 27.00 / 0.7140 27.28 / 0.7256 27.29 / 0.7253 27.49 / 0.7337 27.58 / 0.7349 27.60 / 0.7363 27.65 / 0.7376
2 26.88 / 0.8403 29.70 / 0.8994 30.76 / 0.9143 30.74 / 0.9139 31.33 / 0.9204 31.92 / 0.9256 32.07 / 0.9280 32.24 / 0.9294
Urban100 3 24.46 / 0.7349 26.42 / 0.8076 27.13 / 0.8283 27.15 / 0.8276 27.57 / 0.8398 28.06 / 0.8493 28.14 / 0.8519 28.28 / 0.8542
4 23.14 / 0.6577 24.61 / 0.7291 25.17 / 0.7528 25.20 / 0.7521 25.68 / 0.7731 26.07 / 0.7837 26.18 / 0.7881 26.28 / 0.7905
TABLE VI: The performance of super-resolution algorithms on Set5, Set14, BSD100, and URBAN100 datasets for upscaling factors of 2, 3, and 4. The bold highlighted results are the best on single image super-resolution.

4.4 Super-resolution Comparisons


Urban100 “img_72” VDSR LapSRN MS-LapSRN CARN RNet RNet+
Fig. 10: The visual comparisons for 4 super-resolution against several state-of-the-art algorithms on an image from Urban100 [83] dataset. Our RNet results are the most accurate.

We evaluate our model concerning networks that aim for efficiency as well as PSNR numbers and having similar depth and number of parameters. We compare against TNRD [11], SRMD [37], CARN [42]111CARN has more than hundred convolutional layers etc., as opposed to RCAN [47], DRLN [50] which have more than 16M parameters while our model has only 1.49M parameters. We present the performance on four publicly available datasets given below:

  • Set5 [84] is a classical dataset; contains only five images.

  • Set14 [85] contains 14 RGB images.

  • BSD100 [67] is the subset of the Berkely Segmentation dataset and consists of one hundred natural images.

  • Urban100 [83] is a recently proposed dataset of 100 images. The images contain human-made objects and buildings. The size of the image and the structures present in the dataset make it very challenging for the super-resolution task.

Datasets BM3D DnCNN FFDNet CBDNet Ours
Nam [77] 37.30 35.55 38.7 39.01 39.09
SSID [70] 30.88 26.21 29.20 30.78 38.71
TABLE VII: Quantitative results for the SSID [70] & Nam [77].

4.4.1 RNet for SR

The RNet is modified for super-resolution due to the increase in the size of the final output. There are two modifications performed in the network structure: 1) An additional layer (upsampling layer) is inserted before the final convolutional layer to super-resolve the input image to the desired resolution, 2) the residual learning is performed by changing the position of the long skip connection ( i.e. output of the feature extraction layer is added to the output of final EAM block), as shown in Figure 11. The second modification is due to the change in the size of the features after upsampling. It is also necessary to mention that the input size to the network is 4848 for super-resolution. Except for the mentioned modifications, no additional changes are made to the network.

Fig. 11: The architecture for super-resolution with two modifications: the change in position of long skip connection and insertion of upsampling layer.

4.4.2 Visual Comparisons

We furnish an image from Urban100 [83] for qualitative comparison in Figure 10 against the algorithms which have a similar number of parameters and aim to provide efficient solutions for super-resolution. It is evident from Figure that all the competing methods fail to recover the straight lines forming rectangles shown in the cropped regions from “img_72” image. Our method performs better than current state-of-the-art CARN [42]. Moreover, our network produces results that are faithful to the ground-truth image and without any blurring.

4.4.3 Quantitative Comparisons

Table VI shows the average PSNR and SSIM values for the mentioned four datasets against several state-of-the-art algorithms. Our algorithm outperforms all the methods for all scaling factors on all datasets. Our method is very lightweight as compared to recently published CARN [42] i.e. contains 4 fewer layers as compared to CARN [42]. The performance of our network becomes comparatively better when the number of images in testing datasets and the scaling factor increases. Similarly, our RNet achieves significantly better results than the VDSR [38], which was state-of-the-art until a year ago.

4.5 Raindrop Removal Comparisons

In this section, We present the performance of RNet on raindrop removal and compare with three state-of-the-art algorithms which include Eigen et al. [56], Pix2Pix [58] and DeRain [57] on two test datasets introduced in [57] termed as “Test_a” and “Test_b”. We use the same number of training images as DeRain [57]; however, we train using cropped patches instead of whole images.

Datasets Eigen Pix2pix DeRain RNet (Ours)
Test_a 28.59 / 0.6726 30.14 / 0.8299 31.57 / 0.9023 32.03 / 0.9325
Test_b - - 24.93 / 0.8091 26.42 / 0.8255
TABLE VIII: The average PSNR(dB)/SSIM from different methods on raindrop [57] dataset.
22.25dB 28.14dB
Rain DeRain [57]

Rain Image RNet GT
Fig. 12: The visual comparisons on a rainy image. The figure is showing the plate which is affected by raindrops. Our method is consistent in restoring raindrop affected areas.
24.73dB 29.35dB
Rain DeRain [57]

Rain Image RNet GT

Fig. 13: Another example of a rainy image. The cropped region is showing the road sign affected by raindrops. Our method recovers the distorted colors closer to the ground-truth.

4.5.1 Visual Comparisons

Figure 13 present an example image from the “Test_b” dataset, showing the cropped region near the front end of the car. DeRain [57] fails to remove the effect of the raindrop and results in the same image as the input. On the other hand, our method restores the edges in the input rainy image.

The second example in Figure 13 shows a rainy urban scene. We focus and crop the road sign to visualize better the differences between the output of RNet and competing methods. The DeRain [57] network removes the edges and color information where the raindrop affected the road sign. In our case, the edges and the color both are restored and are closer to the ground-truth clean image.

4.5.2 Quantitative Comparisons

Table VIII presents the quantitative results on both “Test_a” and “Test_b” for the mentioned algorithms. Compared to recent state-of-the-art in rain drop removal i.e. DeRain [57], our gain is 0.46dB for “Test_a” and a significant improvement of 1.49dB on the challenging “Test_b”. Similarly, the gain from Eigen et al. [56] is about 3.44dB. These results illustrate that our method can restore images that are similar in structure to the corresponding ground-truth images.

4.6 JPEG Comparisons

For JPEG compression, our method is compared against three competing methods, which include AR-CNN [63], TNRD [11], and DnCNN [12]. All models are trained for four quality factors of 10, 20, 30, and 40 except for TNRD [11], which is only trained for the first three JPEG quality factors.

10 28.98 / 0.8076 29.15 / 0.8111 29.19 / 0.8123 29.17 / 0.8202
20 31.29 / 0.8733 31.46 / 0.8769 31.59 / 0.8802 32.28 / 0.8957
30 32.67 / 0.9043 32.84 / 0.9059 32.98 / 0.9090 33.61 / 0.9206
40 33.63 / 0.9198 - 33.96 / 0.9247 34.66 / 0.9352
TABLE IX: Average PSNR/SSIM for JPEG image deblocking for quality factors of 10, 20, 30, and 40 on LIVE1 [86] dataset. The best results are in bold.

37.20dB 39.27dB
JPEG Monarch Image DnCNN [12] RNet

Fig. 14: A sample image of a Monarch with the artifacts having a quality factor of 20. Our RNet restore texture correctly, specifically the line, as shown in the zoomed version of the restored patch.

4.6.1 Visual Comparisons

In Figure 14, we show a comparison of our method on the “Monarch” image. Our network can retrieve the fine details such as the straight line in the wing shown in the close up while ARCNN [63] and DnCNN [12] fail to achieve the desired results and produce distorted lines.

Similarly, in Figure 15, the “Parrot” image, it can be observed that our model output has fewer artifacts and restores structures more accurately on the face of the parrot. On the other hand, ARCNN [63] and DnCNN [12] smooth out the texture and lines present in the ground-truth images These outcomes show the importance of our attention mechanism and the enhanced capacity of the proposed model.


34.20dB 35.73dB
JPEG Parrot Image DnCNN [12] RNet

Fig. 15: A different example of the artifact image removal for a quality factor of 20. RNet restores the texture accurately on the face of the parrot.

4.6.2 Quantitative Comparisons

The JPEG deblocking average results in terms of PSNR and SSIM are listed in Table IX for different methods. Our gain over ARCNN [63] and DnCNN [12] for a compression factor of 40 is significant i.e. 1.03dB and 0.7dB, respectively. Similarly, the overall improvement on the LIVE1 dataset for RNet is 0.79dB (over [63]) and 0.5dB (over [12]) for all compression factors.

5 Conclusion

In this paper, we present a new CNN restoration model for real degraded photographs. This is the first end-to-end single pass network to show state-of-the-art results across a broad range of real image restoration and enhancement tasks. Specifically, we show results on denoising, super-resolution, raindrop removal, and compression artifacts.

Unlike previous algorithms, our model is a single-blind restoration network for real degraded images. We propose a new restoration module to learn the features and to enhance the capability of the network further; we adopt feature attention to rescale the channel-wise features by taking into account the dependencies between the channels. We also use LSC, SSC, and SC to allow low-frequency information to bypass so the network can focus on residual learning. Extensive experiments on 11 real-degraded datasets for four restoration tasks against more than 30 state-of-the-art algorithms demonstrate the effectiveness of our proposed model.


  • [1] K. Dabov, A. F., V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3-D transform-domain collaborative filtering,” 2007.
  • [2] A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for image denoising,” in CVPR, 2005.
  • [3] S. Gu, L. Zhang, W. Zuo, and X. Feng, “Weighted nuclear norm minimization with application to image denoising,” in CVPR, 2014.
  • [4] Y. Peng, A. Ganesh, J. Wright, W. Xu, and Y. Ma, “Rasl: Robust alignment by sparse and low-rank decomposition for linearly correlated images,” TPAMI, 2012.
  • [5] J. Xu and S. Osher, “Iterative regularization and nonlinear inverse scale space applied to wavelet-based denoising,” TIP, 2007.
  • [6] Y. Weiss and W. T. Freeman, “What makes a good model of natural images?” in CVPR, 2007.
  • [7] S. Roth and M. J. Black, “Fields of experts,” IJCV, 2009.
  • [8] S. Anwar, F. Porikli, and C. P. Huynh, “Category-specific object image denoising,” TIP, 2017.
  • [9] H. Yue, X. Sun, J. Yang, and F. Wu, “Cid: Combined image denoising in spatial and frequency domains using web images,” in CVPR, 2014.
  • [10] E. Luo, S. H. Chan, and T. Q. Nguyen, “Adaptive image denoising by targeted databases,” TIP, 2015.
  • [11] Y. Chen and T. Pock, “Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration,” TPAMI, 2017.
  • [12] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” TIP, 2017.
  • [13] K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep cnn denoiser prior for image restoration,” CVPR, 2017.
  • [14] S. Guo, Z. Yan, K. Zhang, W. Zuo, and L. Zhang, “Toward convolutional blind denoising of real photographs,” CVPR, 2018.
  • [15] M. Lebrun, M. Colom, and J.-M. Morel, “The noise clinic: a blind image denoising algorithm,” IPOL, 2015.
  • [16] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” TNN, 1994.
  • [17] A. Foi, V. Katkovnik, and K. Egiazarian, “Pointwise shape-adaptive DCT for high-quality denoising and deblocking of grayscale and color images,” TIP, 2007.
  • [18]

    K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “BM3D image denoising with shape-adaptive principal component analysis,” in

    Signal Processing with Adaptive Sparse Structured Representations, 2009.
  • [19] M. Lebrun, A. Buades, and J.-M. Morel, “A nonlocal bayesian image denoising algorithm,” SIAM JIS, 2013.
  • [20] M. Elad and D. Datsenko, “Example-based regularization deployed to super-resolution reconstruction of a single image,” Comput. J., 2009.
  • [21] W. Dong, X. Li, D. Zhang, and G. Shi, “Sparsity-based image denoising via dictionary learning and structural clustering,” in CVPR, 2011.
  • [22] D. Zoran and Y. Weiss, “From learning models of natural image patches to whole image restoration,” in ICCV, 2011.
  • [23] J. Xu, L. Zhang, W. Zuo, D. Zhang, and X. Feng, “Patch Group Based Nonlocal Self-Similarity Prior Learning for Image Denoising,” in ICCV, 2015.
  • [24] A. Levin and B. Nadler, “Natural image denoising: Optimality and inherent bounds,” in CVPR, 2011.
  • [25] P. Chatterjee and P. Milanfar, “Is denoising dead?” TIP, 2010.
  • [26] S. Anwar, C. Huynh, and F. Porikli, “Combined internal and external category-specific image denoising,” in BMVC, 2017.
  • [27] U. Schmidt and S. Roth, “Shrinkage fields for effective image restoration,” in CVPR, 2014.
  • [28] S. Lefkimmiatis, “Non-local color image denoising with convolutional neural networks,” CVPR, 2016.
  • [29] H. C. Burger, C. J. Schuler, and S. Harmeling, “Image denoising: Can plain neural networks compete with bm3d?” in CVPR, 2012.
  • [30] S. Anwar, C. P. Huynh, and F. Porikli, “Chaining identity mapping modules for image denoising,” arXiv, 2017.
  • [31] J. Jiao, W.-C. Tu, S. He, and R. W. Lau, “Formresnet: Formatted residual learning for image restoration,” in CVPR Workshops, 2017.
  • [32] T. Plötz and S. Roth, “Neural nearest neighbors networks,” in NIPS, 2018.
  • [33] T. Brooks, B. Mildenhall, T. Xue, J. Chen, D. Sharlet, and J. T. Barron, “Unprocessing images for learned raw denoising,” in CVPR, 2019.
  • [34] K. Zhang, W. Zuo, and L. Zhang, “Ffdnet: Toward a fast and flexible solution for cnn-based image denoising,” TIP, 2018.
  • [35] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” TPAMI, 2016.
  • [36] C. Dong, C. C. Loy, and X. Tang, “Accelerating the super-resolution convolutional neural network,” in ECCV, 2016.
  • [37] K. Zhang, W. Zuo, and L. Zhang, “Learning a single convolutional super-resolution network for multiple degradations,” in CVPR, 2018.
  • [38] J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image super-resolution using very deep convolutional networks,” in CVPR, 2016.
  • [39] ——, “Deeply-recursive convolutional network for image super-resolution,” in CVPR, 2016.
  • [40] Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep recursive residual network,” in CVPR, 2017.
  • [41] B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep residual networks for single image super-resolution,” in CVPR workshops, 2017.
  • [42] N. Ahn, B. Kang, and K.-A. Sohn, “Fast, accurate, and, lightweight super-resolution with cascading residual network,” ECCV, 2018.
  • [43] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in CVPR, 2017.
  • [44] T. Tong, G. Li, X. Liu, and Q. Gao, “Image super-resolution using dense skip connections,” in ICCV, 2017.
  • [45] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense network for image super-resolution,” in CVPR, 2018.
  • [46] M. Haris, G. Shakhnarovich, and N. Ukita, “Deep backprojection networks for super-resolution,” in CVPR, 2018.
  • [47] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-resolution using very deep residual channel attention networks,” ECCV, 2018.
  • [48] V. Mnih, N. Heess, A. Graves et al., “Recurrent models of visual attention,” in NIPS, 2014.
  • [49] J.-H. Kim, J.-H. Choi, M. Cheon, and J.-S. Lee, “Ram: Residual attention module for single image super-resolution,” arXiv, 2018.
  • [50] S. Anwar and N. Barnes, “Densely residual laplacian super-resolution,” arXiv preprint arXiv:1906.12021, 2019.
  • [51] H. Kurihata, T. Takahashi, I. Ide, Y. Mekada, H. Murase, Y. Tamatsu, and T. Miyahara, “Rainy weather recognition from in-vehicle camera images for driver assistance,” in IVS, 2005.
  • [52] M. Roser and A. Geiger, “Video-based raindrop detection for improved image registration,” in ICCV Workshops, 2009.
  • [53] A. Yamashita, Y. Tanaka, and T. Kaneko, “Removal of adherent waterdrops from images acquired with stereo camera,” in IROS, 2005.
  • [54] A. Yamashita, I. Fukuchi, and T. Kaneko, “Noises removal from image sequences acquired with moving camera by estimating camera motion from spatio-temporal information,” in IROS, 2009.
  • [55] S. You, R. T. Tan, R. Kawakami, Y. Mukaigawa, and K. Ikeuchi, “Adherent raindrop modeling, detectionand removal in video,” TPAMI, 2015.
  • [56] D. Eigen, D. Krishnan, and R. Fergus, “Restoring an image taken through a window covered with dirt or rain,” in ICCV, 2013.
  • [57] R. Qian, R. T. Tan, W. Yang, J. Su, and J. Liu, “Attentive generative adversarial network for raindrop removal from a single image,” in CVPR, 2018.
  • [58]

    P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in

    CVPR, 2017.
  • [59] P. List, A. Joch, J. Lainema, G. Bjontegaard, and M. Karczewicz, “Adaptive deblocking filter,” TCSVT, 2003.
  • [60] C. Wang, J. Zhou, and S. Liu, “Adaptive non-local means filter for image deblocking,” Signal Processing: Image Communication, 2013.
  • [61] C. Jung, L. Jiao, H. Qi, and T. Sun, “Image deblocking via sparse representation,” Signal Processing: Image Communication, 2012.
  • [62] J. Jancsary, S. Nowozin, and C. Rother, “Loss-specific training of non-parametric image restoration models: A new state of the art,” in ECCV, 2012.
  • [63] C. Dong, Y. Deng, C. Change Loy, and X. Tang, “Compression artifacts reduction by a deep convolutional network,” in ICCV, 2015.
  • [64] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
  • [65] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in ICML, 2015.
  • [66] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in CVPR, 2018.
  • [67] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in ICCV, 2001.
  • [68] E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image super-resolution: Dataset and study,” in CVPR Workshops, 2017.
  • [69] V. Bychkovsky, S. Paris, E. Chan, and F. Durand, “Learning photographic global tonal adjustment with a database of input/output image pairs,” in CVPR, 2011.
  • [70] A. Abdelhamed, S. Lin, and M. S. Brown, “A high-quality denoising dataset for smartphone cameras,” in CVPR, 2018.
  • [71] J. Xu, H. Li, Z. Liang, D. Zhang, and L. Zhang, “Real-world noisy image denoising: A new benchmark,” arXiv, 2018.
  • [72] J. Anaya and A. Barbu, “Renoir–a dataset for real low-light image noise reduction,” Journal of Visual Communication and Image Representation, 2018.
  • [73] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv, 2014.
  • [74] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.
  • [75] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Color image denoising via sparse 3-D collaborative filtering with grouping constraint in luminance-chrominance space,” in ICIP, 2007.
  • [76] T. Plötz and S. Roth, “Benchmarking denoising algorithms with real photographs,” CVPR, 2017.
  • [77] S. Nam, Y. Hwang, Y. Matsushita, and S. Joo Kim, “A holistic approach to cross-channel image noise modeling and its application to image denoising,” in CVPR, 2016.
  • [78] J. Xu, L. Zhang, and D. Zhang, “A trilateral weighted sparse coding scheme for real-world image denoising,” in ECCV, 2018.
  • [79] J. Xu, L. Zhang, D. Zhang, and X. Feng, “Multi-channel weighted nuclear norm minimization for real color image denoising,” in ICCV, 2017.
  • [80] ABSoft. Neat image. [Online]. Available:
  • [81] W. Dong, L. Zhang, G. Shi, and X. Li, “Nonlocally centralized sparse representation for image restoration,” TIP, 2012.
  • [82] M. Aharon, M. Elad, and A. Bruckstein, “K-svd: An algorithm for designing overcomplete dictionaries for sparse representation,” TIP, 2006.
  • [83] J.-B. Huang, A. Singh, and N. Ahuja, “Single image super-resolution from transformed self-exemplars,” in CVPR, 2015.
  • [84] M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel, “Low-complexity single-image super-resolution based on nonnegative neighbor embedding,” 2012.
  • [85] R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using sparse-representations,” in ICCS, 2010.
  • [86] H. Sheikh, “Live image quality assessment database release 2,” http://live. ece. utexas. edu/research/quality, 2005.