In the present digital age, many hand-held camera-based devices allow humans, machines, etc., to record or capture video and image data. However, during image and video acquisition, various forms of corruption, for example, noise (Gaussian, speckle, thermal etc.), compression (JPEG etc.), blur (motion, defocus etc.
) are often inevitable and can downgrade the visual quality considerably. The process of reducing the artifacts to recover the missing details is called image restoration. Moreover, being a low-level vision task, image restoration is a crucial step for various computer vision and image analysis applications such as computational photography, surveillance, robotic vision, recognition, and classification,etc.
Generally, restoration algorithms can be categorized as model-based and learning-based. Model-based algorithms include non-local self-similarity (NSS) [1, 2], sparsity [3, 4], gradient methods [5, 6], Markov random field models , and external restoration priors [8, 9, 10]. The model-based algorithms above are computationally expensive, time-consuming, unable to suppress spatially variant degradations directly as well as characterize complex image textures. On the other hand, discriminative learning aims to model the image prior from a set of degraded and ground-truth image sets. One technique is to learn the prior steps in the context of truncated inference  while another approach is to employ brute force learning, for example, CNN methods [12, 13]. CNN models improved restoration performance thanks to their modeling capacity, network training, and design. However, the performance of current learning models is limited and tailored for specific synthetic degradation types.
A practical restoration algorithm should be efficient, flexible, perform restoration using a single model and handle both spatially variant and invariant degradations when the degradation is known or unknown. Unfortunately, the current state-of-the-art algorithms are far from achieving all of these aims. We present a CNN model that is efficient and capable of handling synthetic as well as real degradation present in images. We summarize the contributions of this work in the following paragraphs.
|Noisy||CBDNet ||RNet (Ours)|
CNN based approaches for real image restoration (synthetic image denoising) producing state-of-the-art results using a first of its kind single stage model.
To the best of our knowledge, the first use of feature attention in restoration tasks, specifically in denoising, JPEG compression, and raindrop removal (although feature attention is used in super-resolution, however, our model is lightweight and efficient).
A modular network less affected by vanishing gradients , thus enabling improved performance with an increasing number of modules.
Quantitative as well as qualitative experimental results on 11 real-image degradation datasets producing state-of-the-art results against more than 30 algorithms. Additionally, results on three synthetic noisy datasets are provided for denoising.
We introduce a single network that can handle spatially variant noise, significant local artifacts i.e. JPEG compression, pixel-by-pixel artifacts such as raindrop removal, and super-resolution.
2 Related Work
In the following sections, we present the literature for denoising, super-resolution, JPEG compression, and raindrop removal.
2.1 Image denoising
In this section, we present and discuss recent trends in image denoising. Two notable denoising algorithms, NLM  and BM3D , use self-similar patches. Due to their success, many subsequent variants were proposed, including SADCT , SAPCA , and NLB  which seek self-similar patches in different transform domains. Dictionary-based methods [20, 21] enforce sparsity by employing self-similar patches and learning over-complete dictionaries from clean images. Many algorithms [22, 23] investigated the maximum likelihood algorithm to learn a statistical prior, e.g.
the Gaussian Mixture Model of natural patches or patch groups for patch restoration. Furthermore, Levinet al.  and Chatterjee et al. , motivated external denoising [8, 26, 10] by showing that an image can be recovered with negligible error by selecting reference patches from a clean external database. However, all of the external algorithms are class-specific.
Recently, Schmidt et al. 
introduced a cascade of shrinkage fields (CSF) which integrated half-quadratic optimization and random-fields. Shrinkage aims to suppress smaller values (noise values) and learn mappings discriminatively. CSF assumes the data fidelity term to be quadratic and that it has a discrete Fourier transform-based closed-form solution.
Due to the popularity of convolutional neural networks (CNNs), image denoising algorithms [12, 13, 28, 29, 27, 30] have achieved a performance boost. Notable denoising neural networks, DnCNN , and IrCNN 
predict the residue present in the image instead of the denoised image itself as the input to the loss function. That is, comparing against ground truth noise instead of the original clean image. Both networks achieved better results despite having a simple architecture where repeated blocks of convolutional, batch normalization, and ReLU activations are used. Furthermore, IrCNN and DnCNN  are dependent on blindly predicted noise i.e. without taking into account the underlying structures and textures of the noisy image.
Another essential image restoration framework is Trainable Nonlinear Reaction-Diffusion (TNRD) , which uses a field-of-experts prior  into the deep neural network for a specific number of inference steps by extending the non-linear diffusion paradigm into a profoundly trainable set of parametrized linear filters and influence functions. Although the results of TNRD are favorable, the model requires a significant amount of data to learn the parameters and influence functions as well as overall fine-tuning, hyper-parameter determination, and stage-wise training. Similarly, non-local color net (NLNet)  was motivated by non-local self-similar (NSS) priors, which employ non-local self-similarity coupled with discriminative learning. NLNet improved upon the traditional methods; but, it lags in performance compared to most of the CNNs [13, 12] due to the adaptation of NSS priors, as it is unable to find the analogs for all the patches in the image. Recently, Anwar et al. introduced CIMM, a deep denoising CNN architecture, composed of identity mapping modules . The network learns features in cascaded identity modules using dilated kernels and uses self-ensemble to boost performance. CIMM improved upon all the previous CNN models [12, 31].
benefitted from the modeling capacity of CNNs and have shown the ability to learn a single-blind denoising model; however, the denoising performance is limited, and the results are not satisfactory on real photographs. Generally speaking, real-noisy image denoising is a two-step process: the first step involves noise estimation whereas the second step addresses non-blind denoising. Noise clinic (NC) estimates the noise model dependent on signal and frequency followed by denoising the image using non-local Bayes (NLB). In comparison, Zhang et al.  proposed a non-blind Gaussian denoising network, termed FFDNet, that can produce satisfying results on some of the real-noisy images; however, it requires manual intervention to select a high value for high noise-level.
also incorporated multiple losses, is engineered to be trained on real-synthetic noise and real-image noise, and enforces a higher noise standard deviation value for low noise images. Similarly, The methods of[14, 34] may require manual intervention to improve the results. In this paper, we present an end-to-end architecture that learns the noise characteristics and produces results on real noisy images without requiring separate subnets or manual intervention.
In this section, we provide a chronological record of advancement in the area of deep super-resolution. The initial focus of CNN models was to have a simple architecture with no skip connections. For example, SRCNN  with three convolutional layers and FSRCNN  having eight convolutional layers utilizing shrinking and expansion of channels to make it run in real-time on a CPU. Next, SRMD , a linear network (resembling [35, 13]), is able to handle multiple types of degradations. The input to the system is low-resolution images with the corresponding degradation maps.
The introduction of skip-connections in deep networks made its way into super-resolution algorithms. Kim et al.  employed a global skip connection to enforce residual learning, improving on the previous super-resolution methods. The same authors then developed a deep recursive structure (DRCN)  sharing layer parameters, which reduced the number of parameters significantly, though, it lags behind VDSR  in performance. Following that, to decrease the memory usage and computational complexity, Tai et al.  proposed a deep recursive residual network (DRRN) that utilizes basic skip-connections to implement residual learning along various convolutional blocks i.e. multi-path architecture.
The success of residual blocks motivated many super-resolution works. In the enhanced deep super-resolution (EDSR) network, Lim et al.  proposed to employ residual blocks and a global skip-connection rescaling each block output by a factor of 0.1 to avoid exploding gradients and significantly improving on all previous methods. More recently, Ahn et al.  proposed an efficient network, namely, the cascading residual network (CARN). The authors use cascading connections with a variant of residual blocks with three convolutional layers.
To improve performance, several super-resolution works are driven by the success of dense-concatenation . For example, Tong et al.  takes the output of all the previous convolutional layers in a block and fed into the subsequent one. Similarly, residual-dense network (RDN)  learns relationships through dense-connections in the patches. Lately, Haris et al.  trains a series of densely connected downsampling and upsampling layers (single block) with feedback and feed-forward mechanism.
Recently, Zhang et al.  introduced visual attention  in their RCAN network. In addition, the authors also employ a series of residual blocks and multi-level skip-connections in the network. Furthermore, Kim et al. , in parallel to , suggested a dual attention procedure, namely, SRRAM. The performance of SRRAM  is, however, not on par with RCAN . Recently, Anwar & Barnes introduced densely connected residual units with laplacian attention  to advance the performance of super-resolution. Comparatively, our proposed SR method is lightweight and achieves competitive performance.
2.3 Raindrop Removal
Many papers deal with visibility enhancement, which removes haze, fog, and rain streaks; however, these algorithms do not apply to raindrop removal on a window or camera lens as the image formation models are different.
To remove raindrops, Kurihata et al.  proposes to apply PCA on the learned shapes of raindrops. During testing, the learned shapes are compared to raindrops, and the matching entities are removed. Due to the irregular shapes of raindrops, it is challenging to learn all the representative shapes; similarly, it is difficult to model transparent raindrops via PCA. Furthermore, there is a risk that local content is falsely detected and removed from the image, mainly due to similar appearance and structure. Roser & Gieger  compare raindrops generated synthetically with real ones based on the assumption that the former are simply spherical in shape while the latter ones are inclined spheres. Although the assumptions seem to enhance visibility, however, the method lacks generalization capability due to random sizes and shapes of raindrops.
To detect raindrops, Yamashita et al.  employed stereo images using disparity measures. Neighborhood textures replaced the detected raindrops based on the assumption that the raindrop occludes a similar looking background. Furthermore, Yamashita et al.  proposed a method relevant to , where a sequence of images was used instead of stereo pairs. Similarly, You et al.  exploited a motion-based method for raindrop detection, while a video completion method is employed to remove it. The performance of these methods may be satisfactory in certain video scenes; however, a straightforward application to the case of a single image is not possible.
Eigne et al.  and DeRain  are the only methods specifically designed for this task. Eigne et al.  employs a shallow network with three layers, which works for sparse and small raindrops. However, it fails to remove drops that are dense or large, and the output of the network is not sharp. More recently, DeRain  is proposed by Qian et al.
, which uses a GAN as the backbone with LSTMs, attention, and both global and local assessments. Finally, we also compare with a general method, namely, Pix2Pix, that maps the input image to the output one by minimizing a loss function. On the other hand, we use attention to remove the raindrops without employing multiple backbone networks or loss functions.
2.4 JPEG Compression
JPEG algorithms fall into two categories: 1) deblocking-oriented and 2) restoration-oriented. Deblocking-oriented approaches remove the blocking artifacts in the spatial domain by the use of adaptive filters [59, 60]
. While in the frequency domain, transforms and thresholds are used at multiple scales to eliminate the artifacts,e.g. the Pointwise Shape-Adaptive DCT (SA-DCT) . Although deblocking methods remove artifacts, it fails to generate sharp edges and produce smooth textures.
On the other hand, the restoration based approach considers the compression as a form of distortion, including the sparse-coding-based method , the Regression Tree Fields based method (RTF) , and CNN-based methods [63, 12], etc. Recently, Artifacts Reduction Convolutional Neural Networks (AR-CNN) , inspired by and with a similar architecture as SRCNN  except having more feature layers. AR-CNN  lags behind DnCNN  in performance. Contrary to the mentioned networks, our method is useful in suppressing artifacts and preserving edges and sharp details via feature attention, merge-and-run units, and enhanced residual modules.
3 CNN restorer
3.1 Network Architecture
Our model is composed of three main modules, i.e.feature extraction, feature learning residual on the residual, and reconstruction, as shown in Figure 2. Let us consider a degraded input image, and the restored output image. Our feature extraction module is composed of only one convolutional layer to extract initial features from the noisy input:
where performs convolution on the noisy input image. Next, is passed on to the feature learning residual on the residual module, termed ,
where are the learned features and is the main feature learning residual on the residual component, composed of enhancement attention modules (EAM) that are cascaded together as shown in Figure 2. Our network has a small depth but provides a wide receptive field through kernel dilation in each of the first two branches of convolutions in the EAM. The output features of the final layer are fed to the reconstruction module, which is again composed of one convolutional layer.
where denotes the reconstruction layer.
There are several choices available for the loss function to optimize such as [12, 13, 30], perceptual loss [31, 14], total variation loss  and asymmetric loss . Some networks [31, 14] make use of more than one loss to optimize the model. Contrary to earlier networks, we only employ one loss, i.e. . Now, given a batch of training pairs, , where is the noisy input and is the ground truth, the aim is to minimize the loss function
where RNet() is our network, and denotes the set of all the network parameters learned. Our feature extraction and reconstruction module resemble the previous algorithms [35, 30]. We now focus on the feature learning residual on the residual block and feature attention.
3.2 Feature Learning Residual on the Residual
In this section, we provide more detail on the enhancement attention modules that use a Residual on the Residual structure with a local skip and short skip connections. Each EAM is further composed of blocks, followed by feature attention. Thanks to the residual on the residual architecture, very deep networks are now possible that improve denoising performance; however, we restrict our model to four EAM modules only. The first part of EAM covers the full receptive field of input features, followed by learning on the features; then, the features are compressed for speed, and finally, a feature attention module enhances the weights of essential features from the maps. The first part of EAM is realized using a novel merge-and-run unit (MRU), as shown in the second row of Figure 2. The input features are branched and passed through two parallel dilated convolutions, then concatenated and passed through another convolution. Next, the features are learned using a residual block (RB) of two convolutions, while compression is achieved by an enhanced residual block (ERB) of three convolutional layers. The last layer of ERB flattens the features by applying a kernel. Finally, the output of the feature attention unit is added to the input of EAM.
In image recognition, residual blocks  are often stacked together to construct a network of more than 1000 layers. Similarly, in image superresolution, EDSR  stacked the residual blocks and used long skip connections (LSC) to form a very deep network. However, to date, very deep networks have not been investigated for denoising. Motivated by the success of , we introduce the residual on the residual as a basic module for our network to construct deeper systems. Now consider the m-th module of the EAM is given as
where is the output of the feature learning module, in other words . The output of each EAM is added to the input of the group as . The learned features i.e. are passed to the reconstruction layer to output the same number of channels as the input of the network. Furthermore, we use a long skip connection to add the input image to the final network output as
where are the weights and biases learned in the group. This addition, i.e. LSC eases the flow of information across groups and helps learning the residual (degradation) rather than the image. This technique helps in faster learning as compared to learning the original image thanks to the sparse representation of the degradation.
3.2.1 Feature Attention
|Long skip connection (LSC)||✓||✓||✓||✓|
|Short skip connection (SSC)||✓||✓||✓||✓||✓|
|Long connection (LC)||✓||✓||✓|
|Feature attention (FA)||✓||✓||✓||✓||✓|
|PSNR (in dB)||28.45||28.77||28.81||28.86||28.52||28.85||28.86||28.90||28.96|
This section provides information about the feature attention mechanism. Attention  has been around for some time; however, it has not been employed in image denoising. Channel features in image denoising methods are treated equally, which is not appropriate for many cases. To exploit and learn the critical content of the image, we focus attention on the relationship between the channel features; hence, the name: feature attention (see Figure 3).
An important question here is how to generate attention differently for each channel-wise feature. Images generally can be considered as having low-frequency regions (smooth or flat areas), and high-frequency regions (e.g., lines edges, and texture). As convolutional layers exploit local information only and are unable to utilize global contextual information, we first employ global average pooling to express the statistics denoting the whole image, other options for aggregation of the features can also be explored to represent the image descriptor. Let be the output features of the last convolutional layer having feature maps of size ; global average pooling will reduce the size from to as:
where is the feature value at position in the feature maps.
, the mentioned mechanism must learn the nonlinear synergies between channels as well as mutually-exclusive relationships. Here, we employ soft-shrinkage and sigmoid functions to implement the gating mechanism. Let us consider, and are the soft-shrinkage and sigmoid operators, respectively. Then the gating mechanism is
where and are the channel reduction and channel upsampling operators, respectively. The output of the global pooling layer is convolved with a downsampling Conv layer, activated by the soft-shrinkage function. To differentiate the channel features, the output is then fed into an upsampling Conv layer followed by sigmoid activation. Moreover, to compute the statistics, the output of the sigmoid () is adaptively rescaled by the input of the channel features as
Our proposed model contains four EAM blocks. The kernel size for each convolutional layer is set to , except the last Conv layer in the enhanced residual block and those of the features attention units, where the kernel size is
. Zero padding is used forto achieve the same size outputs feature maps. The number of channels for each convolutional layer is fixed at 64, except for feature attention downscaling. A factor of 16 reduces these Conv layers, hence having only four feature maps. The final convolutional layer either outputs three or one feature maps depending on the input. As for running time, our method takes about 0.2 seconds to process a image.
4.1 Training settings
To generate noisy synthetic images, we employ BSD500 , DIV2K , and MIT-Adobe FiveK , resulting in 4k images while for real noisy images, we use cropped patches of from SSID , Poly , and RENOIR . Data augmentation is performed on training images, which includes random rotations of 90, 180, 270 and flipping horizontally. In each training batch, 32 patches are extracted as inputs with a size of . Adam  is used as the optimizer with default parameters. The learning rate is initially set to and then halved after
iterations. The network is implemented in the Pytorch
framework and trained with an Nvidia Tesla V100 GPU. Furthermore, we use PSNR as the evaluation metric.
4.2 Ablation Studies
4.2.1 Influence of the skip connections
Skip connections play a crucial role in our network. Here, we demonstrate the effectiveness of the skip connections. Our model is composed of three basic types of connections, which include long skip connection (LSC), short skip connections (SSC), and local connections (LC). Table I shows the average PSNR for the BSD68  dataset. The highest performance is obtained when all the skip connections are available while the performance is lower when any connection is absent. We also observed that increasing the depth of the network in the absence of skip connections does not benefit performance.
|Levels||CBM3D ||MLP ||TNRD ||DnCNN ||IrCNN ||CNLNet ||FFDNet ||Ours|
|Methods||= 15||= 25||= 50|
Another important aspect of our network is feature attention. Table I compares the PSNR values of the networks with and without feature attention. The results support our claim about the benefit of using feature attention. Since the inception of DnCNN , the CNN models have matured, and further performance improvement requires the careful design of blocks and rescaling of the feature maps. The two mentioned characteristics are present in our model in the form of feature-attention and skip connections.
4.3 Denoising Comparisons
We evaluate our algorithm using the Peak Signal-to-Noise Ratio (PSNR) index as the error metric and compare against many state-of-the-art competitive algorithms which include traditional methodsi.e. CBM3D , WNNM , EPLL , CSF  and CNN-based denoisers i.e. MLP , TNRD , DnCNN , IrCNN , CNLNet , FFDNet  and CBDNet . To be fair in comparison, we use the default setting of the traditional methods provided by the corresponding authors.
4.3.1 Denoising Test Datasets
In the experiments, we test four noisy real-world datasets i.e. RNI15 , DND , Nam  and SSID . Furthermore, we prepare three synthetic noisy datasets from the widely used 12 classical images, BSD68  color and gray 68 images for testing. We corrupt the clean images by additive white Gaussian noise using noise sigma of 15, 25 and 50 standard deviations.
Classical images: The denoising comparisons would be incomplete without testing on the traditional images. Here, we use 12 classical images for testing.
BSD68  is composed of 68 images grayscale (CBSD68 is the same but with color images). The ground-truth images are available as the degraded dataset is synthetically created.
RNI15  provides 15 real-world noisy images. Unfortunately, the clean images are not given for this dataset; therefore, only the qualitative comparison is presented.
Nam  comprises of 11 static scenes and the corresponding noise-free images obtained by the mean of 500 noisy images of the same scene. The size of the images is enormous; hence, we cropped the images in patches and randomly selected 110 from those for testing.
DnD is recently proposed by Plotz et al. , which initially contains 50 pairs of real-world noisy and noise-free scenes. The scenes are further cropped into patches of size by the providers of the dataset, which resulted in 1000 smaller images. The near noise-free images are not publicly available, and the results (PSNR/SSIM) can only be obtained through the online system introduced by .
SSID  (Smartphone Image Denoising Dataset) is recently introduced. The authors have collected 30k real noisy images and their corresponding clean images; however, only 320 images are released for training and 1280 images pairs for validation, as testing images are not released yet. We use the validation images for testing our algorithm and the competitive methods.
4.3.2 Classical noisy images
|Noisy||CBM3D ||WNNM ||NC ||TWSC |
|Noisy Image||MCWNNM ||NI ||FFDNet ||CBDNet ||RNet (Ours)|
In this subsection, we evaluate our model on the noisy grayscale images corrupted by spatially invariant additive white Gaussian noise. We compare against nonlocal self-similarity representative models i.e. BM3D  and WNNM , learning based methods i.e. EPLL, TNRD , MLP , DnCNN , IrCNN , and CSF .
SET12: In Table IV, we present the PSNR values on Set12. Our method outperforms all the competitive algorithms for all noise levels; this may be due to the larger receptive field in the merge-and-run unit as well as better network modeling capacity.
BSD68: We show the performance of our algorithm against competing methods on BSD68  in Table II. It is to be remembered here that BSD68  and BSD500  are two disjoint sets. Our method shows improvement over all the competitive algorithms for all noise levels. The increase in PSNR proves the superior network design and better feature learning for denoising tasks. It should be noted here that even a marginal improvement on the synthetic noisy datasets is difficult as according to the Levin et al.  and Chatterjee  the synthetic denoising has already achieved optimal limits.
Color noisy images: Next, for noisy color image denoising, we keep all the parameters of the network similar to the grayscale model, except the first and last layer are changed to input and output three channels rather than one. Figure 4 presents the visual comparison and Table III reports the PSNR numbers between our methods and the alternative algorithms. Our algorithm consistently outperforms all the other techniques published in Table III for CBSD68 dataset . Similarly, our network produces the best perceptual quality images as shown in Figure 4. A closer inspection on the vase reveals that our network generates textures closest to the ground-truth with fewer artifacts and more details.
4.3.3 Real-World noisy images
To assess the practicality of our model, we employ a real noise dataset. The evaluation is difficult because of the unknown level of noise, the various noise sources such as shot noise, quantization noise etc., imaging pipeline i.e. image resizing, lossy compression etc. Furthermore, the noise is spatially variant (non-Gaussian) and also signal-dependent; hence, the assumption that noise is spatially invariant, employed by many algorithms does not hold for real image noise. Therefore, real-noisy images evaluation determines the success of the algorithms in real-world applications.
DnD: Table V presents the quantitative results (PSNR/SSIM) on the sRGB data for competitive algorithms and our method obtained from the online DnD benchmark website available publicly. The blind Gaussian denoiser DnCNN  performs inefficiently and is unable to achieve better results than BM3D  and WNNM  due to the poor generalization of the noise during training. Similarly, the non-blind Gaussian traditional denoisers are able to report limited performance, although the noise standard-deviation is provided. This may be due to the fact that these denoisers [1, 3, 22] are tailored for AWGN only, and real-noise is different in characteristics to synthetic noise. Incorporating feature attention and capturing the appropriate characteristics of the noise through a novel module means our algorithm leads by large margin i.e. 1.17dB PSNR compared to the second performing method, CBDNet . Furthermore, our algorithm only employs real-noisy images for training using only loss while CBDNet  uses many techniques such as multiple losses (i.e. total variation, and asymmetric learning) and both real-noise as well as synthetically generated real-noise. As reported by the author of CBDNet , it is able to achieve 37.72 dB with real-noise images only. Noise Clinic (NC)  and Neat Image (NI)  are the other two state-of-the-art blind denoisers other than . NI  is commercially available as a part of Photoshop and Corel PaintShop. Our network is able to achieve 3.82dB and 4.14dB more PSNR from NC  and NI , respectively.
Next, we visually compare the result of our method with the competing methods on the denoised images provided by the online system of Plotz et al.  in Figure 5. The PSNR and SSIM values are also taken from the website. From Figure 5, it is clear that the methods of [14, 34, 12] perform poorly in removing the noise from the star and in some cases the image is over-smoothed, on the other hand, our algorithm can eliminate the noise while preserving the finer details and structures in the star image.
RNI15: On RNI15 , we provide qualitative images only as the ground-truth images are not available. Figure 7 presents the denoising results on a low noise intensity image. FFDNet  and CBDNet  are unable to remove the noise in its totality as can been seen near the bottom left of handle and body of the cup image. On the contrary, our method is able to remove the noise without the introduction of any artifacts. We present another example from the RNI15 dataset  with high noise in Figure 6. CDnCNN  and FFDNet  produce results of limited nature as some noisy elements can be seen in the near the eye and gloves of the Dog image. In comparison, our algorithm recovers the actual texture and structures without compromising on the removal of noise from the images.
|Noisy||CBM3D (39.13)||IRCNN (33.73)|
|DnCNN (37.56)||CBDNet (40.40)||RNet (40.50)|
Nam: We present the average PSNR scores of the resultant denoised images in Table VII. Unlike CBDNet , which is trained on Nam  to specifically deal with the JPEG compression, we use the same network to denoise the Nam images  and achieve favorable PSNR numbers. Our performance in terms of PSNR is higher than any of the current state-of-the-art algorithms. Furthermore, our claim is supported by the visual quality of the images produced by our model, as shown in Figure 8. The amount of noise present after denoising by our method is negligible as compared to CDnCNN and other counterparts.
|25.75 dB||21.97 dB||20.76 dB|
|19.70 dB||28.84 dB||35.57 dB|
SSID: As the last dataset, we employ the SSID real noise dataset, which has the highest number of test (validation) images available. The results in terms of PSNR are shown in the second row of Table VII. Again, it is clear that our method outperforms FFDNet  and CBDNet  by a margin of 9.5dB and 7.93dB, respectively. In Figure 9, we show the denoised results of a challenging image by different algorithms. Our technique recovers the true colors which are closer to the original pixel values while competing methods are unable to restore original colors and in specific regions induce false colors.
|2||33.66 / 0.9299||36.86 / 0.9556||37.56 / 0.9591||37.58 / 0.9590||37.79 / 0.9601||37.76 / 0.9590||37.95 / 0.9605||38.07 / 0.9608|
|Set5||3||30.39 / 0.8682||33.18 / 0.9152||33.67 / 0.9220||33.75 / 0.9222||34.12 / 0.9254||34.29 / 0.9255||34.37 / 0.9269||34.49 / 0.9278|
|4||28.42 / 0.8104||30.85 / 0.8732||31.35 / 0.8845||31.40 / 0.8845||31.96 / 0.8925||32.13 / 0.8937||32.15 / 0.8946||32.34 / 0.8970|
|2||30.24 / 0.8688||32.51 / 0.9069||33.02 / 0.9128||33.03 / 0.9128||33.32 / 0.9159||33.52 / 0.9166||33.54 / 0.9173||33.63 / 0.9182|
|Set14||3||27.55 / 0.7742||29.43 / 0.8232||29.77 / 0.8318||29.81 / 0.8321||30.04 / 0.8382||30.29 / 0.8407||30.34 / 0.8419||30.43 / 0.8433|
|4||26.00 / 0.7027||27.66 / 0.7563||27.99 / 0.7659||28.04 / 0.7672||28.35 / 0.7787||28.60 / 0.7806||28.62 / 0.7822||28.72 / 0.7842|
|2||29.56 / 0.8431||31.40 / 0.8878||31.89 / 0.8961||31.90 / 0.8961||32.05 / 0.8985||32.09 / 0.8978||32.19 / 0.9001||32.25 / 0.9007|
|BSD100||3||27.21 / 0.7385||28.50 / 0.7881||28.82 / 0.7980||28.85 / 0.7981||28.97 / 0.8025||29.06 / 0.8034||29.12 / 0.8055||29.17 / 0.8065|
|4||25.96 / 0.6675||27.00 / 0.7140||27.28 / 0.7256||27.29 / 0.7253||27.49 / 0.7337||27.58 / 0.7349||27.60 / 0.7363||27.65 / 0.7376|
|2||26.88 / 0.8403||29.70 / 0.8994||30.76 / 0.9143||30.74 / 0.9139||31.33 / 0.9204||31.92 / 0.9256||32.07 / 0.9280||32.24 / 0.9294|
|Urban100||3||24.46 / 0.7349||26.42 / 0.8076||27.13 / 0.8283||27.15 / 0.8276||27.57 / 0.8398||28.06 / 0.8493||28.14 / 0.8519||28.28 / 0.8542|
|4||23.14 / 0.6577||24.61 / 0.7291||25.17 / 0.7528||25.20 / 0.7521||25.68 / 0.7731||26.07 / 0.7837||26.18 / 0.7881||26.28 / 0.7905|
4.4 Super-resolution Comparisons
We evaluate our model concerning networks that aim for efficiency as well as PSNR numbers and having similar depth and number of parameters. We compare against TNRD , SRMD , CARN 111CARN has more than hundred convolutional layers etc., as opposed to RCAN , DRLN  which have more than 16M parameters while our model has only 1.49M parameters. We present the performance on four publicly available datasets given below:
Set5  is a classical dataset; contains only five images.
Set14  contains 14 RGB images.
BSD100  is the subset of the Berkely Segmentation dataset and consists of one hundred natural images.
Urban100  is a recently proposed dataset of 100 images. The images contain human-made objects and buildings. The size of the image and the structures present in the dataset make it very challenging for the super-resolution task.
4.4.1 RNet for SR
The RNet is modified for super-resolution due to the increase in the size of the final output. There are two modifications performed in the network structure: 1) An additional layer (upsampling layer) is inserted before the final convolutional layer to super-resolve the input image to the desired resolution, 2) the residual learning is performed by changing the position of the long skip connection ( i.e. output of the feature extraction layer is added to the output of final EAM block), as shown in Figure 11. The second modification is due to the change in the size of the features after upsampling. It is also necessary to mention that the input size to the network is 4848 for super-resolution. Except for the mentioned modifications, no additional changes are made to the network.
4.4.2 Visual Comparisons
We furnish an image from Urban100  for qualitative comparison in Figure 10 against the algorithms which have a similar number of parameters and aim to provide efficient solutions for super-resolution. It is evident from Figure that all the competing methods fail to recover the straight lines forming rectangles shown in the cropped regions from “img_72” image. Our method performs better than current state-of-the-art CARN . Moreover, our network produces results that are faithful to the ground-truth image and without any blurring.
4.4.3 Quantitative Comparisons
Table VI shows the average PSNR and SSIM values for the mentioned four datasets against several state-of-the-art algorithms. Our algorithm outperforms all the methods for all scaling factors on all datasets. Our method is very lightweight as compared to recently published CARN  i.e. contains 4 fewer layers as compared to CARN . The performance of our network becomes comparatively better when the number of images in testing datasets and the scaling factor increases. Similarly, our RNet achieves significantly better results than the VDSR , which was state-of-the-art until a year ago.
4.5 Raindrop Removal Comparisons
In this section, We present the performance of RNet on raindrop removal and compare with three state-of-the-art algorithms which include Eigen et al. , Pix2Pix  and DeRain  on two test datasets introduced in  termed as “Test_a” and “Test_b”. We use the same number of training images as DeRain ; however, we train using cropped patches instead of whole images.
|Test_a||28.59 / 0.6726||30.14 / 0.8299||31.57 / 0.9023||32.03 / 0.9325|
|Test_b||-||-||24.93 / 0.8091||26.42 / 0.8255|
4.5.1 Visual Comparisons
Figure 13 present an example image from the “Test_b” dataset, showing the cropped region near the front end of the car. DeRain  fails to remove the effect of the raindrop and results in the same image as the input. On the other hand, our method restores the edges in the input rainy image.
The second example in Figure 13 shows a rainy urban scene. We focus and crop the road sign to visualize better the differences between the output of RNet and competing methods. The DeRain  network removes the edges and color information where the raindrop affected the road sign. In our case, the edges and the color both are restored and are closer to the ground-truth clean image.
4.5.2 Quantitative Comparisons
Table VIII presents the quantitative results on both “Test_a” and “Test_b” for the mentioned algorithms. Compared to recent state-of-the-art in rain drop removal i.e. DeRain , our gain is 0.46dB for “Test_a” and a significant improvement of 1.49dB on the challenging “Test_b”. Similarly, the gain from Eigen et al.  is about 3.44dB. These results illustrate that our method can restore images that are similar in structure to the corresponding ground-truth images.
4.6 JPEG Comparisons
For JPEG compression, our method is compared against three competing methods, which include AR-CNN , TNRD , and DnCNN . All models are trained for four quality factors of 10, 20, 30, and 40 except for TNRD , which is only trained for the first three JPEG quality factors.
|10||28.98 / 0.8076||29.15 / 0.8111||29.19 / 0.8123||29.17 / 0.8202|
|20||31.29 / 0.8733||31.46 / 0.8769||31.59 / 0.8802||32.28 / 0.8957|
|30||32.67 / 0.9043||32.84 / 0.9059||32.98 / 0.9090||33.61 / 0.9206|
|40||33.63 / 0.9198||-||33.96 / 0.9247||34.66 / 0.9352|
|JPEG Monarch Image||DnCNN ||RNet|
4.6.1 Visual Comparisons
In Figure 14, we show a comparison of our method on the “Monarch” image. Our network can retrieve the fine details such as the straight line in the wing shown in the close up while ARCNN  and DnCNN  fail to achieve the desired results and produce distorted lines.
Similarly, in Figure 15, the “Parrot” image, it can be observed that our model output has fewer artifacts and restores structures more accurately on the face of the parrot. On the other hand, ARCNN  and DnCNN  smooth out the texture and lines present in the ground-truth images These outcomes show the importance of our attention mechanism and the enhanced capacity of the proposed model.
4.6.2 Quantitative Comparisons
The JPEG deblocking average results in terms of PSNR and SSIM are listed in Table IX for different methods. Our gain over ARCNN  and DnCNN  for a compression factor of 40 is significant i.e. 1.03dB and 0.7dB, respectively. Similarly, the overall improvement on the LIVE1 dataset for RNet is 0.79dB (over ) and 0.5dB (over ) for all compression factors.
In this paper, we present a new CNN restoration model for real degraded photographs. This is the first end-to-end single pass network to show state-of-the-art results across a broad range of real image restoration and enhancement tasks. Specifically, we show results on denoising, super-resolution, raindrop removal, and compression artifacts.
Unlike previous algorithms, our model is a single-blind restoration network for real degraded images. We propose a new restoration module to learn the features and to enhance the capability of the network further; we adopt feature attention to rescale the channel-wise features by taking into account the dependencies between the channels. We also use LSC, SSC, and SC to allow low-frequency information to bypass so the network can focus on residual learning. Extensive experiments on 11 real-degraded datasets for four restoration tasks against more than 30 state-of-the-art algorithms demonstrate the effectiveness of our proposed model.
-  K. Dabov, A. F., V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3-D transform-domain collaborative filtering,” 2007.
-  A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for image denoising,” in CVPR, 2005.
-  S. Gu, L. Zhang, W. Zuo, and X. Feng, “Weighted nuclear norm minimization with application to image denoising,” in CVPR, 2014.
-  Y. Peng, A. Ganesh, J. Wright, W. Xu, and Y. Ma, “Rasl: Robust alignment by sparse and low-rank decomposition for linearly correlated images,” TPAMI, 2012.
-  J. Xu and S. Osher, “Iterative regularization and nonlinear inverse scale space applied to wavelet-based denoising,” TIP, 2007.
-  Y. Weiss and W. T. Freeman, “What makes a good model of natural images?” in CVPR, 2007.
-  S. Roth and M. J. Black, “Fields of experts,” IJCV, 2009.
-  S. Anwar, F. Porikli, and C. P. Huynh, “Category-specific object image denoising,” TIP, 2017.
-  H. Yue, X. Sun, J. Yang, and F. Wu, “Cid: Combined image denoising in spatial and frequency domains using web images,” in CVPR, 2014.
-  E. Luo, S. H. Chan, and T. Q. Nguyen, “Adaptive image denoising by targeted databases,” TIP, 2015.
-  Y. Chen and T. Pock, “Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration,” TPAMI, 2017.
-  K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” TIP, 2017.
-  K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep cnn denoiser prior for image restoration,” CVPR, 2017.
-  S. Guo, Z. Yan, K. Zhang, W. Zuo, and L. Zhang, “Toward convolutional blind denoising of real photographs,” CVPR, 2018.
-  M. Lebrun, M. Colom, and J.-M. Morel, “The noise clinic: a blind image denoising algorithm,” IPOL, 2015.
-  Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” TNN, 1994.
-  A. Foi, V. Katkovnik, and K. Egiazarian, “Pointwise shape-adaptive DCT for high-quality denoising and deblocking of grayscale and color images,” TIP, 2007.
K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “BM3D image denoising with shape-adaptive principal component analysis,” inSignal Processing with Adaptive Sparse Structured Representations, 2009.
-  M. Lebrun, A. Buades, and J.-M. Morel, “A nonlocal bayesian image denoising algorithm,” SIAM JIS, 2013.
-  M. Elad and D. Datsenko, “Example-based regularization deployed to super-resolution reconstruction of a single image,” Comput. J., 2009.
-  W. Dong, X. Li, D. Zhang, and G. Shi, “Sparsity-based image denoising via dictionary learning and structural clustering,” in CVPR, 2011.
-  D. Zoran and Y. Weiss, “From learning models of natural image patches to whole image restoration,” in ICCV, 2011.
-  J. Xu, L. Zhang, W. Zuo, D. Zhang, and X. Feng, “Patch Group Based Nonlocal Self-Similarity Prior Learning for Image Denoising,” in ICCV, 2015.
-  A. Levin and B. Nadler, “Natural image denoising: Optimality and inherent bounds,” in CVPR, 2011.
-  P. Chatterjee and P. Milanfar, “Is denoising dead?” TIP, 2010.
-  S. Anwar, C. Huynh, and F. Porikli, “Combined internal and external category-specific image denoising,” in BMVC, 2017.
-  U. Schmidt and S. Roth, “Shrinkage fields for effective image restoration,” in CVPR, 2014.
-  S. Lefkimmiatis, “Non-local color image denoising with convolutional neural networks,” CVPR, 2016.
-  H. C. Burger, C. J. Schuler, and S. Harmeling, “Image denoising: Can plain neural networks compete with bm3d?” in CVPR, 2012.
-  S. Anwar, C. P. Huynh, and F. Porikli, “Chaining identity mapping modules for image denoising,” arXiv, 2017.
-  J. Jiao, W.-C. Tu, S. He, and R. W. Lau, “Formresnet: Formatted residual learning for image restoration,” in CVPR Workshops, 2017.
-  T. Plötz and S. Roth, “Neural nearest neighbors networks,” in NIPS, 2018.
-  T. Brooks, B. Mildenhall, T. Xue, J. Chen, D. Sharlet, and J. T. Barron, “Unprocessing images for learned raw denoising,” in CVPR, 2019.
-  K. Zhang, W. Zuo, and L. Zhang, “Ffdnet: Toward a fast and flexible solution for cnn-based image denoising,” TIP, 2018.
-  C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,” TPAMI, 2016.
-  C. Dong, C. C. Loy, and X. Tang, “Accelerating the super-resolution convolutional neural network,” in ECCV, 2016.
-  K. Zhang, W. Zuo, and L. Zhang, “Learning a single convolutional super-resolution network for multiple degradations,” in CVPR, 2018.
-  J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate image super-resolution using very deep convolutional networks,” in CVPR, 2016.
-  ——, “Deeply-recursive convolutional network for image super-resolution,” in CVPR, 2016.
-  Y. Tai, J. Yang, and X. Liu, “Image super-resolution via deep recursive residual network,” in CVPR, 2017.
-  B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee, “Enhanced deep residual networks for single image super-resolution,” in CVPR workshops, 2017.
-  N. Ahn, B. Kang, and K.-A. Sohn, “Fast, accurate, and, lightweight super-resolution with cascading residual network,” ECCV, 2018.
-  G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in CVPR, 2017.
-  T. Tong, G. Li, X. Liu, and Q. Gao, “Image super-resolution using dense skip connections,” in ICCV, 2017.
-  Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu, “Residual dense network for image super-resolution,” in CVPR, 2018.
-  M. Haris, G. Shakhnarovich, and N. Ukita, “Deep backprojection networks for super-resolution,” in CVPR, 2018.
-  Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-resolution using very deep residual channel attention networks,” ECCV, 2018.
-  V. Mnih, N. Heess, A. Graves et al., “Recurrent models of visual attention,” in NIPS, 2014.
-  J.-H. Kim, J.-H. Choi, M. Cheon, and J.-S. Lee, “Ram: Residual attention module for single image super-resolution,” arXiv, 2018.
-  S. Anwar and N. Barnes, “Densely residual laplacian super-resolution,” arXiv preprint arXiv:1906.12021, 2019.
-  H. Kurihata, T. Takahashi, I. Ide, Y. Mekada, H. Murase, Y. Tamatsu, and T. Miyahara, “Rainy weather recognition from in-vehicle camera images for driver assistance,” in IVS, 2005.
-  M. Roser and A. Geiger, “Video-based raindrop detection for improved image registration,” in ICCV Workshops, 2009.
-  A. Yamashita, Y. Tanaka, and T. Kaneko, “Removal of adherent waterdrops from images acquired with stereo camera,” in IROS, 2005.
-  A. Yamashita, I. Fukuchi, and T. Kaneko, “Noises removal from image sequences acquired with moving camera by estimating camera motion from spatio-temporal information,” in IROS, 2009.
-  S. You, R. T. Tan, R. Kawakami, Y. Mukaigawa, and K. Ikeuchi, “Adherent raindrop modeling, detectionand removal in video,” TPAMI, 2015.
-  D. Eigen, D. Krishnan, and R. Fergus, “Restoring an image taken through a window covered with dirt or rain,” in ICCV, 2013.
-  R. Qian, R. T. Tan, W. Yang, J. Su, and J. Liu, “Attentive generative adversarial network for raindrop removal from a single image,” in CVPR, 2018.
P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” inCVPR, 2017.
-  P. List, A. Joch, J. Lainema, G. Bjontegaard, and M. Karczewicz, “Adaptive deblocking filter,” TCSVT, 2003.
-  C. Wang, J. Zhou, and S. Liu, “Adaptive non-local means filter for image deblocking,” Signal Processing: Image Communication, 2013.
-  C. Jung, L. Jiao, H. Qi, and T. Sun, “Image deblocking via sparse representation,” Signal Processing: Image Communication, 2012.
-  J. Jancsary, S. Nowozin, and C. Rother, “Loss-specific training of non-parametric image restoration models: A new state of the art,” in ECCV, 2012.
-  C. Dong, Y. Deng, C. Change Loy, and X. Tang, “Compression artifacts reduction by a deep convolutional network,” in ICCV, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
-  K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in ICML, 2015.
-  J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in CVPR, 2018.
-  D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in ICCV, 2001.
-  E. Agustsson and R. Timofte, “Ntire 2017 challenge on single image super-resolution: Dataset and study,” in CVPR Workshops, 2017.
-  V. Bychkovsky, S. Paris, E. Chan, and F. Durand, “Learning photographic global tonal adjustment with a database of input/output image pairs,” in CVPR, 2011.
-  A. Abdelhamed, S. Lin, and M. S. Brown, “A high-quality denoising dataset for smartphone cameras,” in CVPR, 2018.
-  J. Xu, H. Li, Z. Liang, D. Zhang, and L. Zhang, “Real-world noisy image denoising: A new benchmark,” arXiv, 2018.
-  J. Anaya and A. Barbu, “Renoir–a dataset for real low-light image noise reduction,” Journal of Visual Communication and Image Representation, 2018.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv, 2014.
-  A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation in pytorch,” 2017.
-  K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Color image denoising via sparse 3-D collaborative filtering with grouping constraint in luminance-chrominance space,” in ICIP, 2007.
-  T. Plötz and S. Roth, “Benchmarking denoising algorithms with real photographs,” CVPR, 2017.
-  S. Nam, Y. Hwang, Y. Matsushita, and S. Joo Kim, “A holistic approach to cross-channel image noise modeling and its application to image denoising,” in CVPR, 2016.
-  J. Xu, L. Zhang, and D. Zhang, “A trilateral weighted sparse coding scheme for real-world image denoising,” in ECCV, 2018.
-  J. Xu, L. Zhang, D. Zhang, and X. Feng, “Multi-channel weighted nuclear norm minimization for real color image denoising,” in ICCV, 2017.
-  ABSoft. Neat image. [Online]. Available: https://ni.neatvideo.com/home
-  W. Dong, L. Zhang, G. Shi, and X. Li, “Nonlocally centralized sparse representation for image restoration,” TIP, 2012.
-  M. Aharon, M. Elad, and A. Bruckstein, “K-svd: An algorithm for designing overcomplete dictionaries for sparse representation,” TIP, 2006.
-  J.-B. Huang, A. Singh, and N. Ahuja, “Single image super-resolution from transformed self-exemplars,” in CVPR, 2015.
-  M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel, “Low-complexity single-image super-resolution based on nonnegative neighbor embedding,” 2012.
-  R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using sparse-representations,” in ICCS, 2010.
-  H. Sheikh, “Live image quality assessment database release 2,” http://live. ece. utexas. edu/research/quality, 2005.