Image denoising is an important task in computer vision. During image acquisition, noise is often unavoidable due to the limitations of the imaging environment and equipment. Therefore, noise removal is an essential step, not only for visual quality but also for other computer vision tasks. Image denoising has a long history, and many methods have been proposed. Many of the earlier model-based methods find natural image priors and then apply optimization algorithms to solve the model iteratively[22, 2, 29, 40]
. These methods are time consuming and cannot effectively remove noise. With the rise of deep learning, convolutional neural networks (CNNs) has been applied to image denoising tasks and have achieved high-quality results.
On the other hand, the earlier works assume that noise is independent and identically distributed. Additive white Gaussian noise (AWGN) is often adopted for synthetic noise images. People now realize that noise has more complicated forms, which are spatially variant and channel dependent. Therefore, some recent works have made progress in real image denoising [25, 38, 12, 4].
However, despite the many advances in image denoising, some issues remain to be resolved. The traditional CNN can only use the features in local fixed-location neighborhoods, which may be irrelevant or even exclusive to the current location. Due to their lack of adaptability to textures and edges, CNN-based methods result in oversmoothing artifacts and some details are lost. In addition, the receptive field of the traditional CNN is relatively small. Many methods deepen the network structure  or use a non-local module to expand the receptive field [17, 36]. However, these methods lead to high computational memory and time consumption, and they cannot be applied in practice.
In this paper, we propose a spatial-adaptive denoising network (SADNet) to address the above issues. A residual spatial-adaptive block (RSAB) is designed to adapt to changes in spatial textures and edges. We introduce the modulated deformable convolution in each RSAB to sample the spatially relevant features for weighting. Moreover, we incorporate the RSAB and residual blocks (ResBlock) in an encoder-decoder structure to remove noise from coarse to fine. To further enlarge the receptive field and capture the multiscale information, a context block is applied on the coarsest scale. Compared to the stat-of-the-art methods, our method can achieve good performance while maintaining a relatively small computational overhead.
In conclusion, the main contributions of our method are as follows:
We propose a novel spatial-adaptive denoising network for efficient noise removal. The network can capture the relevant features from complex image content, and recover details and textures from heavy noise.
We propose the residual spatial-adaptive block, which introduces deformable convolution to adapt to spatial textures and edges. In addition, with a multiscale structure and context block to capture multiscale information, we can estimate offsets and remove noise from coarse to fine.
We experiment on multiple synthetic image datasets and real noisy datasets. The results demonstrate that our model achieves state-of-the-art performance on both synthetic and real images with a relatively small computational overhead.
2 Related works
In general, image denoising methods include model-based and learning-based methods. Model-based methods attempt to model the distribution of natural images or noise. With the distribution as the prior, they attempt to obtain clear images with optimization algorithms. The common priors include local smoothing [22, 29], sparsity [2, 19, 32], non-local self-similarity [5, 9, 8, 31, 11] and external statistical prior [40, 33]. Non-local self-similarity is the notable prior in the image denoising task. This prior believes that image information is redundant and that there are similar structures in a single image. We can find some self-similar patches to remove noise. Many methods have been proposed based on the non-local self-similarity prior including NLM , BM3D [9, 8], WNNM [11, 31], which are currently widely used.
With the popularity of deep neural networks, learning-based denoising methods have developed rapidly. Some works combine natural priors with deep neural networks. TRND  introduced the field-of-experts prior into a deep neural network. NLNet  combined the non-local self-similarity prior with a CNN. Limited by the designed priors, their performance is often inferior compared to end-to-end CNN methods. DnCNN 
introduced residual learning and batch normalization to implement end-to-end denoising. FFDNet introduced the noise level map as the input and enhanced the flexibility of the network for non-uniform noise. MemNet  proposed a very deep end-to-end persistent memory network for image restoration, which fuses both short-term and long-term memories to capture different levels of information. Inspired by the non-local self-similarity prior, a non-local module  was designed in the neural network. NLRN 
attempted to incorporate non-local modules into a recurrent neural network (RNN) for image restoration. N3Net proposed neural nearest neighbors block to achieve non-local operation. RNAN  designed non-local attention blocks to capture global information and pay more attention to the challenging parts. However, the non-local operations would lead to high computational memory and time consumption.
People now realize that real noise has a more complicated distribution. Thus, some recent works have focused on real noisy images. Several real noisy datasets have been established by capturing real noisy scenes [24, 3, 1], which promotes the research of real denoising methods. N3Net  demonstrated the significance on real noisy dataset. CBDNet  trained two subnets to estimate noise and non-blind denoising in sequence. PD  applied the pixel-shuffle downsampling strategy to approximate the real noise to AWGN, which can adapt the trained model to real noises. RIDNet  proposed a one-stage denoising network with feature attention for real image denoising. However, these methods lack adaptability to image content and result in oversmoothing artifacts.
The architecture of our proposed spatial-adaptive denoising network (SADNet) is shown in Fig. 2. Let denotes a noisy input image and denotes the corresponding output denoised image. Then our model can be described as follows:
We use one convolutional layer to extract the initial features from the noisy input and then place them into a multiscale encoder-decoder architecture. In the encoder component, we use ResBlock 
to extract different scale features. However, unlike the original ResBlock, we remove the batch normalization and use the leaky ReLU
as the activation function. Avoiding damaging image structures, we limit the number of downsampling operations and implement a context block to further enlarge the receptive field and capture multiscale information. Then, in the decoder component, we design the residual spatial-adaptive block (RSAB) to sample and weight correlated features to remove noise and reconstruct the textures. In addition, we estimate the offsets and transfer them from coarse to fine, which is benefical for obtaining more accurate feature locations. Finally the reconstructed features are fed to the last one convolutional layer to restore the denoised image. With the use of the long residual connection, our network only learns the noise component.
In addition to the network architecture, the loss function is crucial to the performance. Several loss functions, such as, , perceptual loss, asymmetric loss, have been used in denoising tasks. In general, and are the two most commonly used loss functions in previous works. The loss has good confidence in Gaussian noise, whereas the
loss has better tolerance for outliers. Given a batch of training pairsthat contain noisy inputs and their corresponding noise-free ground truths, the loss is defined as follows:
where is the learned parameters in the network. In the same way, the loss can be expressed as follows:
In our experiment, we use the loss for training on synthetic image datasets and the loss for training on real-image noise datasets.
In the following sections, we will focus on the RSAB and context block.
3.1 Residual spatial-adaptive block
In this section, we first introduce the deformable convolution and then propose our RSAB in detail.
Let denote the features at location from the input feature map . Then, for a traditional convolution operation, the corresponding output features can be obtained by
where denote the neighborhood of location , whose size is equal to the convolutional kernel size. denotes the weight, and denotes the location in . As shown, the traditional convolution operation strictly takes the feature of the fixed location around for calculating the output feature. Some unwanted and unrelated features can interfere with the calculation of the output. As shown in Fig. 3, when the current location is near the edge, the distinct features located outside the object are introduced for weighting, which may smooth the edges and destroy the texture. For the denoising task, we hope that only the correlated or similar features are used for noise removal, similar to the non-local weighted mean denoising methods.
Therefore, we introduce deformable convolution [10, 39] to be adaptive to spatial texture changes. In contrast to traditional convolutional layers, deformable convolution can change the shapes of convolutional kernels. It first learns an offset map for every location and applies it to the feature map, which can resample the correlated features for weighting. Here, we use the modulated deformable convolution , which provides another dimension of freedom to adjust its spatial support regions,
where is the learnable offset for location . is the learnable modulation scalar, which lies in the range . It indicates the correlation between the sampled features and the features in the current location. Thus, the modulated deformable convolution can modulate the input feature amplitudes to further adjust the spatial support regions. Both and are obtained from the previous features.
In each RSAB, we first fuse the extracted features and the reconstructed features from the last scale as our input. The RSAB is constructed by a modulated deformable convolution followed by a traditional convolution with a short skip connection. Similar to ResBlock, we implement local residual learning to enhance the information flow and representation ability of the network. However, unlike ResBlock, we replace the first convolution with modulated deformable convolution and use the leaky ReLU as our activation function. Hence, the RSAB can be formulated as,
where and are the modulated deformable convolution and traditional convolution respectively. is the activation function, which is leaky ReLU here. The architecture of RSAB is shown in Fig. 4.
Furthermore, to better estimate the offsets from coarse to fine, we transfer the last-scale offsets and modulation scales to the current scale , and then use both and the input features to estimate . The formulation is
where and denote the offset transfer function and upsampling function respectively, which are shown in Fig. 4
. The offset transfer function is several convolutions, and it extracts features from input and fuses them with the last offsets to estimate the offsets in the current scale. The upsampling function magnifies the size and value of the last offset maps. In our experiment, bilinear interpolation is adopted to upsample the offsets and modulation scales.
3.2 Context block
Multiscale information is important for image denoising tasks; therefore, the downsampling operation is often adapted in networks. However, when the spatial resolution is too small, the structure of the image is destroyed, and information is lost, which is not conducive to reconstructing the features.
To increase the receptive field and capture multiscale information without further reducing the spatial resolution, we introduce a context block into the minimum scale between the encoder and decoder. Context blocks have been successfully used in image segments  and deblurring task . In contrast to spatial pyramid pooling 
, the context block uses several dilated convolutions with different dilation rates rather than downsampling. It can expand the receptive field without increasing the number of parameters and damaging the structure. Then the features extracted from different receptive fields are fused to estimate the output (as shown in Fig.5). It is beneficial to estimate offsets from a larger receptive field.
In our experiment, we remove the batch normalization layer and only use four dilation rates which are set to 1, 2, 4, and 8. To further simplify the operation and speed up the running time, we first use convolution to compress the channels of the feature. The compression ratio is set to 4. In the fusion setup, we use convolution to output fusion features whose channels are equal to the original input features. Similarly, a local skip connection between the input and output features is applied to prevent information blocking.
In the proposed model, we use four scales for the encoder-decoder architecture, and the number of channels for each scale is set to 32, 64, 128, and 256. Expect for the first and last convolutional layers, unless stated otherwise, all other convolutional layers adopt a kernel size of . The kernel size of the first and last convolutional layers is set as . The final output is set to 1 or 3 channels depending on the input.
In this section, we demonstrate the effectiveness of our model on both synthetic datasets and real noisy datasets. We adopt DIV2K  which contains 800 2K images and add different levels of noise to synthetic noise datasets. For real noisy images, we use the SIDD , RENOIR  and Poly  datasets. We randomly rotate and flip the images horizontally and vertically for data augmentation. In each training batch, we use 16 patches with size of as inputs. We train our model using the ADAM  optimizer with , , and . The initial learning rate is set to and then halved after
. Our model is implemented in the PyTorch framework with an Nvidia GeForce RTX 1080Ti. In addition, PSNR and SSIM  are employed to evaluate the results.
4.1 Ablation study
We perform the ablation study on the Kodak24 dataset with a noise sigma 50. The results are shown in Table 1.
Ablation on RSAB RSAB is the crucial block in our network. The network will lose adaptability to image content without RSAB. When we replace RSAB with a original ResBlock, the performance significantly decreases, which demonstrates the its effect.
Ablation on context block The context block complements the downsampling operations to capture larger field information. We can observe that the performance improves when the context block is introduced.
Ablation on offset transfer We remove the offset transfer from coarse to fine and only use the features on the current scale to estimate the offsets for RSAB. The comparison validates the effectiveness of offset transfer.
We compare our algorithm with the state-of-the-art denoising methods with PSNR as the evaluation metric. For a fair comparison, all methods in the comparison employ the default settings provided by the corresponding authors.
4.2.1 Synthetic noisy images
In the comparisons of synthetic noisy images, we use BSD68 and Kodak24 as our test datasets. Both datasets have color and gray images for testing. We add AWGN of different noise levels to the clean images. We choose BM3D  and CBM3D  as representatives of the classical traditional methods and some CNN-based methods, including DnCNN , MemNet , FFDNet , RNAN , and RIDNet , for comparisons.
Tables 2 and 3 are the quantitative results for gray images and color images with three different noise levels. Our SADNet can outperform the state-of-the-art methods on most of the tested noise levels and datasets. Note that although RNAN can achieve comparable evaluations to our method on partial results, it requires more parameters and computational overhead. In addition, we can observe that our method shows more improvement on higher noise levels, which demonstrates the effectiveness of the method on heavy noise removal.
The visual comparisons are shown in Fig. 6 and Fig. 7. We present some challenging examples from BSD68 and Kodak24. The feathers of birds and the texture of clothes are difficult to separate from heavy noise. The other methods in the comparison remove the details along with the noise, which results in oversmoothing artifacts. Due to its adaptivity to the image content, our method can restore the vivid textures from noise without other artifacts.
4.2.2 Real noisy images
In the comparisons of real noisy images, we choose DND , SIDD  and Nam  as our test datasets. DND contains 50 real noisy images with their corresponding clear images. One-thousand patches with a size of are extracted from the dataset by the providers for testing and comparison. The SIDD validation dataset is introduced for our evaluation, which contains 1280 noisy-clean image pairs. Nam includes 15 large image pairs with JPEG compression for 11 scenes. We cropped the images into patches and selected 25 patches as CBDNet  for testing. We train our model on the SIDD medium dataset and RENOIR for evaluation on the DND and SIDD validation datasets. Then, we finetune our model on the Poly  dataset for Nam, which improves the performance on the noisy images with JPEG compression. Furthermore, we choose the state-of-the-art methods that have demonstrated their validity on real noisy images for comparisons, including CBM3D , DnCNN , CBDNet , PD , and RIDNet .
DND The quantitative results are listed in Table 4, which are obtained from the public DnD benchmark website. FFDNet+ is the improved version of FFDNet wih a uniform noise level map manually selected by the providers. CDnCNN-B is the original DnCNN model for blind color denoising. DnCNN+ is finetuned on CDnCNN-B with the results of FFDNet+. Both non-blind and blind denoising methods are included for comparisons. Our SADNet can outperform the state-of-the-art methods on both PSNR and SSIM values. We further make a visual comparison on denoised images from the DnD dataset, which are shown in Fig. 8. We magnify the patches for better comparison. Other methods corrode the edges with residual noise. Our method can effectively remove the noise in the smooth region and keep the edges clear.
SIDD The images in the SIDD dataset are captured by smartphones, and some noisy images have high noise levels. We employ 1280 validation images for quantitative comparisons, which are shown in Table 5. This demonstrates that our method can make significant improvements over other methods. For visual comparisons, we choose two examples from the denoised results. The first scene has rich texture and the second scene has strong structures. As shown in Fig. 9 and Fig. 10, CDnCNN-B and CBDNet fail in noise removal. CBM3D results in pseudo artifacts. PD and RIDNet destroy the textures. Our network can recover textures and structures that are closer to the ground truth.
Nam The JPEG compression makes the noise more stubborn on the Nam dataset. For a fair comparison, we use the patches chosen by CBDNet for evaluation. Furthermore, CBDNet (JPEG) is introduced for comparison, which was trained on JPEG compressed datasets. We report the average PSNR values for Nam in Table 6. Our SADNet achieves 1.98, 1.93 and 1.71 dB gains over RIDNet, PD, and CBDNet (JPEG). In the visual comparison shown in Fig. 11, our method again obtains the best result over other methods.
4.2.3 Parameters and running times
To compare the running times, we test different methods on color image denoising. Note that the running time may depend on the test platform and code, and we also provide the number of floating point operations (FLOPs). All methods are implemented in PyTorch. Although SADNet has high parameter numbers, the FLOPs are minimal, and the running time is relatively short due to multiple downsampling. Most operations of our model run on smaller scale feature maps.
In this paper, we propose the spatial-adaptive denoising network for effective noise removal. The network is built by multiscale residual spatial-adaptive blocks, which sample relevant features for weighting based on image content and textures. We further introduce a context block to capture multiscale information and implement offset transfer to more accurately estimate the sampling locations. We find that the introduction of spatially adaptive capability can result in richer details for complex scenes. The proposed SADNet achieves the state-of-the-art performances with a moderate running time.
This work is partially supported by Science and Technology on Optical Radiation Laboratory (61424080211).
A high-quality denoising dataset for smartphone cameras.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1692–1700. Cited by: §2, §4.2.2, §4.
-  (2006) K-svd: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on signal processing 54 (11), pp. 4311–4322. Cited by: §1, §2.
-  (2018) RENOIR–a dataset for real low-light image noise reduction. Journal of Visual Communication and Image Representation 51, pp. 144–154. Cited by: §2, §4.
-  (2019) Real image denoising with feature attention. arXiv preprint arXiv:1904.07396. Cited by: §1, §2, §4.2.1, §4.2.2.
-  (2005) A non-local algorithm for image denoising. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 2, pp. 60–65. Cited by: §2.
-  (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §3.2.
-  (2016) Trainable nonlinear reaction diffusion: a flexible framework for fast and effective image restoration. IEEE transactions on pattern analysis and machine intelligence 39 (6), pp. 1256–1272. Cited by: §2.
-  (2007) Color image denoising via sparse 3d collaborative filtering with grouping constraint in luminance-chrominance space. In 2007 IEEE International Conference on Image Processing, Vol. 1, pp. I–313. Cited by: §2, §4.2.1, §4.2.2.
-  (2007) Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on image processing 16 (8), pp. 2080–2095. Cited by: §2, §4.2.1.
-  (2017) Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 764–773. Cited by: §3.1.
-  (2014) Weighted nuclear norm minimization with application to image denoising. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2862–2869. Cited by: §2.
-  (2019) Toward convolutional blind denoising of real photographs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1712–1722. Cited by: §1, §2, §4.2.2.
-  (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence 37 (9), pp. 1904–1916. Cited by: §3.2.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
-  (2017) Non-local color image denoising with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3587–3596. Cited by: §2.
-  (2018) Non-local recurrent network for image restoration. In Advances in Neural Information Processing Systems, pp. 1673–1682. Cited by: §1, §2.
-  (2013) Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, Vol. 30, pp. 3. Cited by: §3.
-  (2009) Non-local sparse models for image restoration.. In ICCV, Vol. 29, pp. 54–62. Cited by: §2.
-  (2001) A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. Cited by: §4.
-  (2016) A holistic approach to cross-channel image noise modeling and its application to image denoising. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1683–1691. Cited by: §4.2.2.
-  (2005) An iterative regularization method for total variation-based image restoration. Multiscale Modeling & Simulation 4 (2), pp. 460–489. Cited by: §1, §2.
-  (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §4.
-  (2017) Benchmarking denoising algorithms with real photographs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1586–1595. Cited by: §2, §4.2.2.
-  (2018) Neural nearest neighbors networks. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §2, §2.
-  (2017) Memnet: a persistent memory network for image restoration. In Proceedings of the IEEE international conference on computer vision, pp. 4539–4547. Cited by: §1, §2, §4.2.1.
-  (2018) Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803. Cited by: §2.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §4.
-  (2007) Iterative regularization and nonlinear inverse scale space applied to wavelet-based denoising. IEEE Transactions on Image Processing 16 (2), pp. 534–544. Cited by: §1, §2.
-  (2018) Real-world noisy image denoising: a new benchmark. arXiv preprint arXiv:1804.02603. Cited by: §4.2.2, §4.
-  (2017) Multi-channel weighted nuclear norm minimization for real color image denoising. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1096–1104. Cited by: §2.
-  (2018) A trilateral weighted sparse coding scheme for real-world image denoising. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 20–36. Cited by: §2.
-  (2018) External prior guided internal prior learning for real-world noisy image denoising. IEEE Transactions on Image Processing 27 (6), pp. 2996–3010. Cited by: §2.
-  (2017) Beyond a gaussian denoiser: residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing 26 (7), pp. 3142–3155. Cited by: §2, §4.2.1, §4.2.2.
-  (2018) FFDNet: toward a fast and flexible solution for cnn-based image denoising. IEEE Transactions on Image Processing 27 (9), pp. 4608–4622. Cited by: §2, §4.2.1.
-  (2019) Residual non-local attention networks for image restoration. arXiv preprint arXiv:1903.10082. Cited by: §1, §2, §4.2.1.
-  (2019) DAVANet: stereo deblurring with view aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10996–11005. Cited by: §3.2.
-  (2019) When awgn-based denoiser meets real noises. arXiv preprint arXiv:1904.03485. Cited by: §1, §2, §4.2.2.
-  (2019) Deformable convnets v2: more deformable, better results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9308–9316. Cited by: §3.1.
-  (2011) From learning models of natural image patches to whole image restoration. In 2011 International Conference on Computer Vision, pp. 479–486. Cited by: §1, §2.