Code of 'when AWGN-based Denoiser Meets Real Noises'
Discriminative learning based image denoisers have achieved promising performance on synthetic noise such as the additive Gaussian noise. However, their performance on images with real noise is often not satisfactory. The main reason is that real noises are mostly spatially/channel-correlated and spatial/channel-variant. In contrast, the synthetic Additive White Gaussian Noise (AWGN) adopted in most previous work is pixel-independent. In this paper, we propose a novel approach to boost the performance of a real image denoiser which is trained only with synthetic pixel-independent noise data. First, we train a deep model that consists of a noise estimator and a denoiser with mixed AWGN and Random Value Impulse Noise (RVIN). We then investigate Pixel-shuffle Down-sampling (PD) strategy to adapt the trained model to real noises. Extensive experiments demonstrate the effectiveness and generalization ability of the proposed approach. Notably, our method achieves state-of-the-art performance on real sRGB images in the DND benchmark. Codes are available at https://github.com/yzhouas/PD-Denoising-pytorch.READ FULL TEXT VIEW PDF
Recent studies on learning-based image denoising have achieved promising...
We present a novel algorithm for blind denoising of images corrupted by ...
In this work, we explore an innovative strategy for image denoising by u...
Image denoising and artefact removal are complex inverse problems admitt...
Denoising images contaminated by the mixture of additive white Gaussian ...
Since time immemorial, noise has been a constant source of disturbance t...
We propose a new space-variant regularization term for variational image...
Code of 'when AWGN-based Denoiser Meets Real Noises'
Image denoising is a fundamental task in image processing and computer vision that has been extensively explored in the past several decades for a couple of downstream applications[31, 32, 42]. Traditional methods including the ones based on image filtering , low rank approximation [10, 36, 37], sparse coding [8, 7, 35], and image prior[30, 44, 34] have achieved satisfactory results on synthetic noise such as Additive White Gaussian Noise (AWGN). Recently, deep CNN has been applied to this task, and discriminative-learning-based methods such as DnCNN outperform most traditional methods on AWGN denoising.
Unfortunately, while these learning-based methods work well on the same type of synthetic noise that they are trained on, their performance degrades rapidly on real images, showing poor generalization ability in real world applications. This indicates that these data-driven denoising models are highly domain-specific and non-flexible to transfer to other noise types beyond AWGN. To improve model flexibility, the recently-proposed FFDNet trains a conditional non-blind denoiser with a manually adjusted noise-level map. By giving high-valued uniform maps to FFDNet, only over-smoothed results can be obtained in real image denoising. Therefore, blind denoising on real images is still very challenging due to the lack of accurate real noise distribution modeling. These unknown real-world noises are much more complex than pixel-independent AWGN. They are spatial-variant, spatially-correlated, signal-dependent, and even device-dependent.
To better address the problem of real image denoising, current attempts can be roughly divided into the following categories: (1) realistic noise modeling , (2) noise profiling such as multi-scale [11, 37], multi-channel  and regional based  settings, and (3) data augmentation techniques such as the adversarial-learning-based ones . Among them, CBDNet  achieves good performance by modeling the realistic noise using the in-camera pipeline model proposed in . It also trains an explicit noise estimator and sets a larger penalty for under-estimated noise. The network is trained on both synthetic and real noises, but it still cannot fully characterize real noises.
In this work, from a novel viewpoint of real image blind denoising, we seek to adapt a learning-based denoiser trained on pixel-independent synthetic noises to real noises. As shown in Fig. 1, we assume that real noises differ from pixel-independent synthetic noises dominantly in spatial/channel-variance and correlation
spatial/channel-variance and correlation. This difference results from in-camera pipeline like demosaicing, color and gamma transforms. Based on this assumption, we first propose to train a basis denoising network using mixed AWGN and RVIN. Our flexible basis net consists of an explicit noise estimator followed by a conditional denoiser. We demonstrate that this fully-convolutional nets are actually efficient in coping with pixel-independent spatial/channel variant noises. Second, we propose a simple yet effective adaptation strategy, Pixel-shuffle Down-sampling(PD), which employs the divide-and-conquer idea to handle real noises by breaking down the spatial correlation.
In summary, our main contributions include:
We propose a new flexible deep denoising model (trained with AWGN and RVIN) for both blind and non-blind image denoising. We also demonstrate that such fully convolutional models trained on spatial-invariant noises can handle spatial-variant noises.
We adapt the AWGN-RVIN-trained deep denoiser to real noises by applying a novel strategy based on Pixel-shuffle Down-sampling (PD). Spatially-correlated noises are broken down to pixel-wise independent noises by the pixel-shuffle operation.
The proposed method achieves state-of-the-art performance on DND benchmark and other real noisy images. We also show that with the proposed PD strategy, the performance of other existing denoising models can be boosted.
Discriminative Learning based Denoiser. Denoising methods based on CNNs have achieved impressive performance on removing synthetic Gaussian noise. Burger 
proposed to apply multi-layer perceptron (MLP) to denoising task. In, Chen proposed a trainable nonlinear reaction diffusion (TNRD) model for Gaussian noise removal at different level. MLP and TNRD are high-performance non-blind Gaussian denoising models that are comparable with BM3D . DnCNN 
was the first to propose a blind Gaussian denoising network using deep CNNs. It demonstrated the effectiveness of residual learning and batch normalization. More network structures like dilated convolution[40, 38]
, autoencoder with skip connection, ResNet , recursively branched deconvolutional network (RBDN)  were proposed to either enlarge the receptive field or balance the efficiency. Recently some interests are put into combining image denoising with high-level vision tasks like classification and segmentation. Liu [16, 15] applied segmentation to enhance the denoising performance on different regions. Similar class-aware work were developed in [20, 23]. Due to domain-specific training and deficient realistic noise data, those deep models are not robust enough on realistic noises. In recently proposed FFDNet , the author proposed a non-blind denoising by concatenating the noise level as a map to the noisy image. By manually adjusting noise level to a higher value, FFDNet demonstrates a spatial-invariant denoising on realistic noises with over-smoothed details.
Blind Denoising on Real Noisy Images. Real noises of CCD cameras are complicated and are related to optical sensors and in-camera process. Specifically, multiple noise sources like photon noise, read-out noise and processing including demosaicing, color and gamma transformation introduce the main characteristics of real noises: spatial/channel correlation, variance, and signal-dependence. To approximate real noise, multiple types of synthetic noise are explored in previous work, including Gaussian-Poisson [9, 18]
, Gaussian Mixture Model (GMM), in-camera process simulation [14, 29] and GAN-generated noises , to name a few. CBDNet  first simulated real noise and trained a subnetwork for noise estimation, in which spatial-variance noise is represented as spatial maps. Besides, multi-channel[36, 29] and multi-scale[11, 22, 38, 37] strategy were also investigated for adaptation. Different from all the aforementioned works which focus on directly synthesizing or simulating noises for training, in this work, we apply AWGN-RVIN model and focus on pixel-shuffle adaptation strategy to fill in the gap between pixel-independent synthetic and pixel-correlated real noises.
The architecture of the proposed basis model is illustrated in Figure 2. The basis noise model is mixed AWGN-RVIN. It is claimed that real-world noise can be locally approximated as AWGN [41, 13, 35], thus we generate AWGN to handle unknown noise sources. RVIN is mixed as a remedy to better deal with impulse noises in real images mostly caused by ADC errors. We generate AWGN, RVIN and mixed AWGN-RVIN following PGB.
The proposed blind denoising model consists of a noise estimator and a follow-up non-blind denoiser . Given a noisy observation , where is the noise synthetic process, and is the noise-free image, the model aims to jointly learn the residual , and it is trained on paired synthetic data . Specifically, the noise estimator outputs consisting of six pixel-wise noise-level maps that correspond to two noise types, , AWGN and RVIN, across three channels (R, G, B). Then is concatenated with the estimated noise level maps and fed into the non-blind denoiser . The denoiser then outputs the noise residual . Three objectives are proposed to supervise the network training, including the noise estimation (), blind () and non-blind () image denoising objectives, defined as,
where and are the trainable parameters of and . is the ground truth noise level maps for , consisting of and . For AWGN,
is represented as the even maps filled with the same standard deviation values ranging from 0 to 75 across R,G,B channels. For RVIN,is represented as the maps valued with the corrupted pixels ratio with upper-bound set to 0.3. We further normalize to range [0,1]. Then the full objective can be represented as a weighted sum of the above three losses,
in which , and are hyper-parameters to balance the losses, and we set them to be equal for simplicity.
The proposed model structure can perform both blind and non-blind denoising simultaneously, and the model is more flexible in interactive denoising and result adjustment. Explicit noise estimation also benefits noise modeling and disentanglement.
Pixel-shuffle down-sampling 
is defined to create the mosaic by sampling the images by stride
. Compared to other down-sampling methods like linear interpolation, bi-cubic interpolation, and pixel area relation, the pixel-shuffle and nearest-neighbour down-sampling on noisy image would not influence the real noise distribution. Besides, pixel-shuffle also benefits image recovery by preserving the original pixels from the images compared to others. These two advantages yield the two stages of PD strategy: adaptation and refinement.
Learning-based denoiser trained on AWGN is not robust enough to real noises due to domain difference. To adapt the noise model to real noise, here we briefly analyze and justify our assumption on the difference between real noises and Gaussian noise: spatial/channel variance and correlation.
Suppose a noise estimator is robust, , can accurately estimate the exact noise level, for a single AWGN-corrupted image, pixel-shuffle down-sampling will neither influence the AWGN variance nor the estimation values, when the sample stride is small enough to preserve the textural structures. When extending it to real noise case, we have an interesting hypothesis: as we increase the sample stride of pixel-shuffle, the estimation values of specific noise estimators will first fluctuate and then keep steady for a couple of stride increment. This assumption is feasible because pixel-shuffle will break down the spatial-correlated noise patterns to pixel-independent ones, which can be approximated as spatial-variant AWGN and adapted to those estimators.
We justify this hypothesis on both  and our proposed pixel-wise estimator. As shown in Fig. 3 (a), we randomly cropped a patch of size from a random noisy image in SIDD. We add AWGN with to its noise-free ground truth . After pixel-shuffling both and AWGN-corrupted , starting from stride , the noise pattern of demonstrates expected pixel independence. Using , the estimation result for is unchanged in Fig. 3 (b) (Left), but the one for in Fig. 3 (b) (Right) first increases and begins to keep steady after stride . It is consistent with the visual pattern and our hypothesis.
One assumption of  is that the noise is additive and evenly distributed across the image. For spatial-variant signal-dependent real noises, our pixel-wise estimator has its superiority. To make statistics of spatial-variant noise estimation values, we extract the three AWGN channels of noise map , where and are width and height of the input image, and compute the normalized 10-bin histograms across each channel when the stride is . We introduce the changing factor to monitor the noise map distribution changes as the stride increases.
where is the channel index. We then investigate the difference of sequence between AWGN and realistic noises. Specifically, we randomly select 50 images from CBSD68  and add random-level AWGN to them. For comparison, we randomly pick up 50 image patches of from DND benchmark. In Fig. 3 (c), sequence remains closed to zero for all AWGN-currupted images (Left figure), while for real noises demonstrates an abrupt drop when . It indicates that the spatial-correlation has been broken from .
The above analysis inspires the proposed adaptation strategy based on pixel-shuffle. Intuitively, we aim at finding the smallest stride to make the down-sampled spatial-correlated noises match the pixel-independent AWGN. Thus we keep increasing the stride until drops under a threshold . We run the above experiments on CBSD68 for 100 iterations to select the proper generalized threshold . After averaging the maximum of each iteration, we empirically set .
Fig. 4 shows the proposed Pixel-shuffle Down-sampling (PD) refinement strategy: (1) Compute the smallest stride , which is 2 in this example and more CCD image cases, to match AWGN following the adaptation process, and pixel-shuffle the image into mosaic ; (2) Denoise using ; (3) Refill each sub-image with noisy blocks separately and inversely pixel-shuffle them; (4) Denoise each refilled image again using and average them to obtain the ‘texture details’ ; (5) Combine the over-smoothed ‘flat regions’ to refine the final result.
As summarized in , the goals of noise removal include preserving texture details and boundaries, smoothing flat regions, and avoiding generating artifacts. Therefore, in the above step-(5), we propose to further refine the denoised image with the combination of ‘texture details’ and ‘flat regions’ . ‘Flat regions’ can be obtained from over-smoothed denoising results generated by lifting the noise estimation levels. In this work, given a noisy observation , the refined noise maps are defined as,
Consequently, the ‘flat region’ is defined as . The final result is obtained by .
In this work, the structures of the sub-network and follow DnCNN  of 5 layers and 20 layers. For grayscale image experiments, we also follow DnCNN to crop patches from 400 images of size . For color image model, we crop patches with stride 10 from all 500 color images in the Berkeley segmentation dataset (BSD) . The training data ratio of single-type noises (either AWGN or RVIN) and mixed noises (AWGN and RVIN) is 1:1. During training, Adam optimizer is utilized and the learning rate is set to
, and batch size is 128. After 30 epochs, the learning rate drops toand the training stops at epoch 50.
To evaluate the algorithm on synthetic noise (AWGN, mixed AWGN-RVIN and spatial-variant Gaussian), we utilize the benchmark data from BSD68, Set20  and CBSD68 . For realistic noise, we test it on RNI15 , DND benchmark , and self-captured night photos. We evaluate the performance of the algorithm in terms of PSNR and SSIM. Qualitative performance for denoising is also presented, with comparison to other state-of-the-arts.
|Non-blind (NB)||Blind (B)|
We first evaluat the performance of AWGN removal on grayscale images in BSD68. For a fair comparison, the blind DnCNN  model is the baseline model trained with the two separate noise types and the mixed one. For other baseline models, the noise level is assumed to be known before testing. The result comparison is shown in Table 1. We present the results of NB (Non-blind) and B (Blind) testing methods. For non-blind testing, our model is slightly worse than FFDNet because we have a more complicated noise model and smaller receptive field. For blind denoising, we achieve slightly better results compared with blind DnCNN. In conclusion, training the model with mixed noises will not greatly hurt the performance of single AWGN removal, and in contrast to implicit noise estimation in blind DnCNN, explicit noise estimation will benefit removal of high-level noises.
We evaluate our model on eliminating mixed AWGN and RVIN on Set20 as in . We also compare our method with other baselines, including BM3D  and WNNM  which are non-blind Gaussian denoisers anchored with a specific noise level estimated by the approach provided in . Besides, we include the PGB  denoiser that is designed for mixed AWGN and RVIN. The result of the blind version of DnCNN-B, trained by the same strategy as our model, is also presented for reference. The comparison results are shown in Table 2, from which we can see the proposed method achieves the best performance. Compared to DnCNN-B, for complicated mixed noises, our model explicitly disentangles different noises. It benefits the conditional denoiser to differentiate mixed noises from other types. In addition, the proposed model can be used to eliminate different types of noises separately. As shown in Figure 5, after we zero out the unrelated channels, we will be able to denoise only AWGN or RVIN without greatly influencing the other. The conditional denoising can be utilized to analyze the noise portions and help to well cope with spatial- and type-variant noise.
We conduct experiments to examine the generalization ability of fully convolutional model on generic signal-dependent noise model [29, 9, 18]. Given a clean image , the noises in the noisy observation contain both signal-dependent components with variance and independent components with variance . Table 3 shows the PSNR comparison. For non-blind model like BM3D and FFDNet, only scalar noise estimator  is applied, thus they cannot well cope with the spatial-variant cases. In this experiment, DnCNN-B is the original blind model trained on AWGN with ranged between 0 and 55. It shows that spatial-variant Gaussian noises can be handled by fully convolutional model trained with spatial-invariant AWGN . Compared to DnCNN-B, the proposed network explicitly estimates the pixel-wise map to make the model more flexible.
Impulse noises are caused by sudden disturbances in the image signal during image transmission and conversion. Training models with mixed AWGN and RVIN noises will benefit the removal of impulse noises in real images. We train another model only based on AWGN, and test it on our real captured noisy night photos. An example is shown in Fig. 6, in which it demonstrates the superiority of the existence of RVIN in the training data.
We apply different stride numbers while refining the denoised results, and compare the visual quality in Fig. 7. For sRGB images directly converted from raw images like Fig. 7 (a), stride number should be selected since the spatial-correlation is mostly caused by direct demosacing process. However, for arbitrary given sRGB images which may experience resizing like in Fig. 7 (c), the stride number can be computed using our adaptation algorithm with the assistance of noise estimator. In our experiments, the selected stride is the smallest that . Small stride number will treat large noise patterns as textures to preserve, as shown in Fig. 7 (d). While using large stride number tends to break the textural structures and details. Interestingly, as shown in Fig. 7 (b), the texture of the fabric is invisible while applying .
The ablation on the refinement steps is shown in Fig. 8, in which we compare the denoised results of I (directly inversely pixel-shuffling after step (2)), DI (denoising I using ), and Full (the current whole pipeline). It shows that both I and DI will form additional visible artifacts, while the whole pipeline smooths out those artifacts and has the best visual quality.
Ambiguity of fine textural details and mid-frequent noises is challenging in real image denoising. We introduce the blending factor as a remedy to the automatic denoising and refinement process. In Fig. 9, as increases, the denoised results tend to be over-smoothed. This is suitable for images with more background patterns. However, smaller will preserve more fine details which are applicable for images with more foreground objects.
Qualitative denoising results on the aforementioned real-world datasets are shown in Fig. 10, 11, 12, and 13. The methods we include for the comparison cover blind real denoisers (CBDNet , NI  and NC ), blind Gaussian denoisers (CDnCNN-B ) and non-blind Gaussian denoisers (CBM3D , WNNM , and FFDNet ). For non-blind methods, we manually select the noise level for the denoiser so that they perform the best visual quality. The results of DND is reported by the authors. From these example denoised results, we can observe that most of them are either noisy (as in DnCNN and WNNM), or spatial-invariantly over-smoothed (as in FFDNet). CBDNet performs better than others but it still suffers from blur edges and uncleaned background, as in Fig. 11 (g). Compared with these state-of-the-arts, our proposed method (PD) achieves a better spatial-variant denoising performance by smoothing the background while preserving the textural details in a blind setting.
Since the images in the DND benchmark are all captured by CCD camera and demosaiced from RAW images, we set the stride number . We follow the submission guideline of DND dataset to evaluate our algorithm. From the results in Table 4, we can see that models trained on AWGN (DnCNN, TNRD, MLP) mostly perform poorly on realistic noises, mainly due to the large gap between AWGN and real noise. CBDNet  improves the results significantly by training the deep networks with artificial realistic noise model. Our AWGN-RVIN-trained model with PD refinement achieves better results than CBDNet, and also boosts the performance of other AWGN-based methods (+PD).
In this paper, we revisit the real image blind denoising from a new viewpoint. We assumed the realistic noises are spatial/channel -variant and correlated, and addressed adaptation from AWGN-RVIN noises to real noises. Specifically, we proposed an image blind and non-blind denoising network trained on AWGN-RVIN noise model. The network consists of an explicit multi-type multi-channel noise estimator and an adaptive conditional denoiser. To generalize the network to real noises, we investigated Pixel-shuffle Down-sampling (PD) refinement strategy. We showed qualitatively that PD behaves better in both spatial-variant denoising and details preservation. Results on DND benchmark and other realistic noisy images demonstrated the newly proposed model with the strategy are efficient in processing spatial/channel variance and correlation of real noises without explicit modeling. We also achieve state-of-the-art result on DND benchmark, and boost other denoising algorithms.
Image denoising: Can plain neural networks compete with bm3d?In CVPR, 2012.
Image Processing: Algorithms and Systems, Neural Networks, and Machine Learning, volume 6064, page 606414. International Society for Optics and Photonics, 2006.
Survey of face detection on low-quality images.In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 769–773. IEEE, 2018.