Image denoising is a fundamental problem in computer vision and image processing. The task is to recover the latent clean imagegiven a noisy observation in the presence of unknown additive noise , namely:
If the noise can be effectively estimated, it can subtracted from noisy image to produce a denoised result. Therefore, an effective approach to denoising is estimation of accurate, per-pixel noise. However, in modern digital cameras this is a difficult challenge for the following reasons:
Non-linear ISP operations: Current Image Signal Processing pipelines perform a large number of non-linear operations, including demosaicing and high dynamic range compression. Consequently, noise that can be accurately characterised in the raw domain becomes very complicated at the end of the pipeline. For tractability, much previous work on denoising makes the over-simplifying assumption that the noise is white and Gaussian, and applies a denoiser to the RGB image produced by the ISP. However, recent work  shows the power of building denoising methods in the raw domain, where the modeled noise better matches the physics of image formation.
Types of noise: There are several types of noise present in raw data. Read noise
results from thermal and electrical noise in the electronics of the imaging sensor. This stochastic noise can be modelled with a Gaussian distribution.Shot noise is related to the number of photons arriving at the sensor, and is therefore brightness dependent. Shot noise can be effectively modeled with a Poisson distribution. Defective pixel noise arises from different sensitivities of pixels on the sensor to incoming light. This can occur as a consequence of the manufacturing process of the sensor, or due to random failures during service, so these sensitivities can change with time and operating condition. Some pixels may become saturated (maximum brightness) or capture no light (minimum brightness) as well as levels inbetween. A spatially impulsive noise model can represent this type of noise.
Light level (ISO): In digital cameras, ISO is used to adjust the light sensitivity of imaging sensors. As scene lighting decreases, the ISO can be increased to apply an electronic gain to produce a correct exposure. However, amplifying the signal also increases the noise. Particularly in extreme low light scenes (ISO3200), all noise sources described in the previous paragraph are increased, especially defective pixels, which become more apparent in extreme low light. This has a detrimental effect on image quality.
This paper addresses these challenges by proposing a novel image denoising network designed for extreme low light imaging. The method, called Noise Decomposition (NODE) works in the raw domain where the noise can be more accurately modeled . NODE has a multi-task design that decomposes the noise using separate Gaussian+Poisson and defective pixel noise estimators. We demonstrate that NODE is more effective in extreme low light imaging compared to single-task state-of-the-art denoisers. An example is shown in Figure NODE: Extreme Low Light Raw Image Denoising using a Noise Decomposition Network comparing the denoised result produced using DnCNN  to that of NODE.
2 Related Work
We briefly review traditional approaches to denoising as well as recent deep learning approaches, along with literature focussing on low light denoising. For more detailed coverage of the image denoising literature, we refer the interested reader to[Bertalmío2018Book].
2.1 Traditional methods
Noise is a ubiquitous phenomena in images dating back to the first photograph taken in the mid-1820s by Nicèphore Nièpce. Early denoising work using digital image processing techniques includes lowpass filters implemented in the spatial domain (box filters, Gaussian filters) or transform (Fourier, DCT) domain . While effective to remove high frequency noise, these methods also tend to blur the image. Subsequent transformation domain work such as wavelet shrinkage  better preserves detail by exploiting sparsity in the wavelet domain. Edge-preserving denoising was also approached using anisotropic diffusion  and bilateral filtering . More recent approaches exploit self-similarity often inherent in an image. Such work includes non-local means  and BM3D , the latter being a block-matching technique that exploits aggregates similar patches and filters collaboratively in the wavelet domain.
2.2 Deep learning approaches
Closely related to this paper, deep neural networks have been proposed to address the image denoising task. Jain and Seung 
design a multi-layer convolutional neural network for image denoising. Burger solve the denoising problem by training a multi-layer perceptrion (MLP). This paper shows than an MLP can achieve comparable performance with BM3D . Xie  combine sparse coding and a deep neural network pre-trained with denosing auto-encoder to solve denoising and inpainting problems. The architecture proposed by Zhang 
learns the residual noise given a noisy image by combining convolution, batch normalization28]. Their network, DnCNN, achieves better quantitative results, i.e. peak signal-to-noise ratio (PSNR) than the conventional state-of-the-art approaches [14, 39, 11].
Ronneberger  propose an encoding-decoding framework for segmentation of biomedical images. This paper utilizes skip connections, which have been applied to train very deep networks in many other applications [18, 10, 20, 34, 9]. Mao  propose an encoder-decoder style network (RED) for image restoration. This method demonstrates that using nested skip connections, the training process converges more easily and quickly.
Motivated by the persistence from human thought, Tai  introduced deep persistent memory neural network called MemNet for image restoration. In this network, the memory block contains a recursive unit and a gate unit to learn multi-level representations with different receptive fields through an adaptive learning process. The paper shows that combining short-term memory from the recursive unit and the long-term memory from the memory block, the method can resolve long term dependencies.
2.3 Low light denoising
Low light imaging is a challenging task that has received some attention in the literature. Guo  present a Retinex-based method to adjust image brightness and allow for denoising of low light images. Chatterjee present a joint denoising and demosaicking method 
applied to low-light images, using vector upsampling based on Local Linear Embedding. While effective, residual noise is still present in the processed images. Li explore a dark channel prior by inverting the image’s brightness and applying BM3D-style denoising to superpixels. Remez adaptively employ a deep neural network for Poisson noise removal on low light images .
All of the methods described above are working in medium to low light. However, in this paper we address extreme-low light (ISO3200) where denoising becomes more difficult. Chen .  propose a UNet-style network that learns the entire ISP pipeline for extreme low light environments. In contrast, the proposed method focusses solely on the denoising task in the raw domain.
While it is possible to build a single network to denoise an extreme low light image, the different noise sources confuse the denoiser, resulting in images that typically have undesirable residual noise. Critically, none of the prior work listed above specifically addresses defective pixel noise, which becomes more acute in extreme low light scenes.
The key idea of this paper is to decompose the noise into Gaussian + Poisson noise and defective pixel noise, using dedicated sub-networks trained in a multi-task setting. One task estimates the Gaussian + Poisson noise, and the other task estimates the defective pixel noise. As these noise types are fundamentally different, each task can focus on a particular type of noise, producing a better result than training a single-task denoising network.
Each sub-network is pre-trained with synthesized data. One sub-network is trained to estimate Gaussian+Poisson noise, whereas the other is trained to estimated defective pixel noise. These two noise estimates are then provided to a denoiser, as a concatenated input with the noisy image. The entire network is then fine-tuned on real images. In essence, NODE estimates the noise in an image-adaptive way, and then denoises the image. The main contributions of the proposed methods are:
Multi-task noise estimation: Our designed neural network can simultaneously estimate Gaussian+Poisson noise and defective pixel noise using two separate sub-networks. This way, different parts of the network focus on specific types of noise. To our knowledge, the defective pixel noise removal using a deep neural network has never done before.
End-to-end training: The network is trained end-to-end using noisy and clean image pairs. NODE can produce an optimal solution to the problem of denoising of images corrupted with Gaussian + Poisson and defective pixel noise.
Sub-network design: We design variants of UNet  by replacing some maxpooling and deconvolution operations with shuffle operations (space to depth and depth to space). While simple, these modifications help particularly with impulsive noise resulting from defective pixels.
Extreme low light denoising: The proposed method is designed to work in extreme low light scenarios and is shown to be more effective than conventional denoising techniques.
3 NODE: Noise Decomposition Network
In this section, we present details of the decomposition network and explain how it works with the sub-network.
3.1 The Overall Architecture
The overall framework for NODE is presented in Figure 1. First, the method takes the raw noisy image from the Bayer pattern (consisting of , , , and pixels) and packs it into four channels for subsequent processing. These channels are at half the width and height of the original image. Packing is necessary to group same-color pixels together for subsequent convolutional layers.
The packed, noisy image is then input into two subnetworks: a Gaussian + Poisson noise estimation sub-network, and a defective pixel estimation sub-network. This multi-task architecture is designed to separately decompose the noise, allowing each subnetwork to focus on a different task as the noise types are very different. Gaussian+Poisson noise corrupts every pixel in the image, whereas defective pixel noise is spatially sparse, affecting only certain pixels (which may vary over time). By decomposing the noise into these two different streams, NODE can achieve better results than a single-task denoising network as our experimental results will show.
The upper branch in the NODE architecture estimates a packed form of Gaussian+Poisson noise, essentially the predicted noise at each pixel, but packed into four channels (corresponding to , , , and ). Similarly, the lower branch in the NODE architecture estimates a packed form of the defective pixel noise. These eight packed estimated noise channels serve as an initial estimate of the noise in the raw image. While it would be possible to directly subtract these noise sources from the raw image to produced a denoised output, instead, NODE concatenates the estimated noise with the noisy raw image and feeds a 12 channel input to a denoising subnetwork. This final network refines the noise estimate to produce the final denoised image.
Pre-training is critical to the success of the Gaussian+Poisson and the defective pixel estimation subnetworks. Each subnetwork is pre-trained using synthetic data of a specific type. For example, in the Gaussian+Poisson subnetwork, we start with clean raw images, and degrade them by adding Gaussian+Poisson noise. We then train the subnetwork to learn how to predict a clean image given a noisy image degraded by Gaussian+Poisson noise. A similar process is performed for the defective pixel subnetwork.
Let the noisy input image be denoted as , the per-pixel Gaussian+Poisson noise as the per-pixel defective pixel noise , and the clean image as . In this case, we can rewrite Equation 1 as
The upper subnetwork is trained to remove Gaussian+Poisson noise, so given a noisy image it regresses an image , where and are estimated clean image plus defective pixel noise. This result is subtracted from the original noisy input to produce the estimated per-pixel Gaussian+Poisson noise, i.e.
This is the output of the upper subnetwork and is represented by the four channels in Figure 1. In a similar fashion, the lower defective subpixel estimation subnetwork estimates the per-pixel defective pixel noise, .
With the pre-trained subnetworks in place, the entire NODE architecture is fine-tuned, end-to-end on real images. This process adapts the weights learned from synthetic noise to that of real images. In this way, the method first adaptively estimates the noise at each pixel, and then applies a denoising network given the estimated noise.
At inference, the input is a real noisy image, containing Gaussian + Poisson and defective pixel noise. The noise is estimated, and concatenated with the original image for subsequent refinement by the denoising network, which produces the final denoised image.
Note that although it would be possible to directly subtract estimated Gaussian + Poisson and defective pixel noises from input images, NODE instead concatenates the estimated noise with the real image to refine the estimated noise. This design can be thought of as an image-adaptive noise estimation, followed by a refinement denoising operation.
3.2 The sub-networks
Many neural network designs could be used for the sub-networks described above. In our experiments, we use an encoder/decoder network design inspired by UNet . Our subnetwork architecture is in Figure 2.
On the encoding path (starting from the upper left in Figure 2), a series of convolutional layers with leaky ReLU  extract features at high resolution, which is important in the raw domain as noise varies from pixel to pixel. We include a bottleneck layer for calculation efficiency. In practice, we find that these convolutional layers help preserve the high frequency detail.
Next, the resolution of the image data is progressively reduced using a shuffle layer (red arrows in Fig. 2). This layer reshapes the data so that the spatial resolution is decreased by a factor of two in width and height, but creates four times as many channels (space to depth). Consequently, this shuffle layer makes the image size smaller while retaining important perceptual information. Symmetrically, deshuffle layers (green arrows) are used for the decoding process. The shuffle and deshuffle layers are rendered with red and light green arrows in Figure 2.
In subsequent processing, the resolution of the image data is progressively reduced using max pooling (golden arrow) is applied between each layer to allow more efficient subsequent processing. On the decoding side, transposed convolution (up-conv, yellow arrow) is used for upsampling. Skip connections are provided feed-through between corresponding layers as a effective way to configure the models to achieve good trainability and restore high frequency details.
3.3 The Denoising Network
A very similar architecture to the sub-network is used for the denoising network. The only differences are that there are no extra convolutional layers at the beginning of the encoding path and there is only one pair of shuffling layer and deshuffling layer at the highest resolution replaced by the shuffle/unshuffling layers. Please see the supplementary material more for a figure showing the denoiser network architecture.
4 Implementation Details
For this paper, we collected a new dataset using a Huawei P20 cellphone at ISO 12800. The data is captured in a lossless, raw format using an RGGB Bayer pattern color filter array. At each pixel there is only a red, green, or blue color.
4.1 Synthetic Images
An important part of this work is noise synthesis to pre-train the sub-networks. For this, we first fit Gaussian+Poisson and defective pixel noise models to real data captured by the device, using a sequence of 12 images captured in a low light, static scene with a static camera. We average the 12 frames, producing a mean image, which serves as a noise-free estimate
, and a variance image computed for each pixelacross the sequence. The well known [16, 26] noise model , where is the read noise, and is the shot noise, respectively is fit to the noise variance as a function of intensity using RANSAC 
to robustly handle outliers in the data. Once fit, we can characterise noise using the Gaussian + Possion noise model. Any pixels that exhibit noise inconsistent with this Gaussian + Poisson noise model is considered as defective pixel noise. For this, we consider all pixels intensities outside of the 99% confidence interval of the noise Gaussian + Poisson distribution as the defective pixels. Once the noise models are computed, we can then synthesize realistic noise for the device. An example is shown in Figure3. In practice, we use 187 noisy sequences of 12 frames at high resolution (2736 3648) to generate the noise model and 145 images at the same resolution to generate two synthetic datasets which contain Gaussian + Poisson noise and Defective noise respectively.
|EvaluationMethods||BM3D ||DnCNN ||Unet ||MemNet ||RED ||Ours|
|PSNR (higher is better)||38.93||40.25||40.37||40.04||40.93||41.10|
|SSIM  (higher is better)||0.9452||0.9770||0.9755||0.9763||0.9784||0.9789|
|PSNR(MASK) (higher is better)||39.01||40.34||40.48||40.11||41.04||41.55|
|SSIM(MASK)  (higher is better)||0.9487||0.9765||0.9752||0.9760||0.9780||0.9796|
|PI  (lower is better)||6.5607||6.4801||6.5367||6.2536||6.4676||6.1065|
We use the noise model to synthesize two different datasets, i.e. one dataset containing only the Gaussian + Possion noise model and the other containing only defective pixel noise. Then these two datasets are used for each subnet pre-training. After pre-training, we then put the networks together into the full architecture of Figure 1 and fine-tune, training end-to-end on real data.
5.1 Experimental Setting
For training the overall architecture, we collected 123 short/long exposure pairs at high resolution (2736 3648) using five Huawei P20 cellphones. The data was randomly split into training (90%) and independent testing (10%) sets with phones used in testing different from those in training. The training data was augmented by either flipping right/left or top/bottom or both. All the images are captured in an extreme-low light environment at ISO 12800. The network is trained using the Adam Optimizer  with . We set the patch size , batch size and the learning rate . We use
as the loss function.
The 10 MP images used in our study are real images taken from a phone, and our method can process them using a standard NVidia GTX 1080Ti GPU. However, MemNet and RED were not able to process the full resolution images. Therefore, we cropped the test images to a size for inference for fair comparison. Cropping was performed in the center of the image where there was the most salient content. We compare to state-of-the-art denoisers including BM3D , DnCNN , MemNet , RED  and Unet . Through a grid search, we set for BM3D  as it returned the best PSNR and SSIM 
values. Aside from BM3D, all methods are implemented using Tensorflow and trained using the same dataset described above for the purpose of fair comparison. The settings of these competing methods are from the their papers respectively.
5.2 Performance Evaluation
We evaluate the different methods on raw images, before automatic white balancing or demosaicing. Because the raw image is captured in extreme low light, it is very dark. For visualisation purposes, we post-process the image by demosaicing using bicubic interpolation and brighten by scaling the image.
For quantitative evaluation, we include PSNR, SSIM  and Perceptual Index  (PI). PSNR and SSIM  are well-known and established measures to assess image quality by comparing the estimated denoised image to a reference image. For each, a higher number is desired, representing a better match between the estimated denoised image and the reference image. We augment these quantitative results with those produced by the recently proposed perceptual index  to better assess the visual quality as perceived by human observers. They evaluate the performance by the non-reference measurement from the PIRM-SR Challenge . This index is a linear combination of NIQE  and Ma’s methods  and does not require a reference, i.e. PI . For the perceptual index, a lower score indicates better quality.
The quantitative results are shown in Table 1. Note that higher PSNR/SSIM  values are related to the fact that the images are captured under extreme-low light, so the pixel intensities are small in general over a large range (10 bit values). Our long exposure images serving as ground truth also had some defective pixel residual noise, visible in Figure 3 (left). We observed NODE is effective at removing defective pixels, but PSNR/SSIM will penalise NODE on defective pixels in such cases, therefore we mask out the defective pixels in the ground truth (but not noisy image) using the method described in Section 4.1. These evaluation results are marked as MASK in Table 1. Using all metrics, the proposed method outperforms the state-of-the-art. Representative qualitative results are shown in Figure 4 and Figure 5. From the qualitative results, it is apparent that the proposed method can better handle the noise caused by the defective points. Particularly BM3D struggles with defective pixel correction, as it relies on self-similarity which is less relevant in the presence of defective pixels. Whilst the deep learning networks do better with the defective pixel noise, there is residual noise contamination that is best removed by NODE.
Limitations Although NODE demonstrates considerable strength in denoising extreme low light raw images, there are limitations to this research. First, the proposed method was developed using raw data collected from a single phone model. In practical setting, this could be feasible approach for producing a targeted denoising method. However, this paper does not consider generalisation to other phone models, which is left for future work. Also, the data was collected at a single ISO 12800. However, recent work  has shown it is straightforward to include additional input channels to give the denoising network knowledge of the expected noise level.
Multi-task noise decomposition proves to be a promising approach for the task of denoising extreme low light raw images. By letting each subnetwork focus on noise of a particular type, better results can be obtained compared to single-task denoising networks. Future improvements can address adaptation of the method to additional sensors, ISOs, and data types including video.
Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §5.1.
The 2018 pirm challenge on perceptual image super-resolution. In European Conference on Computer Vision, pp. 334–355. Cited by: §5.2.
-  (2018) Unprocessing images for learned raw denoising. arXiv preprint:1811.11127. Cited by: §1, §1, §5.2.
-  (2005) A non-local algorithm for image denoising. In CVPR, Cited by: §2.1.
Image denoising: can plain neural networks compete with bm3d?.
2012 IEEE conference on computer vision and pattern recognition, pp. 2392–2399. Cited by: §2.2.
-  (1996) Digital image processing. Prentice Hall. External Links: Cited by: §2.1.
-  (2011) Noise suppression in low-light images through joint denoising and demosaicing. In CVPR 2011, pp. 321–328. Cited by: §2.3.
-  (2018) Learning to see in the dark. In CVPR, Cited by: §2.3.
-  (2017) Photographic image synthesis with cascaded refinement networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1511–1520. Cited by: §2.2.
-  (2016) 3D u-net: learning dense volumetric segmentation from sparse annotation. In International conference on medical image computing and computer-assisted intervention, pp. 424–432. Cited by: §2.2.
-  (2007) Image denoising by sparse 3d transform-domain collaborative filtering. In IEEE Transactions on Image Processing, Cited by: §2.1, §2.2, Table 1, (f)f, (f)f, §5.1.
-  (1995) Denoising via soft thresholding. IEEE Transactions on Information Theory 41 (3). Cited by: §2.1.
-  (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24 (6), pp. 381–395. Cited by: §4.1.
-  (2014) Weighted nuclear norm minimization with application to image denoising. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2862–2869. Cited by: §2.2.
-  (2017) LIME: low-light image enhancement via illumination map estimation. IEEE Transactions on Image Processing 26 (2). Cited by: §2.3.
-  (1994) Radiometric ccd camera calibration and noise estimation. TPAMI 16 (3). Cited by: §4.1.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §2.2.
-  (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §2.2.
-  (2009) Natural image denoising with convolutional networks. In Advances in neural information processing systems, pp. 769–776. Cited by: §2.2.
-  (2017) Deep convolutional neural network for inverse problems in imaging. IEEE Transactions on Image Processing 26 (9), pp. 4509–4522. Cited by: §2.2.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.1.
-  (2015) A low-light image enhancement method for both denoising and contrast enlarging. In 2015 IEEE International Conference on Image Processing (ICIP), pp. 3730–3734. Cited by: §2.3.
-  (2017) Learning a no-reference quality metric for single-image super-resolution. Computer Vision and Image Understanding 158, pp. 1–16. Cited by: §5.2.
-  (2013) Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, Vol. 30, pp. 3. Cited by: §3.2.
-  (2016) Image restoration using very deep convolutional encoder-decoder networks with symmetric skip connections. In Advances in neural information processing systems, pp. 2802–2810. Cited by: §2.2, Table 1, (c)c, (c)c, §5.1.
-  (2018) Burst denoising with kernel prediction networks. In CVPR, Cited by: §4.1.
-  (2013) Making a “completely blind” image quality analyzer. IEEE Signal Processing Letters 20 (3), pp. 209–212. Cited by: §5.2.
Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814. Cited by: §2.2.
-  (1990) Scale-space and edge detection using anisotropic diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence 12 (7). Cited by: §2.1.
-  (2017) Deep convolutional denoising of low-light images. arXiv preprint:1701.01687. Cited by: §2.3.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §2.2, §2, §3.2, Table 1, (d)d, (d)d, §5.1.
-  (2017) Memnet: a persistent memory network for image restoration. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4539–4547. Cited by: §2.2, Table 1, (e)e, (e)e, §5.1.
-  (1998) Bilateral filtering for gray and color images. In ICCV, Cited by: §2.1.
-  (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8798–8807. Cited by: §2.2.
-  (2018) Esrgan: enhanced super-resolution generative adversarial networks. In European Conference on Computer Vision, pp. 63–79. Cited by: Table 1, §5.2.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: Table 1, §5.1, §5.2, §5.2.
-  (2012) Image denoising and inpainting with deep neural networks. In Advances in neural information processing systems, pp. 341–349. Cited by: §2.2.
-  (2017) Beyond a gaussian denoiser: residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing 26 (7), pp. 3142–3155. Cited by: §1, §2.2, Table 1, (b)b, (b)b, §5.1, NODE: Extreme Low Light Raw Image Denoising using a Noise Decomposition Network.
-  (2011) From learning models of natural image patches to whole image restoration. In 2011 International Conference on Computer Vision, pp. 479–486. Cited by: §2.2.