Image noise modeling, estimation, and reduction is an important and active research area (e.g.,[Foi2009ClippedDenoising, Hwang2012Difference-basedDistribution, Seybold2013TowardsNoise, Trussell2012TheCameras]) with a long-standing history in computer vision (e.g., [Healey1994RadiometricEstimation, Kuan1985AdaptiveNoise, Liu2008AutomaticImage, Naderi1978EstimationNoise]). A primary goal of such efforts is to remove or correct for noise in an image, either for aesthetic purposes, or to help improve other downstream tasks. Towards this end, accurately modeling noise distributions is a critical step.
Existing noise models are not sufficient to represent the complexity of real noise [Abdelhamed2018ACameras, Plotz2017BenchmarkingPhotographs]
. For example, a univariate homoscedastic Gaussian model does not represent the fact that photon noise is signal-dependent—that is, the variance of the noise is proportional to the magnitude of the signal. In turn, the signal-dependent heteroscedastic model[Foi2009ClippedDenoising, Foi2015PracticalRaw-data, Makitalo2013OptimalNoise], often referred to as the noise level function (NLF), does not represent the spatial non-uniformity of noise power (e.g., fixed-pattern noise) or other sources of noise and non-linearities, such as amplification noise and quantization [Holst1998CCDDISPLAYS]. See Figure 2. In spite of their well-known limitations, these models are still the most commonly used. More complex models, such as a Poisson mixture [Jin2013ApproximationsNoise, Zhang2017ImprovedNoise], exist, but still do not capture the complex noise sources mentioned earlier.
Contribution We introduce Noise Flow, a new noise model that combines the insights of parametric noise models and the expressiveness of powerful generative models. Specifically, we leverage recent normalizing flow architectures [Kingma2018Glow:Convolutions] to accurately model noise distributions observed from large datasets of real noisy images. In particular, based on the recent Glow architecture [Kingma2018Glow:Convolutions], we construct a normalizing flow model which is conditioned on critical variables, such as intensity, camera type, and gain settings (i.e., ISO). The model can be shown to be a strict generalization of the camera NLF but with the ability to capture significantly more complex behaviour. The result is a single model that is compact (fewer then 2500 parameters) and considerably more accurate than existing models. See Figure 1. We explore different aspects of the model through a set of ablation studies. To demonstrate the effectiveness of Noise Flow, we consider the application of denoising and use Noise Flow to synthesize training data for a denoising CNN resulting in significant improvements in PSNR. Code and pre-trained models for Noise Flow are available at: https://github.com/BorealisAI/noise_flow.
2 Background and Related Work
Image noise is an undesirable by-product of any imaging system. Image noise can be described as deviations of the measurements from the actual signal and results from a number of causes, including physical phenomena, such as photon noise, or the electronic characteristics of the imaging sensors, such as fixed pattern noise.
Given an observed image and its underlying noise-free image , their relationship can be written as
where is the noise corrupting . Our focus in this work is to model .
Several noise models have been proposed in the literature. The simplest and most common noise model is the homoscedastic Gaussian assumption, also known as the additive white Gaussian noise (AWGN). Under this assumption, the distribution of noise in an image is a Gaussian distribution with independent and identically distributed values:
where is the noise value at pixel
and follows a normal distribution with zero mean andvariance.
Despite its prevalence, the Gaussian model does not represent the fact that photon noise is signal-dependent. To account for signal dependency of noise, a Poisson distributionis used instead:
where , the underlying noise-free signal at pixel , is both the mean and variance of the noise, and is a sensor-specific scaling factor of the signal.
Neither the Gaussian nor the Poisson models alone can accurately describe image noise. That is because image noise consists of both signal-dependent and signal-independent components. To address such limitation, a Poisson-Gaussian model has been adapted [Foi2009ClippedDenoising, Foi2015PracticalRaw-data, Makitalo2013OptimalNoise], where the noise is a combination of a signal-dependent Poisson distribution and a signal-independent Gaussian distribution:
A more widely accepted alternative to the Poisson-Gaussian model is to replace the Poisson component by a Gaussian distribution whose variance is signal-dependent [Liu2014PracticalImage, Mohsen1975NoiseDevices], which is referred to as the heteroscedastic Gaussian model:
The heteroscedastic Gaussian model is more commonly referred to as the noise level function (NLF) and describes the relationship between image intensity and noise variance:
Signal-dependent models may accurately describe noise components, such as photon noise. However, in real images there are still other noise sources that may not be accurately represented by such models [Abdelhamed2018ACameras, Foi2009ClippedDenoising, Plotz2017BenchmarkingPhotographs]. Examples of such sources include fixed-pattern noise, defective pixels, clipped intensities, spatially correlated noise (i.e., cross-talk), amplification, and quantization noise. Some attempts have been made to close the gap between the prior models and the realistic cases of noise—for example, using a clipped heteroscedastic distribution to account for clipped image intensities [Foi2009ClippedDenoising] or using a Poisson mixture model to account for the tail behaviour of real sensor noise [Zhang2017ImprovedNoise]. Recently, a GAN was trained for synthesizing noise [Chen2018ImageModeling]; however, it was not clear how to quantitatively assess the quality of the generated samples. To this end, there is still a lack of noise models that capture the characteristics of real noise. In this paper, we propose a data-driven normalizing flow model that can estimate the density of a real noise distribution. Unlike prior attempts, our model can capture the complex characteristics of noise that cannot be explicitly parameterized by existing models.
2.1 Normalizing Flows
Normalizing flows were first introduced to machine learning in the context of variational inference[Rezende2015VariationalFlows] and density estimation [Dinh2015NICE:Estimation] and are seeing increased interest for generative modeling [Kingma2018Glow:Convolutions]
. A normalizing flow is a transformation of a random variable with a known distribution (typically Normal) through a sequence of differentiable, invertible mappings. Formally, let
be a random variable with a known and tractable probability density functionand let be a sequence of random variables such that where is a differentiable, bijective function. Then if , the change of variables formula says that the probability density function for is
where is the inverse of , and is the Jacobian of the th transformation with respect to its input (i.e., the output of ).
Density Estimation A normalizing flow can be directly used for density estimation by finding parameters which maximize the log likelihood of a set of samples. Given the observed data, , and assuming the transformations are parameterized by respectively, the log likelihood of the data is
where the first term is the log likelihood of the sample under the base measure and the second term, sometimes called the log-determinant or volume correction, accounts for the change of volume induced by the transformation by the normalizing flows.
Bijective Transformations To construct an efficient normalizing flow we need to define differentiable and bijective transformations . Beyond being able to define and compute , we also need to be able to efficiently compute its inverse, , and the log determinant , which are necessary to evaluate the data log likelihood in Equation 8
. First consider the case of a linear transformations[Kingma2018Glow:Convolutions]
where and are parameters. For to be invertible must have full rank; its inverse is given by and the determinant of the Jacobian is simply .
Affine Coupling To enable more expressive transformations, we can use the concept of coupling [Dinh2015NICE:Estimation]. Let be a disjoint partition of the dimensions of and let be a bijection on which is parameterized by . Then a coupling flow is
where is any arbitrary function which uses only as input. The power of a coupling flow resides, largely, in the ability of to be arbitrarily complex. For instance, shallow ResNets [He2016DeepRecognition] were used for this function in [Kingma2018Glow:Convolutions].
Inverting a coupling flow can be done by using the inverse of . Further, the Jacobian of is a block triangular matrix where the diagonal blocks are and the identity. Hence the determinant of the Jacobian is simply the determinant of . A common form of a coupling layer is the affine coupling layer [Dinh2017DensityNVP, Kingma2018Glow:Convolutions]
where is a diagonal matrix. To ensure that is invertible and has non-zero diagonals it is common to use .
With the above formulation of normalizing flows, it becomes clear that we can utilize their expressive power for modeling real image noise distributions and mapping them to easily tractable simpler distributions. As a by-product, such models can directly be used for realistic noise synthesis. Since the introduction of normalizing flows to machine learning, they have been focused towards image generation tasks (e.g., [Kingma2018Glow:Convolutions]). However, in this work, we adapt normalizing flows to the task of noise modeling and synthesis by introducing two new conditional bijections, which we describe next.
3 Noise Flow
In this section, we define a new architecture of normalizing flows for modeling noise which we call Noise Flow. Noise Flow contains novel bijective transformations which capture the well-established and fundamental aspects of parametric noise models (e.g., signal-dependent noise and gain) which are mixed with more expressive and general affine coupling transformations.
3.1 Noise Modeling using Normalizing Flows
Starting from Equations 1 and 8, we can directly use normalizing flows to estimate the probability density of a complex noise distribution. Let denote a dataset of observed camera noise where is the noise layer corrupting a raw-RGB image. Noise layers can be obtained by subtracting a clean image from its corresponding noisy one. As is common, we choose an isotropic Normal distribution with zero mean and identity covariance as the base measure. Next, we choose a set of bijective transformations, with a set of parameters , that define the normalizing flows model. Lastly, we train the model by minimizing the negative log likelihood of the transformed distribution, as indicated in Equation 8.
We choose the Glow model [Kingma2018Glow:Convolutions] as our starting point. We use two types of bijective transformations (i.e., layers) from the Glow model: (1) the affine coupling layer as defined in Equation 11 that can capture arbitrary correlations between image dimensions (i.e., pixels); and (2) the convolutional layers that are used to capture cross-channel correlations in the input images.
3.2 Noise Modeling using Conditional Normalizing Flows
Existing normalizing flows are generally trained in an unsupervised manner using only data samples and without additional information about the data. In our case, we have some knowledge regarding the noise processes, such as the signal-dependency of noise and the scaling of the noise based on sensor gain. Some of these noise processes are shown in Figure 2 along with their associated imaging processes. Thus, we propose new normalizing flow layers that are conditional on such information. However, many noise processes, such as fixed-pattern noise, cannot be easily specified directly. To capture these other phenomena we use a combination of affine coupling layers (Equations 10 and 11) and convolutional layers (a form of Equation 9) which were introduced by the Glow model [Kingma2018Glow:Convolutions].
Figure 3 shows the proposed architecture of our noise model (Noise Flow). Noise Flow is a sequence of a signal-dependent layer; unconditional flow steps; a gain layer; and another set of unconditional flow steps. Each unconditional flow step is a block of an affine coupling layer followed by a convolutional layer. The term is the number of flow steps to be used in the model. In our experiments, we use , unless otherwise specified. The model is fully bijective—that is, it can operate in both directions, meaning that it can be used for both simulating noise (by sampling from the base measure and applying the sequence of transformations) or likelihood evaluation (by using the inverse transformation given a noise sample to evaluation of Equation 7). The Raw-to-sRGB rendering pipeline is imported from [Abdelhamed2018ACameras]. Next, we discuss the proposed signal-dependent and gain layers in details.
3.2.1 Signal-Dependent Layer
We construct a bijective transformation that mimics the signal-dependent noise process defined in Equation 5. This layer is defined as
The inverse of this layer is given by , where is the latent clean image, and is point-wise multiplication. To account for volume change induced by this transformation, we compute the log determinant as
where is the th element of and is the dimensionality (i.e., number of pixels and channels) of . The signal-dependent noise parameters and
should be strictly positive as the standard deviation of noise should be positive and an increasing function of intensity. Thus, we parameterize them asand . We initialize the signal-dependent layer to resemble an identity transformation by setting and . This way, and , and hence the initial scale .
3.2.2 Gain Layer
Sensor gain amplifies not only the signal, but also the noise. With common use of higher gain factors in low-light imaging, it becomes essential to explicitly factor the effect of gain in any noise model. Hence, we propose a gain-dependent bijective transformation as a layer of Noise Flow. The gain layer is modeled as a scale factor of the corresponding ISO level of the image, and hence the transformation is
where allows the gain factors to vary somewhat from the strict scaling dictated by the ISO value. The inverse transformation is , where is parameterized to be strictly positive and is initialized to to account for the typical scale of the ISO values. Finally, the log determinant of this layer is
where is the number of dimensions (i.e., pixels and channels) in . There are many ways to represent . However, since the available dataset contained only a small set of discrete ISO levels, we chose to simply use a discrete set of values. Formally where the exponential is used to ensure that is positive. We use a single parameter for each ISO level in the dataset (e.g., ). The values of are initialized so that to account for the scale of the ISO value and ensure the initial transformation remains close to an identity transformation.
Different cameras may have different gain factors corresponding to their ISO levels. These camera-specific gain factors are usually proprietary and hard to access but may have a significant impact on the noise distribution of an image. To handle this, we use an additional set of parameters to adjust the gain layer for each camera. In this case, the above gain layer is adjusted by introducing a camera-specific scaling factor. That is,
where is the scaling factor for camera . This is a simple model but was found to be effective to capture differences in gain factors between cameras.
To assess the performance of Noise Flow, we train it to model the realistic noise distribution of the Smartphone Image Denoising Dataset (SIDD) [Abdelhamed2018ACameras] and also evaluate the sampling accuracy of the trained model.
4.1 Experimental Setup
Dataset We choose the SIDD for training our Noise Flow model. The SIDD consists of thousands of noisy and corresponding ground truth images, from ten different scenes, captured repeatedly with five different smartphone cameras under different lighting conditions and ISO levels. The ISO levels ranged from 50 to 10,000. The images are provided in both Raw-RGB and sRGB color spaces. We believe this dataset is the best fit to our task for noise modeling, mainly due to the great extent of variety in cameras, ISO levels, and lighting conditions.
Data preparation We start by collecting a large number of realistic noise samples from the SIDD. We obtain the noise layers by subtracting the ground truth images from the noisy ones. In this work, we use only raw-RGB images as they directly represent the noise distribution of the underlying cameras. We avoid using sRGB images as rendering image into sRGB space tends to significantly change the noise distribution [Nam2016ADenoising]. We arrange the data as approximately image patches of size pixels. We split the data into a training set of approximately of the data and a testing set of approximately of the data. We ensure that the same set of cameras and ISO levels is represented in both the training and testing sets. For visualization only, we render raw-RGB images through a color processing pipeline into sRGB color space.
The SIDD provides only the gain amplified clean image and not the true latent clean image . To handle this, we use the learned gain parameter to correct for this and estimate the latent clean image as when it is needed in the signal-dependant layer.
Loss function and evaluation metrics
Loss function and evaluation metricsWe train Noise Flow as a density estimator of the noise distribution of the dataset which can be also used to generate noise samples from this distribution. For density estimation training, we use the negative log likelihood () of the training set (see Equation 8
) as the loss function which is optimized using Adam[Kingma2015Adam:Optimization]. For evaluation, we consider the same evaluated on the test set.
To provide further insight in the differences between the approaches, we also consider the Kullback-Leibler (KL) divergence of the pixel-wise marginal distributions between generated samples and test set samples. Such a measure ignores the ability of a model to capture correlations but focuses on a model’s ability to capture the most basic characteristics of the distribution. Specifically, given an image from the test set, we generate a noise sample from the model and compute histograms of the noise values from the test image and the generated noise and report the discrete KL divergence between the histograms.
Baselines We compare the Noise Flow models against two well-established baseline models. The first is the homoscedastic Gaussian noise model (i.e., AWGN) defined in Equation 2. We prepare this baseline model by estimating the maximum likelihood estimate (MLE) of the noise variance of the training set, assuming a univariate Gaussian distribution. The second baseline model is the heteroscedastic Gaussian noise model (i.e., NLF), described in Equations 5 and 6, as provided by the camera devices. The SIDD provides the camera-calibrated NLF for each image. We use these NLFs as the parameters of the heteroscedastic Gaussian model for each image. During testing, we compute the of the testing set against both baseline models.
4.2 Results and Ablation Studies
Noise Density Estimation Figure 3(a) shows the training and testing on the SIDD of Noise Flow compared to (1) the Gaussian noise model and (2) the signal-dependent noise model as represented by the camera-estimated noise level functions (NLFs). It is clear that Noise Flow can model the realistic noise distribution better than Gaussian and signal-dependent models. As shown in Table 1, Noise Flow achieves the best , with and nats/pixel improvement over the Gaussian and camera NLF models, respectively. This translates to and improvement in likelihood, respectively. We calculate the improvement in likelihood by calculating the corresponding improvement in .
|Gaussian||Cam. NLF||Noise Flow|
Noise Synthesis Figure 3(b) shows the average marginal KL divergence between the generated noise samples and the corresponding noise samples from the testing set for the three models: Gaussian, camera NLF, and Noise Flow. Noise Flow achieves the best KL divergence, with and improvement over the Gaussian and camera NLF models, respectively, as shown in Table 1.
Figure 5 shows generated noise samples from Noise Flow compared to samples from Gaussian and camera NLF models. We show samples from various ISO levels and lighting conditions (N: normal light, L: low light). Noise Flow samples are the closest to the real noise distribution in terms of the marginal KL divergence. Also, there are more noticeable visual similarities between Noise Flow samples and the real samples compared to the Gaussian and camera NLF models.
Learning signal-dependent noise parameters Figure 5(a) shows the learning of the signal-dependent noise parameters and as defined in Equation 6 while training a Noise Flow model. The parameters are converging towards values that are consistent with the signal-dependent noise model where is the dominant noise factor that represents the Poisson component of the noise and is the smaller factor representing the additive Gaussian component of the noise. In our experiments, the shown parameters are run through an exponential function to force their values to be strictly positive.
Learning gain factors Figure 5(b) shows the learning of the gain factors as defined in Equation 14 while training a Noise Flow model. The gain factors are consistent with the corresponding ISO levels indicated by the subscript of each gain factor. This shows the ability of the Noise Flow model to properly factor the sensor gain in the noise modeling and synthesis process. Note that we omitted ISO level 200 from the training and testing sets because there are not enough images from this ISO level in the SIDD.
Learning camera-specific parameters In our Noise Flow model, the camera-specific parameters consist of a set of gain scale factors , one for each of the five cameras in the SIDD. Figure 7 shows these gain scales for each camera in the dataset during the course of training. It is clear that there are differences between cameras in the learned gain behaviours. These differences are consistent with the differences in the noise level function parameter of the corresponding cameras shown in Figure 6(b) and capture fundamental differences in the noise behaviour between devices. This demonstrates the importance of the camera-specific parameters to capture camera-specific noise profiles. Training Noise Flow for a new camera can be done by fine-tuning the camera-specific parameters within the gain layers; all other layers (i.e., the signal-dependent and affine coupling layers) can be considered non-camera-specific.
Effect of individual layers Table 2 compares different architecture choices for our Noise Flow model. We denote the different layers as follows: G: gain layer; S: signal-dependent layer; CAM: a layer using camera-specific parameters; Ax1: one unconditional flow step (an affine coupling layer and a convolutional layer); Ax4: four unconditional flow steps. The results show a significant improvement in noise modeling (in terms of and ) resulting from the additional camera-specific parameters (i.e., the S-G-CAM model), confirming the differences in noise distributions between cameras and the need for camera-specific noise parameters. Then, we show the effect of using affine coupling layers and convolutional layers in our Noise Flow model. Adding the Ax1 blocks improves the modeling performance in terms of . Also, increasing the number of unconditional flow steps from one to four introduces a slight improvement as well. This indicates the importance of affine coupling layers in capturing additional pixel-correlations that cannot be directly modeled by the signal-dependency or the gain layers. The S-Ax4-G-Ax4-CAM is the final Noise Flow model.
5 Application to Real Image Denoising
Preparation To further investigate the accuracy of the Noise Flow model, we use it as a noise generator to train an image denoiser. We use the DnCNN image denoiser [Zhang2017BeyondDenoising]. We use the clean images from the SIDD-Medium [Abdelhamed2018ACameras] as training ground truth and the SIDD-Validation as our testing set. The SIDD-Validation contains both real noisy images and the corresponding ground truth. We compare three different cases for training DnCNN using synthetically generated noise: (1) DnCNN-Gauss: homoscedastic Gaussian noise (i.e., AWGN); (2) DnCNN-CamNLF: signal-dependent noise from the camera-calibrated NLFs; and (3) DnCNN-NF: noise generated from our Noise Flow model. For the Gaussian noise, we randomly sample standard deviations from the range . For the signal-dependent noise, we randomly select from a set of camera NLFs. For the noise generated with Noise Flow, we feed the model with random camera identifiers and ISO levels. The range, camera NLFs, ISO levels, and camera identifiers are all reported in the SIDD. Furthermore, in addition to training with synthetic noise, we also train the DnCNN model with real noisy/clean image pairs from the SIDD-Medium and no noise augmentation (indicated as DnCNN-Real).
Results and discussion Table 3 shows the best achieved testing peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) [Wang2004ImageSimilarity] of DnCNN using the aforementioned three noise synthesis strategies and the discriminative model trained on real noise. The model trained on noise generated from Noise Flow yields the highest PSNR and SSIM values, even slightly higher than DnCNN-Real due to the relatively limited number of samples in the training dataset. We also report, in parentheses, the relative improvement introduced by DnCNN-NF over the other two models in terms of root-mean-square-error (RMSE) and structural dissimilarity (DSIMM) [oza2009StructuralVideos, Webb2003StatisticalRecognition], for PSNR and SSIM, respectively. We preferred to report relative improvement in this way because PSNR and SSIM tend to saturate as errors get smaller; conversely, RMSE and DSSIM do not saturate. For visual inspection, in Figure 8, we show some denoised images from the best trained model from the three cases, along with the corresponding noisy and clean images. DnCNN-Gauss tends to over-smooth noise, as in rows 3 and 5, while DnCNN-CamNLF frequently causes artifacts and pixel saturation, as in rows 1 and 5. Although DnCNN-NF does not consistently yield the highest PSNR, it is the most stable across all six images. Noise Flow can be used beyond image denoising in assisting computer vision tasks that require noise synthesis (e.g., robust image classification [Diamond2017DirtyData] and burst image deblurring [Aittala2018BurstNetworks]. In addition, Noise Flow would give us virtually unlimited noise samples compared to the limited numbers in the datasets.
In this paper, we have presented a conditional normalizing flow model for image noise modeling and synthesis that combines well-established noise models and the expressiveness of normalizing flows. As an outcome, we provide a compact noise model with fewer than 2500 parameters that can accurately model and generate realistic noise distributions with nats/pixel improvement (i.e., higher likelihood) over camera-calibrated noise level functions. We believe the proposed method and the provided model will be very useful for advancing many computer vision and image processing tasks. The code and pre-trained models are publicly available at: https://github.com/BorealisAI/noise_flow.
This work was supported by Mitacs through the Mitacs Accelerate Program as part of an internship at Borealis AI. This study was also funded in part by the Canada First Research Excellence Fund for the Vision: Science to Applications (VISTA) programme and an NSERC Discovery Grant. Dr. Brown contributed to this article in his personal capacity as a professor at York University. The views expressed are his own and do not necessarily represent the views of Samsung Research. Abdelrahman is partially supported by an AdeptMind scholarship.