Code for "Joint Denoising and Demosaicking with Green Channel Prior for Real-world Burst Images", TIP2021
Denoising and demosaicking are essential yet correlated steps to reconstruct a full color image from the raw color filter array (CFA) data. By learning a deep convolutional neural network (CNN), significant progress has been achieved to perform denoising and demosaicking jointly. However, most existing CNN-based joint denoising and demosaicking (JDD) methods work on a single image while assuming additive white Gaussian noise, which limits their performance on real-world applications. In this work, we study the JDD problem for real-world burst images, namely JDD-B. Considering the fact that the green channel has twice the sampling rate and better quality than the red and blue channels in CFA raw data, we propose to use this green channel prior (GCP) to build a GCP-Net for the JDD-B task. In GCP-Net, the GCP features extracted from green channels are utilized to guide the feature extraction and feature upsampling of the whole image. To compensate for the shift between frames, the offset is also estimated from GCP features to reduce the impact of noise. Our GCP-Net can preserve more image structures and details than other JDD methods while removing noise. Experiments on synthetic and real-world noisy images demonstrate the effectiveness of GCP-Net quantitatively and qualitatively.READ FULL TEXT VIEW PDF
We propose a simple method for estimating noise level from a single colo...
One popular strategy for image denoising is to design a generalized
This paper presents a comprehensive study of applying the convolutional
Demosaicking is a procedure to reconstruct full RGB images from Color Fi...
In recent years, single image dehazing deep models based on Atmospheric
Image denoising and demosaicking are the most important early stages in
Automatic extraction of raw data from 2D line plot images is a problem o...
Code for "Joint Denoising and Demosaicking with Green Channel Prior for Real-world Burst Images", TIP2021
Most consumer grade digital cameras capture natural images using a single-chip CCD/CMOS sensor covered by a color filter array (CFA), resulting in incomplete color sampling at each photoreceptor. The process of interpolating the missing colors from mosaicked CFA data is called color demosaicking. The captured data is inevitably corrupted by noise, especially under low-light conditions. Denoising and demosaicking play crucial roles to obtain high quality images in the camera ISP (image signal processing) pipeline, and a variety of image denoising and demosaicking methods[1, 2, 3, 4, 5] have been proposed.
Previous demosaicking and denoising methods are usually designed independently and implemented sequentially in the ISP. However, the demosaicking errors will complicate the denoising process, or the denoising artifacts can be amplified in the demosaicking process. Therefore, joint denoising and demosaicking (JDD) has received considerable research interests [6, 7, 8, 9, 10, 11, 12, 13]. Traditional JDD methods resort to image priors, such as piecewise smoothness  and non-local self-similarity , and employ an optimization model for this joint task. Those handcrafted priors, however, are not accurate enough to reproduce the complex image local structures. Recent JDD methods are mostly data-driven learning methods, where a deep convolutional neural network (CNN) is trained on pairwise dataset with noisy mosaicked images and their clean full color ground truths [9, 10, 12, 13, 14]. By learning deep priors from a large amount of data, those CNN based methods achieve much better JDD performance than traditional model based methods.
Existing JDD methods, including those CNN based ones, mostly work on a single CFA image, and we call them JDD-S methods, which have several limitations when applying to real-world CFA data. First, their performance will deteriorate significantly on CFA images with strong noise level. This often occurs for low-end devices such as smartphone cameras due to the small sensor and lens. The situation becomes worse under low-light imaging conditions. Second, current JDD-S methods [9, 10, 12, 13, 14] usually assume additive white Gaussian noise (AWGN) in the training process, which cannot accurately describe the distribution of real-world noise. As a result, strong visual artifacts will appear in the JDD outputs of real-world noisy CFA images.
Recently, it has been shown that the denoising performance can be significantly improved by using a set of burst images instead of a single image, especially for the low-light imaging conditions [16, 17, 18, 19, 20]. Inspired by the success of burst image denoising, we propose to perform JDD with real-world burst images, which is called JDD-B. With some realistic noise modeling methods [16, 21], we can synthesize noisy burst images with clean ground truth by reversing the ISP pipeline on high quality video sequences and adding noise into them. Such pairwise data can be used to train the JDD-B model. It is well-known that the green channel of images captured by single-chip digital cameras has better quality than red and blue channels. On one hand, the green channel has twice the sampling rate than red/blue channels in most CFA patterns (e.g., Bayer pattern). On the other hand, the sensitivity of green is better than red/blue . As a result, the green channel has more texture information and higher SNR than red/blue channels in most natural images, which is demonstrated in Fig. 1 by using the SIDD dataset . We call the above fact and prior knowledge the green channel prior (GCP), and use the GCP to design our JDD-B network, namely GCP-Net, to improve the JDD-B performance on real-world burst images.
Specifically, in GCP-Net, we extract the GCP features from green channel to guide the deep feature modeling and upsampling of the whole image. The GCP features are also utilized to estimate the offset within frames to relief the impact of noise. As shown in Fig.2, with GCP, the JDD-B results can preserve more structures and details while removing noise. Our GCP-Net achieves state-of-the-art JDD-B performance on both synthetic noisy images and real-world burst images captured by smartphones.
Image denoising and demosaicking are two important steps in camera ISP pipeline. A few methods have been proposed for joint denoising and demosaicking on a single raw image (JDD-S) [9, 23, 10, 11, 12, 13, 14]. In , Qian et al. showed that the performance of JDD-S is generally better than performing denoising and demosaicking separately. In , a learning-based method was proposed for JDD-S. Henz et al.  proposed an auto-encoder architecture to model the color-image capturing process on each monochromatic sensor. Kokkinos et al.  proposed a plug-and-play framework for the JDD-S task. To enhance the performance on real-world images, Ehret et al.  proposed a mosaic-to-mosaic framework by finetuning the network using mosaic burst images. It should be noted that though Ehret et al. utilized burst images to fine-tune the network, the input of the network is still a single image. Liu et al.  proposed a self-guided JDD-S network by considering the advantages of the higher sampling rate of green channel and using this prior to guide the upsampling process. In this paper, we further analyze the noise level imbalance among different color channels in real-world photographs, and perform the JDD task using burst images instead of a single image.
Compared with single image restoration tasks, burst image processing encounters new challenges on estimating the offsets among different frames caused by camera movement and moving objects. According to the employed alignment frameworks, we partition burst image restoration methods into three categories, i.e., pre-aligned methods [17, 18], kernel-based methods [16, 24, 25, 26] and feature-based alignment [27, 28, 19, 20].
Pre-alignment methods mostly employ optical flow to estimate the motions and perform warping to compensate for temporal offset. The frame-to-frame method  utilizes the TV- algorithm  to estimate optical flow within frames. ToFlow  utilizes the SpyNet  as the flow estimation module, which is jointly trained with the denoising module. However, the restoration performance of those methods is largely affected by the accuracy of estimated optical flow, while accurate flow is difficult to obtain especially under large motion and severe noise. Kernel-based methods use convolutional neural networks to predict spatially varying kernels, which perform aligning and denoising simultaneously . Compared with the original KPN , Xu et al.  and Marinc et al.  proposed to learn the deformable kernels and multiple kernels. Xia et al.  proposed to predict a set of global basis kernels and the corresponding mixing coefficients to effectively exploit larger denoising kernels.
Comparing with the above two categories of alignment methods, performing alignment in feature domain is a more promising strategy and has achieved SOTA performance on video super-resolution tasks. Liuet al.  proposed to use a localization net to estimate spatial transform parameters from deep feature and directly wrap the feature to align shift. TDAN  and EDVR  were proposed to estimate offset of deformable convolution which is utilized to align the shift in features domain. Comparing with  which needs the ground-truth information of spatial transform parameters, TDAN and EDVR do not need such parameters while still achieving SOTA performance. RviDeNet  also utilizes DConv to align multi-scale features for burst denoising. In this work, we perform alignment in feature domain and utilize deformable convolution to implicitly compensate for offsets. Moreover, we design an inter-frame module which not only utilizes multi-scale information, but also considers temporal constraint.
Our goal is to recover a clean full-color image, denoted by , from a burst of real-world noisy CFA images, denoted as . The subscript “ref” represents the index of reference frame. Usually, the noisy counterpart of is the center frame in .
The noise in real-world raw images is signal-dependent  due to the photon arrival statistics and the imprecision in readout circuitry. The noise introduced by photon sensing, i.e.
, shot noise, can be modeled as the Poisson distribution, while the noise introduced in readout circuitry,i.e.
, read noise, can be modeled by the Gaussian distribution. Denote bythe desired clean raw image captured at time . The corresponding noisy raw image can be written as :
where and and are the scale parameters for shot noise and read noise, respectively. represents Gaussian distribution.
As discussed in , the CMOS sensor has different sensitivity to light of different wavelengths or colors, and in most illumination conditions, green channels are brighter than red and blue channels in Bayer pattern CFA images. Since the real-world noise contains the Poissonian shot noise (see Eq. 1), and the signal-to-noise ratio (SNR) has a square root relationship between signal and noise, the brighter green channel often has a higher SNR than red/blue channels. To validate this, we compute the average SNR of different color channels of 50 real-world noisy raw images randomly chosen from the SIDD  benchmark dataset (which contains high-ISO noisy images captured by smartphone cameras), and show the SNR comparison in Fig. 1. One can see that the SNR of green channel is higher than that of red/blue channels for most of the noisy images. In addition, the green channel has twice the sampling rate of red/blue channels. Overall, the green channel preserves better image structure and details than the other two channels. In this paper, we call the prior knowledge that the green channel has higher SNR, higher sampling rate and hence better channel quality than red/blue channels the green channel prior (GCP), which is carefully exploited in this paper to design our JDD-B network.
With GCP, we propose a new network, namely GCP-Net, for JDD-B. Without loss of generality, we assume that the Bayer CFA pattern is used. For each raw image , we reshape it as four R, G, G, B sub-images of the same size so that . We denote the noise map of as
, whose value at each location is the standard deviation of signal-dependent noise at that position. The input of our GCP-Net is a sequence of noisy raw imagesand their corresponding noise maps . The output is the clean full-resolution linear RGB image .
The overview of GCP-Net is illustrated in Fig. 3. GCP-Net consists of two branches, i.e., a GCP branch and a reconstruction branch. In the GCP branch, the green features are extracted from the concatenation of noisy green channels, denoted by , and their noise level maps, denoted by . This process can be written as:
where consists of several Conv+LReLU blocks. These GCP features are utilized as the guided information for the reconstruction branch. We utilize layer-wise guiding strategy and denote the GCP feature of the -th layer as .
The reconstruction branch utilizes the burst images, the corresponding noise maps and the GCP features to estimate the clean full color image. As illustrated in Fig. 3, it consists of three parts: the intra-frame (IntraF) module, the inter-frame (InterF) module and the merge module. The IntraF module is designed to model the deep features of each frame and it utilizes GCP features to guide the feature extraction. The InterF module is to compensate for the shift between frames by using the DConv in feature domain. To reduce the influence of noise in alignment, offset is estimated from the cleaner GCP features. The merge module is designed to aggregate the aligned features and use the GCP features to perform adaptive upsampling for the full-resolution image reconstruction. The details of these modules are presented in the following sections.
The architecture of the IntraF module is shown in Fig. 4. For the -th frame, the input of IntraF includes the noisy raw image , the corresponding noise level map and the GCP feature . Firstly, one simple convolution layer is used to model the initial features as . Then, the initial features are passed to the concatenation of four green channel attention (GCA) blocks, where the GCP features are used to guide the feature extraction and a dual attention mechanism is designed to better deal with the channel-dependent and spatial-dependent noise. We adopt a layer-wise guiding strategy for GCA blocks and empirically find that such a strategy is favorable to the restoration results. Without loss of generality, we use the -th GCA block to present the modeling process. The output features are denoted as:
where represents the -th GCA block.
The detailed structure of the GCA block is shown in Fig. 5. Inspired by [32, 33], the GCP information can be exploited by using pixel-wise scaling and bias, and the enhanced feature can be expressed as:
where and are two learned modulation parameters of the guided layers. We denote the unit to implement Eq. 4 as the green guided (GG) unit. The green-guided features are estimated by two residual blocks, denoted by ,
is the learned features. As normal Conv layers treat spatial and channel features equally, it is not appropriate to handle the real-world noise which is channel and spatial dependent. To further enhance the representational power of standard Conv+ReLU blocks, channel attention and spatial attention[34, 35, 36] are designed to model the cross-channel and spatial relationship of deep features.
The features of size are firstly converted into a channel descriptor using global average pooling (GAP). To make use of the aggregated information, the channel descriptor is processed by two convolutional layers with kernel size , followed by a sigmoid activation to obtain the activations . The output of CA is the rescaled feature using .
The SA block is designed to model the spatial dependencies of deep features by rescaling the features using the estimated spatial attention map
. Instead of using average pooling and max pooling, is adaptively obtained by using two convolutional layers, followed by the sigmoid activation.
The output of the -th GCA block is obtained by:
The extracted features by the IntraF module are then aligned to the reference frame feature in the InterF module. The InterF module aims at modeling the temporal dependency between frames, whose architecture is shown in Fig. 6. We use the deformable convolution to compensate for the offset within frames. To relieve the affect of severe noise and to better model the correlation between neighboring frames, we use the GCP features to estimate the offset. Similar to EDVR  and RViDeNet , pyramidal processing is utilized to handle possible large motions. Moreover, to better exploit the temporal constraint in the offset estimation, we introduce an LSTM regularization in the offset estimation.
For each pyramidal scale of the -th frame, the inter-frame GCP feature, denoted by , is obtained by using
where is the concatenation operator and is the Conv layer. Then, the temporal regularization is introduced by using ConvLSTM , which is a popular 2D sequence data modeling method. The ConvLSTM updates the hidden state and the cell state with:
The updated inter-frame feature can be written as:
As discussed in , the LSTM mechanism has limited ability to deal with complex motions. To handle large motions, multi-scale information is aggregated to estimate more accurately the offset :
where and are two convolutional layers, is the upsampling operator with factor 2.
The aligned features at each position can then be obtained by:
in which is the sampling location of deformable convolution kernel, is the modulation scalar. Following , since is fractional, bilinear interpolation is applied. The final aligned feature at scale is obtained by:
where refers to general Conv+LReLU layers.
The merge module is designed to merge the aligned features and output the estimated clean RGB image . The aligned features are firstly concatenated and adaptively merged as follows:
where is the merge function by using one simple convolution layer. Then we upsample the features to full-resolution features using GCP adaptive upsampling. Similar to the green guided operator in GCA block, the GCP adaptive upsampling can be expressed as:
where are the green guided features of size , represents the upsampling operator by a factor , and are the two learned modulation parameters. Transpose convolution is used for the upsampling interpolation.
The final estimation can be written as:
where is designed to estimate the final clean RGB image. In this paper, we utilize a three scale U-Net  architecture for to exploit the multi-scale information as well as enlarging the receptive field. All the Conv kernels are of size , followed by the nonlinear function LReLU. The upsampling and downsampling operators in
are strided convolutions and transpose convolutions.
For the estimated clean , we define the reconstruction loss in the linear color space as follows:
where is the Charbonnier penalty function , is set to . As discussed in KPN , computing loss in sRGB color space can produce a perceptually more relevant estimation. Therefore, we also introduce a loss in the sRGB color space:
where is the operator which transforms linear RGB color space to sRGB space. In this paper, contains white balance, color correction and gamma compression as in . To sum up, the overall loss to optimize our model is:
where is the trade-off parameter and we simply set to 1 in our experiments.
Obtaining ground-truth images for training is difficult for real-world image restoration tasks. In some single-image based restoration works [42, 15, 13], real-world degraded data are collected and the corresponding ground-truth images are physically and/or mathematically estimated for pair-wise training. However, for the burst-images based JDD-B task, the misalignment problem and the coherence between denoising and demosaicking make the ground-truth estimation much more difficult. Therefore, we synthesize training data by using an open high quality video dataset, i.e., Vimeo-90K .
Since camera sensor outputs are in the linear color space, we first convert the sRGB images into linear RGB space by using the unprocessing operation in , which can be written as . The unprocessing operation includes inverse gamma compression, inverse color correction, inverse tone mapping and inverse white balance. The converted linear RGB frame is taken as the ground-truth image. By using Eq. 1, the noisy raw image can be synthesized as:
where is the mosaic matrix which downsamples a linear RGB image to a Bayer CFA image. Without loss of generality, the RGGB mosaic pattern is used as to generate the data.
|# of GG Units||0||1||2||3||4||5||4 (w/o GCP upsampling)||using RB to guide|
In our experiments, a number of neighboring frames are used as the input and the central frame is chosen as the reference frame. Following the setting in , the noise level parameters and in Eq. 1 are uniformly sampled from the ranges of and , respectively. We adopt the method in  to initialize the GCP-Net and use the ADAM  algorithm with = 0.9 and = 0.99 to update the network. The size of mini-batch is 2 and the size of each noisy raw patch is with 4 color channels (RGGB). The reconstructed RGB patch is of size with three color channels (RGB). The learning rate is initialized as and it is decreased using the cosine function 
. It takes about two days to train our model under the PyTorch framework using two Nvidia GeForce RTX 2080 Ti GPU.
In this section, we perform ablation studies to discuss the effect of major components in GCP-Net and the setting of some parameters. The Vid4  and REDS4  datasets are used in the experiments.
In GCP-Net, GCP features are used to guide the deep feature extraction and the upsampling process. By removing and adding the GG unit (see Fig. 5) in the GCA block, we can analyze the influence of GCP on deep feature extraction. Fig. 7 shows a patch of a noisy image captured in night time by a smartphone camera, the extracted deep features without and with the GG unit, and the JDD results. As expected, using the GCP features to guide the feature extraction is favorable to suppress the noise and preserve more detailed textures, as shown in Figs. 7 (e) and (f).
To quantitatively verify the contribution of GCP, we implement five variants of GCP-Net with different number of GG units and GCA blocks. The quantitative results on the REDS4 dataset are shown in Table I. We can see that using one GG unit, we can obtain 0.13dB gain over the result without using the GCP guidance. The PSNR value can be further improved by increasing the number of GG units from one to four, and the performance gets saturated when the number of GG units is five. Therefore, we use four GG units and GCA blocks in our GCP-Net. We also train a GCP-Net without using the adaptive upsampling in GCA and the result is shown in Table I. One can see that the adaptive upsampling can obtain about 0.1dB gain for the JDD task.
We also train a network by using the red and blue channels to guide the feature extraction. The results are shown in the last column of Table I. We see that using the red and blue channels to guide feature extraction cannot enhance the performance. Instead, it leads to serious performance degradation (about 0.6dB) compared with the network without using GCP. This is not surprising since the red and blue channels have lower SNR and contain less textures (please see Figs. 2(d)-(g)) so that the network fails to extract more guiding information to enhance the deep features.
There are two types of network structures for JDD: one-stage and two-stage. One-stage algorithms [10, 14] learn to directly estimate the clean demosaicked image, while two-stage algorithms [9, 23, 13, 11, 5] sequentially learn the denoisng task and the demosaicking task. In this part, we evaluate which structure is more effective to reconstruct the full color images from a burst of noisy mosaic images with similar trainable parameters.
We train two variants of GCP-Net as the two-stage networks for evaluation, namely, GCP-Net-DE+DM and GCP-Net-DM+DE. GCP-Net-DE+DM first performs burst denoising to obtain the clean mosaic image and then applies demosaicking. The intermediate denoising loss is applied on the estimated , which is obtained by adding one conv layer after the merge function in Eq. 13. The denoised image is then taken as the input for the remaining layers to perform demosaicking. For GCP-Net-DM+DE, it firstly performs demosaicking on every noisy raw image and then performs burst denoising on the demosaicked images. The demosaicking output of the -th frame is a noisy demosaicked image , which is estimated from the output of the last GCA block in the IntraF module. Since there’s no ground-truth for , we pretrain the first stage of GCP-Net-DM+DE for single image demosaicking and then fine-tune the whole network for JDD task.
Table II lists the average PSNR results of the variants of GCP-Net on the Vid4 and REDS4 datasets. We can see that our one-stage GCP-Net consistently outperforms its two-stage variants. This is mainly because denoising and demosaicking are two highly relevant tasks and the two-stage network is not effective to exploit the correlation information of these two tasks. Similar to those multi-task learning works , learning simultaneously the relevant tasks could result in better performance. Thus, we choose to use one-stage structure in GCP-Net.
|Testset||Noise Level||DE + DM||DM + DE||JDD (Ours)|
To demonstrate the contribution of the proposed inter-frame module (see Fig. 6), we implement four variants of GCP-Net, i.e., GCP-Net-w/o-GCP, GCP-Net-w/o-inter, GCP-Net-w/o-MS and GCP-Net-w/o-LSTM. Specifically, GCP-Net-w/o-GCP represents the network without using GCP in the inter-frame module. That is, the offset is directly estimated from the original features, instead of GCP features. In GCP-Net-w/o-inter, we remove the inter-frame module and take the concatenation of the output features of IntraF module as the input of the merge module. GCP-Net-w/o-MS is implemented by removing the multi-scale offset estimation in the interF module, and GCP-Net-w/o-LSTM represents the network without temporal regularization in the offset estimation.
Table. III reports the PSNR results of GCP-Net and its four variants with different interF modules. As expected, full GCP-Net achieves the best performance, showing that compensating for the shift between frames is crucial for burst image restoration. Estimating offset from better quality GCP features can obtain 0.05dB improvement on JDD-B. Compared with GCP-Net-w/o-MS, utilizing multi-scale information benefits to handle large and complex motions, which results in 0.50.6dB improvement on the REDS4 dataset. By introducing temporal regularization in the offset estimation part, our full model further improves the JDD-B results by 0.06dB.
|Noise||w/o GCP||w/o Inter||w/o MS||w/o LSTM||full|
In this section, we evaluate the performance of GCP-Net trained with different number of frames, denoted as GCP-Net- with . The results are listed in Tables IV and V. Compared with the network using a single image as input, the network using three frames as input can achieve performance gain by a great margin (i.e., 1.11.5dB on Vid4 and 0.30.4dB on REDS4). By using more frames, GCP-Net-5 further improves GCP-Net-3 by about 0.5dB at high noise level and about 0.4dB at low noise level. By further increasing the frame number from 5 to 7, GCP-Net-7 achieves slight improvement (0.1dB) on the Vid4 dataset and comparable results on the REDS4 dataset. Considering the computational efficiency and the performance gains, we choose to use frames in our model.
|High noise level||FlexISP||20.89/0.6108||25.61/0.6015||22.41/0.5908||23.73/0.5125||23.16/0.5789|
|Low noise level||FlexISP||22.28/0.7292||28.26/0.7692||24.57/0.7333||27.17/0.6958||25.57/0.7319|
|High noise level||FlexISP||23.44/0.5257||24.09/0.4820||24.37/0.4338||23.75/0.5129||23.91/0.4886|
|Low noise level||FlexISP||25.86/0.6932||27.49/0.6638||27.85/0.6314||26.70/0.6878||26.97/0.6690|
Since currently there is no JDD-B method publically available, for fair comparison, we combine the representative burst image denoising methods with a state-of-the-art demosaicking method, DemosaicNet (DMN) , to compare with our GCP-Net. We choose four widely used burst denoising algorithms: VBM3D , KPN , EDVR  and RViDeNet . For VBM3D, the noise level is needed as the input, and we use the method in  for noise level estimation. For the KPN, EDVR and RViDeNet models, they are retrained using the same training data as GCP-Net. The retrained EDVR and RViDeNet models adopt 20 residual blocks to perform feature extraction and 40 residual blocks in the reconstruction module. The size of learned per-pixel kernel of retrained KPN is 7 x 7.
We also adjust the structures of EDVR and RViDeNet by adding one upsampling operator in the reconstruction step, and train these models for the JDD-B task. These models are denoted as EDVR* and RViDeNet*, respectively. In addition, we compare with two state-of-the-art JDD-S algorithms: FlexISP  and ADMM .
Two video datasets, i.e., Vid4  and REDS4 , are adopted in the experiments. Vid4 is widely used as the test set in the study of video super-resolution, and the video clips (resolution: ) in Vid4 have small motion. The videos in REDS4 dataset have better quality and resolution (720p) but with bigger motions. Video clips are firstly converted to raw space using the unprocessing operator introduced in Sec IV-A. Then the noise is added to the raw images by using Eq. 1. Tables IV and V list the PSNR/SSIM results of different algorithms under different noise levels. Following , both PSNR and SSIM are computed after gamma correction to better reflect perceptual quality. One can see that the proposed GCP-Net achieves the best PSNR/SSIM measures. Visual comparisons on Vid4 and REDS4 are presented in Fig. 8 and Fig. 9, respectively.
From Tables IV and V, we can see that the performance of JDD-S methods FlexISP and ADMM is generally far below the learning based multi-frame algorithms. However, the multi-frame based VBM3D+DMN fails to compete with the single frame based ADMM, especially on the REDS4 dataset. This is mainly because VBM3D+DMN assumes AWGN and it cannot handle the large-motion in the videos of REDS4. The CNN-based methods KPN+DMN, EDVR+DMN and RviDeNet+DMN achieve much better performance than JDD-S methods because they can learn to handle the misalignments between adjacent frames and exploit the temporal redundancy for denoising. Nonetheless, one can still see that these methods generate noticeable color artifacts and zippers (see Figs. 8(b)(c)(e) and Figs. 9(b)(c)(e)) near edges and complex textures. This is mainly because they perform burst denoising and color demosaicking separately without considering the correlations between the two tasks. For JDD-B algorithms, i.e., EDVR* and RviDeNet*, their restoration results contain less zippering and color artifacts compared with EDVR+DMN and RviDeNet+DMN, which proves the effectiveness of jointly handling denoising and demosaicking task. However, the results of EDVR* and RviDeNet* suffer from the over-smoothing problem (see Figs. 9(d)(f)). Benefiting from the GCP guidance, our GCP-Net performs the best in the JDD task, making good balance between noise removal and structure preservation (see Fig. 9(g)).
We also compare different JDD algorithms using real-world burst raw images captured by several smartphone cameras, including iPhone 7, iPhone X and Pixel 2. Since there is no ground-truth for the collected images, we can only provide qualitative comparison. The restored images are converted to sRGB domain for visualization by using the ISP operations, including white balance, color transfer and gamma compression. The white balance parameters, color matrix and the noise level are collected from the camera metadata.
In Fig. 10, we show the JDD results of several noisy images captured by the three smartphone cameras under normal lighting conditions as well as nighttime environments with a wide range of ISO values. Similar conclusions to the synthetic experiments can be made. KPN+DMN, EDVR+DMN and RviDeNet+DMN can remove noise but will produce zippering artifacts and smooth the image details. By jointly performing denoising and demosaicking, EDVR* and RviDeNet* can reduce the zippering effect but still produce over-smoothed reconstruction. In the zoom-in patch of the second image in Fig. 10, we can also see that RviDeNet* generates artifacts in the high noisy area. Our proposed GCP-Net can effectively remove noise while retaining fine textures. It is also free of the moire pattern. In the last row of Fig. 10, we show the JDD results on night-shot burst images with large motion. We can see that the restoration result of KPN+DMN contains much ghosting artifact, which is mainly caused by the large motion of object (i.e., car) in the burst images. RviDeNet* also generates some motion induced artifacts around the moving objects. The proposed GCP-Net can work stably under such large-motion scenes. This is because we utilize the pyramid offset estimation and the offset is estimated on GCP features which dilute the impact of noise.
Table VI lists the number of model parameters and the number of floating point operations (FLOPs) of our GCP-Net and the comparison methods, i.e., DMN, KPN, EDVR and RviDeNet. The FLOPs are calculated on 5 frames of size . Because of the GCP branch and some attention modules, GCP-Net has more parameters than DMN, KPN and EDVR, but its model is much smaller than RviDeNet. Though the number of parameter and FLOPs of GCP-Net are 2 times and 1.5 times that of EDVR, it achieves significantly better JDD results than EDVR (3.5dB on Vid4 and 2dB on REDS4). Since RviDeNet utilizes a pre-denoising network and non-local modules, it has very high computational cost but achieves lower JDD performance than GCP-Net. Overall, GCP-Net achieves a good balance between JDD effectiveness and efficiency.
Most of the previous joint denoising and demosaicking (JDD) methods worked on a single color filter array (CFA) image. In this paper, we proposed an effective network, namely GCP-Net, for JDD on real-world burst images (JDD-B). Our method took the advantages of green channel prior (GCP), which referred to the fact that the green channel of CFA raw images usually had higher quality and sampling rate than the red and blue channels. The GCP features were used to guide the intra-frame feature extraction, inter-frame fusion and upsampling process of the multi-frame JDD-B task. Our experiments on synthetic data and real-world data quantitatively and qualitatively demonstrated that GCP-Net achieved superior performance to existing state-of-the-art algorithms of JDD. It can remove the heavy noise from images captured in low-light condition, preserve the textures and details without generating much visual artifact.
S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,” inAdvances in neural information processing systems, 2015, pp. 802–810.
K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” inProceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.