DeepAI
Log In Sign Up

Low-light Image Restoration with Short- and Long-exposure Raw Pairs

Low-light imaging with handheld mobile devices is a challenging issue. Limited by the existing models and training data, most existing methods cannot be effectively applied in real scenarios. In this paper, we propose a new low-light image restoration method by using the complementary information of short- and long-exposure images. We first propose a novel data generation method to synthesize realistic short- and longexposure raw images by simulating the imaging pipeline in lowlight environment. Then, we design a new long-short-exposure fusion network (LSFNet) to deal with the problems of low-light image fusion, including high noise, motion blur, color distortion and misalignment. The proposed LSFNet takes pairs of shortand long-exposure raw images as input, and outputs a clear RGB image. Using our data generation method and the proposed LSFNet, we can recover the details and color of the original scene, and improve the low-light image quality effectively. Experiments demonstrate that our method can outperform the state-of-the art methods.

READ FULL TEXT VIEW PDF

page 2

page 3

page 5

page 6

page 7

page 8

page 9

page 10

05/04/2018

Learning to See in the Dark

Imaging in low light is challenging due to low photon count and low SNR....
07/07/2022

D2HNet: Joint Denoising and Deblurring with Hierarchical Network for Robust Night Image Restoration

Night imaging with modern smartphone cameras is troublesome due to low p...
11/23/2018

LSD_2 - Joint Denoising and Deblurring of Short and Long Exposure Images with Convolutional Neural Networks

This paper addresses the challenging problem of acquiring high-quality p...
08/12/2021

Deep Camera Obscura: An Image Restoration Pipeline for Lensless Pinhole Photography

The lensless pinhole camera is perhaps the earliest and simplest form of...
06/02/2022

Long Scale Error Control in Low Light Image and Video Enhancement Using Equivariance

Image frames obtained in darkness are special. Just multiplying by a con...
03/17/2020

Burst Denoising of Dark Images

Capturing images under extremely low-light conditions poses significant ...

I Introduction

Imaging in a low-light environment is always a challenging subject, especially on mobile devices. Limited by the quality of optical imaging devices on these devices, images captured in low light suffer from high level noise, low visualization and color distortion when the exposure time is short. One way to reduce noise and obtain accurate color is extending the exposure time with sensor sensitivity reduction. However, due to the camera motion of handheld devices or object motion in the scenes, the resulting long-exposure images suffer from motion blur. In addition, many highlighted areas may have a large area of overexposure, causing the dynamic range of the image to be cut off. Therefore, to obtain high-quality images in dark scenes, image postprocessing is often required.

Most existing methods focus mainly on one aspect of image denoising [11, 10, 46, 47, 15, 5, 8] or deblurring [42, 31, 30, 29, 36, 45]

. Noise removal methods can obtain sharp images, avoiding overexposure. Hence, many low-light enhancement methods capture short-exposure images, and then they increase the brightness and reduce the noise to produce normal-exposure images. This method can allow for a higher dynamic range and avoidance of motion blur. However, the true colors and missing details of the scene are hard to recover. On the other hand, because the motion of the scene is complex, the blur in the image is not uniform. Motion blur is more difficult to deal with than noise especially when there are outliers in low-light scenes. The presence of an overexposed area near a light source may mislead blur removal tasks. Deblurring methods cannot deal with dark scenes well and always result in blurred residuals or other artifacts. Another approach is to combine the advantages of both long/short-exposed images into one high-quality image. Many previous methods have been studied to achieve this goal, and some progress has been made 

[44, 33, 28, 48]. However, limited by their algorithms or the data generation method, these methods cannot be effectively applied to images taken by real handheld mobile devices.

The existing methods of synthesizing long- or short-exposure images are often straightforward. For short-exposure images, they add Gaussian noise directly to the standard RGB (sRGB) domain to obtain noise images. Additionally, for long-exposure images, uniform or nonuniform blur kernels are used to generate blurred images in the sRGB domain or after inverse gamma correction. These methods do not take the formation of noise or blur into account in the imaging pipeline. The images generated by these methods are only approximations of the real images and ignore many factors in the imaging process. Recently, many methods have been proposed to generate noise images by taking real dark noise images [2, 8] or by simulating the noise factors in the imaging process [1, 37, 38], and great progress has been made. Limited by the information of a single noisy image, however, the performance remains unsatisfactory. In addition, there is no method to analyze the generation of long-exposure blurred images from the perspective of a real imaging pipeline.

In this paper, we simulate a real low-light imaging pipeline and propose a novel method to synthesize long/short-exposure images. The camera sensor actually receives raw images, which are then processed by an ISP pipeline to produce visual sRGB images. Therefore, our data generation method first synthesizes raw images with long and short exposures. Then, to avoid the damage to image information caused by ISP operation, we directly use the long- and short-exposure raw image to synthesize a high-quality image.

There are three main challenges in image fusion: misalignment, ghosting, and information fusion. First, since the images are acquired consecutively, the time delay causes displacement between them. Second, areas that are misaligned or inconsistent between images result in ghost artifacts after fusion. Third, the fusion method needs to extract useful information from different images, and then improve the visual effects and fidelity. To achieve this goal and avoid the aforementioned limitations, we propose an long-short-exposure fusion network (LSFNet) to fuse long/short-exposure images. LSFNet consists of three designed modules to address each of the issues described above. To improve the details, a multiscale structure is used to reconstruct the image from coarse to fine. The experimental results demonstrate that our method is effective at improving low-light image quality.

In conclusion, the main contributions of our work are as follows:

  • Based on a highly accurate imaging model in a low-light environment, we propose a data generation approach that synthesizes long/short-exposure raw images. With the contribution of our dataset, we are able to better deal with real low-light images obtained with mobile devices.

  • We propose a novel long-short exposure fusion network architecture, which takes the long/short-exposed raw images as input. The proposed network can handle misalignment, ghost artifacts and information fusion issues, and then outputs a high-quality RGB image.

  • We compare our method with various low-light enhancement methods on synthetic and real test datasets. The results demonstrate that our model achieves state-of-the-art performance on both synthetic and real images with vivid visual effects.

Ii Related Works

The previous work focused mainly on two aspects, deblurring and denoising, to improve image quality. Deblurring and denoising are two classic ill-posed problems and have been studied extensively.

Traditional single-image denoising attempts to model the distribution of natural image or noise, and using this prior information to recover clear images with optimization algorithms. The common priors include sparsity [3, 40], non-local self-similarity [11, 10, 14] and external statistical prior [34, 50, 41]. Recently, learning-based methods have been proposed to remove noise from end to end  [46, 47] and improve the performance on more realistic noise [15, 5]. Moreover, great progress has been made in the acquisition of noise data. Many recent works have begun to analyze the source of noise from the perspective of image acquisition [1, 37, 38], and remove noise from raw images [8, 6]. Because raw images retain more information than RGB images, with simpler noise distributions, image recovery tasks tend to yield better results on raw images.

Similar to image denoising methods, image deblurring methods can be divided into model-based methods and learning-based methods. Model-based methods often assume that the image has uniform blur, and introduce prior information to suppress the noise and ringing artifacts [43, 31, 30]

. The learning-based methods use convolutional networks to process blurred images and output sharp images directly. Learning-based approaches can deal with nonuniform blur since they do not require explicit estimation of the blur kernel 

[29, 36, 48]. However, single-image deblurring methods are difficult to have good robustness due to the diversity of motion blur. On the other hand, it is difficult to capture blur-sharp image pairs in real scenes, so the existing blur datasets are synthesized, including convoluting uniform/nonuniform blur kernels [22, 23] or averaging consecutive short-exposure frames from high-frame-rate videos [29]. These methods all add blur to the sRGB domain without considering the formation of blur from the low-light imaging pipeline, which limits their recovery performance.

In addition to single-image restoration, burst image denoising [16, 13, 27] or deblurring methods [39, 4]

have also made much progress. Burst image methods consecutively capture multiple images with the same exposure and fuse them together to improve the image quality. However, to achieve the ideal signal-to-noise ratio, it is often necessary to acquire many frames of images, which increases the difficulty of image registration. On the other hand, limited by the constant exposure, complementary information is missing for accurate recovery of high-dynamic-range scenes.

Noisy-blurred image pairs have been used in image deblurring tasks [44, 33, 48]. Using the texture of the noisy image as a constraint, these pairs can effectively removes severe blur and suppress ringing artifacts. Short-long-exposure fusion methods [26, 19] were previously used to improve the dynamic range of images, without considering noise and blur in the imaging process. Recently, LSD [28] was used to combines long- and short-exposure images for joint noise and blur removal, to improve the quality of low-light images. This approach uses only a channel-augmented UNET architecture and takes the RGB images after ISP postprocessing as input, which cannot obtain a satisfactory result in real images captured by mobile devices.

Fig. 1: Intensity maps and histograms for short- and long-exposure images. (a) sRGB images for different exposures. (b) Intensity values of pixels in the red line. (c) Histograms of the raw images.
Fig. 2: The overview of our method.
Fig. 3: The imaging pipeline.

Iii Background

As mentioned above, imaging in a low-light environment is a challenging issue. Different exposure strategies always have their own drawbacks. Short exposure times with high ISO values lead to high noise levels and color distortion. A long exposure time with low ISO results in motion blur. Both exposure strategies may have a cutoff dynamic range limited by the camera sensor. Compounding the problem, furthermore, is the fact that these flaws are often not independent of each other.

First, noise is inevitable in the process of imaging. Even if we can reduce the noise by increasing the exposure time and lowering the ISO, the noise cannot be completely eliminated, especially in low-light environments. This can result in both noise and motion blur existing in the images. Denoising and deblurring tasks are often mutually constrained, which may amplify noise when removing blur. Second, motion blur causes the overexposure outliers to form sharp edges, which may mislead the deblurring task [9]. Moreover, the large areas of overexposure are not conducive to the removal of blur and noise.

The interaction between a high noise level and exposure cutoff is also one of the reasons for color distortion in short-exposure images. The red and blue values are always smaller than the green values in captured raw images. Due to high noise level, more pixels in the red and blur channels fall below the dark current value and are cut off, resulting in the red and blur channels having higher brightness levels after white balancing.

Noise and blur also limit the expansion of the dynamic range. The common way to extend the dynamic range is multi-exposure fusion. This method extracts the information of normal-exposure regions in different-exposure images and then generates a high-dynamic-range image. However, noise and other artifacts are fused as the textures of the images, severely degrading the fused image quality. Another way to obtain high-dynamic-range images is to capture short-exposure images while avoiding overexposure, and then use tone mapping to increase the brightness. However, the increase in brightness amplifies noise at the same time, as shown in Fig. 1. In addition, to obtain the high dynamic range of real scenes, most dark regions are often compressed. Limited by the quantification scope of the sensor, many details in dark regions are quantitatively compressed into a small bit range. Fig. 1 (c) shows the histograms of raw images for different exposures. Compared to the long-exposure image, the intensity values of the scaled short-exposure image are quantified in a few discrete levels. Moreover, color correction is also a great challenge with short-exposure images.

Therefore, simple denoising or deblurring operations alone cannot effectively improve the quality of low-light images and sometimes even make the results worse. We need to take all factors into consideration and use the complementary information of short and long exposures to improve the image quality.

Iv Approach

Long/short-exposure images generation is the key task in learning-based methods. Most existing methods generate blur images only by convoluting uniform/nonuniform blur kernels and adding homoscedastic Gaussian noise in the sRGB domain. However, this is only an approximation of the actual imaging process. In fact, the blurred and noisy images captured by the camera are processed by the ISP system. Demosaicing operations make pixels independent, and some nonlinear operations, such as gamma correction or tone mapping, also affect the imaging results. Therefore, we generate long- and short-exposure images in the raw domain, which is unaffected by ISP operations and scaled linearly to the intensity received by the sensor. In practice, long- and short-exposure images can be acquired by the burst mode of mobile devices. Short-exposure inputs have shorter exposure times than long-exposure inputs. Similar to [28], we choose an exposure ratio of 30, which means that the short exposure time is 1/30 of the long exposure time. We scale the short-exposure image to match the brightness when being input into the network.

The proposed LSFNet takes the long- and short-exposure raw images as input, and outputs a camera RGB image, which is sharp, noise free and color corrected. Then, the ISP postprocessing transforms the output to a vivid sRGB image. An overview of our method is shown in Fig. 2. In the following section, we will give the details on our data generation method and the architecture of LSFNet.

Fig. 4: The overview of our data generation method.

V Data Generation

The training data include pairs of short/long-exposure images with the corresponding high-quality clear images. The short-exposure images are sharp but noisy and color distorted. In addition, underexposed areas and quantization artifacts may also exist in dark regions. On the other hand, the long-exposure image is blurred, with areas of overexposure. Moreover, consecutive imaging with handheld devices will inevitably introduces inter-frame misalignment. We need to introduce these factors in turn when simulating the imaging process to synthesize data.

The imaging process is shown in Fig. 3. The camera sensor receives the irradiation from the scene, and obtains Bayer-pattern raw images through a color filter. When we make a long exposure, the camera shake overlays the irradiation of different scenic spots in the same pixel position of the sensor. Therefore, motion blur is generated in the irradiation domain. Then, the color filter samples the irradiation received by the sensor to form the blurred raw images. At the same time, noise is introduced due to the random fluctuation of photons and the imperfect nature of sensor devices. The blurred and noisy raw data pass through the ISP pipeline in the camera to obtain sRGB blurred images, constituting the final output of the camera. Therefore, in order to synthesize more realistic blur and noise data, we first need to obtain the irradiation domain images received by the camera sensor.

However, what we can obtain directly from the camera are only sampled Bayer-pattern raw images or the sRGB images processed by the ISP pipeline. To solve this problem, we first collect some high-quality raw images without noise and blur, and try to avoid large overexposed and underexposed areas. Then, we split the high-quality raw image with the Bayer pattern into three color channels and use the maximum entropy downsampling operation proposed by [20] to align each color channel, which leads to each pixel has three color values. In this way, we can synthesize the irradiation images received by the sensor. We subsequently add blur and noise to the irradiation images to obtain the low and short exposure raw data. The overview of our data generation method is shown in Fig. 4.

V-a Overexposure outliers

The real low-light scenes may include light sources that lead to some overexposed areas. Especially in the light source areas, even the short-exposure images are partially overexposed. Therefore, to simulate different overexposures, we allow a small number of light sources to exist in the irradiation images and exposure cutoffs to exist at the light source points. Then, we increase the whole brightness of the irradiation images by multiplying them by a scale factor , which is uniformly sampled from the range [1.3, 3]. For the ground truth images, we can directly clip the scaled irradiation images to the range of to obtain the long-exposure sharp images. For the long- and short- exposure inputs, we should clip to this range after adding motion blur and noise.

To simulate the light source regions, when producing short-exposure images, we keep the brightness of the cutoff regions unchanged and reduce the brightness of the other regions by dividing the exposure ratio , which is set to 30, similar to the method in [6]. This allows areas of the light source that are cut off in the original irradiation images to remain cut off in the short-exposure images, while the scaled highlight areas in long-exposure images are normally exposed.

V-B Synthesis of long-exposure raw images

Motion blur is the main cause of long-exposure image degradation. As mentioned above, motion blur should be added to the scaled irradiation images . We focus on blur caused by camera motion, which can be described from an imaging model [35, 18]

. To generate more realistic motion blur, we obtain camera motion information by recording the built-in gyroscope data. Then, an optical flow map is generated according to the imaging model, which indicates how each pixel is moving at each moment. We can generate the irradiation images captured by the sensor at each moment. By averaging the moving image series over an exposure time, we can obtain the long-exposed blur image.

Fig. 5: Long-exposure blurred image synthesis method.

We assume that short- and long-exposure images are shot consecutively, and the scene of the ground truth image should be consistent with the short exposure image. In generation, each pixel continuously moves from the position of the original irradiation image. In addition to the camera motion during long exposure, the motion in time interval between two images also causes spatial misalignment. In order to simulate the misalignment between two images, we throw away the first few frames of optical flow map and only add up the following frames, as shown in Fig. 5.

After adding blur to the scaled irradiation image , we clip the range of the intensity to [0, 1]. Then, we sample the blurred irradiation image with a Bayer pattern to obtain a blurred raw image. The whole process to generate long-exposure raw images can be described as follows:

(1)

where is the synthetic long-exposed blur raw image; , and are the clipping function, Bayer sampling function and blur function, respectively. is the noise, which is discussed later.

V-C Realistic noise

Noise is inevitable in the imaging process, especially in low-light environment. In general, there are two main types of noise in raw images: shot noise and read noise. Shot noise occurs from individual photon detection events in the sensor, constituting a Possion process with variance equal to the signal level. Read noise is caused by a combination of sensor readout effects, and has an approximately Gaussian distribution. The overall noise can be described as a heteroscedastic Gaussian distribution:

(2)

where the noise parameters and are proportional to the ISO value. Since the exposure time of short exposure images is shorter than that of long-exposure images, the ISO values of short-exposure images need to be increased, or the image values need to be scaled, to match their brightness, which magnifies the noise variance of short-exposure images at the same time. In our method, the exposure ratio is 30, so the noise variance of short-exposure images is 30 times that of long-exposure images.

V-D Color distortion

The captured short-exposure image in a low-light environment often exhibits color distortion relative to the long-exposure images, especially when the exposure ratio is large. We assume that the ambient lighting is constant for each shot. In raw images recorded by sensors, green signals are higher than red and blue signals, so the estimated white balance coefficient is used to compensate for the red and blue signals in the ISP to restore the real scene color. Therefore, when the estimation of the white balance coefficient is inaccurate, the image color will be biased.

Another important cause of color distortion is the cutoff effect. Since the red and blue signals are smaller, underexposure cutoff is more likely to occur. When the noise variance of the image is large, the cutoff effect is more obvious. The cutoff effect causes the red and blue signal levels to be raised, thus shifting the underexposed images to purple.

Fig. 6: A typical example of color distortion. The low light images are chosen from SID dataset.
Fig. 7: The architecture of our LSFNet.

A typical example of color distortion is shown in Fig. 6. We choose two pairs of short/long-exposure images from SID dataset [8], which are captured in a real low-light environment. When the noise level is relatively low, the short-exposure image is greenish. However, when the noise level is high, the short-exposure image tends to be purple due to the cutoff effect. Hence, to simulate this phenomenon, we multiply the red and blur channels of the irradiation image by a independent random coefficient . Each is uniformly sampled from the range [0.7, 0.9]. After adding noise, we clip the image brightness and then perform white balancing on the clipped image.

V-E Synthesis of short-exposure raw images

As mentioned above, we first reduce the brightness of the scaled irradiation image by the exposure ratio to obtain the clean short-exposure images. Then, we introduce color distortion and noise. After Bayer sampling, we clip the brightness range to [0,1] to obtain the short raw images. The whole process to generate short-exposure raw images can be described as following:

(3)

where is the color distortion function. Before input into the network, the short-exposure image is scaled to match the brightness of the ground truth images.

Vi Model

The network takes the long- and short-exposure raw images as the input and makes use of their complementary advantages to improve image quality. The output is a three-channel image in the camera RGB domain. Therefore, we accomplish both fusion and demosaicing in the network. Then, we treat the output of the network into the ISP postprocessing to obtain a clear sRGB image. The same postprocessing operation is applied to the clipped and scaled irradiation image to generate the ground truth. We compute the loss in the sRGB domain to take into account the impact of ISP postprocessing.

Vi-a Network

Our network structure consists mainly of four parts: a feature extraction module, alignment module, deghosting module and fusion module. The feature extraction module extracts useful features from the long- and short-exposure image respectively. The alignment module aligns the input features. We assume that the short-exposure image is spatially consistent with the ground truth image, so we mainly align the long-exposure features to the short-exposure features. After alignment, there may still be some misaligned areas, or some moving objects in the scene may cause ghost artifacts. Therefore, we introduce a deghosting module to further suppress these artifacts. Subsequently, we fuse the aligned feature to reconstruct the output. To improve the image quality, we use the multiscale structure to reconstruct the clear image from coarse to fine.

The architecture of our LSFNet is shown in Fig. 7. We first use a convolution layer and a Resblock [17] to extract features from the inputs. The alignment blocks adopt deformable convolutions [12, 49]. They first compute the offset maps by both short- and long-exposure inputs then reshape the features of the long-exposure input. The deghosting blocks concatenate the two inputs and calculate a weight map, which reweights the long-exposure features and suppresses the inconsistent areas. Then we concatenate all these features and input them into the fusion module, which is constructed by multiscale Resblocks. The coarse-scale features and offset maps are transferred to a larger scale by upsampling to assist in feature reconstruction. At the end of the finest scale, we use a convolutional layer to recombine the features and use a pixel-shuffle layer to reconstruct the three-channel output.

We construct a four-scales network architecture with 32, 64, 128, 256 channels. LeakyReLU [24]

with a slope of 0.2 is used as the activation function. The convolutional layers with a stride of 2 for downsampling and the transposed convolutional layers for upsampling use

kernels. The last convolutional layer uses kernels. All other convolutional layers use

kernels. The upsampling operation in offset transmission is bilinear interpolation.

Vi-B ISP postprocessing

We perform white balancing and demosaicing before obtaining the output of the LSFNet. The output of the LSFNet lies in the camera RGB space, which is the same as the irradiation image. To obtain the visual sRGB image for display, we need ISP postprocessing including color space conversion and gamma correction. The color space conversion transforms the image into sRGB color space with a color correction matrix. Then, gamma correction improves the details in dark regions, making the image more consistent with human visual perception. We use the standard gamma curve:

(4)

where , a constant to prevent numerical instability, is set to .

Fig. 8: Ablation study of the proposed network on the synthetic dataset.
Fig. 9: Results of the Grass image from the synthetic dataset.
Fig. 10: Results of the Chair image from the synthetic dataset.

Vi-C HDR compression

As mentioned above, the proposed network takes the short- and long-exposure raw pairs as input and outputs a sharp long-exposure image that is noise-free and has not motion blur. However, some overexposed areas still exist in the long-exposure image, thus limiting the dynamic range. Therefore, to maintain the details in highlighted areas, we retrain the network with dynamic range compression. The method of data generation and the architecture of the network are the same as described above. The only difference is the input of the network. We reduce the brightness of the long- and short-exposure images to match the brightness of the original irradiation image when input into the network. During ISP postprocessing, we replace the gamma correction with -law[19] to compress the dynamic range, which is described as follows:

(5)

where is the HDR input image in the linear domain and is the tone-mapped output image. is a parameter which controls the ratio of compression. We set to 100 in our experiment. For the ground truth, we use the original irradiation image with the same ISP postprocessing, which is not scaled by and maintain normal exposure. In other words, we finish noise and motion blur removal, color correction, demosaicing and HDR compression in the proposed network, without other exposure fusion methods that would incur additional computational overhead.

Fig. 11: Results of the Lamp image in a real scene.
Fig. 12: Results of the Street image in a real scene.

Vii Experiments

Vii-a Implement details

We manually select 1322 high-quality raw images from the MIT-Adobe 5K dataset [7] to generate our training data. The selected images are sized approximately and have no obvious noise or motion blur. For the synthetic test data, we adopt 70 long-exposure raw images in the SID test dataset [8], which are around pixels. After that, we synthesize the training and test datasets using the method described in Section 4. For the real images, we modify the burst mode in the camera of the mobile phone so that it could consecutively capture a pair of short- and long-exposure raw images. The long exposure time is set to 30 times the short exposure, and the ISO values for both are set the same.

We use the loss function for training the network, which is computed in the sRGB domain. The loss function can be described as:

(6)

where denotes the learned parameters in the network. denotes the ISP postprocessing function.

In each training batch, we crop the raw images into patches and pack them into four R-G-G-B channels. Then, we use 16 patches with a size of as inputs. We train our model by ADAM optimizer [21] with , , and . The initial learning rate is set to

and then halved after 100 epochs. Our model is implemented by the PyTorch framework 

[32] with an Nvidia GeForce RTX 1080Ti.

We compare several methods for image restoration, including denoising, deblurring and multi-image fusion methods. For the denoising methods, SID and RIDNet are chosen, which are representative low-light denoising methods on raw and rgb domain respectively. The compared deblurring methods include SRN and DMPHN. LSD is chosen to represent the fusion method of long and short exposures. Since the code of LSD is not publicly available, we retrain it on our synthetic training data.

Vii-B Ablation study

Since the proposed network contains multiple modules, we perform an ablation study on our synthetic test dataset to demonstrate their effectiveness. The quantitative results are provided in Table I, and the visual results are shown in Fig. 8. Without alignment modules, the network cannot effectively align the long- and short-exposure features, which results in many edges being smeared. Deghosting modules are complements to the alignment modules. The introduction of the deghosting modules further improves the details. Moreover, instead of taking raw images as the input directly, we input the sRGB images after ISP pipeline processing. The raw images are first white-balanced and demosaiced using the method proposed by [25]. Then, the demosaiced images are processed by the ISP postprocessing as described above. The ISP pipeline makes noise distribution more complicated and compresses the image details. The restored images exhibit substantial loss of textures, and oversmoothing artifacts appear, which also result in a marked decline in quantitative indicators.

Method w/o align w/o deghost w/o raw input full model
PSNR 30.05 30.56 30.10 30.70
SSIM 0.8735 0.8804 0.8725 0.8823
TABLE I: Ablation study of different components.

Vii-C Comparison on synthetic images

Method PSNR SSIM
RIDNet 25.75 0.8049
SRN 24.65 0.7829
DMPHN 22.13 0.7621
SID 26.33 0.8345
LSD 29.96 0.8697
LSFNet (ours) 30.70 0.8823
TABLE II: Quantitative comparison on the synthetic test dataset. The outputs of compared methods are adjusted to match the ground truth for fairness.

The quantitative results for the synthetic images are listed in Table II. For a fair comparison, we align the deblurring results of SRN and DMPHN with the ground truth. We also adjust the colors of the RIDNet outputs to match the ground truth. In the quantitative comparison, both PSNR and SSIM values exceed those of all the other methods in the comparison.

Fig. 11 and Fig. 12 show some examples from our synthetic dataset. We extend the brightness of the short-exposure images for better display. RIDNet can handle real noise on sRGB images. However, when the noise of short-exposure images is high, many textures are smeared and high-frequency information is lost. In addition, RIDNet does not consider the color distortion when denoising and cannot correct the color of short-exposure images. SID removes noise from raw images and considers the whole ISP pipeline in an end-to-end network. However, limited by the information from short-exposure images, SID cannot restore the rich details. Without long-exposure images as references, SID often makes mistakes in color correction and fails to generalize to different imaging devices.

SRN can deal with motion blur when the scene is relatively simple. However, many artifacts appear in the results. Some blurred edges are sharpened to multiple edges, such as the leaves in Fig. 11, or lead to ghosts as shown in Fig. 12. Motion blur in some areas cannot be removed, and many textures are smoothed. The existence of noise also worsens the deblurring results and leads to artifacts. Similar to our method, LSD utilizes the complementary information of long and short exposures, which can remove noise and blur simultaneously and restore true color. However, limited by the network and ISP postprocessing, some details are smoothed. In contrast, our method can restore vivid textures and sharp edges without other artifacts.

Vii-D Comparison on real images

We capture some real low-light scenes and evaluate our method on these real images. Figs. 11 and 12 show three examples. Without long exposure as a reference, the denoising methods cannot recover the color correctly. The overexposed areas are enlarged in the blurred images and the deblurring methods such as SRN cannot restore the cutoff details. LSD cannot remove the noise in textured areas such as grass and leaves. Moreover, ghosts may appear in highlighted areas due to misalignment, as shown in Fig. 11. In contrast to other method, our method obtains rich textures. We also show the results of our method with HDR compression. The overexposed areas in the long exposure images are restored, and high contrast is maintained in other areas.

Fig. 13: Comparison with exposure fusion on real scene.

Vii-E Comparison with exposure fusion

Other exposure fusion methods improve the dynamic range by fusing multi-exposure images only. They assume that the input images are of high quality with only the dynamic range cut off. When the input images have noise or other artifacts, they may magnify them with the textures in the outputs. As shown in Fig. 13, we fuse the long-exposure image of LSD’s output and short-exposure image using [26]. The fusion result has a high noise level. Although the short-exposure image is denoising using RIDNet, some textures are lost. The fusion result still has low contrast and poor details. In contrast, our LSFNet can output a sharp tone-mapped result directly without additional exposure fusion.

Fig. 14: An extreme imaging case in a real scene.

Vii-F Computational overhead

Method DMPHN SID RIDNet LSD LSFNet (ours)
Params 21.7M 7.8M 1.5M 31.0M 8.4M
FLOPs 234.5G 13.8G 98.1G 54.8G 38.0G
Times (s) 0.128 0.005 0.039 0.011 0.019
PSNR (dB) 22.13 26.33 25.75 29.96 30.70
TABLE III: Parameter number and time comparisons on 256256 input patches.

We test different methods on 256256 input patches to compare the computational overhead. All the methods are implemented in PyTorch. We provide the number of floating point operations (FLOPs) since the running time may depend on the test platform and code. As shown in Table III, our method achieves the optimal performance with a moderate computational overhead.

Viii Discussion and Limitations

Although our method can produce high-quality images in real scenes, it has several limitations that may inspire future work.

Our method adopts the complementary information of short- and long-exposure images. However, the complementary information is missing when the quality of captured images is poor. On the one hand, short-exposure images are too noisy to distinguish details; useful information come from long exposure images only, and the model degenerates into a single-image deblurring network in this case. On the other hand, when long-exposure images are extremely blurred, the quality of the outputs also deteriorates. However, as long as the noise of short-exposure images is not severe, the correct color and textures can still be recovered. In some harsh imaging environments, such as extremely low-light condition, the captured image pairs have high noise and large motion blur, which leads to the difficulty in outputting satisfactory results.

As shown in Fig. 14, both a high noise level and a large motion blur exist in the input image pair. The proposed method fails to recover sharp result. Under such conditions, other methods also have difficulty achieving good results. More images may be needed to improve the image quality.

Ix Conclusion

In this paper, we propose a novel low-light restoration method that fuses a noisy short-exposure image and a blurred long-exposure image into a high quality sharp image. Based on simulating the imaging process in a low-light environment, a new data generation method is proposed to synthesize more realistic raw images with a variety of exposures. Moreover, we design a new network to handle the short- and long-exposure raw images and output an RGB image without noise, blur or color distortion. We compare various low-light image enhancement methods and demonstrate that our method can obtain the state-of-the-art performance.

Acknowledgments

This work was supported by Basic Research on Civil Aerospace No.D040301.

References

  • [1] A. Abdelhamed, M. A. Brubaker, and M. S. Brown (2019) Noise flow: noise modeling with conditional normalizing flows. In

    Proceedings of the IEEE International Conference on Computer Vision

    ,
    pp. 3165–3173. Cited by: §I, §II.
  • [2] A. Abdelhamed, S. Lin, and M. S. Brown (2018) A high-quality denoising dataset for smartphone cameras. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 1692–1700. Cited by: §I.
  • [3] M. Aharon, M. Elad, and A. Bruckstein (2006) K-svd: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on signal processing 54 (11), pp. 4311–4322. Cited by: §II.
  • [4] M. Aittala and F. Durand (2018)

    Burst image deblurring using permutation invariant convolutional neural networks

    .
    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 731–747. Cited by: §II.
  • [5] S. Anwar and N. Barnes (2019) Real image denoising with feature attention. arXiv preprint arXiv:1904.07396. Cited by: §I, §II.
  • [6] T. Brooks, B. Mildenhall, T. Xue, J. Chen, D. Sharlet, and J. T. Barron (2019) Unprocessing images for learned raw denoising. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11036–11045. Cited by: §II, §V-A.
  • [7] V. Bychkovsky, S. Paris, E. Chan, and F. Durand (2011) Learning photographic global tonal adjustment with a database of input / output image pairs. In The Twenty-Fourth IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §VII-A.
  • [8] C. Chen, Q. Chen, J. Xu, and V. Koltun (2018) Learning to see in the dark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3291–3300. Cited by: §I, §I, §II, §V-D, §VII-A.
  • [9] S. Cho, J. Wang, and S. Lee (2011) Handling outliers in non-blind image deconvolution. In 2011 International Conference on Computer Vision, pp. 495–502. Cited by: §III.
  • [10] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian (2007) Color image denoising via sparse 3d collaborative filtering with grouping constraint in luminance-chrominance space. In 2007 IEEE International Conference on Image Processing, Vol. 1, pp. I–313. Cited by: §I, §II.
  • [11] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian (2007) Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on image processing 16 (8), pp. 2080–2095. Cited by: §I, §II.
  • [12] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 764–773. Cited by: §VI-A.
  • [13] C. Godard, K. Matzen, and M. Uyttendaele (2017) Deep burst denoising. arXiv preprint arXiv:1712.05790. Cited by: §II.
  • [14] S. Gu, L. Zhang, W. Zuo, and X. Feng (2014) Weighted nuclear norm minimization with application to image denoising. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2862–2869. Cited by: §II.
  • [15] S. Guo, Z. Yan, K. Zhang, W. Zuo, and L. Zhang (2019) Toward convolutional blind denoising of real photographs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1712–1722. Cited by: §I, §II.
  • [16] S. W. Hasinoff, D. Sharlet, R. Geiss, A. Adams, J. T. Barron, F. Kainz, J. Chen, and M. Levoy (2016) Burst photography for high dynamic range and low-light imaging on mobile cameras. ACM Transactions on Graphics (TOG) 35 (6), pp. 1–12. Cited by: §II.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §VI-A.
  • [18] S. Hee Park and M. Levoy (2014) Gyro-based multi-image deconvolution for removing handshake blur. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3366–3373. Cited by: §V-B.
  • [19] N. K. Kalantari and R. Ramamoorthi (2017) Deep high dynamic range imaging of dynamic scenes.. ACM Trans. Graph. 36 (4), pp. 144–1. Cited by: §II, §VI-C.
  • [20] D. Khashabi, S. Nowozin, J. Jancsary, and A. W. Fitzgibbon (2014) Joint demosaicing and denoising via learned nonparametric random fields. IEEE Transactions on Image Processing 23 (12), pp. 4968–4981. Cited by: §V.
  • [21] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §VII-A.
  • [22] R. Köhler, M. Hirsch, B. Mohler, B. Schölkopf, and S. Harmeling (2012) Recording and playback of camera shake: benchmarking blind deconvolution with a real-world database. In European conference on computer vision, pp. 27–40. Cited by: §II.
  • [23] W. Lai, J. Huang, Z. Hu, N. Ahuja, and M. Yang (2016) A comparative study for single image blind deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1701–1709. Cited by: §II.
  • [24] A. L. Maas, A. Y. Hannun, and A. Y. Ng (2013) Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, Vol. 30, pp. 3. Cited by: §VI-A.
  • [25] H. S. Malvar, L. He, and R. Cutler (2004) High-quality linear interpolation for demosaicing of bayer-patterned color images. In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 3, pp. iii–485. Cited by: §VII-B.
  • [26] T. Mertens, J. Kautz, and F. V. Reeth (2007) Exposure fusion. In Computer Graphics and Applications, 2007. PG ’07. 15th Pacific Conference on, Cited by: §II, §VII-E.
  • [27] B. Mildenhall, J. T. Barron, J. Chen, D. Sharlet, R. Ng, and R. Carroll (2018) Burst denoising with kernel prediction networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2502–2510. Cited by: §II.
  • [28] J. Mustaniemi, J. Kannala, J. Matas, S. Särkkä, and J. Heikkilä (2018) LSD -joint denoising and deblurring of short and long exposure images with convolutional neural networks. arXiv preprint arXiv:1811.09485. Cited by: §I, §II, §IV.
  • [29] S. Nah, T. Hyun Kim, and K. Mu Lee (2017) Deep multi-scale convolutional neural network for dynamic scene deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3883–3891. Cited by: §I, §II.
  • [30] J. Pan, Z. Lin, Z. Su, and M. Yang (2016) Robust kernel estimation with outliers handling for image deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2800–2808. Cited by: §I, §II.
  • [31] J. Pan, D. Sun, H. Pfister, and M. Yang (2016) Blind image deblurring using dark channel prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1628–1636. Cited by: §I, §II.
  • [32] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019)

    PyTorch: an imperative style, high-performance deep learning library

    .
    In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §VII-A.
  • [33] V. Rengarajan, S. Zhao, R. Zhen, J. Glotzbach, H. Sheikh, and A. C. Sankaranarayanan (2019) Photosequencing of motion blur using short and long exposures. arXiv preprint arXiv:1912.06102. Cited by: §I, §II.
  • [34] S. Roth and M. J. Black (2009) Fields of experts. International Journal of Computer Vision 82 (2), pp. 205. Cited by: §II.
  • [35] O. Sindelar and F. Sroubek (2013) Image deblurring in smartphone devices using built-in inertial measurement sensors. Journal of Electronic Imaging 22 (1), pp. 011003. Cited by: §V-B.
  • [36] X. Tao, H. Gao, X. Shen, J. Wang, and J. Jia (2018) Scale-recurrent network for deep image deblurring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8174–8182. Cited by: §I, §II.
  • [37] W. Wang, X. Chen, C. Yang, X. Li, X. Hu, and T. Yue (2019) Enhancing low light videos by exploring high sensitivity camera noise. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4111–4119. Cited by: §I, §II.
  • [38] K. Wei, Y. Fu, J. Yang, and H. Huang (2020) A physics-based noise formation model for extreme low-light raw denoising. arXiv preprint arXiv:2003.12751. Cited by: §I, §II.
  • [39] P. Wieschollek, M. Hirsch, B. Scholkopf, and H. Lensch (2017) Learning blind motion deblurring. In Proceedings of the IEEE International Conference on Computer Vision, pp. 231–240. Cited by: §II.
  • [40] J. Xu, L. Zhang, and D. Zhang (2018) A trilateral weighted sparse coding scheme for real-world image denoising. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 20–36. Cited by: §II.
  • [41] J. Xu, L. Zhang, and D. Zhang (2018) External prior guided internal prior learning for real-world noisy image denoising. IEEE Transactions on Image Processing 27 (6), pp. 2996–3010. Cited by: §II.
  • [42] L. Xu and J. Jia (2010) Two-phase kernel estimation for robust motion deblurring. In European conference on computer vision, pp. 157–170. Cited by: §I.
  • [43] L. Xu, S. Zheng, and J. Jia (2013) Unnatural l0 sparse representation for natural image deblurring. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1107–1114. Cited by: §II.
  • [44] L. Yuan, J. Sun, L. Quan, and H. Shum (2007) Image deblurring with blurred/noisy image pairs. In ACM SIGGRAPH 2007 papers, pp. 1–es. Cited by: §I, §II.
  • [45] H. Zhang, Y. Dai, H. Li, and P. Koniusz (2019-06) Deep stacked hierarchical multi-patch network for image deblurring. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
  • [46] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang (2017) Beyond a gaussian denoiser: residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing 26 (7), pp. 3142–3155. Cited by: §I, §II.
  • [47] K. Zhang, W. Zuo, and L. Zhang (2018) FFDNet: toward a fast and flexible solution for cnn-based image denoising. IEEE Transactions on Image Processing 27 (9), pp. 4608–4622. Cited by: §I, §II.
  • [48] S. Zhang, A. Zhen, and R. L. Stevenson (2019) Deep motion blur removal using noisy/blurry image pairs. arXiv preprint arXiv:1911.08541. Cited by: §I, §II, §II.
  • [49] X. Zhu, H. Hu, S. Lin, and J. Dai (2019) Deformable convnets v2: more deformable, better results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9308–9316. Cited by: §VI-A.
  • [50] D. Zoran and Y. Weiss (2011) From learning models of natural image patches to whole image restoration. In 2011 International Conference on Computer Vision, pp. 479–486. Cited by: §II.