Burst Photography for Learning to Enhance Extremely Dark Images

06/17/2020 ∙ by Ahmet Serdar Karadeniz, et al. ∙ 0

Capturing images under extremely low-light conditions poses significant challenges for the standard camera pipeline. Images become too dark and too noisy, which makes traditional enhancement techniques almost impossible to apply. Recently, learning-based approaches have shown very promising results for this task since they have substantially more expressive capabilities to allow for improved quality. Motivated by these studies, in this paper, we aim to leverage burst photography to boost the performance and obtain much sharper and more accurate RGB images from extremely dark raw images. The backbone of our proposed framework is a novel coarse-to-fine network architecture that generates high-quality outputs progressively. The coarse network predicts a low-resolution, denoised raw image, which is then fed to the fine network to recover fine-scale details and realistic textures. To further reduce the noise level and improve the color accuracy, we extend this network to a permutation invariant structure so that it takes a burst of low-light images as input and merges information from multiple images at the feature-level. Our experiments demonstrate that our approach leads to perceptually more pleasing results than the state-of-the-art methods by producing more detailed and considerably higher quality images.



There are no comments yet.


page 1

page 4

page 6

page 7

page 8

page 9

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Capturing images in low-light conditions is a challenging task – the main difficulty being that the level of the signal measured by the camera sensors is generally much lower than the noise in the measurements [1]. The fundamental factors causing the noise are the variations in the number of photons entering the camera lens and the sensor-based measurement errors occurred when reading the signal [2, 3]. In addition, noise present in a low-light image also affects various image characteristics such as fine-scale structures and color balance, further degrading the image quality.

Figure 2: A sample result obtained with our proposed burst-based extremely low-light image enhancement method. The standard camera output and its scaled version are shown at the top left corner. For comparison, the zoomed-in details from the outputs produced by the existing approaches are given in the subfigures. The results of the single image enhancement models, denoted with (S), are shown on the right. The results of the multiple image enhancement methods are presented at the bottom, with (B) denoting the burst and (E) indicating the ensemble models. Our single image model recovers finer-scale details much better than its state-of-the-art counterparts. Moreover, our burst model gives perceptually the most satisfactory result, compared to all the other methods.

Direct approaches for capturing bright photos in low light conditions include widening the aperture of the camera lens, lengthening the exposure time, or using camera flash [1, 4]. These methods, however, do not solve the problem completely as each of these hacks has its own drawbacks. Opening the aperture is limited by the hardware constraints, and when the camera flash is used, the objects closer to the camera are brightened more than the objects or the scene elements that are far away [5]. Images captured with long exposure times might have unwanted image blur due to camera shake or object movements in the scene [6]. Hence, in the literature, there has been a wide range of studies which try to improve the quality of low-light images, ranging from traditional denoising and enhancement methods to learning-based approaches.

Image denoising is one of the classical problems in image processing, where the aim is to restore a clean image from a noisy image. Several methods have been proposed over the years to denoise images [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]

. Most of these approaches rely on the images with Gaussian noise for developing a denoising model. Recently, deep learning-based methods that can deal with real image noise have been proposed 

[3, 20]. However, these approaches are not specialized to extremely low-light images which are harder to restore than a standard noisy image. Image enhancement is another active field of research, which has seen tremendous progress in the past few years with deep learning [21, 22, 23, 24, 25, 26, 27]. Usually, these methods work with low dynamic range (LDR) input images and hence, their performance is also limited due to the errors accumulated in the camera processing pipeline. When compared to LDR images, raw images straight from the camera are more suitable to use for enhancing extremely low-light images since they contain more information and are processed minimally.

In the context of enhancing extremely dark images, See-in-the-Dark (SID) [28]

is the first learning-based attempt to replace the standard camera pipeline, training a convolutional neural network (CNN) model to produce an enhanced RGB image from a single raw low-light image. For this purpose, the authors collected a dataset of short-exposure, dark raw photos and their corresponding long-exposure references. Their method is further improved by Maharjan et al. 

[29] and Zamir et al. [30] with some changes in the CNN architecture and the objective functions utilized in training. In a similar fashion, in our study, we develop a new multi-scale architecture for single image enhancement and use a different objective by combining contextual and pixel-wise losses. While the previous methods obtain an RGB image from a single dark raw image, we further explore whether the results can be improved by integrating multiple observations regarding the scene.

Bracketing is a well-known technique in photography that relies on rapidly taking several shots of the same scene. These shots usually differ from each other in terms of some camera settings, e.g. exposure, which capture characteristics of the scene differently, and thus they can be used for applications like constructing a high dynamic range (HDR) image. A technique similar to exposure bracketing is shooting each frame in the burst sequence with a constant exposure [4]. To our interest, when shot with a constant short exposure under low-light, these images represent different dark, noisy realizations of the same scene. Naturally, they provide us multiple observations about the scene when compared to a single dark image. While simply averaging these images reduces noise, results are not always satisfactory. For this reason, different techniques are introduced to merge the temporal pixels in the burst sequence [1, 4, 31, 32, 33, 34, 35, 36]. Among these approaches, [34, 35, 36] use learning-based methods to process burst images. In these studies, burst images are fed to a CNN either by concatenating through channels or in a recurrent fashion. In our case, we propose a radically different approach and show that processing these burst images in a permutation invariant manner is a simple yet more effective approach. The order of burst images does not affect the output, and accordingly a more accurate output can be obtained. In Fig. 2, we present the results of the aforementioned extremely low-light image enhancement models along with our results. The multiple image enhancement models, which either employ burst imagery or integrate ensemble of enhanced images, give superior results than their single image counterparts, yet they still suffer from artifacts such as over-smoothing, and fail to recover fine-scale details in the image. Despite the remarkable progress of previous studies [28, 29, 30, 36], this example image demonstrates that there is still large room for improvement, regarding various issues such as unwanted blur, noise and color inaccuracies in the end results – especially for the input images which are extremely dark.

In a nutshell, to alleviate these shortcomings, in this study, we propose a learning-based framework that takes a burst of extremely low-light raw images of a scene as input and generates an enhanced RGB image. In particular, we develop a coarse-to-fine network architecture which allows for simultaneous processing of a burst of dark raw images as input to obtain a high quality RGB image.

Our main contributions are summarized as follows:

  • We introduce a multi-scale deep architecture for image enhancement under extremely dark lighting conditions, which consists of a coarse-scale network and a fine-scale network.

  • We further extend our coarse-to-fine architecture to design a novel permutation invariant CNN model that predicts an enhanced RGB image by integrating features from a burst of images of a dark scene.

  • Our experiments demonstrate that our approach outputs RGB images with less noise and sharper edge details than those of the state-of-the-art methods. These are validated quantitatively based on several quality measures in both single-frame and burst settings.

Our models are publicly available at the project webpage: https://hucvl.github.io/dark-burst-photography/.

Ii Related Work

Low-light images show different characteristics due to the lighting conditions of the environments, and the noise and/or motion blur they contain. In general, the approaches for low-light image processing can be divided into two groups, with respect to the darkness levels of the input images: (i) low-light image enhancement, and (ii) extremely low-light image enhancement. Generic low-light image enhancement methods refer to the approaches that restore the perceptual quality of images taken under poor illumination conditions, which suffer from low visibility. Enhancement models for extremely low-light images, on the other hand, deal with images captured under more severe conditions, which cannot be handled by the first group of works. In particular, the darkness of an image is directly related to the illuminance of a scene, which is measured in terms of lumens per meter squared (lux). In this sense, extremely low light images denote short exposure images (usually between 1/30 and 1/10 sec exposure) that are taken in 0.2-5 lux outdoor or 0.03-0.3 lux indoor scenes.

In this study, we explore the use of burst photography for enhancing extremely dark images. Since extremely low-light images contain severe noise, our work is also related to generic image denoising and burst photography. Hence, in this section, we provide a brief review of image denoising, low-light image enhancement, extremely low-light image enhancement and burst photography methods proposed in recent years.

(a) Dark (b) Traditional
(c) Traditional + BM3D denoising (d) Long exposure
Figure 3: For an extremely dark image displayed in (a), the traditional camera pipeline produces a highly noisy image with severe color degradation, as shown in (b). Moreover, as demonstrated in (c), the state-of-the-art denoising methods cannot handle these challenges and give unsatisfactory results. Extremely low-light image enhancement methods, on the other hand, aim for generating an output close to a long-exposure image, like the one given in (d).

Ii-a Image Denoising

Image denoising is a fundamental problem in computer vision that deals with removing noise from an image 

[37, 38]. Traditionally, methods that exploit the non-local self-similarity prior [7, 8, 9], sparsity [10, 11] and image gradients [12] have been widely used for image denoising. Recently, various deep learning approaches have been proposed for both non-blind Gaussian denoising [13, 14] and blind Gaussian denoising [15, 16], which involve training denoising models under known and unknown noise levels, respectively. Lately, researchers proposed unsupervised deep denoising models [17, 18, 19] that do not use any clean ground truth data during training. Although most of these existing denoising models focus on additive white Gaussian noise, this noise model falls short when the real-life images are considered. Hence, the recent trend in image denoising is to develop models that are trained with real-world noisy data [20, 3] and that can generalize much better than the models which consider additive white Gaussian noise. While these aforementioned recent methods give fairly good results most of the time, they are not well-suited to extremely dark images as they suffer from severe noise and color degradation, as shown in Fig. 3.

Figure 5: Common failure cases for the state-of-the-art extremely low-light image enhancement methods. Subfigures show some cropped images from the results of the existing models together with the corresponding error and the ground truth images, demonstrating that these models suffer from over-smoothing and color bleeding artifacts and fail to properly recover thin structures and textured regions.

Ii-B Low-Light Image Enhancement

Generic approaches that can be used for low-light image enhancement can be divided into three groups: (i) traditional contrast enhancement methods, (ii) techniques based on Retinex-theory, and (iii) learning-based approaches. Most well-known methods for contrast enhancement include histogram equalization based approaches that apply transformations to image histograms  [39, 40, 41, 42]. Motivated by human color perception, Retinex-theory based approaches decompose the images into illumination and reflectance components, and take into account these components while enhancing the images [43, 44, 45, 46, 47]

. On the other hand, learning-based methods mostly include discriminative methods based on sparse autoencoders


and CNNs that either directly estimate an enhanced image

[22, 23] or extract an illumination map [24, 25]. Recently, researchers suggested some unsupervised models which employ adversarial losses for enhancement [26] or CNNs for illumination curve estimation [27].

These low-light image enhancement methods provide good results under certain conditions. However, they fail to deal with the full extent of the challenges in imaging under extremely dark conditions. These enhancement models mainly accept LDR images generated by the standard camera pipeline. Transforming raw images to LDR images introduces some information loss in the measurements which complicates the enhancement process. Hence, these low-light image enhancement models are favorable only when the input images are partly dark and do not exhibit serious color degradation and severe noise.

Ii-C Extremely Low-Light Image Enhancement

As discussed in the introduction, enhancing extremely dark images was introduced as a challenging image enhancement task by Chen et al. in [28], and the See-in-the-Dark (SID) model proposed therein is the first model that specifically aims for solving this task. This approach processes a raw image captured under very poor illumination condition with a U-Net [48] like architecture. Training of the model is carried out on a dataset of paired short and long-exposure images by taking into account a pixel-wise () loss.

Very recently, there have been a few attempts to further improve the performance of SID. For instance, Maharjan et al. [29] have proposed to use residual learning to boost the final image quality. Zamir et al. [30]

have used a hybrid loss function which is a combination of pixel-wise and multi-scale structural similarity (MS-SSIM) losses and a perceptual loss 

[49, 50]

, which is defined by the absolute difference of the features extracted by a deep network. Interestingly, in 

[36], Ma et al. have developed an enhancement model for extremely low-light images, which employs recurrent convolutional neural networks to obtain a high quality result from a burst of input images. Although these studies demonstrate significant progress in enhancing extremely low-light images, they cannot fully deal with the challenges of the dark scenes. As presented in Fig. 5, the images enhanced by these approaches may suffer from artifacts such as over-smoothing and color bleeding. Moreover, the existing models do not recover texture and fine details such as thin structures successfully.

As will be discussed in the next section, different from the aforementioned methods, we alternatively propose a multi-scale approach which uses a novel coarse-to-fine architecture that better handles the extremely low-light images by giving much sharper and more vivid colors. In addition, we use a combination of the pixel loss and the recently proposed contextual loss function which maintains the image statistics better [51]. Moreover, for our burst model, we employ a set-based permutation invariant architecture that jointly processes low-light input images in an orderless manner, giving perceptually plausible and high quality results.

There are also some recent efforts to extend the aforementioned image enhancement models to videos by additionally taking into account temporal consistencies. For instance, Chen et al. [52] extended their SID model to videos by training a Siamese network on static raw videos. Similarly, Jiang and Zheng [53] proposed a U-Net like architecture containing 3D convolution layers for the same purpose. These models are out of focus of this paper as they require training with dark videos not images, but are mentioned here only for completeness.

Ii-D Burst Photography

Burst photography refers to the process of capturing a sequence of images each spaced a few milliseconds apart and subsequently integrating them to obtain a higher-quality image. For instance, the most intuitive way to produce a noise-free image is to capture a burst of images and apply simple averaging. Yet, this strategy gives unsatisfactory results in practice due to moving objects and/or a moving camera. Hence, a variety of more complicated methods were introduced to combine the information from multiple images in a more effective manner. Buades et al. proposed to apply standard averaging only for the aligned pixels and utilize the state-of-the art denoising methods for the remaining pixels [31]. Joshi et al. developed a method that weights the pixels with respect to their sharpness levels by using Laplacian convolution [32] and accordingly utilizes these weights in obtaining higher quality images. Liu et al. proposed to fuse the consistent pixels with an optimal linear estimator [33]

. Moreover, some researchers suggested to employ the information encoded in the frequency-domain for temporal fusion 

[4, 1, 54]. Recently, more sophisticated approaches have been proposed for denoising such as Kernel Prediction Networks [34], Recurrent Fully Convolutional Networks [35], and Permutation Invariant Networks [55], which process a burst of noisy and blurred images through deep CNN architectures.

(a) Coarse-to-fine network
(b) Set-based network
Figure 8: Network architectures of the proposed (a) single-frame coarse-to-fine model, and (b) set-based burst model.

These aforementioned models do not cope with the challenges of extremely dark images – with the exception of Liba et al. [1] and Hasinoff et al. [4] where the authors rely on hand-crafted strategies. As mentioned in the previous subsection, the only work that focuses on learning-based burst imagery in the extremely low-light conditions is the work by Ma et al. [36]. In this work, the authors utilized a recurrent convolutional neural network architecture, similar to the one in [35], to enhance a burst of raw low-light images. In our work, specifically motivated by these recent burst photography approaches, we develop a set-based permutation invariant CNN architecture that can be used to obtain a high quality image from a burst of extremely dark images. In particular, our network jointly processes the burst frames in an orderless manner, as compared to the recurrent model in [36] which processes each frame sequentially.

Iii Our Approach

Table I summarizes the notations used throughout the paper. Our aim is to learn a mapping from the domain of raw low-light images to the domain of long-exposure RGB images. To achieve this, we first propose a single-frame coarse-to-fine model and then extend it to a set-based formulation to process a burst of images. The details of our networks are illustrated in Fig. 8.

Burst of raw low-light input images
, Reference and predicted long-exposure RGB images
Coarse, fine and set-based networks
Raw, low-res outputs of the coarse network
Noise approximations for
Tensors containing raw inputs, upsampled coarse outputs and noise approximations
Downsampling and upsampling functions
Table I: The notations used throughout the paper.

Iii-a Coarse-to-fine Model

To recover fine-grained details from dark images, we propose to employ a two-step coarse-to-fine training procedure. Similar strategies have been proven very effective in various other tasks such as deblurring [56] and image synthesis [57]. Different than those approaches, our coarse network outputs a raw (denoised) image. This helps us to decouple the problem of learning the mapping between the raw domain and the RGB domain. Some recent denoising methods use the noise level as an additional input channel [3, 34]. Predicting the coarse outputs in the raw domain also allows us to compute the approximate noise in the input.

In our proposed framework, the raw low-light input image is first downsampled by a factor of two and then fed to our coarse network. The coarse network, which is illustrated in Fig. 8(a), is trained on downsampled data and produces denoised and enhanced outputs in low-resolution


We utilize the output of the coarse network not just for guidance in assisting the fine network but also in approximating the noise by computing the difference between the upsampled coarse prediction and the raw low-light input, as:


The fine network takes the concatenation of the low-light raw input image, the output from the coarse network and the noise approximation as inputs and processes them to generate the final RGB output:


Both our coarse and fine networks follow a U-Net like encoder-decoder architecture. In the encoder, they contain 10 convolution layers where the number of filters is doubled and the resolution is halved after every 2 convolution layers, with the initial number of filters is set to 32. In the decoder, they include deconvolution layers which are concatenated with earlier corresponding convolution layers through skip connections. As shown in Fig. 13, the coarse network gives a fairly good enhancement result for a given extremely low-light image containing severe noise and color degradation. The fine network further improves the color accuracy and the details of the result of the coarse network, producing a higher quality image.

Iii-B Set-Based Extension to Burst Images

Recently, there have been some attempts to study the invariance and equivariance properties of neural networks [58, 59, 60]. Interestingly, Zaheer et al. provided a generic algorithm to train neural networks that operate on sets via a simple parameter sharing scheme [61], which allows for information exchange with a commutative operation. Based on this idea, Aittala and Durand proposed a permutation invariant CNN model for burst image deblurring [55]. In a similar vein, in this study, we develop a permutation invariant CNN architecture but with a much lower computational cost by using multiple encoders and a single decoder.

(a) Traditional
(b) Coarse
(c) Fine
(d) Burst
Figure 13: An example night photo captured with 0.1 sec exposure and its enhanced versions by the proposed coarse, fine and burst networks. As the cropped images demonstrate, the fine network enhances both the color and the details of the coarse result. The burst network produces even much sharper and perceptually more pleasing output.

We extend our coarse-to-fine model to a novel permutation invariant CNN architecture which takes multiple images of the scene as input and predicts an enhanced image. In particular, first, low-resolution coarse outputs are obtained for each frame in the burst sequence, using our coarse network:


In addition, we compute an approximate noise component for each frame, as


Finally, our set-based network accepts a set of tensors as input, each instance corresponding to the concatenation of one of raw burst images , its noise approximation and the upsampled version of the coarse prediction , and produces the final RGB output:


In the above equation, represents our permutation invariant CNN, which has

convolutional subnetworks which allow for information exchange between the features of burst frames. This is achieved by using a max-pooling over the set of burst features after each convolution layer in the encoder part of the network. Then, in the decoder part, instead of concatenating the deconvolution features with the corresponding earlier features, we concatenate them with the corresponding global max-pooled features computed in the encoder part. Hence, without even changing the parameter size, we integrate the advantage of multiple observations to the network. As Fig. 

13 demonstrates, processing multiple dark images via the proposed burst network significantly improves the quality of the end result. Our burst network produces perceptually more pleasing and sharper results than our fine network and especially recovers the fine details and the texture much better.

Iii-C Losses

To train our networks, we tested combining a pixel-wise loss () with two alternative featurewise losses, namely perceptual loss ([49, 50] and contextual loss ([51, 62].

Pixel-wise Loss. As the pixel-wise loss, we use the loss between the network output and the groundtruth long-exposure image, given as:


Perceptual Loss. To measure the distance at a more semantic level, we employ the commonly used perceptual loss [49, 50], which uses high-level features from a pre-trained VGG-19 network [63], defined as:


where denotes the feature maps at the -th layer of the network.

Contextual Loss. As an alternative to the perceptual loss, we also consider the contextual loss proposed in [51, 62], which is shown to better capture changes in fine scale details. Specially, it measures the statistical difference between the feature distributions and extracted from and , respectively, and is defined as:


where the statistical similarity CX is estimated by an approximation of the KL-divergence, as follows.

Let and respectively represent the set of features extracted from a pair of images, with cardinality , and be the cosine distance between the features and . Then, where and .

(a) SID [28]
(b) Maharjan et al. [29]
(c) Zamir et al. [30]
(d) Ours (S)
(e) Ground truth
Figure 29: Qualitative comparison of our coarse-to-fine single image (S) method for enhancing extremely low-light images, compared against the state-of-the-art models that also process single image. From top to the bottom row, the amplification ratios are 250, 100 and 250, respectively.
(a) SID (E) [28]
(b) Maharjan et al. (E) [29]
(c) Zamir et al. (E) [30]
(d) Ma et al. (B) [36]
(e) Ours (B)
(f) Ground truth
Figure 48: Qualitative comparison of our burst (B) model for enhancing extremely low-light images, compared against the burst model by Ma et al. [36] and the ensemble versions (E) of the single image state-of-the-art models. From top to the bottom row, the amplification ratios are 100, 300 and 300 respectively.

Implementation Details. To generate our training data, we extracted 512

512 pixels random patches for each input image and also generated their downsampled versions with half resolution (obtained by bilinear interpolation). Hence, the input patch sizes for the coarse and fine networks are 256

256 and 512512 pixels, respectively. We first trained the coarse network by using Adam optimizer with a learning rate of

for 2000 epochs and

for 2000 epochs. Then, the fine network

was trained with the same hyperparameters without fixing the parameters of the coarse network. Finally, we trained the set-based network

for 1000 epochs by initializing its weights from the fine network. During the training of , we randomly chose the number of burst input frames between 1 and 10. We trained both of our models by using a hybrid loss that consists of the pixel-wise and the contextual loss functions111In our experiments, we observed that the contextual loss works consistently better than the perceptual loss .. For the contextual loss, we used conv3_2 and conv4_2

layers of the VGG-19 network. We implemented our model with Tensorflow library on an NVIDIA GeForce GTX 1080 Ti GPU. Training our model lasted about 4 days.

Iv Experimental Evaluation

Iv-a Dataset

Obtaining long-exposure images is practically difficult but they can serve as ground truth images if the low-light scenes are static. We train and evaluate our models on the SID dataset [28], which consists of short-exposure burst raw images taken under extremely dark indoor (0.2-5 lux) or outdoor (0.03-0.3 lux) scenes. These images are acquired with three different exposure times of 1/10, 1/25 and 1/30 sec, where the corresponding reference images are obtained with 10 seconds or 30 seconds exposures depending on the scene. Specifically, we evaluate the performance of our models on the Sony subset, which contains 161, 36 and 93 distinct burst sequences for training, validation and testing, respectively. The number of burst frames varies from 2 to 10 for each distinct scene. The burst images are totally aligned as they are captured with a tripod. The total number of images in this dataset is 2697, including the burst frames. Moreover, the images are categorized into three groups based on their amplification ratios (100, 250, 300), measured as the ratio between the exposure times of the dark input image and the long-exposure ground truth.

Iv-B Competing Approaches

We compare our models with four state-of-the-art methods, SID [28], Maharjan et al. [29], Ma et al. [36] and Zamir et al. [30]. In our experiments, we used the pre-trained models provided by the authors of [28] and [29], and our implementations of the methods in [36] and [30] as their models are not publicly available. Specifically, for the method of Zamir et al. [30], we trained the U-Net model with the hybrid loss including pixel-wise and MS-SSIM losses and the perceptual loss for 4000 epochs. For the burst-based model by Ma et al. [36], we implemented a recurrent U-Net architecture, where the concatenated features from the previous frame, the single image model and the previous layer are fed to each convolution block of the network. We trained the model for 1000 epochs fixing the parameters of the single image network. It is important to note that among these approaches, only the method by Ma et al. [36] processes a burst of images at once. For a fair comparison with the single image models, we also process each burst image independently via each model, take the average of these enhanced outputs as the final result, and additionally report the predictions of these ensemble models.

Iv-C Evaluation Metrics

For quantitative evaluation, we employ the popular peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) metrics, and also two perceptual image quality metrics, namely learned perceptual image patch similarity (LPIPS) 

[64] and perceptual image-error assessment through pairwise preference (PieAPP) [65]. These perceptual metrics can be used to quantify the natural distortion of images such as noise and blur as well as CNN-based distortions. In addition, we also utilize perceptual index (PI) [66], a recently proposed no-reference perceptual image quality metric.

Iv-D Experimental Results

We first analyze the effectiveness of our coarse-to-fine strategy, and the performance gains achieved over the existing single image models. Fig. 29 shows visual comparison of our single image model against the state-of-the-art [28, 29, 30]. For the first image, the color of the books and the details of texts contained on the spines are better recovered by our model. For the second image, the fine details are more visible and the edges are sharper (e.g. the lines on the wall and the cable) in our result. For the third image, our model greatly reduces the noise in the dark regions. Moreover, it is apparent that our approach preserves the edges better. Table II shows a quantitative analysis of our single image model on the SID dataset. Overall, our model outperforms the state-of-the-art in terms of PSNR and all perceptual metrics, LPIPS, PieAPP and PI, and gives competitive results in terms of SSIM. It should also be noted that our model achieves the highest PSNR on the dark images with 250 and 300 amplification ratios which are more challenging than the other subsets of 100.



SID [28] 30.087 0.904 0.450 1.427 4.320
Maharjan et al. [29] 30.535 0.906 0.448 1.250 4.481
Zamir et al. [30] 29.922 0.895 0.465 1.310 4.518
Ours (S) 30.464 0.905 0.292 0.968 4.309


SID [28] 28.428 0.887 0.482 1.601 4.577
Maharjan et al. [29] 28.787 0.888 0.488 1.443 4.961
Zamir et al. [30] 28.254 0.878 0.462 1.462 4.956
Ours (S) 28.900 0.884 0.326 1.113 4.551


SID [28] 28.528 0.870 0.507 1.644 4.107
Maharjan et al. [29] 28.382 0.868 0.516 1.645 4.523
Zamir et al. [30] 28.441 0.860 0.494 1.520 4.479
Ours (S) 28.669 0.863 0.356 1.048 4.039


SID [28] 28.976 0.886 0.482 1.564 4.319
Maharjan et al. [29] 29.167 0.886 0.487 1.462 4.646
Zamir et al. [30] 28.838 0.876 0.465 1.437 4.639
Ours (S) 29.290 0.882 0.327 1.087 4.281
Table II: Performance comparison of single image models on the SID dataset for different amplification ratios, with the best performing model highlighted with a bold typeface.


SID (E) [28] 30.361 0.908 0.447 1.441 4.686
Maharjan et al. (E) [29] 30.833 0.909 0.445 1.324 4.863
Zamir et al. (E) [30] 30.120 0.898 0.430 1.335 4.776
Ma et al. (B) [36] 30.429 0.908 0.423 1.312 4.295
Ours (B) 30.849 0.909 0.280 0.945 4.233


SID (E) [28] 28.915 0.893 0.480 1.622 5.313
Maharjan et al. (E) [29] 29.289 0.893 0.480 1.525 5.609
Zamir et al. (E) [30] 28.630 0.882 0.454 1.495 5.406
Ma et al. (B)[36] 29.053 0.896 0.470 1.517 4.429
Ours (B) 29.479 0.892 0.313 1.063 4.366


SID (E) [28] 28.979 0.878 0.516 1.699 4.606
Maharjan et al. (E) [29] 28.783 0.875 0.520 1.744 5.003
Zamir et al. (E) [30] 28.750 0.866 0.500 1.581 4.805
Ma et al. (B)[36] 29.078 0.884 0.467 1.464 4.018
Ours (B) 29.232 0.877 0.322 1.048 3.923


SID (E) [28] 29.383 0.892 0.484 1.596 4.850
Maharjan et al. (E) [29] 29.568 0.891 0.485 1.548 5.148
Zamir et al. (E) [30] 29.132 0.881 0.462 1.480 4.983
Ma et al. (B)[36] 29.485 0.895 0.455 1.433 4.232
Ours (B) 29.804 0.891 0.306 1.021 4.157
Table III: Performance comparison of burst (B) and ensemble (E) models on the SID dataset for different amplification ratios, with the best performing model highlighted with a bold typeface.

Fig. 48 presents some visual results of our burst model, along with a performance comparison to the burst method of [36] and the ensemble versions of the single image methods [28, 29, 30]. As evident from the zoomed-in regions, our permutation-invariant CNN model can produce enhancement results with much sharper and well restored texture details. On the other hand, the ensemble methods all suffer from over-smoothing of the fine-scale details such as the thin lines on the mat and the printed characters on the spine of the book, and the textured regions like the green bush. The burst method of [36] does relatively better but its outputs are of low contrast. Table III clearly demonstrates the benefit of our approach that it performs the best in terms of all perceptual metrics, LPIPS, PieAPP, PI, and the PSNR metric.

Method 1 frame 4 frames 8 frames
SID [28] 0.424 1.648
Maharjan et al. [29] 2.287 3.045
Zamir et al. [30] 0.424 1.648
Ma et al. [36] 2.001
Ours (S) 0.597 1.889
Ours (B) 0.597 1.509 2.413
Table IV: Runtime analysis for single image and ensemble/burst models. The fastest model is indicated with a bold typeface. Running times are in seconds.

In Table IV, we report the runtime performances of our single image and burst models in comparison with other competing methods. In particular, we measure the time taken to process a single image and also a burst of 4 images. Our experiments are conducted on a machine with an NVIDIA GeForce GTX 1080 Ti 11GB graphics card using 42562848 pixels images. For single image enhancement, our single image model is a bit slower than SID [28] and Zamir et al [30] due to its multi-scale architecture, though it gives better enhancement results as discussed before. For burst enhancement, our model achieves the best runtime performance, with 1.509 sec for a burst size of 4. This clearly demonstrates the advantage of having a shared decoder to process burst features, contrary to the competing approaches. We additionally report the runtime of our burst model to enhance a burst of 8 frames. As can be seen, the increase in the runtime is not linear in the number of processed images. We only observe  1.6 increase when the burst size is doubled from 4 to 8. It should be noted that for the case of the burst size of 8, we were unable to report runtimes of the competing models here as enhancing these frames within a single batch by these models exceed the limits of our GPU memory.

Our model is entirely trained on the Sony dataset of SID [28] containing images captured by the Sony 7S II sensor. To demonstrate that our learned models can (partly) generalize to other camera sensors, in Fig. 49 and Fig. 53, we present example outputs of our single and burst image models on extremely dark photos taken with cameras of an iPhone 6s and an iPhone SE, respectively. Once again, Fig. 49 demonstrates that our model reduces the noise better than the state-of-the-art models [28, 29, 30], while accurately improving the texture details of the flower and the leaves. Similarly, Fig. 53 shows the cross-sensor generalization capability of our burst model. Our method clearly produces a better result than both the traditional camera pipeline222https://github.com/letmaik/rawpy and SID [28] in that it recovers the details of the water hose and the leaves of the tree more accurately.

SID [28] Maharjan et al. [29]
Zamir et al. [30] Ours (single)
Figure 49: Enhancement results of a raw image captured by an iPhone 6s using 1/20 sec exposure time and 400 ISO. Our proposed single image enhancement model provides better noise reduction with more structural details, in comparison to the prior approaches.
(a) Traditional Pipeline (Ensemble)
(b) SID (Ensemble) [28]
(c) Ours (Burst)
Figure 53: Enhancement results on a burst of 8 raw images taken with an iPhone SE with 1/10 sec exposure time and 400 ISO. Resulting images obtained by (a) averaging over the traditional pipeline, (b) averaging over the SID [28] predictions, (c) our burst model.

Iv-E Ablation Study

To evaluate the effectiveness of our approach in more detail and to better understand the effects of the loss functions and also the contribution of the burst size to the overall quality, we conducted an extensive series of ablation tests.

Losses. As mentioned before, the loss function we used to train our networks consists of two complementary loss terms. The first term is the pixel-wise loss which is used to improve the accuracy of reconstructing a long-exposure image. The second term, on the other hand, is comprised of the contextual loss function, which is utilized to improve the perceived quality of the end result.

In Table V, we quantify the effect of using the contextual loss, as opposed to the perceptual loss, in conjunction with the pixel-wise loss. First of all, the burst model trained with only loss results in higher PSNR and SSIM but relatively lower perceptual quality, which is in line with the previous observations [66, 64]. In that sense, adding either or to our objective function provides a good tradeoff between pixel-wise and perceptual metrics. To inspect which one is better, we also qualitatively analyze the contribution of incorporating the perceptual loss or the contextual loss . As demonstrated in Fig. 60, either or allows improving the perceived quality of the end-result. The resulting images have more realistic fine-scale details and texture while avoiding over-smoothing. To our interest, however, the network trained with the contextual loss tends to better recover the thin structures, especially at the darker regions, as compared to the others.

29.843 0.898 0.417 1.364 4.252
29.895 0.894 0.274 1.053 4.593
29.804 0.891 0.306 1.021 4.157
Table V: Effect of the loss functions on the performance of the proposed burst enhancement model.
Figure 60: Enhancement results of our method with different loss functions. Utilizing the combination of contextual loss and pixel-wise loss gives visually more pleasing results, as compared to using the pixel-wise loss together with and without the perceptual loss.
(b) Single image
(c) 4 frames
(d) 8 frames
(e) Ground truth
Figure 66: Effect of the burst size. As can be seen, as we increase the number of images in the burst sequence, the enhancement quality of our burst model improves further.

Burst Processing. In Fig. 66, we analyze how the number of frames in the burst sequence affects the performance of our model. Here, we provide the results obtained with a single input image and the burst sizes of four and eight frames. As can be seen from the zoomed-in results, the output quality improves with an increasing number of the burst images – the method gets much better at preserving texture details and thin structures. In Fig. 69, we also compare our (set-based) burst method with the ensemble of our single image model (i.e., processing each image in the burst separately and then taking the average of individual outputs). Fusing burst images at the feature level is evidently much more effective. Additionally, in Table VI, we quantitatively evaluate the performance of these alternative strategies333As mentioned before, the burst sizes for the images in the Sony dataset vary between 2 and 10. Here, we report the results obtained using at most four or at most eight burst frames.. Our burst model gets better scores across all metrics as compared to the ensemble approach, even when using only half of the burst images.

(a) Ours (ensemble)
(b) Ours (burst)
Figure 69: A comparison between our burst model and the ensemble version of our single image model for a burst size of 8 images. Our set-based approach, which performs fusion at the feature-level, gives perceptually better enhancement results.
Ours (S) 29.290 0.882 0.327 1.087 4.281
Ours (E) (4 frames) 29.706 0.888 0.329 1.121 4.762
Ours (E) (8 frames) 29.738 0.889 0.332 1.126 4.716
Ours (B) (4 frames) 29.742 0.890 0.313 1.034 4.197
Ours (B) (8 frames) 29.804 0.891 0.306 1.021 4.157
Table VI: A quantitative comparison of the proposed burst model with the ensemble of the single image model for varying number of burst images.

Iv-F Limitations

Our approach does have a few limitations. First and foremost, our burst approach might struggle with the burst sequences having large motion changes or camera-shake since it is trained on a dataset where the burst frames are spatially aligned. We present such an example in Fig. 72, in which our burst model introduces some unintuitive edges and blurry textures because of the misalignment while the single image model produces much sharper output. Second, as illustrated in Fig. 75, our model may sometimes hallucinate non-existing high-frequency details. We suspect that this is caused by the excessive noise in the raw images and may be alleviated to some extent by better modeling of the sensor noise. Third, our framework does not explicitly learn to perform white balance correction and tone mapping, and this somewhat affects the results. In an attempt to address this, we employ an additional post-processing step. In particular, we first apply the white balance correction method proposed in [67] to our result. Then, we adjust highlights and shadows using the Core Image API by Apple444Documentation of the API can be found at https://developer.apple.com/documentation/coreimage.. Finally, we merge this image with the white-balanced image by using the exposure fusion method by Mertens et al. [68] to obtain a tone-mapped image. Fig. 78 presents the result of this post-processing step on a sample dark input image. It is evident that this post-processing strategy leads to a visually more pleasing image with vivid colors, further improving the perceived quality of the enhanced image.

(a) Ours (single)
(b) Ours (burst)
Figure 72: A limitation of the proposed burst model. Our model might generate unintuitive edges and blurry textures when the burst frames are not spatially well-aligned.
(a) Traditional pipeline
(b) Ours (burst)
Figure 75: Another limitation of the proposed approach. Our model may sometimes hallucinate false high-frequency details for extremely noisy regions.
(a) Ours
(b) Ours + Post-process
Figure 78: Effect of the post-processing procedure applied to the result of our model for a low-light image captured with 0.1 sec exposure. Post-processing further improves the perceived quality of the enhanced image.

V Conclusion

In this study, we tackle the problem of learning to generate long-exposure images from a set of extremely low-light burst images. We developed a new deep method that incorporates a coarse-to-fine strategy to better enhance the details of the output. Moreover, we extended this network architecture to work with a burst of images via a novel a permutation invariant CNN architecture, which efficiently processes the exchanged information between the features of the burst frames. Our experiments show that our burst method achieves higher quality results than the existing state-of-the-art models, better capturing finer details, texture and color information and reducing noise. That being said, our analysis also suggests that there is still much room for improvement, especially for dynamic scenes. In that sense, an interesting future research direction is to extend the proposed framework to videos with moving objects or fast camera motions where capturing temporal relationships between succeeding frames is crucial.


This work was supported in part by TUBA GEBIP fellowship awarded to E. Erdem. We would like to thank NVIDIA Corporation for the donation of GPUs used in this research.


  • [1] O. Liba, K. Murthy, Y.-T. Tsai, T. Brooks, T. Xue, N. Karnad, Q. He, J. T. Barron, D. Sharlet, R. Geiss et al., “Handheld mobile photography in very low light,” ACM Trans. Graphics, 2019.
  • [2] S. W. Hasinoff, “Photon, poisson noise,” Computer vision: a reference Guide, 2014.
  • [3] T. Brooks, B. Mildenhall, T. Xue, J. Chen, D. Sharlet, and J. T. Barron, “Unprocessing images for learned raw denoising,” in CVPR, 2019.
  • [4] S. W. Hasinoff, D. Sharlet, R. Geiss, A. Adams, J. T. Barron, F. Kainz, J. Chen, and M. Levoy, “Burst photography for high dynamic range and low-light imaging on mobile cameras,” ACM Trans. Graphics, 2016.
  • [5] G. Petschnigg, R. Szeliski, M. Agrawala, M. Cohen, H. Hoppe, and K. Toyama, “Digital photography with flash and no-flash image pairs,” ACM Trans. Graphics, 2004.
  • [6] D. Sugimura, T. Mikami, H. Yamashita, and T. Hamamoto, “Enhancing color images of extremely low light scenes based on rgb/nir images acquisition with different exposure times,” IEEE Trans. Image Process., 2015.
  • [7] A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for image denoising,” in CVPR, 2005.
  • [8] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising by sparse 3-d transform-domain collaborative filtering,” IEEE Trans. Image Process., 2007.
  • [9] H. Talebi and P. Milanfar, “Global image denoising,” IEEE Trans. Image Process., 2013.
  • [10] S. G. Chang, B. Yu, and M. Vetterli, “Adaptive wavelet thresholding for image denoising and compression,” IEEE Trans. Image Process., 2000.
  • [11] M. Elad and M. Aharon, “Image denoising via sparse and redundant representations over learned dictionaries,” IEEE Trans. Image Process., 2006.
  • [12] L. I. Rudin, S. Osher, and E. Fatemi, “Nonlinear total variation based noise removal algorithms,” Physica D: nonlinear phenomena, 1992.
  • [13] V. Jain and S. Seung, “Natural image denoising with convolutional networks,” in NeurIPS, 2009.
  • [14] J. Xie, L. Xu, and E. Chen, “Image denoising and inpainting with deep neural networks,” in NeurIPS, 2012.
  • [15] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE Trans. Image Process., 2017.
  • [16] K. Zhang, W. Zuo, and L. Zhang, “FFDNet: Toward a fast and flexible solution for CNN-based image denoising,” IEEE Trans. Image Process., 2018.
  • [17] J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Karras, M. Aittala, and T. Aila, “Noise2noise: Learning image restoration without clean data,” in ICML, 2018.
  • [18] A. Krull, T.-O. Buchholz, and F. Jug, “Noise2void-learning denoising from single noisy images,” in CVPR, 2019.
  • [19] S. Laine, T. Karras, J. Lehtinen, and T. Aila, “High-quality self-supervised deep image denoising,” in NeurIPS, 2019.
  • [20] S. Guo, Z. Yan, K. Zhang, W. Zuo, and L. Zhang, “Toward convolutional blind denoising of real photographs,” in CVPR, 2019.
  • [21] K. G. Lore, A. Akintayo, and S. Sarkar, “LLNet: A deep autoencoder approach to natural low-light image enhancement,” Pattern Recognition, 2017.
  • [22] L. Tao, C. Zhu, G. Xiang, Y. Li, H. Jia, and X. Xie, “Llcnn: A convolutional neural network for low-light image enhancement,” in VCIP, 2017.
  • [23] F. Lv, F. Lu, J. Wu, and C. Lim, “MBLLEN: Low-light image/video enhancement using CNNs.” in BMVC, 2018.
  • [24] R. Wang, Q. Zhang, C.-W. Fu, X. Shen, W.-S. Zheng, and J. Jia, “Underexposed photo enhancement using deep illumination estimation,” in CVPR, 2019.
  • [25] C. Wei, W. Wang, W. Yang, and J. Liu, “Deep retinex decomposition for low-light enhancement,” in BMVC, 2018.
  • [26] Y. Jiang, X. Gong, D. Liu, Y. Cheng, C. Fang, X. Shen, J. Yang, P. Zhou, and Z. Wang, “Enlightengan: Deep light enhancement without paired supervision,” arXiv:1906.06972, 2019.
  • [27] C. Guo, C. Li, J. Guo, C. C. Loy, J. Hou, S. Kwong, and R. Cong, “Zero-reference deep curve estimation for low-light image enhancement,” arXiv: 2001.06826, 2020.
  • [28] C. Chen, Q. Chen, J. Xu, and V. Koltun, “Learning to see in the dark,” in CVPR, 2018.
  • [29] P. Maharjan, L. Li, Z. Li, N. Xu, C. Ma, and Y. Li, “Improving extreme low-light image denoising via residual learning,” in ICME, 2019.
  • [30] S. W. Zamir, A. Arora, S. H. Khan, F. S. Khan, and L. Shao, “Learning digital camera pipeline for extreme low-light imaging,” arXiv: 1904.05939, 2019.
  • [31] T. Buades, Y. Lou, J.-M. Morel, and Z. Tang, “A note on multi-image denoising,” in International Workshop on Local and Non-Local Approximation in Image Processing, 2009.
  • [32] N. Joshi and M. F. Cohen, “Seeing mt. rainier: Lucky imaging for multi-image denoising, sharpening, and haze removal,” in ICCP, 2010.
  • [33] Z. Liu, L. Yuan, X. Tang, M. Uyttendaele, and J. Sun, “Fast burst images denoising,” ACM Trans. Graphics, 2014.
  • [34] B. Mildenhall, J. T. Barron, J. Chen, D. Sharlet, R. Ng, and R. Carroll, “Burst denoising with kernel prediction networks,” in CVPR, 2018.
  • [35] C. Godard, K. Matzen, and M. Uyttendaele, “Deep burst denoising,” in ECCV, 2018.
  • [36] L. Ma, D. Zhao, S. Li, and D. Yu, “End-to-end denoising of dark burst images using recurrent fully convolutional networks,” in VISIGRAPP, 2020.
  • [37] S. Gu and R. Timofte, “A brief review of image denoising algorithms and beyond,” in Inpainting and Denoising Challenges, 2019.
  • [38] P. Chatterjee and P. Milanfar, “Is denoising dead?” IEEE Trans. Image Process., 2010.
  • [39] R. Hummel, “Image enhancement by histogram transformation,” Computer Graphics and Image Processing, 1977.
  • [40] K. Zuiderveld, “Contrast limited adaptive histogram equalization,” in Graphics Gems IV, 1994.
  • [41] H. Ibrahim and N. Pik Kong, “Brightness preserving dynamic histogram equalization for image contrast enhancement,” IEEE Trans. Consum. Electron., 2007.
  • [42] T. Arici, S. Dikbas, and Y. Altunbasak, “A histogram modification framework and its application for image contrast enhancement,” IEEE Trans. Image Process., 2009.
  • [43] E. H. Land, “The retinex theory of color vision,” Scientific American, 1977.
  • [44] M. K. Ng and W. Wang, “A total variation model for retinex,” SIAM J. Imag. Sci., 2011.
  • [45] X. Fu, D. Zeng, Y. Huang, X.-P. Zhang, and X. Ding, “A weighted variational model for simultaneous reflectance and illumination estimation,” in CVPR, 2016.
  • [46] X. Guo, Y. Li, and H. Ling, “LIME: Low-light image enhancement via illumination map estimation,” IEEE Trans. Image Process., 2017.
  • [47] D. J. Jobson, Z.-u. Rahman, and G. A. Woodell, “A multiscale retinex for bridging the gap between color images and the human observation of scenes,” IEEE Trans. Image Process., 1997.
  • [48] O. Ronneberger, P.Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in MICCAI, 2015.
  • [49] A. Dosovitskiy and T. Brox, “Generating images with perceptual similarity metrics based on deep networks,” in NeurIPS, 2016.
  • [50]

    J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual losses for real-time style transfer and super-resolution,” in

    ECCV, 2016.
  • [51] R. Mechrez, I. Talmi, F. Shama, and L. Zelnik-Manor, “Maintaining natural image statistics with the contextual loss,” in ACCV, 2018.
  • [52] C. Chen, Q. Chen, M. N. Do, and V. Koltun, “Seeing motion in the dark,” in ICCV, 2019.
  • [53] H. Jiang and Y. Zheng, “Learning to see moving objects in the dark,” in ICCV, 2019.
  • [54] M. Delbracio and G. Sapiro, “Hand-held video deblurring via efficient fourier aggregation,” IEEE Trans. Comput. Imag., 2015.
  • [55] M. Aittala and F. Durand, “Burst image deblurring using permutation invariant convolutional neural networks,” in ECCV, 2018.
  • [56] S. Nah, T. Hyun Kim, and K. Mu Lee, “Deep multi-scale convolutional neural network for dynamic scene deblurring,” in CVPR, 2017.
  • [57] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, “High-resolution image synthesis and semantic manipulation with conditional gans,” in CVPR, 2018.
  • [58] S. Ravanbakhsh, J. Schneider, and B. Poczos, “Equivariance through parameter-sharing,” in ICML, 2017.
  • [59] T. Cohen and M. Welling, “Group equivariant convolutional networks,” in ICML, 2016.
  • [60] R. Gens and P. M. Domingos, “Deep symmetry networks,” in NeurIPS, 2014.
  • [61] M. Zaheer, S. Kottur, S. Ravanbakhsh, B. Poczos, R. R. Salakhutdinov, and A. J. Smola, “Deep sets,” in NeurIPS, 2017.
  • [62] R. Mechrez, I. Talmi, and L. Zelnik-Manor, “The contextual loss for image transformation with non-aligned data,” in ECCV, 2018.
  • [63] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
  • [64]

    R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in

    CVPR, 2018.
  • [65] E. Prashnani, H. Cai, Y. Mostofi, and P. Sen, “Pieapp: Perceptual image-error assessment through pairwise preference,” in CVPR, 2018.
  • [66] Y. Blau, R. Mechrez, R. Timofte, T. Michaeli, and L. Zelnik-Manor, “The 2018 pirm challenge on perceptual image super-resolution,” in ECCV, 2018.
  • [67] M. Afifi, B. Price, S. Cohen, and M. S. Brown, “When color constancy goes wrong: Correcting improperly white-balanced images,” in CVPR, 2019.
  • [68] T. Mertens, J. Kautz, and F. Van Reeth, “Exposure fusion: A simple and practical alternative to high dynamic range photography,” in Computer Graphics Forum, 2009.