Capturing high-quality images in difficult acquisition conditions is a formidable challenge. Such conditions, which are not uncommon, include low lighting levels and dynamic scenes with significant motion or high dynamic range, e.g. in the presence of both dark shadows and bright highlights. The problems related to low-light imaging affect all cameras but they are most pronounced in smartphones, the currently most commonly used acquisition device, where the camera and optics need to be small, light-weight and cheap.
The situation is particularly challenging if the device is handheld or the scene is dynamic as no satisfactory compromise between short and long exposure times exists. To get rich colors and good brightness with low noise, one should choose long exposure with low sensor sensitivity setting (ISO number). However, this will cause strong motion blur when the camera is moving (shaking) or if there is motion in the scene. On the other hand, a short exposure and high sensitivity setting will produce sharp but noisy images of moving objects. Examples of such short and long exposure images are shown in Fig. 1.
We propose a novel approach that addresses the aforementioned challenges by taking “the best of both worlds” via computational photography, avoiding the unsatisfactory trade-off between the short and long exposure settings. The method captures pairs of short and long exposure images in almost instantaneous succession and fuses them into a single high-quality image using a convolutional neural network (CNN). The overall capture time is only fractionally longer than the long exposure time.
Our CNN-based method, called LSD111LSD stands for Long-Short Denoising and Deblurring. performs joint image denoising and deblurring, exploiting information from both images, adapting their contributions to the conditions at hand. Experiments show that LSD handles well situations traditionally falling under the rubric of either blind deblurring or denoising methods. Moreover, the real-valued output of the convolutional neural net may be treated as a high dynamic range image.
Many current mobile devices can be programmed to capture sequences of images with different exposure times in rapid bursts without extra hardware or notable delay. The proposed approach thus brings significant practical benefits in comparison to conventional denoising methods, which are limited by the information in a single image and solve only one of the problems, not covering situations when both blur and noise have to be addressed.
Besides the problems of noise and blur, mobile imaging suffers from the limited dynamic range of camera sensors, which is often more severe in smartphone cameras than in digital single-lens reflex cameras. Even if the user were able to keep the camera perfectly still, the camera might not be able to capture the full dynamic range of the scene with a single exposure. Thus, details are typically lost either in dark shadows or bright highlights. Our approach provides a solution to this problem and produces more faithful colors and brightness values than in single-exposure input images. We note that previous exposure fusion algorithms such as [prabhakar2017deepfuse] assume that input images are neither blurry nor misaligned.
There are only few papers addressing both the denoising and deblurring problems jointly in a similar setup [yuan2007image, whyte2012non] and, to the best of our knowledge, our work is the first one utilizing deep neural networks for this task.
Our approach has the following key ingredients. We train a U-net-shaped deep convolutional neural network that takes a pair of short-long exposure images as input and provides a single high-quality image as output. The network is trained using both simulated and real data. Large volume of simulated data is generated from regular high-quality photographs by synthesizing both under- and over-exposed images and a realistic blur to the latter. Real training data are acquired by capturing image pairs of static scenes with varying exposure times using a tripod. The long exposure image in each real pair is the ground truth target for the network and the blurred input is obtained by adding synthetic blur to it. Additionally, we train a second U-net for exposure fusion, which takes the short-exposure image and the output of the LSD network as input and produces a tone-mapped result as shown in Fig. 1 (bottom right).
The main contributions of the paper are the following:
We present LSD, the first joint denoising and deblurring approach based on convolutional neural networks, and show results superior to the state-of-the art. The network will be made public.
We propose a novel approach for generating realistic training and evaluation data. The data will be published to facilitate future research.
We show that processing the output of the LSD network with an exposure fusion network achieves better reproduction of colors and brightness than a single-exposure smartphone image.
We will publish the Android software we developed for acquisition of the back-to-back short and long exposure images, enabling reproducibility of our results and further exploitation of multi-exposure imagery.
2 Related work
Single-image denoising is a classical problem, which has been addressed using various approaches such as sparse representations [elad2006image], transform-domain collaborative filtering [dabov2007image] or nuclear norm minimization [gu2014weighted]
. In addition, several deep learning based approaches have been proposed recently[jain2009natural, burger2012image, zhang2017beyond, lehtinen2018noise2noise]. Typically the deep networks are trained with pairs of clean and noisy images [jain2009natural, burger2012image, zhang2017beyond], but it has been shown that training is possible without clean targets [lehtinen2018noise2noise]. Besides the end-to-end deep learning approaches there are methods that utilize either conventional feed-forward networks [zhang2017learning] or recurrent networks [chen2017trainable] as learnable priors for denoising. Randomly initialized networks have been used as priors without pretraining [deepimageprior]. Many of the recent methods can be applied to other restoration tasks, such as inpainting [lehtinen2018noise2noise, deepimageprior]
and single-image super-resolution[zhang2017beyond, chen2017trainable]. Nevertheless, in contrast to our approach, the aforementioned methods focus on single image restoration and do not address multi-image denoising and deblurring, which is essential in our case.
Single-image deblurring is an ill-posed problem and various kind of priors have been utilized to regularize the solutions. For example, the so called dark and bright channel priors [pan2016blind, yan2017image] have been used with promising results. However, these methods assume spatially invariant blur which limits their practicality. Priors based on deep networks have also been proposed [zhang2017learning]. There are end-to-end approaches, where a neural network takes the blurry image as input and directly outputs a deblurred result [nimisha2017blur, nah2017deep, DeblurGAN]. Some methods utilize inertial sensor data in addition to images [mustaniemi19, hee2014gyro]
. Other methods first estimate blur kernels and thereafter perform non-blind deconvolution[sun2015learning, gong2017motion], and some approaches utilize deep networks for removing the deconvolution artifacts [son2017fast, wang2018training]. Despite recent progress, single-image deblurring methods often fail to produce satisfactory results since the problem is very challenging and ill-posed. That is, unlike our approach, the aforementioned methods can not utilize a sharp but noisy image to guide the deblurring.
Recently, several multi-image denoising [hasinoff2016burst, mildenhall2018burst] or deblurring approaches [delbracio2015removing, wieschollek2016end, wieschollek2017learning, aittala2018burst] have been proposed that are based on processing a burst of input images that are captured consecutively. However, unlike our approach, these methods do not vary the exposure time of the images but use either short or long exposure bursts and, hence, they address either denoising or deblurring, but not both problems jointly like we do. Moreover, since the characteristics of their input images are not as complementary as in our case, they can not get “the best of both worlds” but suffer the drawbacks of either case. For example, a burst of short exposure images may suffer from too low light and low signal to noise ratio in the darkest scene regions, although alignment and weighted averaging of multiple frames can alleviate the problem to some extent [hasinoff2016burst, mildenhall2018burst]. On the other hand, using only relatively long exposure has problems with dynamic scenes as there may be severe spatial misalignment between the images, and the capture time is longer so that fast-moving objects may disappear from the view. On top of that, based on our own observations and earlier studies [mildenhall2018burst, aittala2018burst], it seems that due to the non-complementary nature of constant exposure images it is necessary to use more input frames than two and this may increase the consumption of memory and power besides time. Moreover, with a constant exposure the saturated bright regions can not be easily avoided and high dynamic range imaging is not achieved.
A similar problem setting as in our work is considered in [yuan2007image, whyte2012non] but without utilizing CNNs. Both [yuan2007image] and [whyte2012non] first estimate blur kernels for the blurry image and thereafter use so-called residual deconvolution, proposed by [yuan2007image], to iteratively estimate the residual image that is to be added to the denoised sharp image. Both methods use [portilla2003] for denoising, and [whyte2012non] estimates spatially varying blur kernels whereas [yuan2007image] assumes uniform blur. One limitation of [whyte2012non] is that their model is not applicable to non-static scenes and it assumes that the motion of the camera during exposure is limited to rotations about its optical center, whereas our approach can generalize to more varied motions. Another drawback of [yuan2007image] and [whyte2012non] is that they require a separate photometric and geometric registration stage, where the rotation is estimated manually [yuan2007image]. We compared our approach to [whyte2012non] using their images (static scene, pure rotation) and observed that our results are better or comparable despite the fact that the images have unknown exposure times and they are captured with another camera having different noise characteristics than our camera used for training our model (see Fig. 6).
3 Method Overview
The short and long exposure images can be captured with a modern mobile device that supports per-frame camera control. An example is shown in Fig. 1. The short exposure image is sharp but noisy as it is taken with a high sensitivity setting of ISO equal to 800. Notice that the colors are also distorted compared to the long exposure image with ISO equal to 200, which is blurry due to camera motion. Furthermore, the images are slightly misaligned even though they are captured immediately one after the other.
Fig. 2 shows an overview of the proposed LSD method. The goal is to recover the underlying sharp and noise-free image using a pair of blurry and noisy images. The input images are jointly denoised and deblurred by a convolutional neural network similar to U-net [ronneberger2015u]. The architecture of the network and training details are covered in Sec. 5.
Capturing real pairs of noisy and blurry images together with the ground truth sharp images is a major challenge. To train the network, we propose a data generation framework that produces realistic training data with the help of real gyroscope readings recorded from handheld movements. Details of the data generation framework are given in the next section. To further improve the performance, the network is fine-tuned with real short and long exposure images captured with a mobile device as described in Sec. 5.3.
4 Data Generation
In order to train the network, we need pairs of noisy and blurry images together with the corresponding sharp images. Since there is no easy way to capture such real-world data, we propose a data generation framework that synthesizes realistic pairs of short and long exposure images. By utilizing images taken from the Internet and gyroscope readings, we can generate unlimited amount of training data with realistic blur while covering a wide range of different scene types.
In the following subsections, we describe the different stages of our data generation pipeline: synthesis of long and short exposure image pairs, addition of noise and realistic blur, and simulation of spatial misalignment. The LSD network operates with images having intensity range and hence we first scale the original RGB values to that range. Since the aforementioned imaging effects occur in linear color space, we invert the gamma correction of the input images. As we do not know the real value of the gamma, it is assumed that . Once the images have been generated, the gamma is re-applied.
4.1 Synthesis of Long Exposure Images
We take a regular high-quality RGB image from Internet as the starting point of our simulation. We avoid overexposed or underexposed photographs. However, at test time our long exposure input image should be slightly overexposed in order to enable high dynamic range and ensure sufficient illumination of darkest scene regions. Hence, we need to simulate the saturation of intensities due to overexposure. We do that by first multiplying the intensity values with a random number uniformly sampled from the interval . The short exposure image is generated from this intensity-scaled version , as described in the next subsection. Then, by clipping the maximum intensity to value of 1, we get the sharp long exposure image, which will be the ground truth target for network training. That is, we train the network to predict an output with similar exposure as the long exposure image. This enables us to use the real long exposure images captured with a tripod as targets when fine-tuning with real data (Sec. 5.3). In practical use, the degree of overexposure can be controlled by utilizing an auto-exposure algorithm to determine the long exposure time. Further, the performance can be improved by selecting the ratio between the short and long exposure time to be always constant even if the absolute time varies, e.g. based on brightness of the scene. Thus, we record the real image pairs so that the short exposure time is always 1/30 of the long exposure time.
4.2 Underexposure and Color Distortion
The underexposed short exposure image is synthesized from the aforementioned long exposure image , where intensities can exceed 1, by applying affine intensity change () with random coefficients (
) sampled from uniform distributions, whose parameters are determined by analyzing the intensity distributions of real short and long exposure pairs, captured with a constant exposure time ratio (1/30).
Our analysis of real image pairs showed that the colors are often distorted in the noisy short exposure image as show in Fig. 1. Hence, in order to simulate the distortion, we randomly sample different affine transformation parameters () for each color channel . Moreover, the parameters of the uniform distributions for and are determined independently for each color channel and they are such that and always. By introducing random color distortions, we encourage the network to learn the colors and brightness mainly from the (blurry) long exposure image.
4.3 Motion Blur
The motion blur is simulated only to the long exposure image . Synthetically blurred images are generated with help of gyroscope measurements. Similar to prior work [hee2014gyro, vsindelavr2013image], we assume that motion blur is mainly caused by the rotation of the camera. We start by recording a long sequence of gyroscope readings with a mobile device. The device is kept more or less steady during the recording to simulate a real life imaging situation with a shaking hand.
Let denote the starting time of the synthetic image exposure. It is randomly selected to make each of the blur fields different. The level of motion blur is controlled by the exposure time parameter , which defines the end time of the exposure . The rotation of the camera is obtained by solving the quaternion differential equation driven by the angular velocities and computing the corresponding direction cosine matrices [Titterton+Weston:2004]. Assuming that the translation is zero (or that the scene is far away), the motion blur can be modelled using a planar homography
where is the intrinsic camera matrix. Let be a projection of the 3D point in homogeneous coordinates. The point-spread-function (PSF) of the blur at the given location can be computed by .
Since mobile devices are commonly equipped with a rolling shutter camera, each row of pixels is exposed at slightly different time. This is another cause of spatially-variant blur [su2015rolling]. When computing the PSFs, the start time of the exposure needs to be adjusted based on the y-coordinate of the point . Let denote the camera readout time, i.e. the time difference between the first and last row exposure. The exposure of the :th row starts at , where corresponds to the starting time of the first row exposure and is the number of pixel rows. To take this into account, we modify Eq. 1 so that
An example of computed PSFs is shown in Fig. 2. The blurred image is produced by performing a spatially-variant convolution between the sharp image and the blur kernels (PSFs). To speed-up the convolution, we only store and process the nonzero elements of each blur kernel.
4.4 Spatial Misalignment
It is assumed that the blurry image is captured right after the noisy image. Still, the blurry image might be misaligned with respect to the noisy image due to camera or scene motion. Let us consider a horizontal blur kernel with the length of 5 pixels . Normally, the origin would be at the center of the kernel (middle of the exposure). To introduce the effect of spatial misalignment, we set the origin of each PSF kernel to be at the beginning of the exposure. In the previous example, that would correspond to the first or last position of the kernel depending on the motion direction. The effect of misalignment is visualized in Fig. 3. Although we assumed that the images can be taken immediately one after the other, this approach also extends to cases when there is a known gap between the two exposures.
4.5 Realistic Noise
As a final step, we add shot noise to both generated images. The shot noise is considered to be the dominant source of noise in photographs, modeled by a Poisson process. The noise magnitude is varied across different images since it depends on the (ISO) sensitivity setting of the camera. In general, the noise will be significantly more apparent in the short exposure image, and we model this by setting the noise magnitude for the short exposure image larger by a constant factor of 4. Later in Sec. 5.3, the network is fine-tuned with real examples of noisy images. This way the noise characteristics can be learned directly from the data.
Finally, after adding the noise, we ensure that the maximum intensity of the blurry long exposure image does not exceed the maximum brightness value of 1. That is, we clip larger values at 1.
5 Network and Training Details
The network is based on the popular U-Net architecture [ronneberger2015u]
. This type of network has been successfully used in many image-to-image translation problems[pix2pix2017]. In our case, the input of the network is a pair of blurry and noisy images (stacked). Since the network is fully convolutional, the images can be of arbitrary size. The architecture of the network is shown in Fig. 2. First, the input goes through a series of convolutional and downsampling layers. Once the bottleneck, i.e. the lowest resolution is reached, this process is reversed. The upsampling layers expand the low-resolution image back into a full resolution image. The feature maps from the encoder are concatenated with equally sized feature maps of the decoder. The number of feature maps is shown below the layers in Fig. 2
The LSD network was trained on 100k images taken from an online image collection [huiskes2010new]. The synthetically corrupted images have resolution of 270 480 pixels. We used the Adam [kingma2015j]
The method is targeted for real-world images that have gone through unknown image processing pipeline of the camera. To this end, we fine-tune the network with real images captured with the NVIDIA Shield tablet, the same device that will be used in testing. This way, the network can learn the noise and color distortion models directly from the data. Examples of real noise are shown in Fig. 4. Notice the relatively coarse appearance of the noise. Our synthetic noise model assumes that the noise is independent for each pixel. This is clearly does not hold because of the camera’s internal processing (demosaicing, etc.).
We capture pairs of short and long exposure images while the camera is on a tripod. In this case, the long exposure image is used as the ground truth sharp image. It is also used to generate the blurred image as described in Sec. 4.3. The short exposure image directly corresponds to noisy image. To increase the amount of training samples, we capture several image pairs at once while varying the long exposure between 30 - 330 milliseconds. The ratio of exposure times remains fixed so that the short exposure is always 1/30 of the long exposure. The ISO settings for the long and short exposure images are set to 200 and 800, respectively. The original images are divided to four sub-images to further increase the training data. The network was fine-tuned on 3500 images (480 x 960 pixels) for 30 epochs. The rest of the details are the same as in Sec. 5.2.
We capture pairs of noisy and blurry images in rapid succession with the NVIDIA Shield tablet. The image acquisition setup is the same as in Sec. 5.3, except this time the camera and/or scene is moving. The resolution of the images is 800 800 pixels (cropped from the original images). For the quantitative comparison, we use synthetically blurred and noisy image pairs taken from the validation set. An example of such pair is shown in Fig. 2.
6.1 Single-Image Approaches
The proposed approach is first compared against the state-of-the-art deblurring and denoising methods DeblurGAN [DeblurGAN] and BM3D [dabov2007image]
. The noise standard deviation parameter of BM3D has been manually tuned to achieve a good overall balance between noise removal and detail preservation.
Fig. 4 show the results on static scenes. The short exposure image (noisy) has been normalized so that its intensity matches the blurry image (for visualization). The most apparent weakness of the BM3D is that the color information is partly lost and cannot be recovered using the noisy image alone. LSD does a good job at extracting the colors from the blurry image. Saturated image regions, such as the light streaks, do not cause problems for LSD. There is significantly less noise compared to BM3D, which also tends to over-smooth some of the details. The results of DeblurGAN [DeblurGAN] are unsatisfactory as it fails to remove most of the blur.
Fig. 5 shows the performance on a dynamic scene. Although LSD has not been trained for this type of situations, the results are surprisingly good. However, fine details such as the bike wheels remain blurry. A quantitative comparison of the methods is presented in Table 1. LSD outperforms the other methods by a fair margin. DeblurGAN [DeblurGAN] generates a ”grid-like” pattern over the blurry images, which partly explains the poor results. See the supplementary material for more results.
6.2 Multi-Image Approches
The implementations of Whyte et al. [whyte2012non] or Yuan et al. [yuan2007image] are not publicly available. To compare the methods, we use a pair of blurry and noisy images provided by the authors of [whyte2012non]. As the exposure and ISO settings are different, we skip the fine-tuning of LSD. A comparison against the original result by [whyte2012non] is shown in Fig. 6. Even though the setup is not ideal for LSD, it produces equally good if not better results. The output of [whyte2012non] shows a little bit of ringing and slightly less details. Note that Whyte et al. [whyte2012non] and Yuan et al. [yuan2007image] perform a separate denoising step and their inputs are registered (manually).
A recent burst deblurring method by Aittala and Durand [aittala2018burst] takes an arbitrary number of blurry images as input. Using their implementation, we compare the methods in Fig. 7. Their result clearly improves as more images are added. Nevertheless, the final result appears less sharp compared to ours, which is obtained with only two images (blurry and noisy). Furthermore, the saturated regions such as the overexposed windows, cannot be recovered using the long exposure images alone. We also tried feeding a pair of noisy and blurry images to [aittala2018burst] but the results were poor. This is not surprising as their method is designed for blurry images only. Similar to [whyte2012non, yuan2007image], the input images need to be registered in advance.
6.3 Exposure Fusion
As described in previous sections, LSD network performs joint denoising and deblurring and outputs a sharp version of the long exposure image that is aligned with the short exposure image. Thus, the short exposure image and the output of LSD network would be suitable inputs to exposure fusion methods, such as [prabhakar2017deepfuse], which assume that the input images are not blurry or misaligned. However, instead of utilizing existing methods, we simply train a second U-net for exposure fusion by using similar synthetic long and short exposure image pairs as described in Sections 4.1 and 4.2. This time the random number was uniformly sampled from the interval and the ground truth target is the original image, which has not been scaled by and is presumably taken with ”good exposure”.
In order to demonstrate high-dynamic range imaging, we then process the short exposure image and the output of the LSD network with our exposure fusion U-net. The results in Figures 1 and 7 show that we get higher dynamic range and better reproduction of colors and brightness than in either one of the single-exposure input images.
The main purpose of this experiment is to demonstrate the suitability of LSD approach for handheld high-dynamic range imaging with smartphones. Since exposure fusion is not the main focus in this paper, a more comprehensive evaluation of different approaches is left for future work.
We proposed a CNN-based joint image denoising and deblurring method called LSD. It recovers a sharp and noise-free image given a pair of short and long exposure images. Its performance exceeds the conventional single-image denoising and deblurring methods on both static and dynamic scenes. Furthermore, LSD compares favorably with existing multi-image approaches. Unlike previous methods that utilize pairs of noisy and blurry images, LSD does not rely on any existing denoising algorithm. Moreover, it does not expect the input images to be pre-aligned. Finally, we demonstrated that the LSD output makes exposure fusion possible even in the presence of motion blur and misalignment.