Attention-guided Low-light Image Enhancement

08/02/2019 ∙ by Feifan Lv, et al. ∙ Beihang University 1

Low-light image enhancement is a challenging task since various factors, including brightness, contrast, artifacts and noise, should be handled simultaneously and effectively. To address such a difficult problem, this paper proposes a novel attention-guided enhancement solution and delivers the corresponding end-to-end multi-branch CNNs. The key of our method is the computation of two attention maps to guide the exposure enhancement and denoising respectively. In particular, the first attention map distinguishes underexposed regions from normally exposed regions, while the second attention map distinguishes noises from real-world textures. Under their guidance, the proposed multi-branch enhancement network can work in an adaptive way. Other contributions of this paper include the "decomposition/multi-branch-enhancement/fusion" design of the enhancement network, the reinforcement-net for contrast enhancement, and the proposed large-scale low-light enhancement dataset. We evaluate the proposed method through extensive experiments, and the results demonstrate that our solution outperforms state-of-the-art methods by a large margin. We additionally show that our method is flexible and effective for other image processing tasks.



There are no comments yet.


page 2

page 3

page 4

page 5

page 7

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Images captured in the insufficiently illuminated environment usually have strong degradations, such as poor visibility, low contrast, unexpected noise, etc. Tackling these degradations and converting low-quality low-light images to normally exposed high-quality images require well developed low-light enhancement techniques. Such a technique has a wide range of applications. For example, it can be used in consumer photography to help the users capture beautiful images in the low-light environment. It’s also useful for a variety of intelligent systems, e.g., automated driving and video surveillance, to capture high-quality inputs under low-light conditions.

However, low-light image enhancement is still a challenging task, since it needs to manipulate colorfulness, contrast, brightness and noise simultaneously given the low quality input with missing information. In recent years, various algorithms have been proposed for this purpose, but there is still a lot of room for improvement. Figure 1

shows some limitations of existing methods, which follow typical assumptions of histogram equalization (HE) and Retinex theory. HE-based methods aim to increase the contrast by stretching the dynamic range of images, while Retinex-based methods manipulate the estimated illumination map. Overall, they ignore the influences of noise and focus on restoring brightness and contrast. However, the noise is in fact inevitable and non-negligible in the low-light images.

To suppress the low-light image noise, some methods directly include a denoising process as a separate component in their enhancement pipeline. However, it is problematic to make a simple cascade of the denoising and enhancement procedures. In particular, applying denoising before enhancement will result in blurring, while applying enhancement before denoising will result in noise amplification. Therefore, in this paper, we propose to model and solve the denoising and low-light enhancement problems simultaneously.

Specifically, this paper proposes an attention-guided double-enhancement solution that achieves denoising and enhancing simultaneously and effectively. We find that the severity of low brightness/contrast and high image noise show certain spatial distributions related to the underexposed areas. Therefore, the key is to handle the problem in a region-aware adaptive manner. To this end, we propose the ue-attention map to evaluate the degree of underexposure. It guides the method to pay more attention to the underexposed areas in low light enhancement. In addition, based on the ue-attention map, we derive the noise map to guide the denoising according to the joint distribution of exposure and noise intensity. Subsequently, we design a multi-branch CNN to simultaneously achieve low-light enhancement and denoising under the guidance of both maps. In the final step, we add a fully-convolutional network for improving the image contrast and colorfulness as the second enhancement.

Another difficulty lies in the lack of large-scale paired low-light image dataset, making it challenging to train an effective network. To address this issue, we propose a low-light image simulation pipeline to synthesize realistic low-light images with ideally exposed ground truth images. Image contrast and colorfulness are also improved to provide good references for our image re-enhancement step. Following the above ideas, we propose a large-scale low-light image dataset as an efficient benchmark for low-light enhancement researches.

[width=1]images/figre1_one.pdf [1][2][3]

Fig. 1: Comparison with existing enhancement methods. Our result is with satisfactory visibility, natural colorfulness and higher contrast. For a fair comparison, our method also doesn’t consider the noise. Please zoom in for a better view.

Overall, our contributions lie in three aspects: 1) We propose an attention-guided double-enhancement method and the corresponding multi-branch network architecture111The original version of the multi-branch network can be found in our previous work [4]. However, significant changes have been made to realize the newly proposed ue-attention module, noise attention module and the contrast/colorfulness enhancement, etc. Supplementary materials can be found in the project page.. Guided by the ue-attention map and the noise map, the proposed method achieves low-light enhancement and denoising simultaneously and effectively. 2) We propose a full pipeline for low-light image simulation with high fidelity, based on which we build a new large-scale paired low-light image dataset to support low-light enhancement researches. 3) Comprehensive experiments have been conducted and the experiment results demonstrate that our method outperforms state-of-the-art methods by a large margin.

Ii Related Work

Image enhancement and denoising have been studied for a long time. In this section, we will briefly overview the most related methods.

Traditional enhancement methods. Traditional methods can be mainly divided into two categories. The first category is built upon the histogram equalization (HE) technique. The differences of different HE-based methods are using different additional priors and constraints. In particular, BPDHE [5] tries to preserve image brightness dynamically; Arici et al[6] propose to analyze and penalize the unnatural visual effects for better visual quality; DHECI [7] introduces and uses the differential gray-level histogram; CVC [8] uses the interpixel contextual information; LDR [9] focuses on the layered difference representation of 2D histogram to try to enlarge the gray-level differences between adjacent pixels. These methods expand the dynamic range and focus on improving the contrast of the entire image instead of considering the illumination. They may cause the problem of over- and under-enhancement.

The other category is based on the Retinex theory [10], which assumes that an image is composed of reflection and illumination. Typical methods, e.g., MSR [11] and SSR [12], try to recover and use the illumination map for low-light image enhancement. Recently, AMSR [13] proposes a weighting strategy based on SSR. NPE [14] balances the enhancement level and image naturalness to avoid over-enhancement. MF [15] processes the illumination map in a multi-scale fashion to improve the local contrast and maintain naturalness. SRIE [16] develops a weighted vibrational model for illumination map estimation. LIME [1] develops a structure-aware smoothing model to estimate the illumination map. BIMEF [17] proposes a dual-exposure fusion algorithm and Ying et al[18] use the camera response model for further enhancement. However, the key to these Retinex-based methods is the estimation of the illumination map, which is hand-crafted and relied on careful parameter tuning. Besides, these Retinex-based methods don’t consider noise removal and often amplify the noise.

Learning-based enhancement methods.

Recently, deep learning has achieved great success in the field of low-level image processing. Powerful tools such as end-to-end networks and GANs 

[19] have been used in image enhancement. LLNet [2]

uses the multilayer perceptron autoencoder for low-light image enhancement and denoising. RetinexNet 

[3] combines the Retinex theory with CNN to estimate the illumination map and enhance the low-light images by adjusting the illumination map. Other CNN-based methods such as LLCNN [20] and [21] rely on some traditional methods and are not end-to-end solution to handle brightness/contrast enhancement and denoising simultaneously. Besides, DPED [22] proposes an end-to-end approach using a composite perceptual error function for translating low-quality mobile phone photos into DSLR-quality photos. PPCN [23] designs a compact network and combines teacher-student information transfer to reduce computational cost. WESPE [24] proposes a weakly-supervised method to overcome the restrictions on requiring paired images. As for extremely low-light scenes, Chen et al[25] develop a CNN-based pipeline to directly process raw sensor images. These learning-based methods don’t explicitly contain the denoising process, and some even rely on traditional denoising methods. However, our approach considers the effects of noise and using two attention maps to guide the enhancing and denoising process. So, our method is complementary to existing learning-based methods.

Image denoising methods. Existing works for image denoising are massive. For Gaussian denoising, BM3D [26] and DnCNN [27] are representatives of the filter-based and deep-learning-based methods. For Poisson denoising, NLPCA [28]

combines elements of dictionary learning with sparse patch-based representations of images and employs an adaptation of Principal Component Analysis. Azzari 

et al[29]

propose an iterative algorithm combined with variance-stablizing transformation (VST) and BM3D filter 

[26]. DenoiseNet [30] uses a deep convolutional network to calculate the negative noise components, which adds directly to the original noisy image to remove Poisson noise. For Gaussian-Poisson mixed denoising, CBDNet [31] presents a convolutional blind denoising network by incorporating asymmetric learning. It’s applicable to real noise images by training on both synthetic and real images. For real-world image denoising, TWSC [32] develop a trilateral weighted sparse coding scheme. Chen et al[33] propose a two-step framework which contains noise distribution estimation using GANs and denoising using CNNs. Directly combining these methods with enhancement methods will result in blurring. To avoid this, our solution performs enhancing and denoising simultaneously.

Fig. 2: Example of image synthesis results. From left to right: original image, synthetic underexposed noise-free image, synthetic low-light image, colorful image via fusion processing, synthetic high-contrast image. See Section III for details.

Iii Dataset

Capturing paired large-scale real low-light dataset is difficult. LOL [25] and SID [25] are the only two publicly available datasets of this kind. Images in the LOL dataset are captured in the daytime by controlling the exposure and ISO. However, the way it generates underexposed images is different from general cases. This will result in performance variation when enhancing low-light images caused by other factors (see the result of RetinexNet in Figure 1). The SID dataset is composed of raw sensor data under extremely low-light scenes, which may limit its use for general low-light enhancement researches. Besides, both datasets are relatively small considering the number of images. Therefore, we propose a low-light simulation pipeline to build a large-scale paired low-light image dataset based on public datasets [34, 35, 36, 37].

Iii-a Candidate Image Selection

Our proposed low-light simulation requires high-quality normally exposed images as input, and these images also serve as the ground truth for low-light enhancement. Therefore, we need to distinguish such high-quality images from low-quality ones given large-scale public image datasets, as shown in Figure 3. To this end, we propose a candidate image selection method which takes the proper brightness, rich colorfulness, blur-free and rich details into account. The selection method contains three steps: darkness estimation, blur estimation and colorfulness estimation.

Darkness estimation. To select images with sufficient brightness, we first apply over-segmentation [38] and restore the segmentation results. Subsequently, we calculate the mean/variance of the V component in HSV color space based on the segmentation results. If the calculated mean/variance is larger than thresholds, we set this segmentation block to be sufficiently bright. Finally, images with more than sufficiently bright blocks are selected as candidates.

Blur estimation. This stage selects unblurred images with rich details. Following the same pipeline in [39], we apply the Laplacian edge extraction, calculate the variance among all the output pixels and use a threshold to determine whether this image can be selected.

Colorfulness estimation. We directly estimate the colorfulness according to [40] to select images with rich colorfulness. A threshold is set to to eliminate those low-quality, gray-scale or unnatural images.

To ensure a good diversity, we select images from a total of images based on the above rules to build the dataset. We randomly select of them as the test set which contains images. In this paper, we use the data-balanced subset including images as the training set.


Fig. 3: Samples of large-scale public datasets. The left is low-quality examples while the right is high-quality examples.

Iii-B Target Image Synthesis

We propose a low-light image simulation method to synthesize realistic low-light images from normal-light images, as shown in Figure 2. This produces an adequate number of paired low/normal light images which are needed for training of learning-based methods.

Fig. 4:

The proposed network with four subnets. The Attention-Net and Noise-Net are used to estimate the attention of exposure and noise. The Enhancement-Net and Reinforce-Net are corresponding to the two enhancement processes. The core network is the multi-branch Enhancement-Net, which is composed of feature extraction module (FEM), enhancement module (EM) and fusion module (FM). The dashed lines represent skip connections and the circles represent discontinuous connections.

Fig. 5: Certification of the low-light simulation method: visual comparison and the histogram of Y channel in YCbCr between synthetic images and real different exposure images.

Low-light image synthesis. Low-light images differ from normal-light images due to two salient features: low brightness and the presence of noise. By analyzing images with different degree of exposure, we find that the combination of linear and gamma transformation can approximately convert normal-light images to low-light images. To certificate this, we test on multi-exposure images and use the histogram of Y channel in YCbCr color space as the metric. As shown in Figure 5, the synthetic low-light images are approximately the same to real low-light images. The low-light image simulation pipeline (without additional noise) can be formulated as:


where and

are linear transformations, the

means the gamma transformation. Three parameters all obey uniform distribution:


As for the noise, many previous methods ignore it for convenience, while our method takes it into account. In particular, we follow [31, 41] to use the Gaussian-Poisson mixed noise model and take the in-camera image processing pipeline into account to simulate real low-light noise. The noise model can be formulated as:


where represents adding Poisson noise with variance , is modeled as AWGN with noise variance , stands for the camera response function, is the function that convert RGB images to Bayer images and is the demosaicing function. We do not consider compression in this paper and the configuration is the same as [31].

Image contrast amplification. The high-quality images in our dataset serve as the ground truth for low-light enhancement. However, directly using them to train an image-to-image regression method may result in low-contrast results (see MBLLEN [4] results in Figure 9). To overcome this, we propose a contrast amplification method by synthesizing a new set of high-quality images as the ground truth of our second enhancement step. In particular, we apply exposure fusion to improve the contrast/colorfulness/exposure. First, we use gamma transforms to synthesize images with different exposure settings and saturation levels from each original image. Subsequently, we fuse these differently exposed images following the same routine in [42]. Finally, we apply image smoothing [43] to further enhance the image details. The final outputs are high-contrast images that can be used as ground truth to train a visually better low-light enhancement network.

Iv Methodology

In this section, we introduce the proposed attention-based double-enhancement solution, including the network architecture, the loss function and the implementation details.

Iv-a Network Architecture

We propose a fully convolutional network containing four subnets: an Attention-Net, a Noise-Net, an Enhancement-Net and a Reinforce-Net. Figure 4 shows the overall network architecture. The Attention-Net is designed for estimating the illumination to guide the method to pay more attention to the underexposed areas in enhancement. Similarly, the Noise-Net is designed to guide the denoising process. Under their guidance, the multi-branch Enhancement-Net can perform enhancing and denoising simultaneously. The Reinforce-Net is designed for contrast re-enhancement to solve the low-contrast limitation caused by regression. The detailed description is provided below.

Attention-Net. We directly adopt U-Net in our implementation. The motivation is to provide a guidance to let Enhancement-Net correctly enhance the underexposed areas and avoid over-enhance the normally exposed areas. The output is an ue-attention map indicating the regional under-exposure level, as shown in Figure 6. The higher the illumination is, the lower ue-attention map values are. The ue-attention map’s value range is and is determined by:


where returns the maximum value among three color channels, is the original bright image and is the synthetic low-light image.

As shown in Figure 6, the inverted ue-attention map looks somewhat similar to the illumination map of the Retinex model. This infers that our ue-attention map carries important information used by the popular Retinex model. On the other hand, using our inverted ue-attention map in Retinex model still cannot ensure satisfactory results. This is because the Retinex-based solution faces difficulties in handling black regions (see black regions in Figure 1) and will result in noise amplification (see LIME results in Figure 11). Therefore, we propose to use the ue-attention map as a guidance for our Enhancement-Net introduced later.

Noise-Net. The image noise can be easily confused with image textures, causing unwanted blurring effect after applying simple denoising methods. Estimating the noise distribution beforehand and making the denoising adaptive may help reduce such an effect. Note that the noise distribution is highly related to the distribution of exposure, and thus we propose to use the ue-attention map to help derive a noise map. Under their guidance, the enhancement-net can perform denoising effectively. The Noise-Net is composed of dilated convolutional layers to increase the receptive field, which is conducive to noise estimation.

Enhancement-Net. The motivation is to decompose the enhancement problem into several sub-problems of different aspects (such as noise removal, texture preserving, color correction and so on) and solve them respectively to produce the final output via multi-branch fusion. It’s the core component of the proposed network and it consists of three types of modules: the feature extraction module (FEM), the enhancement module (EM) and the fusion module (FM). FEM is a single stream network with several convolutional layers, each of which uses

kernels, stride of

and ReLU nonlinearity. The output of each layer is both the input to the next layer and also the input to the corresponding subnet of EM.

EMs are modules following each convolutional layer of the FEM. The input to EM is the output of a certain layer in FEM, and the output size is the same as the input. FM accepts the outputs of all EMs to produce the final enhanced image. We concatenate all the outputs from EMs in the color channel dimension and use the convolution kernel to merge them, which equals to the weighted summation with learnable weights.

We propose five different EM structures. As shown in Figure 4, the design of EM follows U-Net [44] and Res-Net [45]

which have been proven effective extensively. In brief, EM-1 is a stack of convolutional/deconvolutional layers with large kernel size. EM-2 and EM-3 has U-Net like structures, and the difference is the skip connection realization and the feature map size. EM-4 has a Res-Net like structure. We remove the Batch-Normalization 

[46] and use just a few res-blocks to reduce the model parameter. EM-5 is composed of dilated convolutional layers whose output size is the same as the input.

[width=0.48]images/illumination3.pdf [1][3]

Fig. 6: Comparison between our ue-attention map and the illumination maps used for retinex-based methods. Our ue-attention map can generate similar illumination information with more details.

Reinforce-Net. The motivation is to overcome the low-contrast drawback and improve the details (see the difference between MBLLEN [4] and ours in Figure 9). Previous research[47] demonstrates the effectiveness of dilated convolution in image processing. Therefore, we use a similar network to improve contrast and details simultaneously.

Iv-B Loss Function

In order to improve the image quality both qualitatively and quantitatively, we propose a new loss function by further considering the structural information, perceptual information and regional difference of the image. It is expressed as:


where the , , and represent the loss function of Attention-Net, Noise-Net, Enhancement-Net and Reinforce-Net, and are the corresponding coefficients. The details of the four loss functions are given below.

Attention-Net loss. To obtain the correct ue-attention map for guiding the Enhancement-Net, we use the error metric to measure the prediction error as:


where is the input image, and are the predicted and expected ue-attention maps.

Noise-Net loss. Similarly, we use the error metric to measure the prediction error of the Noise-Net as:


where , and are the predicted and expected noise maps.

Enhancement-Net loss. Due to the low brightness of the image, only using common error metrics such as mse or mae may cause structure distortion such as blur effect and artifacts. We design a new loss that consists of four components to improve the visual quality. It is defined as:


where the , , and represent bright loss, structural loss, perceptual loss and regional loss. And , , and are the corresponding coefficients.

The bright loss is designed to ensure that the enhanced results have sufficient brightness. It is defined as:


where and are the predicted and expected enhancement images. is defined as: .

The structural loss is introduced to preserve the image structure and avoid blurring. We use the well-known image quality assessment algorithm SSIM [48] to build our structure loss. The structural loss is defined as:


where and are pixel value averages, and are variances, is the covariance, and and are constants to prevent the denominator to zero.

The perceptual loss is introduced to use higher-level information to improve the visual quality. We use the well-behaved VGG network [49] as the content extractor [50]. In particular, we define the perceptual loss based on the output of the ReLU activation layers of the pre-trained VGG-19 network. The perceptual loss is defined as follows:


where and are the predicted and expected enhancement images, and , and describe the dimensions of the respective feature maps within the VGG-19 network. Besides, indicates the feature map obtained by -th convolution layer in -th block of the VGG-19 Network.

For low-light image enhancement, except taking the image as a whole, we should pay more attention to the underexposed regions. We propose the regional loss to balances the degree of enhancement for different regions. It is defined as:


where represents the image quality assessment algorithm SSIM [48] and is the predicted ue-attention map which is used as the guidance.

Reinforce-Net loss. Similar to the Enhancement-Net loss, the Reinforce-Net loss is defined as:


where , and represent bright loss, structural loss and perceptual loss, and are the same as , and . In the experiments, we empirically set , , , .

Iv-C Implementation Details

Our implementation is done with Keras 


and Tensorflow  

[52]. The proposed network can be quickly converged after being trained for epochs on a Titan-X GPU using the proposed dataset. In order to prevent overfitting, we use random clipping, flipping and rotating for data augmentation. We set the batch-size to and the size of random clipping patches to . The input image values is scaled to . We use the output of the fourth convolutional layer in the third block of VGG-19 network as the perceptual loss extraction layer.

In the experiment, training is done using the Adam optimizer [53] with parameters of , , and . We also use the learning rate decay strategy, which reduces the learning rate to before the next epoch. At the same time, we reduce the learning rate to when the loss metric has stopped improving.

[width=1]images/ex1.pdf [1][17][2][1][17][2]

Fig. 7: Visual comparison on synthetic low-light images. We combine LIME [1] and BIMEF [17] with the denoising method CBDNet [31] to produce the final comparison results. Please zoom in for a better view.

V Experimental Evaluation

We compare our method with existing methods through extensive experiments. We use the publicly-available codes with recommended parameter settings. We evaluate the performance using different metrics, including PSNR, SSIM [48], AB [54], VIF [55], LOE [17], TMQI [56] and LPIPS [57]. Note that in the tables below, red, green and blue colors indicate the best, sub-optimal and third best results, respectively.

Overall, our experiment is divided into three sections. First, we make qualitative and quantitative comparisons based on our synthetic dataset and two public-available real low-light datasets. Second, we make visual comparisons with state-of-the-art methods on natural low-light images and provide a user study. We also show the robustness of our method and the benefit to some high-level tasks. Finally, we provide an ablation study to evaluate the effect of different elements and discuss unsatisfying cases.

V-a Experiments on Synthetic Datasets

Direct comparison. We compare our method with methods on our synthetic dataset. Since most methods do not have the ability to remove noise, we combine them with the state-of-the-art denoising method CBDNet [31] to produce the final comparison results. Quantitative comparison results are shown in Table I and Table II. Our result significantly outperforms other methods in all quality metrics, which fully demonstrates the superiority of our approach.

Fig. 8: Intuitive comparison of different enhancement methods.
PSNR SSIM [48] LPIPS [57] VIF [55] LOE [17] TMQI [56] AB [54]
Input 11.99 0.45 0.26 0.33 677.85 0.80 -59.22
BIMEF [17] 18.28 0.76 0.11 0.49 550.20 0.85 -28.06
LIME [1] 15.80 0.68 0.20 0.48 1121.17 0.80 -2.46
MSRCR [11] 14.87 0.72 0.15 0.52 1249.24 0.82 35.07
MF [15] 15.89 0.68 0.18 0.44 766.00 0.83 -36.88
SRIE [16] 13.83 0.56 0.21 0.37 787.42 0.82 -47.86
Dong [58] 15.37 0.65 0.22 0.35 1228.49 0.81 -33.80
NPE [14] 14.93 0.66 0.18 0.42 875.15 0.83 -41.35
DHECI [7] 18.13 0.76 0.17 0.39 547.12 0.87 -17.37
BPDHE [5] 13.62 0.60 0.24 0.34 609.89 0.82 -47.82
AMSR [13] 7.12 0.12 0.80 0.05 4554.29 0.42 -98.20
HE 17.88 0.76 0.18 0.47 596.67 0.88 19.24
Ying [18] 19.21 0.80 0.11 0.56 778.67 0.83 -9.28
WAHE [6] 15.46 0.65 0.18 0.44 564.83 0.84 -39.38
CVC [8] 14.62 0.59 0.22 0.39 610.84 0.83 -42.70
LDR [9] 15.21 0.59 0.20 0.42 590.66 0.83 -39.27
LLNet [2] 20.11 0.80 0.39 0.40 1088.43 0.87 4.30
MBLLEN [4] 24.21 0.90 0.08 0.63 536.75 0.91 -3.66
Ours 25.24 0.94 0.08 0.67 495.48 0.93 2.04

TABLE I: Quantitative comparison of synthetic low-light image (without additional noise) enhancement.
PSNR SSIM [48] LPIPS [57] VIF [55] LOE [17] TMQI [56] AB [54]
Input 11.23 0.37 0.41 0.23 925.06 0.77 -65.32
BIMEF [17] 16.57 0.64 0.32 0.28 978.96 0.83 -32.65
LIME [1] 14.79 0.59 0.34 0.26 1462.64 0.79 -7.39
MSRCR [11] 14.83 0.62 0.34 0.27 1559.05 0.84 30.98
MF [15] 15.29 0.59 0.33 0.26 1095.33 0.82 -37.46
SRIE [16] 13.10 0.48 0.37 0.25 1095.30 0.80 -52.53
Dong [58] 14.69 0.56 0.35 0.21 1592.27 0.79 -33.99
NPE [14] 14.56 0.58 0.33 0.25 1302.10 0.82 -41.17
DHECI [7] 16.57 0.61 0.37 0.23 924.78 0.86 -15.20
BPDHE [5] 12.60 0.48 0.38 0.23 925.56 0.79 -54.66
AMSR [13] 7.09 0.13 0.80 0.07 4465.68 0.42 -98.30
HE 16.65 0.64 0.36 0.26 1036.22 0.87 20.21
Ying [18] 17.18 0.67 0.31 0.28 1152.94 0.83 -13.97
WAHE [6] 13.97 0.52 0.36 0.27 935.21 0.81 -46.87
CVC [8] 13.66 0.50 0.36 0.25 940.89 0.81 -47.61
LDR [9] 13.62 0.48 0.36 0.26 939.36 0.80 -47.59
LLNet [2] 18.40 0.69 0.56 0.26 1168.75 0.85 -5.25
MBLLEN [4] 19.27 0.73 0.23 0.30 864.57 0.89 -4.87
Ours 20.84 0.82 0.17 0.33 785.64 0.91 4.36

TABLE II: Quantitative comparison of synthetic low-light images (with additional noise) enhancement.

Representative results are visually shown in Figure 7. By checking the details, it is clear that our method achieves better visual effects, including good brightness/contrast and less artifacts. Please zoom in to compare the details.

Efficiency comparison. In addition to the result quality, efficiency is also an important metric to algorithms. In order to demonstrate the superiority of our method, we use images with size

as the benchmark to test running time. It should be pointed out that the LLNet is implemented with Theano 

[59], and other methods are implemented with Matlab. Test computer configuration is Intel i5-8400 CPU, GB memory and NVIDIA Titan-Xp GPU. The comparison methods don’t use GPUs other than LLNet [2]. In order to more intuitively demonstrate the relationship between performance and efficiency, we show Figure 8. Our method clearly performs other methods in terms of both quality and efficiency. The problem with LLNet [2] is that it cannot process the entire high resolution image due to its MLP structure. The only choice is to enhance each patch respectively, which greatly limits its effectiveness.

[width=1]images/LOLSID2.pdf [3][4][3][4][25]

Fig. 9: Visual comparison on the LOL dataset and the SID dataset. We use BM3D [26] as the denoising operation in RetinexNet [3] to get free-noise enhancement results for comparison. Please zoom in for a better view.

V-B Experiments on Real Datasets

Besides synthetic datasets, our method also performs well on real low-light image datasets. We evaluate the performance based on two public-available real low-light datasets and show the visual comparison on challenging images.

LOL dataset. This dataset is captured by control the exposure and ISO in the daytime. We fine-tune our model using this dataset to compare with RetinexNet [25], which is trained on the LOL dataset. In addition, we replace the Enhancement-Net by a standard U-Net to build a lightweight version. Following PPCN [23], we also adopt knowledge transfer to further promote its performance. Quantitative comparison is shown in Table III. For both quality and efficiency comparisons, our method performs better, manifesting that our method effectively learns the adjustment and restoration. Visual comparison is shown in Figure 9. Compared with RetinexNet [3] and MBLLEN [4], our results with clear details, better contrast, normal brightness and natural white balance.

Method PSNR SSIM LPIPS Time Params
RetinexNet [3] 16.77 0.56 0.47 0.06 0.44M
RetinexNet [3] + BM3D 17.91 0.73 0.22 2.75 0.44M
MBLLEN [4] 18.56 0.75 0.19 0.05 0.31M
Ours-lightweight-1 18.33 0.75 0.18 0.05 0.21M
Ours-lightweight-2 18.60 0.76 0.22 0.05 0.25M
Ours-1 18.27 0.78 0.18 0.06 0.88M
Ours-2 18.81 0.76 0.21 0.06 0.92M
Chen et al [25] 28.88 0.79 0.36 0.51 7.76M
Ours 27.96 0.77 0.36 0.48 0.88M

TABLE III: Quantitative comparison between our method and state-of-the-arts on the LOL dataset and the SID dataset.

SID dataset. This dataset contains raw short-exposure images with corresponding long-exposure reference images and is benchmarking single-image processing of extremely low-light raw images. Due to the larger bit depth, raw images are more suitable for extremely low-light scenes compared with rgb images. Different from traditional pipelines, Chen et al[25] develop a pipeline based on an end-to-end network and achieve excellent results. Need to notice that, processing low-light raw images is a related but not identical problem. However, to prove the ability of our multi-branch network, we use the same configuration except that the network is replaced by our Enhancement-Net. Quantitative comparison is shown in Table III. Our model is lightweight and more efficient, but achieves comparable enhancement quality. In addition, our results have better visual effects as shown in Figure 9.

V-C Experiments on Real Images

In this section, we evaluate our method on real low-light images, including natural, monochrome and game scenes. We also show the benefit to object detection and semantic segmentation under low-light environment by directly using our method as the pre-processing.

[width=0.49]images/use.pdf [1][17][2][3]

Fig. 10: Rating distribution of the user study.

[width=1]images/natural_compare2.pdf [58][1][17][2]




Fig. 11: Visual comparison of natural low-light images, which are snapshotted at night. Please zoom in for a better view.

Natural low-light images. We first compare our method with state-of-the-art methods on natural low-light images and the representative visual comparison results are shown in Figure 11. Our method surpasses other methods in two key aspects. On the one hand, our method can restore vivid and natural color to make the enhancement results more realistic. In contrast, Retinex-based methods (such as RetinexNet [3] and LIME [1]) will cause different degrees of color distortion. On the other hand, our method is able to recover better contrast and more details. This improvement is especially evident when compared with LLNet [2], BIMEF [17] and MBLLEN [4].

User study. We invite participants to attend the user study and choose four representative methods for comparison. We randomly select natural low-light scene images to construct groups of enhanced results. Every group contains an original low-light image and five enhanced images.

For every image group, the participants are asked to rank the quality of enhancement results and score from (worst) to (best). The enhancement results are presented in a random order to avoid subjective bias. Participants are advised to measure quality enhancements in terms of brightness, naturalness, detail perception, contrast, and vivid color. We don’t set time limitation to ensure the participants make a full comparison.

Figure 10 shows the rating distribution of the user study. Our method receives more “red” or “orange” and fewer “green” or “blue” ratings, which shows that our results are more preferred by human subjects.

Robustness study. To prove the robustness of our method, we directly apply our trained model to enhance some specific types of low-light scenes (such as monochrome surveillance and game night scenes) that are unseen in the training dataset. Figure 12 shows the enhancement results. The results demonstrate that our method is robust and effective for general low-light image enhancement tasks. Besides, we also show that our approach is beneficial to some high-level tasks in low-light scenes, such as object detection and semantic segmentation, as shown in Figure 13. The performance of Mask-RCNN [60, 61] has been improved a lot by using our method in a pre-processing stage without any fine-tuning.

Fig. 12: Robustness proof: visual results of monochrome surveillance and game night scenes enhancement.

V-D Ablation Study

In this section, we quantitatively evaluate the effectiveness of different components in our method based on our synthetic low-light dataset. Table IV reports the accuracy of the presented change in terms of PSNR and SSIM [48]. Note that the Reinforce-Net is not considered in this study.

Loss functions. We mainly evaluate the loss function of the Enhancement-Net, as shown in Table IV (row 2-5). We use mse as the naive loss function under condition 2. The results show that the quality of enhancement is improving by containing more loss components.

Network structures. As shown in Table IV (row 6-7), we evaluate the effectiveness of different network components. Similar to the loss function, the results demonstrate that more components of our network will result in better performance.

Number of branches. We analyze the effect of different branch numbers (model size) on the network performance, as shown in Table IV (row 8-9). Obviously, the increase of model size will not always improve performance, so we set branches as the default configuration.

Condition PSNR SSIM
1. default configuration 20.84 0.82
2. w/o , w/o , w/o , w/o 19.36 0.73
3. with , w/o , w/o , w/o 20.01 0.76
4. with , with , w/o , w/o 19.92 0.78
5. with , with , with , w/o 20.58 0.81
6. w/o Attention-Net, w/o Noise-Net 19.12 0.71
7. with Attention-Net, w/o Noise-Net 20.66 0.80
8. branch number () 20.66 0.79
9. branch number () 20.83 0.82

TABLE IV: Ablation study. This table reports the performance under each condition based on the synthetic low-light dataset. In this table, ”w/o” means without.

V-E Unsatisfying Cases

Figure 14 presents several example cases where our method, as well as other state-of-the-art methods, all fail to produce satisfying results. For the top image, our method fails to recover the face details, as the face region of the original image without any trace of texture. In addition, the result contains significant blocking artifacts due to strong image compression. For the bottom images, our method fails to clear extremely heavy noise and fails to produce satisfying results for non-visible images, such as Infrared images. Solving these unsatisfying cases will be our future topic.

Vi Conclusion

This paper proposes an attention-guided enhancement solution and implements it by a multi-branch network to handle low-light image enhancement and denoising simultaneously. The key is to use the proposed ue-attention map and noise map to guide the enhancement in a region-aware adaptive manner. We also propose a low-light simulation pipeline and build a large-scale low-light enhancement benchmark dataset to enable network training and evaluation. Extensive experiments demonstrate that our solution outperforms state-of-the-art methods by a large margin. Besides, our proposed multi-branch network is flexible and effective for other low-level image processing tasks. Please refer to Figure 15.

Our future work will focus on solving the unsatisfying cases, such as those with blocking artifacts due to compression, large black regions without any texture, extremely strong noise and so on.

Fig. 13: Improvements of object detection and semantic segmentation in low-light environment by using our method as the pre-processing.
Fig. 14: Unsatisfying cases: images with untextured black regions, images with heavy compression, images with extremely strong noise, and Infrared images.

[width=0.5]images/lowlevel.pdf [62][63]





Fig. 15: Visual comparison of several low-level vision tasks.


  • [1] X. Guo, Y. Li, and H. Ling, “Lime: Low-light image enhancement via illumination map estimation,” IEEE Transactions on Image Processing (TIP), vol. 26, no. 2, pp. 982–993, 2017.
  • [2] K. G. Lore, A. Akintayo, and S. Sarkar, “L lnet: A deep autoencoder approach to natural low-light image enhancement,” Pattern Recognition, vol. 61, pp. 650–662, 2017.
  • [3] W. Y. J. L. Chen Wei, Wenjing Wang, “Deep retinex decomposition for low-light enhancement,” British Machine Vision Conference (BMVC), 2018.
  • [4] F. Lv, F. Lu, J. Wu, and C. Lim, “Mbllen: Low-light image/video enhancement using cnns,” British Machine Vision Conference (BMVC), 2018.
  • [5] H. Ibrahim and N. S. P. Kong, “Brightness preserving dynamic histogram equalization for image contrast enhancement,” IEEE Transactions on Consumer Electronics, vol. 53, no. 4, pp. 1752–1758, 2007.
  • [6] T. Arici, S. Dikbas, and Y. Altunbasak, “A histogram modification framework and its application for image contrast enhancement,” IEEE Transactions on image processing (TIP), vol. 18, no. 9, pp. 1921–1935, 2009.
  • [7] K. Nakai, Y. Hoshi, and A. Taguchi, “Color image contrast enhacement method based on differential intensity/saturation gray-levels histograms,” in Intelligent Signal Processing and Communications Systems (ISPACS), 2013 International Symposium on.   IEEE, 2013, pp. 445–449.
  • [8] T. Celik and T. Tjahjadi, “Contextual and variational contrast enhancement,” IEEE Transactions on Image Processing (TIP), vol. 20, no. 12, pp. 3431–3441, 2011.
  • [9] C. Lee, C. Lee, and C.-S. Kim, “Contrast enhancement based on layered difference representation of 2d histograms,” IEEE transactions on image processing (TIP), vol. 22, no. 12, pp. 5372–5384, 2013.
  • [10] E. H. Land, “The retinex theory of color vision,” Scientific American, vol. 237, no. 6, pp. 108–129, 1977.
  • [11] D. J. Jobson, Z.-u. Rahman, and G. A. Woodell, “A multiscale retinex for bridging the gap between color images and the human observation of scenes,” IEEE Transactions on Image processing, vol. 6, no. 7, pp. 965–976, 1997.
  • [12] ——, “Properties and performance of a center/surround retinex,” IEEE Transactions on Image processing, vol. 6, no. 3, pp. 451–462, 1997.
  • [13] C.-H. Lee, J.-L. Shih, C.-C. Lien, and C.-C. Han, “Adaptive multiscale retinex for image contrast enhancement,” in Signal-Image Technology & Internet-Based Systems (SITIS), 2013 International Conference on.   IEEE, 2013, pp. 43–50.
  • [14] S. Wang, J. Zheng, H.-M. Hu, and B. Li, “Naturalness preserved enhancement algorithm for non-uniform illumination images,” IEEE Transactions on Image Processing, vol. 22, no. 9, pp. 3538–3548, 2013.
  • [15] X. Fu, D. Zeng, Y. Huang, Y. Liao, X. Ding, and J. Paisley, “A fusion-based enhancing method for weakly illuminated images,” Signal Processing, vol. 129, pp. 82–96, 2016.
  • [16] X. Fu, D. Zeng, Y. Huang, X.-P. Zhang, and X. Ding, “A weighted variational model for simultaneous reflectance and illumination estimation,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , 2016, pp. 2782–2790.
  • [17] Z. Ying, G. Li, and W. Gao, “A bio-inspired multi-exposure fusion framework for low-light image enhancement,” arXiv preprint arXiv:1711.00591, 2017.
  • [18] Z. Ying, G. Li, Y. Ren, R. Wang, and W. Wang, “A new low-light image enhancement algorithm using camera response model,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3015–3022.
  • [19] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
  • [20]

    L. Tao, C. Zhu, G. Xiang, Y. Li, H. Jia, and X. Xie, “Llcnn: A convolutional neural network for low-light image enhancement,” in

    Visual Communications and Image Processing (VCIP), 2017 IEEE.   IEEE, 2017, pp. 1–4.
  • [21] L. Tao, C. Zhu, J. Song, T. Lu, H. Jia, and X. Xie, “Low-light image enhancement using cnn and bright channel prior,” in Image Processing (ICIP), 2017 IEEE International Conference on.   IEEE, 2017, pp. 3215–3219.
  • [22] A. Ignatov, N. Kobyshev, R. Timofte, K. Vanhoey, and L. Van Gool, “Dslr-quality photos on mobile devices with deep convolutional networks,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3277–3285.
  • [23] Z. Hui, X. Wang, L. Deng, and X. Gao, “Perception-preserving convolutional networks for image enhancement on smartphones,” in ECCV Workshop, 2018, pp. 197–213.
  • [24] A. Ignatov, N. Kobyshev, R. Timofte, K. Vanhoey, and L. Van Gool, “Wespe: weakly supervised photo enhancer for digital cameras,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 691–700.
  • [25] J. X. Chen Chen, Qifeng Chen and V. Koltun, “Learning to see in the dark,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [26] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Image denoising with block-matching and 3d filtering,” in

    Image Processing: Algorithms and Systems, Neural Networks, and Machine Learning

    , vol. 6064.   International Society for Optics and Photonics, 2006, p. 606414.
  • [27] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
  • [28] J. Salmon, Z. Harmany, C.-A. Deledalle, and R. Willett, “Poisson noise reduction with non-local pca,” Journal of mathematical imaging and vision, vol. 48, no. 2, pp. 279–294, 2014.
  • [29] L. Azzari and A. Foi, “Variance stabilization for noisy+ estimate combination in iterative poisson denoising,” IEEE signal processing letters, vol. 23, no. 8, pp. 1086–1090, 2016.
  • [30] T. Remez, O. Litany, R. Giryes, and A. M. Bronstein, “Deep convolutional denoising of low-light images,” arXiv preprint arXiv:1701.01687, 2017.
  • [31] S. Guo, Z. Yan, K. Zhang, W. Zuo, and L. Zhang, “Toward convolutional blind denoising of real photographs,” 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [32] J. Xu, L. Zhang, and D. Zhang, “A trilateral weighted sparse coding scheme for real-world image denoising,” ECCV, 2018.
  • [33] J. Chen, J. Chen, H. Chao, and M. Yang, “Image blind denoising with generative adversarial network based noise modeling,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3155–3164.
  • [34] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338, 2010.
  • [35] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision.   Springer, 2014, pp. 740–755.
  • [36] M. Grubinger, P. Clough, H. Müller, and T. Deselaers, “The iapr tc-12 benchmark: A new evaluation resource for visual information systems,” in Int. Workshop OntoImage, vol. 5, no. 10, 2006.
  • [37]

    S. M. Bileschi, “Streetscenes: Towards scene understanding in still images,” MASSACHUSETTS INST OF TECH CAMBRIDGE, Tech. Rep., 2006.

  • [38] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk, “Slic superpixels compared to state-of-the-art superpixel methods,” IEEE transactions on pattern analysis and machine intelligence (TPAMI), vol. 34, no. 11, pp. 2274–2282, 2012.
  • [39] J. L. Pech-Pacheco, G. Cristóbal, J. Chamorro-Martinez, and J. Fernández-Valdivia, “Diatom autofocusing in brightfield microscopy: a comparative study,” in Pattern Recognition, 2000. Proceedings. 15th International Conference on, vol. 3.   IEEE, 2000, pp. 314–317.
  • [40] D. Hasler and S. E. Suesstrunk, “Measuring colorfulness in natural images,” in Human vision and electronic imaging VIII, vol. 5007.   International Society for Optics and Photonics, 2003, pp. 87–96.
  • [41] H. Yamashita, D. Sugimura, and T. Hamamoto, “Low-light color image enhancement via iterative noise reduction using rgb/nir sensor,” Journal of Electronic Imaging, vol. 26, no. 4, p. 043017, 2017.
  • [42] T. Mertens, J. Kautz, and F. Van Reeth, “Exposure fusion,” in Computer Graphics and Applications, 2007. PG’07. 15th Pacific Conference on.   IEEE, 2007, pp. 382–390.
  • [43] L. Xu, C. Lu, Y. Xu, and J. Jia, “Image smoothing via l 0 gradient minimization,” in ACM Transactions on Graphics (TOG), vol. 30, no. 6.   ACM, 2011, p. 174.
  • [44] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention.   Springer, 2015, pp. 234–241.
  • [45] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2016, pp. 770–778.
  • [46] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
  • [47] Q. Chen, J. xu, and V. Koltun, “Fast image processing with fully-convolutional networks,” ICCV, 2017.
  • [48] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE transactions on image processing, vol. 13, no. 4, pp. 600–612, 2004.
  • [49] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [50] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang et al.

    , “Photo-realistic single image super-resolution using a generative adversarial network,”

    arXiv preprint, 2016.
  • [51] F. Chollet et al., “Keras,”, 2015.
  • [52] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
  • [53] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [54] Z. Chen, B. R. Abidi, D. L. Page, and M. A. Abidi, “Gray-level grouping (glg): an automatic method for optimized image contrast enhancement-part i: the basic method,” IEEE transactions on image processing, vol. 15, no. 8, pp. 2290–2302, 2006.
  • [55] H. R. Sheikh and A. C. Bovik, “Image information and visual quality,” IEEE Transactions on image processing, vol. 15, no. 2, pp. 430–444, 2006.
  • [56] H. Yeganeh and Z. Wang, “Objective quality assessment of tone-mapped images,” IEEE Transactions on Image Processing, vol. 22, no. 2, pp. 657–667, 2013.
  • [57]

    R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in

    CVPR, 2018.
  • [58] X. Dong, G. Wang, Y. Pang, W. Li, J. Wen, W. Meng, and Y. Lu, “Fast efficient algorithm for enhancement of low lighting video,” in Multimedia and Expo (ICME).   IEEE, 2011, pp. 1–6.
  • [59] Theano Development Team, “Theano: A Python framework for fast computation of mathematical expressions,” arXiv e-prints, vol. abs/1605.02688, May 2016. [Online]. Available:
  • [60] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, “Mask r-cnn,” 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988, 2017.
  • [61] W. Abdulla, “Mask r-cnn for object detection and instance segmentation on keras and tensorflow,”, 2017.
  • [62] K. He, J. Sun, and X. Tang, “Single image haze removal using dark channel prior,” IEEE transactions on pattern analysis and machine intelligence (TPAMI), vol. 33, no. 12, pp. 2341–2353, 2011.
  • [63] B. Cai, X. Xu, K. Jia, C. Qing, and D. Tao, “Dehazenet: An end-to-end system for single image haze removal,” IEEE Transactions on Image Processing (TIP), vol. 25, no. 11, pp. 5187–5198, 2016.
  • [64]

    R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in

    European Conference on Computer Vision.   Springer, 2016, pp. 649–666.
  • [65] S. Iizuka, E. Simo-Serra, and H. Ishikawa, “Let there be color!: joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification,” ACM Transactions on Graphics (TOG), vol. 35, no. 4, p. 110, 2016.
  • [66] C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in European conference on computer vision (ECCV).   Springer, 2014, pp. 184–199.
  • [67] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and Y. Fu, “Image super-resolution using very deep residual channel attention networks,” in ECCV, 2018.
  • [68] J. Pan, D. Sun, H. Pfister, and M.-H. Yang, “Blind image deblurring using dark channel prior,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1628–1636.
  • [69]

    O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, and J. Matas, “Deblurgan: Blind motion deblurring using conditional adversarial networks,”

    ArXiv e-prints, 2017.