Nowadays, taking photos is convenient with omnipresence of cameras on multiple devices. However, photos often suffer degradations due to the environment and equipment limitations, such as low contrast, noise, and color distortion. Since vision perception is related to application scenarios and users’ aesthetic, image enhancement should be guided by these factors to improve quality of photos. Although existing professional software provides tools for manipulating photos to help users get their visually pleasing images, these tools are either user-unfriendly or working inferior. Thus, a low-light image enhancement method that meets different needs is essential.
With rapid development of deep learning, various methods have been proposed to enhance low-light images. Recent algorithms in[10, 13, 18, 1, 20, 4, 6, 14] enhance low-light images to a fixed brightness, that is, the algorithms learn brightness difference of training data pairs. Thus, they are inflexible and enhance images without diversity. Such methods ignore the subjectivity. In , the light level is adjusted by a strength ratio, but it may not be an wieldy descriptor for users since the relationship between the perceived light level and the strength ratio is non-linear. 
models user preferences as vectors to guide the enhancement process, yet the preparation of preference vectors is complicated. Furthermore, except for brightness information, color information is also learned in the vector, which leads to color distortion.
In this paper, we propose a deep learning algorithm for multi-level low-light image enhancement guided by arbitrary images as brightness references. Inspired by style transfer, we assume that an image consists of a content component and a luminance component in the latent space, which is proved to be reasonable in our experiments. Specifically, content components refer to scene-invariant information during the enhancement, while luminance components represent brightness-specific information.
A similar but nontrivial theory is Retinex, which decomposes an image into two sub-images, namely reflectance and illumination. It enhances low-light images by adjusting the illumination, and then recombine it with corresponding reflectance. In contrast, our feature components are low-coupling, which allows a new image generated by concatenating two feature components from different images.
Our main contributions are summarized as follows:
1) The proposed network decomposes images into content components and luminance components in the latent space, which are independent of each other. The feature components of different images are concatenated to perform low-light image enhancement guided by arbitrary references.
2) Our network achieves multi-level enhancement mapping trained with paired images. In the training datasets, each low-light image only has one corresponding normal-light image. By comparison, existing methods trained with such datasets simply produce a one-to-one result.
3) Extensive experiments demonstrate strong capacity on various datasets. Furthermore, the network offers diverse outputs according to different brightness references.
The goal of low-image enhancement is to learn a mapping from an image to a normal-light version. However, the NORMAL light level is within a range rather than a discrete value from both qualitative and quantitative point of view. Thus, it is suggested the enhancement is a one-to-many mapping given application scenarios or users’ aesthetic. To achieve multi-level low-light image enhancement, we make basic assumptions in Sec.2.1
. Then the network structure and loss function used to optimize the network are described in detail in Sec.2.2 and Sec.2.3 respectively.
Assumption 1 An image can be decomposed into two feature components in the latent space, namely the content component and the luminance component.
Let be a set of images with different light levels in the same scene. For each image , is its feature vector in the latent space, which consists of a content component and a luminance component . In our assumptions, is invariant for light levels , while is specific for . In other words, a pair of corresponding images , where , are encoded by an encoder to generate feature vectors and . In the latent space, and are decomposed into and . Next, and are concatenated to form a new feature vector , then . The reconstructed image of by a decoder is the same as . In this way, multi-level mapping is performed by extracting luminance components from images with diverse light levels.
Assumption 2 Two feature components with fixed dimensions are low-coupling.
The above is challenging to acquire in practice, so it is considered to use images that are content-irrelevant with the low-light images as a guideline. This paper executes the multi-level low-light image enhancement task guided by arbitrary images as brightness references regardless of scenes. Thus, the components are expected to be low-coupling, so as to concatenate two images without involving information independent of brightness in the reference image. As shown in Fig.1, let be an image pair with different scenes, where is an image as brightness reference. The goal of the task is learning a mapping from x to a corresponding version which is as bright as . Specifically, the feature vectors of and are decomposed into and respectively, and then and are concatenated to reconstruct an enhancement result of . The result preserves original scene-invariant information of and introduces target brightness from . By taking different reference images as guidance, multi-level low-light image enhancement is achieved. The key to testify the assumptions is learning an encoder and a decoder
Our model is designed to enhance a low-light image to corresponding normal-light versions. We present the network structure in Fig.2. It consists of an encoder , a feature concatenation module and a decoder , which form a U shape. The network takes two images as input, including a low-light image and a reference image . During training, and are identical in content, while in testing, is an arbitrary image. The same is used for both inputs.
Our network employs down-sampling part of U-Net as the encoder , followed by a global average pooling, which respectively encodes and as feature vectors and . Correspondingly, the decoder is up-sampling part of U-Net to reconstruct the feature vector. Details about the feature concatenation module are then provided, which is a crucial part of our network.
The Feature Concatenation Module
Its function is to regroup components from two input feature vectors, so that the output vector contains all desired information. Specifically, and are fed into the feature concatenation module, and their components are concatenated to obtain a new feature, which consists of and . Finally, the model produces the concatenation feature map through a fully connection layer and dimension expansion operation, which has the same resolution and channels as corresponding feature map in the encoding stage.
The low-light image is enhanced by introducing while retaining . This way alleviates the problem of color distortion and accords with essence of the task, that is, only light level changes.
As stated in the assumptions, input feature vectors are decomposable, and decomposed components are low-coupling. Therefore, the proposed method uses loss functions described in Sec.2.3 to limit fixed dimensions of the vectors to include brightness information alone, and remaining dimensions include other information such as color, structure and details. These two kinds of information are non-overlapping.
2.3 Loss Function
To perform the task, we propose several differentiable losses to restrict image-to-image and feature-to-feature processes. The following three components of losses are minimized to train our network.
Reconstruction loss In the image-to-image process, we compute the reconstruction loss. The error is used to measure distance between the prediction and the ground truth. The reconstruction loss can be expressed as:
where and are respectively low-light and reference normal-light images, is the content component decomposed by , and is the luminance component decomposed by . Pixels of all channels in the inputs of the network are normalized to .
The loss ensures that the network decomposes image pairs with the same content into identical content components and different luminance components, which is achieved by reconstructing the feature vector composed of and into an image consistent with .
Feature loss The feature loss is designed for feature-to-feature mapping. It is expected that feature components can be reconstructed after passing through the decoder and encoder. To this end, we use content feature loss and luminance feature loss to constrain and learn reconstruction and extraction processes of feature components. The feature loss is expressed as:
Here, and are respectively the content feature loss and the luminance feature loss. Specifically, the content feature loss is defined as:
where and represent the content components of the low-light image and the prediction. is the error. The content feature loss, on the one hand, ensures that the content component is unchanged after enhancement, and on the other hand encourages the feature to be consistent with the original after decoding and encoding. Next, we refer to the definition of the triplet loss to define the luminance feature loss as:
where , , and respectively represent the luminance components of the low-light image, reference image, and the prediction. is a rectifier. The loss is the value in the rectifier when it is greater than 0; otherwise, the loss is 0. is the squared Euclidean distance between feature vectors. is a margin and is set to 0.08 by taking average distances between the luminance components of 20 image pairs, which are randomly selected from the dataset.
We choose triplet form rather than the metric used in the content feature loss. The reason is that is expected to be similar to and different from on account of specificity of the luminance component.
Content consistency loss Next, the content consistency loss is employed to restrict the enhanced image to be the same as the original low-light image except for the light level. Images are first mapped to the HSV color space. Optimization process penalizes the cosine distance of H and S channels between the prediction and the low-light image. The content consistency loss is expressed as:
Here, and respectively represent the cosine loss of H and S channel expressed as:
is an operation to calculate cosine similarity.and are the H channel of the low-light image and prediction, respectively. Similarly, and are the S channel.
We support such color space mapping based on the following experiments. If the H and S channels of the low-light image are combined with the V channel of the content matching normal-light image, and then mapped back to the RGB space, the result is nearly the same as the normal-light image. It proves that the similarity of the H and S channels between the prediction and the low-light image is able to measure whether scene-invariant information changes after enhancement.
Cosine loss is adopted instead of the loss for the following reasons. First, the metric has been calculated in the RGB color space, which fails to figures whether the directions of pixel values are the same. In addition, it is experimentally observed that the enhanced image color is closer to the ground truth when using the cosine loss compared to the loss.
Total loss The proposed network is optimized using the total loss:
where is a weight of corresponding loss term.
In this section, we begin with dataset and implementation details for training. The proposed method is compared with state-of-the-art methods according to extensive qualitative and quantitative experiments. Moreover, the ability to generate multi-level enhancement results is demonstrated with arbitrary brightness references.
LoL dataset is involved in training. It consists of 500 image pairs, where each pair contains a low-light image and its corresponding normal-light image. The first 485 image pairs are for training and the remaining are for testing.
Our network is implemented with Tensorflow on an NVIDIA 2080Ti GPU. The model is optimized using Adam optimizer with a fixed learning rate of 1e-4. Batch size is set to 8. We train the model for 1000 epochs with the whole image as input. For data augmentation, a horizontal or vertical flip is randomly performed. Besides, a 100100 image patch is stochastically located from each low-light image and is replaced with an image patch at the same position in the ground truth. The weight is set to 2.
The network is trained in an end-to-end manner. During the training, a low-light and a reference image are taken as input. After passing through our model, an enhanced image is generated, which is also fed into the encoder. The feature concatenation module produces feature components of three images to calculate the feature loss.
3.1 Performance Evaluation
Our method is evaluated on widely used datasets, including the LoL, LIME, DICM, NPE and MEF datasets. The effectiveness of the proposed algorithm is demonstrated by qualitative and quantitative comparison with several state-of-the-art methods, such as KinD, MIRNet and PieNet. For PieNet, only numerical result published by the author is used for quantitative comparisons since source code is non-available. For the LIME, NPE and MEF datasets, we merely conduct qualitative experiments due to the lack of the ground truth.
|(a) Input||(b) LIME||(c) Retinex-Net||(d) GLAD|
|(e) KinD||(f) MIRNet||(g) Ours||(h) Ground Truth|
|(a) Input||(b) LIME ||(c) Retinex-Net ||(d) KinD ||(e) Ours|
3.1.1 Quantitative Comparison
In quantitative comparison, PSNR and SSIM
are calculated as evaluation metrics. Generally, high value indicates better results. For a fair comparison, we compare the proposed model with methods trained on the same data. Furthermore, methods involved in the comparison all employ the default training set and test set. Table1 reports PSNR/SSIM results of our method and several others on the LoL dataset. The best result is bolded for each metric. As we can see from the table that our network significantly outperforms all the other methods. Notably, the proposed model achieves 3.76dB better than MIRNet on the LoL dataset, which is currently optimal. There are two main reasons. First, the way of feature concatenation retains the scene-invariant information of the low-light image to the greatest extent, which alleviates color distortion. Second, well designed loss functions improve the performance of our network.
3.1.2 Visual and Perceptual Comparisons
Figures 3 and 4 give visual comparisons on low-light images from five datasets, which are under different lighting conditions. As shown in Fig.3, by comparing with the ground truth, our method not only enhances dark regions but also makes colors of the enhanced image closer to the ground truth. In the absence of ground truth, as can be seen from results of different methods shown in Fig.4, our method is more natural in appearance, making images look more realistic. In contrast, other methods either fail to enhance images or suffer from more or less degradations, such as noise and color distortion. In a word, the proposed method achieves better contrast, more vivid colors and sharper details, which are more satisfying.
3.2 Different level of Enhancement
We show results of multi-level mapping in Fig.5. Our network is able to generate multiple enhancement versions for the same low-light image guided by various reference images. More importantly, versions enhanced by an image are basically the same in details, structures and colors except for light levels. In addition, when different low-light images are matched with the same reference image, results have approximate brightness.
Most existing methods trained on paired datasets simply generate one fixed brightness result for a low-light image, which is a one-to-one mapping and means the lack of diversity. In contrast, our method achieves multi-level enhancement utilizing such datasets.
In this paper, we focus on the subjectivity of the enhancement and introduce brightness reference to produce results that conform to this property. We propose a deep network for multi-level low-light image enhancement guided by a reference image. In the network, an image is decomposed into two low-coupling feature components in the latent space, and then the content and luminance component of two images are concatenated to generate a new image. Multiple normal-light versions of one low-light image are obtained by selecting different reference images as guidelines. Extensive experiments demonstrate the superiority of our method compared with existing state-of-the-art methods.
-  (2018) Learning to see in the dark. In CVPR, Cited by: §1.
-  (2011) Fast efficient algorithm for enhancement of low lighting video. In IEEE International Conference on Multimedia & Expo, Cited by: Table 1.
A weighted variational model for simultaneous reflectance and illumination estimation. In , Cited by: Table 1.
-  (2020) Zero-reference deep curve estimation for low-light image enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1780–1789. Cited by: §1.
-  (2017) LIME: low-light image enhancement via illumination map estimation. IEEE Transactions on Image Processing 26 (2), pp. 982–993. Cited by: Table 1, Figure 3, Figure 4, §3.1.
-  (2019) EnlightenGAN: deep light enhancement without paired supervision. CoRR. Cited by: §1.
-  (2020) PieNet: personalized image enhancement. In European Conference on Computer Vision, Cited by: §1, §3.1.
-  (1977) The retinex theory of color vision. Scientific American 237 (6), pp. 108–128. Cited by: §1.
-  (2012) Contrast enhancement based on layered difference representation. In 2012 19th IEEE International Conference on Image Processing, pp. 965–968. Cited by: §3.1.
LLNet: a deep autoencoder approach to natural low-light image enhancement. Pattern Recognition 61, pp. 650–662. Cited by: §1.
-  (2015) Perceptual quality assessment for multi-exposure image fusion. IEEE Transactions on Image Processing 24 (11), pp. 3345–3356. Cited by: §3.1.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, Cited by: §2.2.
-  (2017) MSR-net:low-light image enhancement using deep convolutional network. Cited by: §1, Table 1.
-  (2019) Underexposed photo enhancement using deep illumination estimation. In CVPR, Cited by: §1.
-  (2013) Naturalness preserved enhancement algorithm for non-uniform illumination images. IEEE Transactions on Image Processing 22 (9), pp. 3538–3548. Cited by: Table 1, §3.1.
-  (2018) GLADNet: low-light enhancement network with global awareness. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 751–755. Cited by: Table 1, Figure 3.
-  (2004) Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13 (4), pp. 600–612. Cited by: §3.1.1.
-  (2018) Deep retinex decomposition for low-light enhancement. In BMVC, Cited by: §1, Table 1, Figure 3, Figure 4, §3.
-  (2017) A new image contrast enhancement algorithm using exposure fusion framework. In International Conference on Computer Analysis of Images and Patterns, Cited by: Table 1.
-  (2020) Learning enriched features for real image restoration and enhancement. In ECCV, Cited by: §1, Table 1, Figure 3, §3.1.
-  (2019) Kindling the darkness: a practical low-light image enhancer. In Proceedings of the 27th ACM International Conference on Multimedia, pp. 1632–1640. Cited by: §1, Table 1, Figure 3, Figure 4, §3.1.