Scenes with low light conditions are challenging in photography, cameras usually produce noisy and/or blurry images. In these situations, people usually use an external device such as a camera flash, thus, creating flash images. However, when the light from the flash is pointing directly at the object, the light can be too harsh for the scene and create a non-uniform illumination. Comparing a flash image with its respective image with ambient illumination, it is clear that the illumination is more natural and uniform because the available light can be more evenly distributed (see Figure 1).
Researchers have studied the enhancement of flash images [15, 6, 1, 3], producing enhanced images by the combination of such ambient and flash images, or normalizing the illumination on flash image in a controlled environment (backdrop and studio lighting), but without replicating the natural skin tone of people. However, in a real scenario with low light conditions, there is no information about how the ambient image is. On the other hand, on scenarios without a backdrop, objects away from the camera will have very low illumination, thus, creating dark areas in the image, considering that there is only the illumination of the camera flash. Consequently, in a real scenario with low light conditions, creating ambient images from flash images poses a very challenging problem.
Prior works handle the enhancement of low light images, where a scene is underexposed; however, on flash images, objects close to the camera tend to be bright and these techniques overexpose these regions. Our method attenuates the illumination that is close to the camera, and illuminates the underexposed regions at the same time. Since flash and ambient images represent the same scene, researchers 
study the lighting normalization on a flash image by learning the relationship between both images to estimate a relationship between these pair of images, which is added to the respective flash image in a next step, thus, normalizing the illumination on flash images but maintaining high-frequency information. This approach is not effective to restore overexposed areas due to this region still needs to compute the final result.
In this article, we propose a conditional adversarial network in a guided mode, which follows two objective functions. First, the reconstruction loss generates uniform illumination and synthetic ambient shadows. Second, the adversarial loss, which represents the objective function of GANs 
, forces to model high-frequency details on the output image, and perform a more natural illumination. Both loss functions are guided through the attention mechanism, which is performed by attention maps based on the input image and ground truth. The attention mechanism allows to the model being more robust to overexposed areas and sideways shadows presented on flash images. It also improves the robustness of the model on inconsistent scene match between pairs of flash and ambient images, since they are both usually not perfectly aligned at the moment of capture. We compare against state-of-the-art enhancement techniques for low light images[7, 9], and flash images . Ablation studies are also performed on the architecture.
Then, the major contributions of this article are:
An attention mechanism to guide a conditional adversarial network on the task of translating from flash images to ambient images. Giving robustness against overexposed areas and shadows presented on flash and ambient images, and the misaligned scene between both images. This mechanism guides the adversarial loss to avoid blurry results on regions by discriminating these cases.
Our proposed attention mechanism also guides the reconstruction loss to be robust against high-frequency details thought the texture information that the attention map gives.
2 Related Work
2.1 Low Light Image Enhancement
Prior works [15, 6, 1] combine the advantages of both ambient and flash images. These image processing techniques use the information of the image with the available illumination (ambient image) and the image with light from the camera flash (flash image) and create an enhanced image based on both images. In contrast with these techniques, our model enhances the flash image but without any kind of information of the ambient image.
In SRIE , the reflectance and illumination are estimated by a weighted variational model, then, the images are enhanced with the reflectance and illumination components. LIME , on the other hand, enhance the images by the estimation of their illumination maps. More specific, the illumination map of each pixel is first estimated individually by finding the maximum value in the R, G and B channels, then the illumination map is refined by imposing a structure prior. This refined illumination map has smoothness texture details. Both methods SRIE and LIME do not contemplate sideways shadows removal, reconstruction of overexposed areas or generation of synthetic ambient shadows.
2.2 Image-to-Image Translation
for image-to-image translation such as: image segmentation, synthesizing photos, enhancing low light images, etc. These networks are composed of various convolutional layers, where the input is encoded to a latent space representation and then decoded to estimate the desired output. Inspired on the U-Net architecture, our model employs skip connections to share information between encoder and decoder, to recover spatial information lost by downsampling operations.
, a deep learning model was adapted to turns a smartphone flash selfie into a studio portrait. The model generates a uniform illumination, but not reproduce the same skin tone of the person under studio lighting. The encoder part of the network represents the first 13 convolutional blocks of the VGG-16, and the weights of the encoder are initialized with a pre-trained model for face-recognition . The inputs and target of this network are given filtered, to estimate an image with low-frequency details, which represent the relationship on illumination between the ambient and flash image.This pre-processing step is the drawback of this model, because it can not learn a high-quality relationship of illumination between the flash and the ambient image. This step also have a computation time due to the model uses a bilateral filter.
2.3 Conditional GANs
Conditional GANs  have been proposed as a general purpose for image-to-image translation . A cGAN is composed of two architectures, the generator and the discriminator. Both architectures are fully convolutional networks . On the generator, which represents an encoder-decoder network, each step of the encoder and decoder is mainly composed by convolutional layers. The generator and discriminator are conditioned on some type of information such as images, labels, texts, etc. In our case, this information represents the flash images , and our cGAN learns to map from flash images to ambient images . Thus, the generator synthesizes ambient images , which can not be distinguished from the real ambient images , while the discriminator is trained in adversarial form respect to the generator in order to distinguish between and . As it shows in pix2pix model , this min-max game ensure the learning of high-frequency details unlike using only a reconstruction loss like a MAE (Mean Absolute Error), which output smoothed results.
3 Proposed Method
Our model is composed of two architectures, generator , and discriminator ; and translate from flash images to ambient images . Then, the training procedure follows two objectives: the reconstruction loss , which aims to minimize the distance between the input image () and the target image (); and the adversarial loss ; which represent the objective of the cGAN . Figure 2 illustrates an overall of our architecture model.
Both the reconstruction loss and the adversarial loss are guided by our attention mechanism to ensure a better learning procedure. The attention mechanism in performed on the entries of and , that is, the ambient image and synthetic ambient image first pass through the attention map before the computation of and .
3.1 Attention Mechanism
The attention mechanism that we propose aims to guide the reconstruction and adversarial loss. The mechanism is simple but efficient, we guide both and with an attention map base on the flash image and the ambient image . We define the attention map as:
In Equation 1, represents the number of channels and the value of the attention map at the position . represent the pixel value at and channel . Then, and pass though the attention map before compute the reconstruction loss and the adversarial loss ,
The operation represents the element-wise multiplication. Equation 2 guides and to a better learning procedure through the discrimination of overexposed areas, shadows, and scene misalignment, between and . Then , which represent the L1 distance, and are defined as:
By this operation, the reconstruction loss is conducted to learn the normalization of the lighting, discriminating the high-frequency details because the attention map gives this information by the element-wise multiplication. also guides to be robust for the misaligned scene between flash and ambient images. On the other hand, the adversarial loss is focused on generating realism and high-frequency details on the regions indicated by . not allows blurry outputs where the attention map indicates, because all blurry regions are classify as fake and the adversarial loss tries to fix it by generating high-frequency details on these regions.
Finally, our full objective
is a mix of the reconstruction and the adversarial loss, maintaining the relevance of the reconstruction loss and scaling the adversarial loss by the hyperparameter. Equation 4 allows determining to what extent the adversarial loss should influence to , thus, controlling the generation of artifacts in the output images.
We perform ablation studies on the architecture, and verify the improvements of using our proposed attention mechanism. Our ablation studies also consider the use and not of a pre-trained model in the generator.
In this section, we describe the Flash and Ambient Illumination Dataset (FAID) and the custom set of these images that we use. We also present the training protocol that we followed and show the quantitative and qualitative results that validate our proposal. Finally, we present the controlled experiments that we perform to determine how the components of our architecture affect the overall performance.
Introduced by , the FAID(Flash and Ambient Illumination Dataset) is a collection of pairs of flash and ambient images, which present 6 categories: People, Shelves, Plants, Toys, Rooms, and Objects. As a result, we have pairs of flash and ambient images. We inspected each image in the dataset and found that there exist ambient images that have problems such as low illumination, shadows from external objects or even reflections. Therefore, we used, for our experiments a reduced set of the entire FAID dataset. Finally, our custom dataset has pairs of images for training and 116 for testing and all images were resized to or depending on their orientation. The custom dataset can be downloaded at https://github.com/jozech/flash-to-ambient.
We freeze all convolutional layers of in the encoder part of the generator, and train our model using the Adam optimizer  with , based on . Using learning rates and for the generator and the discriminator respectively, equal or higher learning rate of the discriminator respect to the generator results on a divergence. To regularize the adversarial loss , we set , less values for results on blurry outputs and higher values of results on many artifacts. The training procedure is performed using random crops of
and horizontal random flipping for data augmentation. The implementation of our architecture is in Pytorch, and the training process takes approximately one day using a NVIDIA graphics card GeForce GTX 1070.
4.3 Quantitative and Qualitative Validation
We use the PSNR (Peak Signal-to-Noise Ratio) and the SSIM (Structural Similarity) to measure the performance of our quantitative results. Table1
reports the mean PSNR and the mean SSIM on the test set, for 1000 epochs. All hyperparameters are setting on the same way for, and the encoder-decoder network was pre-trained on the ImageNet dataset 
instead on a model used for face recognition. Our quantitative results do not significantly outperform the state-of-the-art image enhancement methods, but at least shows improvements on the flash image enhancement task.
Figure 4 shows that our model synthesizes ambient shadows on flash images such as shelves, similar to , but suffer for restoring overexposed areas produced by the camera flash. LIME , and SRIE  do not attenuate overexposed areas or synthesize ambient shadows on these type of scenes, these methods do not handle this kind of issues of flash images. DeepFlash architecture  performs weak ambient shadows, attenuate overexposed areas without restoring them, and outputs many artifacts on their results. In the case of sideways shadow removal, all the model fails (including ours).
Estimation of the skin tone of people is shown in Figure 5 where the illumination map created by LIME  conducts to brightening and overexposing the flash images. LIME  can not distinguish the natural color of dark objects and tend to illuminate them. Results in SRIE  do not present considerable changes with respect to the flash images on these kind of scenes. DeepFlash  present non-uniform illumination on flash image of people, apparently this is due to trying to simulate shadows. In the case of flash images that have low illuminated areas and also high illuminated areas like the Rubik’s Cube on Figure 5,  present meaningless illumination on their results, and our method shows considerable better results, that is, our result looks much more similar to the ground truth.
Figure 5 reveals some aspects about the generation of ambient lighting on people. Note the synthetic shadows in mouth and under the chin. Almost all ambient images from train data was taken with light source that came from above through a typical light source that exists in homes. Therefore, the model learns to generate synthetic ambient lighting simulating a light source that comes from above.
4.4 Ablation Study
We perform different experiments to validate the final configuration of our architecture. Table 2 reports the quantitative comparison between our controlled experiments. Furthermore, we also show in Figure 6 qualitative compositions between conditions in Table 2.
|1. Default ()||15.67||0.684|
Our quantitative assessments shows that using a pre-trained model improves significantly the model trained from scratch (condition 4). The other methods seem to have similar results. This is because these models, which use the MAE for the objective function (condition 3), generate blurry results in order to minimize the error between estimated images and the targets. Condition 2, which is similar to the default model without the attention mechanism, has less quantitative values than condition 3 because the adversarial loss gives some sharpness on their output images.
We explore our qualitative results (Figure 6) with respect to different loss functions, the attention map, and network architectures.
Loss function. Table 2 reports the influence by using the adversarial loss. Condition 3 represents the same structure of the generator without considering the adversarial loss, i.e., just an encoder-decoder network, without a discriminator. This architecture presents blurred results comparing with our default model. In this case the reconstruction loss is not enough to generate high-frequency details on their results, note the blurry image of the headphones (Figure 6). The adversarial loss ensure a better quality due to the deep discriminator network, which classifies blurry results as fake. Condition 2 presents also blurry results; however, the output images present more uniform illumination due to the adversarial loss.
Attention map. Condition 1, which represent our default model, present uniform illumination, and high-frequency details (note the sharpness on the headphone respect to the other conditions). Our attention mechanism guides the reconstruction and adversarial loss to obtain uniform illumination and also sharpness results with less artifacts. However, due to the robustness for overexposed areas and shadows, our model can not re-lighting dark areas with high-frequency details. We believe that a better formulation of the attention mechanism could address this problem.
Network architecture. As we report in Table 2, we perform the well known U-Net  architecture in condition 4. We adopt the model proposed by  for enhancing extreme low light images, and train it from scratch. U-Net present blurry output images and also non-uniform illumination. Our default model, which uses transfer learning, performs better quantitative and qualitative results. We believe this is due to the few samples in the training set.
Ambient lighting generation is a challenging problem, even more on flash images under low light conditions. Shadows on the flash image have to be removed, overexposed areas should be reconstructed, and ambient shadows must be synthesized as a part of the simulation of an ambient light source. In this paper, we propose a model with a guided reconstruction loss for normalizing the illumination and a guided adversarial loss to model high-frequency illumination details on flash images. Our results show that our guided mechanism estimated high-frequency details without introducing visual artifacts in our synthetic ambient images. The guided adversarial loss also produces more realistic ambient illumination on flash images than the state-of-the-art methods. Our current results are promising, nonetheless, there are cases where our model fails such as: restoring overexposed areas, normalizing the lighting for flash images on extreme low light conditions, and sideways shadow removal on flash images (see Figure 4). We believe that a more dedicated approach on the adversarial loss would be useful to address these issues.
Other methods based on intrinsic image decomposition  would be also useful by recovering the albedo (reflectance) and shading of the flash image, then, modifying directly the shading component to obtain the ambient image. As we show on this article, there are cases that need a more dedicated treatment. We aim to further study these cases and evaluate new techniques to improve the ambient lighting generation for flash images in such situations.
This work was supported by grant 234-2015-FONDECYT (Master Program) from Cienciactiva of the National Council for Science,Technology and Technological Innovation (CONCYTEC-PERU). I thank all the people who directly or indirectly helped me on this work.
-  (2005-07) Removing photography artifacts using gradient projection and flash-exposure sampling. ACM Trans. Graph. 24 (3), pp. 828–835. External Links: Cited by: §1, §2.1.
A dataset of flash and ambient illumination pairs from the crowd.
Proceedings of the European Conference on Computer Vision (ECCV), pp. 634–649. Cited by: §4.1.
-  (2019) DeepFlash: turning a flash selfie into a studio portrait. Signal Processing: Image Communication 77, pp. 28 – 39. External Links: Cited by: §1, §1, §1, §2.2, Figure 4, Figure 5, §4.3, §4.3, §4.3, Table 1.
Learning to see in the dark.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2, §4.4.
-  (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §2.2, §4.3.
-  (2004-08) Flash photography enhancement via intrinsic relighting. ACM Trans. Graph. 23 (3), pp. 673–678. External Links: Cited by: §1, §2.1.
-  (2016-06) A weighted variational model for simultaneous reflectance and illumination estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.1, Figure 4, Figure 5, §4.3, §4.3, Table 1.
-  (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §1.
-  (2017-02) LIME: low-light image enhancement via illumination map estimation. IEEE Transactions on Image Processing 26 (2), pp. 982–993. External Links: Cited by: §1, §2.1, Figure 4, Figure 5, §4.3, §4.3, Table 1.
-  (2017-07) Image-to-image translation with conditional adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2, §2.3, §3, §4.2.
-  (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Cited by: §4.2.
-  (2015-06) Fully convolutional networks for semantic segmentation. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 3431–3440. External Links: Cited by: §2.3.
-  (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784. Cited by: §2.3.
-  (2015-09) Deep face recognition. In Proceedings of the British Machine Vision Conference (BMVC), G. K. L. Tam (Ed.), pp. 41.1–41.12. External Links: Cited by: §2.2, §4.3.
-  (2004-08) Digital photography with flash and no-flash image pairs. ACM Trans. Graph. 23 (3), pp. 664–672. External Links: Cited by: §1, §2.1.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §2.2, §4.4.
-  (2013-12) Intrinsic image decomposition using a sparse representation of reflectance. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (12), pp. 2904–2915. External Links: Cited by: §5.
-  (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, Cited by: §2.2.